1,720,984 research outputs found

    An Information-Extraction Approach to Speech Processing: Analysis, Detection, Verification, and Recognition

    No full text
    The field of automatic speech recognition (ASR) has enjoyed more than 30 years of technology advances due to the extensive utilization of the hidden Markov model (HMM) framework and a concentrated effort by the speech community to make available a vast amount of speech and language resources, known today as the Big Data Paradigm. State-of-the-art ASR systems achieve a high recognition accuracy for well-formed utterances of a variety of languages by decoding speech into the most likely sequence of words among all possible sentences represented by a finite-state network (FSN) approximation of all the knowledge sources required by the ASR task. However, the ASR problem is still far from being solved because not all information available in the speech knowledge hierarchy can be directly integrated into the FSN to improve the ASR performance and enhance system robustness. It is believed that some of the current issues of integrating various knowledge sources in top-down integrated search can be partially addressed by processing techniques that take advantage of the full set of acoustic and language information in speech. It has long been postulated that human speech recognition (HSR) determines the linguistic identity of a sound based on detected evidence that exists at various levels of the speech knowledge hierarchy, ranging from acoustic phonetics to syntax and semantics. This calls for a bottom-up attribute detection and knowledge integration framework that links speech processing with information extraction, by spotting speech cues with a bank of attribute detectors, weighting and combining acoustic evidence to form cognitive hypotheses, and verifying these theories until a consistent recognition decision can be reached. The recently proposed automatic speech attribute transcription (ASAT) framework is an attempt to mimic some HSR capabilities with asynchronous speech event detection followed by bottom-up knowledge integration and verification. In the last few years, ASAT has demonstrated good potential and has been applied to a variety of existing applications in speech processing and information extraction

    Adaptation to New Microphones Using Artificial Neural Networks With Trainable Activation Functions

    No full text
    Model adaptation is a key technique that enables a modern automatic speech recognition (ASR) system to adjust its parameters, using a small amount of enrolment data, to the nuances in the speech spectrum due to microphone mismatch in the training and test data. In this brief, we investigate four different adaptation schemes for connectionist (also known as hybrid) ASR systems that learn microphone-specific hidden unit contributions, given some adaptation material. This solution is made possible adopting one of the following schemes: 1) the use of Hermite activation functions; 2) the introduction of bias and slope parameters in the sigmoid activation functions; 3) the injection of an amplitude parameter specific for each sigmoid unit; or 4) the combination of 2) and 3). Such a simple yet effective solution allows the adapted model to be stored in a small-sized storage space, a highly desirable property of adaptation algorithms for deep neural networks that are suitable for large-scale online deployment. Experimental results indicate that the investigated approaches reduce word error rates on the standard Spoke 6 task of the Wall Street Journal corpus compared with unadapted ASR systems. Moreover, the proposed adaptation schemes all perform better than simple multicondition training and comparable favorably against conventional linear regression-based approaches while using up to 15 orders of magnitude fewer parameters. The proposed adaptation strategies are also effective when a single adaptation sentence is available

    Hermitian Polynomial for Speaker Adaptation of Connectionist Speech Recognition Systems

    No full text
    Model adaptation techniques are an efficient way to reduce the mismatch that typically occurs between the training and test condition of any automatic speech recognition (ASR) system. This work addresses the problem of increased degradation in performance when moving from speaker-dependent (SD) to speaker-independent (SI) conditions for connectionist (or hybrid) hidden Markov model/artificial neural network (HMM/ANN) systems in the context of large vocabulary continuous speech recognition (LVCSR). Adapting hybrid HMM/ANN systems on a small amount of adaptation data has been proven to be a difficult task, and has been a limiting factor in the widespread deployment of hybrid techniques in operational ASR systems. Addressing the crucial issue of speaker adaptation (SA) for hybrid HMM/ANN system can thereby have a great impact on the connectionist paradigm, which will play a major role in the design of next-generation LVCSR considering the great success reported by deep neural networks - ANNs with many hidden layers that adopts the pre-training technique - on many speech tasks. Current adaptation techniques for ANNs based on injecting an adaptable linear transformation network connected to either the input, or the output layer are not effective especially with a small amount of adaptation data, e.g., a single adaptation utterance. In this paper, a novel solution is proposed to overcome those limits and make it robust to scarce adaptation resources. The key idea is to adapt the hidden activation functions rather than the network weights. The adoption of Hermitian activation functions makes this possible. Experimental results on an LVCSR task demonstrate the effectiveness of the proposed approach

    Joint optimization of event detectors and evidence merger for continuous phone recognition

    No full text
    In the recent years, different data-driven methods have been proposed to detect articulatory features (AF) from short-term spectral representation. The main motivations for the AF based approach are as follows. First, the AFs in general can more accurately and parsimoniously characterize the acoustic variability associated with conversational speech. Further, while not explored in this work, AFs are more language universal than phones, and therefore they can generalize better and are easier to adapt to new languages. For use in phone based systems the AF scores are input to an evidence merger which produces phone posteriors as outputs. Several classifiers are usually built, and each classifier is trained for detecting a single articulatory feature (describing manner and/or place). We believe that joint optimization of all the classifiers and the subsequent phone evidence merger may be beneficial for the classification performance. This work is a preliminary study on this direction, and it is validated on the continuous phone recognition task. A bank of articulatory detectors, designed using hidden Markov models (HMMs), learns the mapping from the MFCC space to the articulatory space. The detectors’ outputs are then combined by the evidence merger. The AF based phone posteriors is integrated into an existing ASR engine and applied to N-best rescoring. Experimental results show promising performance on the TIMIT corpu

    A Multi-Objective Programming-Based Approach to Language Model Adaptation

    No full text
    In this paper, we present a multi-layer learning approach to the language model (LM) adaptation problem by making use of multi-objective programming (MOP). The overall objective function of conventional MAP-based LM adaptation is implicitly a composition of two objective functions: The first objective is concerned with the maximum likelihood estimation of the model parameters from the indomain data while the second objective is concerned with an appropriate representation of prior information obtained from a general purpose corpus. In this paper, we separate these individual objective functions, which are at least partially conflicting, and take an MOP approach to LM adaptation. The resulting MOP problem is solved in an iterative manner such that each objective is optimized one after another with constraints on the others. This iterative solution can be represented as a multi-layer learning problem in each layer of which only one objective is minimized with constraints on others. In estimating an n-gram LM, number of the layers is given by 2× n with one hidden unit per layer. The inputs to the hidden units are LMs of order up to n that are estimated either from the general purpose corpus or from the in-domain data. When solved this way, the target LM is in the form of a log-linear interpolation of component LMs. In our preliminary experiments with bigram LMs, the proposed approach slightly outperformed linear interpolation. In our ongoing work with trigram LMs, we expect the proposed approach to outperform linear interpolation in terms of both the perplexity and the automatic speech recognition work error rate

    An experimental study on continuous phone recognition with little or no language specific-training data

    No full text
    We study continuous phone recognition with little or no language-specific speech training data. The phone recognizer integrates three levels of information from: (1) frame based speech attribute detectors, (2) artificial neural network based phone event mergers, and (3) decoding based evidence verifiers. With a set of acoustic phonetic attributes defined over a number of available languages, a collection of attribute-to-phone mapping rules can either be specified in a language-dependent way, one for each language, or even independently for all languages if the attribute specification is complete to cover all phones and the phone definition is universal to cover all spoken languages. We report on experimental results on Japanese phone recognition with the OGI Multilingual Speech Corpus. It is interesting that a good performance can be achieved without using any Japanese speech training data, and the phone accuracy rates vary depending on how the attribute detectors and phone mergers are configured. Further improvement is observed by adding little Japanese data to train the attribute-to-phone mergers

    A study on lattice rescoring with knowledge scores for automatic speech recognition

    No full text
    We study lattice rescoring with knowledge scores for automatic speech recognition. Frame-based log likelihood ratio is adopted as a score measure of the goodness-of-fit between a speech segment and the knowledge sources. We evaluate our approach in two different applications: phone recognition, and connected digit continuous recognition. By incorporating knowledge scores obtained from 15 attribute detectors for place and manner of articulation, we reduced phone error rate from 40.52% to 35.16% using monophone models. The error rate can be further reduced to 33.42% for triphone models. The same lattice rescoring algorithm is extended to connected digit recognition using the TIDIGITS database, and without using any digit-specific training data. We observed the digit error rate can be effectively reduced to 4.03% from 4.54% which was obtained with the conventional Viterbi decoding algorithm with no knowledge scores

    Bayesian Unsupervised Batch and Online Speaker Adaptation of Activation Function Parameters in Deep Models for Automatic Speech Recognition

    No full text
    We present a Bayesian framework to obtain maximum a posteriori (MAP) estimation of a small set of hidden activation function parameters in CD-DNN-HMM based automatic speech recognition (ASR) systems. When applied to speaker adaptation, we aim at transfer learning from a well-trained deep model for a “general” usage to a “personalized” model geared towards a particular talker using a collection of speakerspecific data. To make the framework applicable to practical situations, we perform adaptation in an unsupervised manner assuming the transcriptions of the adaptation utterances are not readily available to the ASR system. We conduct a series of comprehensive batch adaptation experiments on the Switchboard ASR task and show that the proposed approach is effective even with CD-DNN-HMM built with discriminative sequential training. Indeed, MAP speaker adaptation reduces the word error rate (WER) to 20.1% from an initial 21.9% on the full NIST 2000 Hub5 benchmark test set. Moreover, MAP speaker adaptation compares favourably with other techniques evaluated on the same speech tasks. We also demonstrate its complementarity to other approaches by applying MAP adaptation to CD-DNNHMM trained with speaker adaptive features generated through constrained maximum likelihood linear regression (fMLLR) and further reduces the WER to 18.6%. Leveraging upon the intrinsic recursive nature in Bayesian adaptation and mitigating possible system constraints on batch learning, we also proposed an incremental approach to unsupervised online speaker adaptation by simultaneously updating the hyperparameters of the approximate posterior densities and the DNN parameters sequentially. The advantage of such a sequential learning algorithm over a batch method is not necessarily in the final performance, but in computational efficiency and reduced storage needs, without having to wait for all the data to be processed. So far the experimental results are promising

    A Theory on Deep Neural Network Based Vector-to-Vector Regression With an Illustration of Its Expressive Power in Speech Enhancement

    No full text
    This paper focuses on a theoretical analysis of deep neural network (DNN) based functional approximation. Leveraging upon two classical theorems on universal approximation, an artificial neural network (ANN) with a single hidden layer of neurons is used. With modified ReLU and Sigmoid activation functions, we first generalize the related concepts to vector-to-vector regression. Then, we show that the width of the hidden layer of ANN is numerically related to the approximation of the regression function. Furthermore, we increase the number of hidden layers and show that the depth of the ANN-based regression function can enhance its expressive power. We illustrate this representation with recently-emerged DNN based speech enhancement. We first compare the expressive power by varying ANN structures and then test its related regression performance under different noisy conditions in various noise types and signal-to-noise-ratio levels. Experimental results verify our theoretical prediction that an ANN of a broader hidden layer and a deeper architecture can jointly ensure a closer approximation of the vector-to-vector regression functions in terms of the Euclidean distance between the log power spectra of noisy and expected clean speech. Moreover, a DNN with a broader width at the top hidden layer can improve the regression performance relative to those with a narrower width at the top hidden layers

    Hierarchical Bayesian combination of plug-in maximum a posteriori decoders in deep neural networks-based speech recognition and speaker adaptation

    No full text
    We propose a novel decoding framework by dynamically combining K multiple plug-in maximum a posteriori (MAP) decoders, with each solving for a sequence of symbols in a state-by-state manner in time and according to a set of constraints on the symbol sequences in space. The score combination occurs at the state level with the set of K combination weights either chosen to be equal (i.e., equal weighting scheme) or learned from a collection of data through a hierarchical Bayesian setting. When applied to automatic speech recognition (ASR), leveraging upon some characteristic differences in computing acoustic probabilities with both feed-forward deep neural networks (DNNs) and Gaussian mixture models (GMMs) at the hidden Markov phone state level, these scores can be discriminatively combined in plug-in MAP decoding. The DNN and GMM parameters can be trained from a large collection of speaker-independent (SI) speech data and further refined with a small set of speaker adaptation (SA) utterances. The per-speaker, per-state combination weights can be learned from SA data through the proposed hierarchical Bayesian approach. Experimental results on the Switchboard ASR task show that an ad hoc fixed-weight combination already reduces the word error rate (WER) to 16.9% from a SI WER of 17.4%. Model adaptation with 20 utterances can reduce the WER to 16.7%, which is further reduced to 16.1% using the SA models and fixed-weight combination decoding. The best WER of 15.3% is attained by using the proposed hierarchical Bayesian learned weights combining the two SA and two SI systems. Finally, we contrast the proposed technique with a state-of-the-art static system combination approach based on multiple word lattices generated by different ASR systems, and minimum Bayes risk. The experimental results demonstrate that static system combination cannot boost system performance of the individual systems, and the proposed dynamic combination scheme is needed
    corecore