1,721,138 research outputs found
A formant tracking system toward automatic recognition of speech
This paper describes the implementation of a Speech Understanding System component which tracks the formants of pseudo-syllabic nuclei containing voiced consonants. The nuclei are isolated from continuous speech after a precategorical classification in which feature extraction is carried out by modules organized in a hierarchy of levels. FFT and LPC spectra are the input to the formant tracking system. It works under the control of rules specifying the possible formant evolutions given previously hypothesized phonetic features and produces fuzzy graphs rather than usual formant patterns because formants are not always evident in the spectrogram pattern
Overview of speech research in Italy: National and European projects
A survey of speech research that is being conducted in Italy both in academic and in industrial institutions is presented. Several groups are currently working on speech technology and research in Italy but no efforts to integrate their activities toward common goals have been so far pursued mostly because no nationwide project has been organized and financed to promote their cooperation. Few commendable exceptions can be listed for which the collaboration between national groups has produced satisfactory results. The support for large‐scale man‐machine vocal interfaces programs has been granted instead by a number of European projects. Italian groups are participating in these projects in close cooperation with industrial and academic partners of different countries. Among these programs probably the best known is the ESPRIT project which is now at the beginning of its second phase. All the aspects of speech and natural language processing have been addressed from speech analysis and synthesis to speech recognitionsystem assessment; some of the most relevant results will be presented along with some proposals for perspective research. Another European EUREKA project in which Italian groups are involved is PROMETHEUS an acronym for PROgraM for a European Traffic with Highest Efficiency and Unprecedented Safety where processing of noisy speech speaker independence and dialogue systems are the main topics in the man‐machine communication are
Analysis of Large-Scale SVM Training Algorithms for Language and Speaker Recognition
This paper compares a set of large scale support vector machine (SVM) training algorithms for language and speaker recognition tasks.We analyze five approaches for training phonetic and acoustic SVM models for language recognition. We compare the performance of these approaches as a function of the training time required by each of them to reach convergence, and we discuss their scalability towards large corpora. Two of these algorithms can be used in speaker recognition to train a SVM that classifies pairs of utterances as either belonging to the same speaker or to two different speakers. Our results show that the accuracy of these algorithms is asymptotically equivalent, but they have different behavior with respect to the time required to converge. Some of these algorithms not only scale linearly with the training set size, but are also able to give their best results after just a few iterations. State-of-the-art performance has been obtained in the female subset of the NIST 2010 Speaker Recognition Evaluation extended core test using a single SVM syste
Method and apparatus for efficient i-vector extraction
Most speaker recognition systems use i-vectors which are compact representations of speaker voice characteristics. Typical i-vector extraction procedures are complex in terms of computations and memory usage. According to an embodiment, a method and corresponding apparatus for speaker identification, comprise determining a representation for each component of a variability operator, representing statistical inter- and intra-speaker variability of voice features with respect to a background statistical model, in terms of a linear operator common to all components of the variability operator and having a first dimension larger than a second dimension of the components of the variability operator; computing statistical voice characteristics of a particular speaker using the determined representations; and employing the statistical voice characteristics of the particular speaker in performing speaker recognition. Computing the voice characteristics, by using the determined representations, results in significant reduction in memory usage and possible increase in execution spee
Large scale training of Pairwise Support Vector Machines for speaker recognition
State-of-the-art systems for text-independent speaker recognition use as their features a compact representation of a speaker utterance, known as "i-vector". We recently presented an efficient approach for training a Pairwise Support Vector Machine (PSVM) with a suitable kernel for i-vector pairs for a quite large speaker recognition task. Rather than estimating an SVM model per speaker, according to the "one versus all" discriminative paradigm, the PSVM approach classifies a trial, consisting of a pair of i-vectors, as belonging or not to the same speaker class. Training a PSVM with large amount of data, however, is a memory and computational expensive task, because the number of training pairs grows quadratically with the number of training i-vectors. This paper demonstrates that a very small subset of the training pairs is necessary to train the original PSVM model, and proposes two approaches that allow discarding most of the training pairs that are not essential, without harming the accuracy of the model. This allows dramatically reducing the memory and computational resources needed for training, which becomes feasible with large datasets including many speakers. We have assessed these approaches on the extended core conditions of the NIST 2012 Speaker Recognition Evaluation. Our results show that the accuracy of the PSVM trained with a sufficient number of speakers is 10-30% better compared to the one obtained by a PLDA model, depending on the testing conditions. Since the PSVM accuracy increases with the training set size, but PSVM training does not scale well for large numbers of speakers, our selection techniques become relevant for training accurate discriminative classifier
- …
