1,721,023 research outputs found
Generative pairwise models for speaker recognition
This paper proposes a simple model for speaker recognition
based on i–vector pairs, and analyzes its similarity and differences
with respect to the state–of–the–art Probabilistic Linear
Discriminant Analysis (PLDA) and Pairwise Support Vector
Machine (PSVM) models. Similar to the discriminative PSVM
approach, we propose a generative model of i–vector pairs,
rather than an usual i–vector based model. The model is based
on two Gaussian distributions, one for the “same speakers” and
the other for the “different speakers” i–vector pairs, and on the
assumption that the i–vector pairs are independent. This independence assumption allows the distributions of the two classes to be independently estimated. The “Two–Gaussian” approach can be extended to the Heavy–Tailed distributions, still allowing a fast closed form solution to be obtained for testing i–vector pairs.
We show that this model is closely related to PLDA and to PSVM models, and that tested on the female part of the tel–tel NIST SRE 2010 extended evaluation set, it is able to achieve comparable accuracy with respect to the other models, trained with different objective functions and training procedures
Fast and Memory Effective I-Vector Extraction Using a Factorized Sub-Space
Most of the state-of-the-art speaker recognition systems use a compact representation of spoken utterances referred to as i-vectors. Since the "standard" i-vector extraction procedure requires large memory structures and is relatively slow, new approaches have recently been proposed that are able to obtain either accurate solutions at the expense of an increase of the computational load, or fast approximate solutions, which are traded for lower memory costs. We propose a new approach particularly useful for applications that need to minimize their memory requirements. Our solution not only dramatically reduces the storage needs for i-vector extraction, but is also fast. Tested on the female part of the tel-tel extended NIST 2010 evaluation trials, our approach substantially improves the performance with respect to the fastest but inaccurate eigen-decomposition approach, using much less memory than any other known method
Method and apparatus for efficient i-vector extraction
Most speaker recognition systems use i-vectors which are compact representations of speaker voice characteristics. Typical i-vector extraction procedures are complex in terms of computations and memory usage. According to an embodiment, a method and corresponding apparatus for speaker identification, comprise determining a representation for each component of a variability operator, representing statistical inter- and intra-speaker variability of voice features with respect to a background statistical model, in terms of a linear operator common to all components of the variability operator and having a first dimension larger than a second dimension of the components of the variability operator; computing statistical voice characteristics of a particular speaker using the determined representations; and employing the statistical voice characteristics of the particular speaker in performing speaker recognition. Computing the voice characteristics, by using the determined representations, results in significant reduction in memory usage and possible increase in execution spee
Large scale training of Pairwise Support Vector Machines for speaker recognition
State-of-the-art systems for text-independent speaker recognition use as their features a compact representation of a speaker utterance, known as "i-vector". We recently presented an efficient approach for training a Pairwise Support Vector Machine (PSVM) with a suitable kernel for i-vector pairs for a quite large speaker recognition task. Rather than estimating an SVM model per speaker, according to the "one versus all" discriminative paradigm, the PSVM approach classifies a trial, consisting of a pair of i-vectors, as belonging or not to the same speaker class. Training a PSVM with large amount of data, however, is a memory and computational expensive task, because the number of training pairs grows quadratically with the number of training i-vectors. This paper demonstrates that a very small subset of the training pairs is necessary to train the original PSVM model, and proposes two approaches that allow discarding most of the training pairs that are not essential, without harming the accuracy of the model. This allows dramatically reducing the memory and computational resources needed for training, which becomes feasible with large datasets including many speakers. We have assessed these approaches on the extended core conditions of the NIST 2012 Speaker Recognition Evaluation. Our results show that the accuracy of the PSVM trained with a sufficient number of speakers is 10-30% better compared to the one obtained by a PLDA model, depending on the testing conditions. Since the PSVM accuracy increases with the training set size, but PSVM training does not scale well for large numbers of speakers, our selection techniques become relevant for training accurate discriminative classifier
From adaptive score normalization to adaptive data normalization for speaker verification systems
Domain and trial-dependent mismatch between training and evaluation data can severely affect the performance of speaker verification systems, and are usually addressed either at embedding level, with methods that try matching the distribution of in-domain and out-of-domain data, or at score level by means of calibration and score normalization approaches. In this work we propose an alternative to score normalization that leverages
the adaptive cohort selection of Adaptive S-norm (AS-norm), but performs normalization at embedding rather than at score level. Experimental results on SRE 2016 and SRE 2019 show that the proposed method is able to outperform other approaches in presence of severe mismatch, and achieves similar performance in scenarios where score normalization is less important. Furthermore, in contrast with AS-norm, our approach allows independently normalizing the enrollment and test segments, and has negligible computational cost at scoring time.Index Terms: speaker recognition, score normalization, adaptive score normalization, speaker embedding
Impostor Score Statistics as Quality Measures for the Calibration of Speaker Verification Systems
Analysis of Large-Scale SVM Training Algorithms for Language and Speaker Recognition
This paper compares a set of large scale support vector machine (SVM) training algorithms for language and speaker recognition tasks.We analyze five approaches for training phonetic and acoustic SVM models for language recognition. We compare the performance of these approaches as a function of the training time required by each of them to reach convergence, and we discuss their scalability towards large corpora. Two of these algorithms can be used in speaker recognition to train a SVM that classifies pairs of utterances as either belonging to the same speaker or to two different speakers. Our results show that the accuracy of these algorithms is asymptotically equivalent, but they have different behavior with respect to the time required to converge. Some of these algorithms not only scale linearly with the training set size, but are also able to give their best results after just a few iterations. State-of-the-art performance has been obtained in the female subset of the NIST 2010 Speaker Recognition Evaluation extended core test using a single SVM syste
- …
