1,721,010 research outputs found

    Activation gap generators in neural networks

    No full text
    No framework exists that can explain and predict the generalisation ability of DNNs in general circumstances. In fact, this question has not been addressed for some of the least complicated of neural network architectures: fully-connected feedforward networks with ReLU activations and a limited number of hidden layers. Building on recent work [2] that demonstrates the ability of individual nodes in a hidden layer to draw class specific activation distributions apart, we show how a simplified network architecture can be analysed in terms of these activation distributions, and more specifically, the sample distances or activation gaps each node produces. We provide a theoretical perspective on the utility of viewing nodes as activation gap generators, and define the gap conditions that are guaranteed to result in perfect classification of a set of samples. We support these conclusions with empirical results

    Language identification of individual words with joint sequence models

    Full text link
    Within a multilingual automatic speech recognition (ASR) system, knowledge of the language of origin of unknown words can improve pronunciation modelling accuracy. This is of particular importance for ASR systems required to deal with codeswitched speech or proper names of foreign origin. For words that occur in the language model, but do not occur in the pronunciation lexicon, text-based language identification (T-LID) of a single word in isolation may be required. This is a challenging task, especially for short words. We motivate for the importance of accurate T-LID in speech processing systems and introduce a novel way of applying Joint Sequence Models to the T-LID task. We obtain competitive results on a real-world 4- language task: for our best JSM system, an F-measure of 97:2% is obtained, compared to a F-measure of 95:2% obtained with a state-of-the-art Support Vector Machine (SVM).This work was supported by the South African Department of Arts and Culture (DAC) and the National Research Foundation (NRF). Any opinion, findings and conclusions or recommendations expressed in this material are those of the author(s) and therefore neither DAC nor the NRF accepts any liability in regard thereto

    Predicting vowel substitution in code-switched speech

    Full text link
    Abstract—The accuracy of automatic speech recognition (ASR) systems typically degrades when encountering codeswitched speech. Some of this degradation is due to the unexpected pronunciation effects introduced when languages are mixed. Embedded (foreign) phonemes typically show more variation than phonemes from the matrix language: either approximating the embedded language pronunciation fairly closely, or realised as any of a set of phonemic counterparts from the matrix language. In this paper we describe a technique for predicting the phoneme substitutions that are expected to occur during code-switching, using non-acoustic features only. As case study we consider Sepedi/English code switching and analyse the different realisations of the English schwa. A code-switched speech corpus is used as input and vowel substitutions identified by auto-tagging this corpus based on acoustic characteristics. We first evaluate the accuracy of our auto-tagging process, before determining the predictability of our auto-tagged corpus, using non-acoustic features.This work was partially supported by the National Research Foundation. Any opinion, findings and conclusions or recommendations expressed in this material are those of the author(s) and therefore the NRF do not accept any liability in regard thereto

    N-gram based language identification of individual words

    Full text link
    Various factors influence the accuracy with which the language of individual words can be classified using n-grams. We consider a South African text-based language identification (LID) task and experiment with two different types of n-gram classifiers: a Näıve Bayes classifier and a Support Vector Machine. Specifically, we investigate various factors that influence LID accuracy when identifying generic words (as opposed to running text) in four languages. These include: the importance of n-gram smoothing (Katz smoothing, absolute discounting and Witten-Bell smoothing) when training Naıve Bayes classifiers; the effect of training corpus size on classification accuracy; and the relationship between word length, n-gram length and classification accuracy. For the best variant of each of the two sets of algorithms, we achieve relatively comparable classification accuracies. The accuracy of the Support Vector Machine (88.16%, obtained with a Radial Basis function) is higher than that of the Naıve Bayes classifier (87.62%, obtained using Witten-Bell smoothing), but the latter result is associated with a significantly lower computational cost. Index Terms: text-based language identification, smoothing, character n-grams, Naıve Bayes classifier, support vector machine.http://www.prasa.org/index.php/2012-03-07-10-55-1

    Synthetic triphones from trajectory-based feature distributions

    Full text link
    We experiment with a new method to create synthetic models of rare and unseen triphones in order to supplement limited automatic speech recognition (ASR) training data. A trajectory model is used to characterise seen transitions at the spectral level, and these models are then used to create features for unseen or rare triphones. We find that a fairly restricted model (piece-wise linear with three line segments per channel of a diphone transition) is able to represent training data quite accurately. We report on initial results when creating additional triphones for a single-speaker data set, finding small but significant gains, especially when adding additional samples of rare (rather than unseen) triphones.Human Language Technology Research Group, CSIR Meraka, South Africa. Multilingual Speech Technologies, North-West University, Vanderbijlpark, South Africa. CAIR, CSIR Meraka, South Africa

    Text-based Language Identification of Multilingual Names

    No full text
    Text-based language identification (T-LID) of isolated words has been shown to be useful for various speech processing tasks, including pronunciation modelling and data categorisation. When the words to be categorised are proper names, the task becomes more difficult: not only do proper names often have idiosyncratic spellings, they are also often considered to be multilingual. We, therefore, investigate how an existing T-LID technique can be adapted to perform multilingual word classification. That is, given a proper name, which may be either mono- or multilingual, we aim to determine how accurately we can predict how many possible source languages the word has, and what they are. Using a Joint Sequence Modelbased approach to T-LID and the SADE corpus – a newly developed proper names corpus of South African names – we experiment with different approaches to multilingual T-LID. We compare posterior-based and likelihood-based methods and obtain promising results on a challenging task

    Bilateral G2P accuracy: measuring the effect of variants

    Full text link
    We would like to acknowledge Charl van Heerden for his assistance with the ASR experiments, as well as Ulrike Janke for her editing assistance.Incorporating pronunciation variants in a dictionary is controversial, as this can be either advantageous or detrimental for a speech recognition system. Grapheme-tophoneme (G2P) accuracy can help guide this decision, but calculating the G2P accuracy of variant-based dictionaries is not fully straightforward. We propose a variant matching technique to measure G2P accuracy in a principled way, when both the reference and hypothesised dictionaries may include variants. We use the new measure to evaluate G2P accuracy and speech recognition performance of systems developed with an existing set of dictionaries, and observe a better correlation between G2P accuracy and speech recognition performance, than when utilising alternative metrics.National Research Foundation (NRF)

    Kullback-Leibler divergence-based ASR training data selection

    No full text
    Data preparation and selection affects systems in a wide range of complexities. A system built for a resource-rich language may be so large as to include borrowed languages. A system built for a resource-scarce language may be affected by how carefully the training data is selected and produced. Accuracy is affected by the presence of enough samples of qualitatively relevant information. We propose a method using the Kullback-Leibler divergence to solve two problems related to data preparation: the ordering of alternate pronunciations in a lexicon, and the selection of transcription data. In both cases, we want to guarantee that a particular distribution of n-grams is achieved. In the case of lexicon design, we want to ascertain that phones will be present often enough. In the case of training data selection for scarcely resourced languages, we want to make sure that some n-grams are better represented than others. Our proposed technique yields encouraging results.European Media Laboratory GmbH, Heidelberg, Germany Multilingual Speech Technologies, North-West University, Vanderbijlpark, South Afric

    Comparing Transformer-based and gradient boosted decision tree (GBDT) Models on Tabular Data: A Rossmann Case Study

    Full text link
    Heterogeneous tabular data is a common and important data format. This empirical study investigates how the performance of deep transformer models compares against benchmark gradient boosting decision tree (GBDT) methods, the more typical modelling approach. All models are optimised using a Bayesian hyperparameter optimisation protocol, which provides a stronger comparison than the random grid search hyperparameter optimisation utilized in earlier work. Since feature skewness is typically handled differently for GBDT and transformer-based models, we investigate the effect of a pre-processing step that normalises feature distribution on the model comparison process. Our analysis is based on the Rossmann Store Sales dataset, a widely recognized benchmark for regression tasks

    The predictability of name pronunciation errors in four South African languages

    No full text
    Personal names are often pronounced in very different ways depending on the language background of the speaker. We seek to determine whether some of these pronunciations 'errors' are systematic and if so, in which ways. Specifically, we analyze some of the typical errors made by speakers from four South African languages (Setswana, English, isiZulu) when producing names from the same four languages. We compare these results with the pronunciations generated by four language-specific grapheme-to-phoneme (G2P) predictors trained on generic words from the four languages. We find that the G2P predictors are able to predict at least some of the typical errors humans make and, in fact, that these errors are slightly more predictable than the correct pronunciations themselves.Human Language Technologies Research Group, Meraka Institute, CSIR, Pretoria, South Africa Multilingual Speech Technologies, North-West University, Vanderbijlpark, South Afric
    corecore