1,721,010 research outputs found
Activation gap generators in neural networks
No framework exists that can explain and predict the generalisation ability of DNNs in general circumstances. In fact, this question has not been addressed for some of the least complicated of neural network architectures: fully-connected feedforward networks with ReLU activations and a limited number of hidden layers. Building on recent work [2] that demonstrates the ability of individual nodes in a hidden layer to draw class specific activation distributions apart, we show how a simplified network architecture can be analysed in terms of these activation distributions, and more specifically, the sample distances or activation gaps each node produces. We provide a theoretical perspective on the utility of viewing nodes as activation gap generators, and define the gap conditions that are guaranteed to result in perfect classification of a set of samples. We support these conclusions with empirical results
Language identification of individual words with joint sequence models
Within a multilingual automatic speech recognition (ASR) system,
knowledge of the language of origin of unknown words
can improve pronunciation modelling accuracy. This is of particular
importance for ASR systems required to deal with codeswitched
speech or proper names of foreign origin. For words
that occur in the language model, but do not occur in the pronunciation
lexicon, text-based language identification (T-LID)
of a single word in isolation may be required. This is a challenging
task, especially for short words. We motivate for the
importance of accurate T-LID in speech processing systems and
introduce a novel way of applying Joint Sequence Models to the
T-LID task. We obtain competitive results on a real-world 4-
language task: for our best JSM system, an F-measure of 97:2%
is obtained, compared to a F-measure of 95:2% obtained with a
state-of-the-art Support Vector Machine (SVM).This work was supported by the South African Department of
Arts and Culture (DAC) and the National Research Foundation
(NRF). Any opinion, findings and conclusions or recommendations
expressed in this material are those of the author(s) and
therefore neither DAC nor the NRF accepts any liability in regard
thereto
Predicting vowel substitution in code-switched speech
Abstract—The accuracy of automatic speech recognition
(ASR) systems typically degrades when encountering codeswitched
speech. Some of this degradation is due to the
unexpected pronunciation effects introduced when languages
are mixed. Embedded (foreign) phonemes typically show more
variation than phonemes from the matrix language: either
approximating the embedded language pronunciation fairly
closely, or realised as any of a set of phonemic counterparts
from the matrix language. In this paper we describe a technique
for predicting the phoneme substitutions that are expected
to occur during code-switching, using non-acoustic features
only. As case study we consider Sepedi/English code switching
and analyse the different realisations of the English schwa.
A code-switched speech corpus is used as input and vowel
substitutions identified by auto-tagging this corpus based on
acoustic characteristics. We first evaluate the accuracy of our
auto-tagging process, before determining the predictability of
our auto-tagged corpus, using non-acoustic features.This work was partially supported by the National Research
Foundation. Any opinion, findings and conclusions or
recommendations expressed in this material are those of the
author(s) and therefore the NRF do not accept any liability
in regard thereto
N-gram based language identification of individual words
Various factors influence the accuracy with which the language of individual words can be classified using n-grams. We consider a South African text-based language identification (LID) task and experiment with two different types of n-gram classifiers: a Näıve Bayes classifier and a Support Vector Machine. Specifically, we investigate various factors that influence LID accuracy when identifying generic words (as opposed to running text) in four languages. These include: the importance of n-gram smoothing (Katz smoothing, absolute discounting and Witten-Bell smoothing) when training Naıve Bayes classifiers; the effect of training corpus size on classification accuracy; and the relationship between word length, n-gram length and classification accuracy. For the best variant of each of the two sets of algorithms, we achieve relatively comparable classification accuracies. The accuracy of the Support Vector Machine (88.16%, obtained with a Radial Basis function) is higher than that of the Naıve Bayes classifier (87.62%, obtained using Witten-Bell smoothing), but the latter result is associated with a significantly lower computational cost. Index Terms: text-based language identification, smoothing, character n-grams, Naıve Bayes classifier, support vector machine.http://www.prasa.org/index.php/2012-03-07-10-55-1
Synthetic triphones from trajectory-based feature distributions
We experiment with a new method to create
synthetic models of rare and unseen triphones in order to supplement
limited automatic speech recognition (ASR) training
data. A trajectory model is used to characterise seen transitions
at the spectral level, and these models are then used to create
features for unseen or rare triphones. We find that a fairly
restricted model (piece-wise linear with three line segments per
channel of a diphone transition) is able to represent training
data quite accurately. We report on initial results when creating
additional triphones for a single-speaker data set, finding small
but significant gains, especially when adding additional samples
of rare (rather than unseen) triphones.Human Language Technology Research Group, CSIR Meraka, South Africa.
Multilingual Speech Technologies, North-West University, Vanderbijlpark, South Africa.
CAIR, CSIR Meraka, South Africa
Text-based Language Identification of Multilingual Names
Text-based language identification (T-LID) of isolated
words has been shown to be useful for various speech
processing tasks, including pronunciation modelling and data
categorisation. When the words to be categorised are proper
names, the task becomes more difficult: not only do proper
names often have idiosyncratic spellings, they are also often
considered to be multilingual. We, therefore, investigate how
an existing T-LID technique can be adapted to perform multilingual
word classification. That is, given a proper name, which
may be either mono- or multilingual, we aim to determine how
accurately we can predict how many possible source languages
the word has, and what they are. Using a Joint Sequence Modelbased
approach to T-LID and the SADE corpus – a newly
developed proper names corpus of South African names – we
experiment with different approaches to multilingual T-LID.
We compare posterior-based and likelihood-based methods and
obtain promising results on a challenging task
Bilateral G2P accuracy: measuring the effect of variants
We would like to acknowledge Charl van Heerden for his
assistance with the ASR experiments, as well as Ulrike Janke
for her editing assistance.Incorporating pronunciation variants in a dictionary
is controversial, as this can be either advantageous or
detrimental for a speech recognition system. Grapheme-tophoneme
(G2P) accuracy can help guide this decision, but
calculating the G2P accuracy of variant-based dictionaries
is not fully straightforward. We propose a variant matching
technique to measure G2P accuracy in a principled way, when
both the reference and hypothesised dictionaries may include
variants. We use the new measure to evaluate G2P accuracy
and speech recognition performance of systems developed with
an existing set of dictionaries, and observe a better correlation
between G2P accuracy and speech recognition performance,
than when utilising alternative metrics.National Research Foundation (NRF)
Kullback-Leibler divergence-based ASR training data selection
Data preparation and selection affects systems in a wide range
of complexities. A system built for a resource-rich language
may be so large as to include borrowed languages. A system
built for a resource-scarce language may be affected by how
carefully the training data is selected and produced.
Accuracy is affected by the presence of enough samples of
qualitatively relevant information. We propose a method using
the Kullback-Leibler divergence to solve two problems related
to data preparation: the ordering of alternate pronunciations in
a lexicon, and the selection of transcription data. In both cases,
we want to guarantee that a particular distribution of n-grams
is achieved. In the case of lexicon design, we want to ascertain
that phones will be present often enough. In the case of training
data selection for scarcely resourced languages, we want to
make sure that some n-grams are better represented than others.
Our proposed technique yields encouraging results.European Media Laboratory GmbH, Heidelberg, Germany
Multilingual Speech Technologies, North-West University, Vanderbijlpark, South Afric
Comparing Transformer-based and gradient boosted decision tree (GBDT) Models on Tabular Data: A Rossmann Case Study
Heterogeneous tabular data is a common and important data format. This empirical study investigates how the performance of deep transformer models compares against benchmark gradient boosting decision tree (GBDT) methods, the more typical modelling approach. All models are optimised using a Bayesian hyperparameter optimisation protocol, which provides a stronger comparison than the random grid search hyperparameter optimisation utilized in earlier work. Since feature skewness is typically handled differently for GBDT and transformer-based
models, we investigate the effect of a pre-processing step that normalises feature distribution on the model comparison process. Our analysis is
based on the Rossmann Store Sales dataset, a widely recognized benchmark for regression tasks
The predictability of name pronunciation errors in four South African languages
Personal names are often pronounced in very different ways depending on the language background of the speaker. We seek to determine whether some of these pronunciations 'errors' are systematic and if so, in which ways. Specifically, we analyze some of the typical errors made by speakers from four South African languages (Setswana, English, isiZulu) when producing names from the same four languages. We compare these results with the pronunciations generated by four language-specific grapheme-to-phoneme (G2P) predictors trained on generic words from the four languages. We find that the G2P predictors are able to predict at least some of the typical errors humans make and, in fact, that these errors are slightly more predictable than the correct pronunciations themselves.Human Language Technologies Research Group, Meraka Institute, CSIR, Pretoria, South Africa
Multilingual Speech Technologies, North-West University, Vanderbijlpark, South Afric
- …
