1,720,961 research outputs found
Implications of Sepedi/English code switching for ASR systems
Code switching (the process of switching from one language to another during a conversation) is a common phenomenon in multilingual environments. Where a minority and dominant language coincide, code switching from the minority language to the dominant language can become particularly frequent. We analyse one such scenario: Sepedi spoken in South Africa, where English is the dominant language; and determine the frequency and mechanisms of code switching through the analysis of radio broadcasts. We also perform an initial acoustic analysis to determine the impact of such code switching on speech recognition performance. We find that the frequency of code switching is unexpectedly high, and that the continuum of code switching (from unmodified embedded words to loan words absorbed in the matrix language) makes this a particularly challenging task for speech recognition systems.http://www.prasa.org/index.php/2012-03-07-10-55-1
Context-dependent modelling of English vowels in Sepedi code-switched speech
When modelling code-switched speech (utterances that contain a mixture of languages), the embedded language often contains phones not found in the matrix language. These are typically dealt with by either extending the phone set or mapping each phone to a matrix language counterpart. We use acoustic log likelihoods to assist us in identifying the optimal mapping strategy at a context-dependent level (that is, at triphone, rather than monophone level) and obtain new insights in the way English/Sepedi code-switched vowels are produce
Predicting vowel substitution in code-switched speech
Abstract—The accuracy of automatic speech recognition
(ASR) systems typically degrades when encountering codeswitched
speech. Some of this degradation is due to the
unexpected pronunciation effects introduced when languages
are mixed. Embedded (foreign) phonemes typically show more
variation than phonemes from the matrix language: either
approximating the embedded language pronunciation fairly
closely, or realised as any of a set of phonemic counterparts
from the matrix language. In this paper we describe a technique
for predicting the phoneme substitutions that are expected
to occur during code-switching, using non-acoustic features
only. As case study we consider Sepedi/English code switching
and analyse the different realisations of the English schwa.
A code-switched speech corpus is used as input and vowel
substitutions identified by auto-tagging this corpus based on
acoustic characteristics. We first evaluate the accuracy of our
auto-tagging process, before determining the predictability of
our auto-tagged corpus, using non-acoustic features.This work was partially supported by the National Research
Foundation. Any opinion, findings and conclusions or
recommendations expressed in this material are those of the
author(s) and therefore the NRF do not accept any liability
in regard thereto
The Analysis of the Sepedi-English Code-switched Radio News Corpus
Code-switching is a phenomenon that occurs
mostly in multilingual countries where multilingual
speakers often switch between languages in
their conversations. The unavailability of largescale
code-switched corpora hampers the development
and training of language models for the generation
of code-switched text. In this study, we
explore the initial phase of collecting and creating
Sepedi-English code-switched corpus for generating
synthetic news. Radio news and the frequency
of code-switching on read news were considered
and analysed. We developed and trained a
Transformer-based language model using the collected
code-switched dataset. We observed that the
frequency of code-switched data in the dataset was
very lowat 1.1%.We complemented our dataset with
the news headlines dataset to create a new dataset.
Although the frequencywas still low, the model obtained
the optimal loss rate of 2,361 with an accuracy
of 66%
The Development of a Sepedi Text Generation Model Using Transformers
Text generation is one of the important sub-tasks
of natural language generation (NLG), and aims to produce
humanly readable text given some input text. Deep learning
approaches based on neural networks have been proposed to
solve text generation tasks. Although these models can generate
text, they do not necessarily capture long-term dependencies
accurately, making it difficult to coherently generate longer
sentences. Transformer-based models have shown significant
improvement in text generation. However, these models are
computationally expensive and data hungry. In this study, we
develop a Sepedi text generation model using a Transformer based approach and explore its performance. The developed
model has one Transformer block with causal masking on the
attention layers and two separate embedding layers. To train
the model, we use the National Centre for Human Language
Technology (NCHLT) Sepedi text corpus. Our experimental
setup varied the model embedding size, batch size and the
sequence length. The final model was able to reconstruct unseen
test data with 75% accuracy: the highest accuracy achieved to
date, using a Sepedi corpus.Southern Africa Telecommunication Networks and Applications Conference (SATNAC) 202
Developing a Code-Mixed Sentiment Analysis Dataset of Xitsonga-English Music Reviews
Sentiment analysis is the process of classifying text emotions as positive, negative or neutral. Code-mixed sentiment analysis refers to the classification of text’s sentiments that contains two or more languages. There are limited studies developed for sentiment analysis on South African code-mixed languages and this is due to the absence of annotated dataset. The purpose of the study was to collect code-mixed text data for the Xitsonga-English language pair. The study collected Xitsonga-English code-mixed comments for music reviews from a YouTube channel. After the data was collected, tokenization using a python library called natural language toolkit was performed. Subsequently, we analyzed the comments for the presence of code-mixing. The collected Xitsonga-English code-mixed data would be suitable to build a sentiment analysis model
Pre-training a Transformer-Based Generative Model Using a Small Sepedi Dataset
Journal Article, Faculty of Engineering, Multilingual Speech Technologies (MUST)-- Potchefstroom CampusDue to the scarcity of data in low-resourced languages, the development of language models for these languages has been very slow. Currently, pre-trained language models have gained popularity in natural language processing, especially, in developing domain-specific models for low-resourced languages. In this study, we experiment with the impact of using occlusion-based techniques when training a language model for a text generation task. We curate 2 new datasets, the Sepedi monolingual (SepMono) dataset from several South African resources and the Sepedi radio news (SepNews) dataset from the radio news domain. We use the SepMono dataset to pre-train transformer-based models using the occlusion and non-occlusion pre-training techniques and compare performance. The SepNews dataset is specifically used for fine-tuning. Our results show that the non-occlusion models perform better compared to the occlusion-based models when measuring validation loss and perplexity. However, analysis of the generated text using the BLEU score metric, which measures the quality of the generated text, shows a slightly higher BLEU score for the occlusion-based models compared to the nonocclusion models.Acknowledgments
We would like to acknowledge the Telkom Centre of Excellence for Speech Technology at the University of Limpopo and the MUST deep learning research group at the Northwest University (Potchefstroom) for their continued support. This work is based on research supported in part by the National Research Foundation of South Africa (Ref Number RA211019646111)
Pre-training a Transformer-Based Generative Model Using a Small Sepedi Dataset
Due to the scarcity of data in low-resourced languages, the development of language models for these languages has been very slow.
Currently, pre-trained language models have gained popularity in natural language processing, especially, in developing domain-specific models for
low-resourced languages. In this study, we experiment with the impact of using occlusion-based techniques when training a language model for
a text generation task. We curate 2 new datasets, the Sepedi monolingual (SepMono) dataset from several South African resources and the
Sepedi radio news (SepNews) dataset from the radio news domain. We use the SepMono dataset to pre-train transformer-based models using
the occlusion and non-occlusion pre-training techniques and compare performance. The SepNews dataset is specifically used for fine-tuning.
Our results show that the non-occlusion models perform better compared to the occlusion-based models when measuring validation loss and
perplexity. However, analysis of the generated text using the BLEU score metric, which measures the quality of the generated text, shows a slightly
higher BLEU score for the occlusion-based models compared to the nonocclusion models
Going Beyond Counting First Authors in Author Co-citation Analysis
The present study examines one of the fundamental aspects of author co-citation analysis (ACA) - the way co-citation
counts are defined. Co-citation counting provides the data on which all subsequent statistical analyses and mappings
are based, and we compare ACA results based on two different types of co-citation counting - the traditional type that
only counts the first one among a cited work's authors on the one hand and a non-traditional type that takes into
account the first 5 authors of a cited work on the other hand. Results indicate that the picture produced through this non-traditional author co-citation counting contains more coherent author groups and is therefore considerably clearer. However, this picture represents fewer specialties in the research field being studied than that produced through the traditional first-author co-citation counting when the same number of top-ranked authors is selected and analyzed. Reasons for these effects are discussed
- …
