1,720,961 research outputs found

    Implications of Sepedi/English code switching for ASR systems

    No full text
    Code switching (the process of switching from one language to another during a conversation) is a common phenomenon in multilingual environments. Where a minority and dominant language coincide, code switching from the minority language to the dominant language can become particularly frequent. We analyse one such scenario: Sepedi spoken in South Africa, where English is the dominant language; and determine the frequency and mechanisms of code switching through the analysis of radio broadcasts. We also perform an initial acoustic analysis to determine the impact of such code switching on speech recognition performance. We find that the frequency of code switching is unexpectedly high, and that the continuum of code switching (from unmodified embedded words to loan words absorbed in the matrix language) makes this a particularly challenging task for speech recognition systems.http://www.prasa.org/index.php/2012-03-07-10-55-1

    Context-dependent modelling of English vowels in Sepedi code-switched speech

    Full text link
    When modelling code-switched speech (utterances that contain a mixture of languages), the embedded language often contains phones not found in the matrix language. These are typically dealt with by either extending the phone set or mapping each phone to a matrix language counterpart. We use acoustic log likelihoods to assist us in identifying the optimal mapping strategy at a context-dependent level (that is, at triphone, rather than monophone level) and obtain new insights in the way English/Sepedi code-switched vowels are produce

    Predicting vowel substitution in code-switched speech

    Full text link
    Abstract—The accuracy of automatic speech recognition (ASR) systems typically degrades when encountering codeswitched speech. Some of this degradation is due to the unexpected pronunciation effects introduced when languages are mixed. Embedded (foreign) phonemes typically show more variation than phonemes from the matrix language: either approximating the embedded language pronunciation fairly closely, or realised as any of a set of phonemic counterparts from the matrix language. In this paper we describe a technique for predicting the phoneme substitutions that are expected to occur during code-switching, using non-acoustic features only. As case study we consider Sepedi/English code switching and analyse the different realisations of the English schwa. A code-switched speech corpus is used as input and vowel substitutions identified by auto-tagging this corpus based on acoustic characteristics. We first evaluate the accuracy of our auto-tagging process, before determining the predictability of our auto-tagged corpus, using non-acoustic features.This work was partially supported by the National Research Foundation. Any opinion, findings and conclusions or recommendations expressed in this material are those of the author(s) and therefore the NRF do not accept any liability in regard thereto

    The Analysis of the Sepedi-English Code-switched Radio News Corpus

    Full text link
    Code-switching is a phenomenon that occurs mostly in multilingual countries where multilingual speakers often switch between languages in their conversations. The unavailability of largescale code-switched corpora hampers the development and training of language models for the generation of code-switched text. In this study, we explore the initial phase of collecting and creating Sepedi-English code-switched corpus for generating synthetic news. Radio news and the frequency of code-switching on read news were considered and analysed. We developed and trained a Transformer-based language model using the collected code-switched dataset. We observed that the frequency of code-switched data in the dataset was very lowat 1.1%.We complemented our dataset with the news headlines dataset to create a new dataset. Although the frequencywas still low, the model obtained the optimal loss rate of 2,361 with an accuracy of 66%

    The Development of a Sepedi Text Generation Model Using Transformers

    Full text link
    Text generation is one of the important sub-tasks of natural language generation (NLG), and aims to produce humanly readable text given some input text. Deep learning approaches based on neural networks have been proposed to solve text generation tasks. Although these models can generate text, they do not necessarily capture long-term dependencies accurately, making it difficult to coherently generate longer sentences. Transformer-based models have shown significant improvement in text generation. However, these models are computationally expensive and data hungry. In this study, we develop a Sepedi text generation model using a Transformer based approach and explore its performance. The developed model has one Transformer block with causal masking on the attention layers and two separate embedding layers. To train the model, we use the National Centre for Human Language Technology (NCHLT) Sepedi text corpus. Our experimental setup varied the model embedding size, batch size and the sequence length. The final model was able to reconstruct unseen test data with 75% accuracy: the highest accuracy achieved to date, using a Sepedi corpus.Southern Africa Telecommunication Networks and Applications Conference (SATNAC) 202

    Developing a Code-Mixed Sentiment Analysis Dataset of Xitsonga-English Music Reviews

    Full text link
    Sentiment analysis is the process of classifying text emotions as positive, negative or neutral. Code-mixed sentiment analysis refers to the classification of text’s sentiments that contains two or more languages. There are limited studies developed for sentiment analysis on South African code-mixed languages and this is due to the absence of annotated dataset. The purpose of the study was to collect code-mixed text data for the Xitsonga-English language pair. The study collected Xitsonga-English code-mixed comments for music reviews from a YouTube channel. After the data was collected, tokenization using a python library called natural language toolkit was performed. Subsequently, we analyzed the comments for the presence of code-mixing. The collected Xitsonga-English code-mixed data would be suitable to build a sentiment analysis model

    Pre-training a Transformer-Based Generative Model Using a Small Sepedi Dataset

    No full text
    Journal Article, Faculty of Engineering, Multilingual Speech Technologies (MUST)-- Potchefstroom CampusDue to the scarcity of data in low-resourced languages, the development of language models for these languages has been very slow. Currently, pre-trained language models have gained popularity in natural language processing, especially, in developing domain-specific models for low-resourced languages. In this study, we experiment with the impact of using occlusion-based techniques when training a language model for a text generation task. We curate 2 new datasets, the Sepedi monolingual (SepMono) dataset from several South African resources and the Sepedi radio news (SepNews) dataset from the radio news domain. We use the SepMono dataset to pre-train transformer-based models using the occlusion and non-occlusion pre-training techniques and compare performance. The SepNews dataset is specifically used for fine-tuning. Our results show that the non-occlusion models perform better compared to the occlusion-based models when measuring validation loss and perplexity. However, analysis of the generated text using the BLEU score metric, which measures the quality of the generated text, shows a slightly higher BLEU score for the occlusion-based models compared to the nonocclusion models.Acknowledgments We would like to acknowledge the Telkom Centre of Excellence for Speech Technology at the University of Limpopo and the MUST deep learning research group at the Northwest University (Potchefstroom) for their continued support. This work is based on research supported in part by the National Research Foundation of South Africa (Ref Number RA211019646111)

    Pre-training a Transformer-Based Generative Model Using a Small Sepedi Dataset

    Full text link
    Due to the scarcity of data in low-resourced languages, the development of language models for these languages has been very slow. Currently, pre-trained language models have gained popularity in natural language processing, especially, in developing domain-specific models for low-resourced languages. In this study, we experiment with the impact of using occlusion-based techniques when training a language model for a text generation task. We curate 2 new datasets, the Sepedi monolingual (SepMono) dataset from several South African resources and the Sepedi radio news (SepNews) dataset from the radio news domain. We use the SepMono dataset to pre-train transformer-based models using the occlusion and non-occlusion pre-training techniques and compare performance. The SepNews dataset is specifically used for fine-tuning. Our results show that the non-occlusion models perform better compared to the occlusion-based models when measuring validation loss and perplexity. However, analysis of the generated text using the BLEU score metric, which measures the quality of the generated text, shows a slightly higher BLEU score for the occlusion-based models compared to the nonocclusion models

    Going Beyond Counting First Authors in Author Co-citation Analysis

    Full text link
    The present study examines one of the fundamental aspects of author co-citation analysis (ACA) - the way co-citation counts are defined. Co-citation counting provides the data on which all subsequent statistical analyses and mappings are based, and we compare ACA results based on two different types of co-citation counting - the traditional type that only counts the first one among a cited work's authors on the one hand and a non-traditional type that takes into account the first 5 authors of a cited work on the other hand. Results indicate that the picture produced through this non-traditional author co-citation counting contains more coherent author groups and is therefore considerably clearer. However, this picture represents fewer specialties in the research field being studied than that produced through the traditional first-author co-citation counting when the same number of top-ranked authors is selected and analyzed. Reasons for these effects are discussed
    corecore