SADiLaR Language Resource Repository
Not a member yet
    536 research outputs found

    NCHLT Xitsonga GloVe embeddings

    No full text
    Static word embedding model based on the Global Vectors architecture (Pennington et al., 2014). The embeddings provide real-valued vector representations for Xitsonga text

    NCHLT isiXhosa fastText-Skipgram embeddings

    No full text
    Static word and subword embeddings for the Skipgram flavour of the fastText architecture (Bojanowski et al., 2017). The embedding provides real-valued vector representations for isiXhosa text

    NCHLT Tshivenḓa fastText-CBoW embeddings

    No full text
    Static word and subword embeddings for the continuous bag of words (CBoW) flavour of the fastText architecture (Bojanowski et al., 2017). The embedding provides real-valued vector representations for Tshivenḓa text

    Autshumato English-Xitsonga Parallel Corpora

    No full text
    Aligned parallel corpora for the language pair English-Xitsonga. The data is given as two separate UTF-8 text files, with each aligned segment on a newline. The data was specifically selected and formatted for use in the training of machine translation systems. Further clean-up and processing might be required depending on the task the data is reused for

    CTexT Afrikaans FLAIR String Embeddings

    No full text
    The CTexT Afrikaans FLAIR String Embeddings are two Afrikaans embedding models based on the FLAIR architecture (Akbik et al. 2018, 2019) that provides real-valued vector representations for Afrikaans text. The embeddings were trained on a corpus of 230 million words

    Generic Multilingual Academic Wordlists with Definitions

    No full text
    This multilingual generic academic wordlist has been developed to serve as a resource to students to assist with building a vocabulary and decoding academic texts. Examples for use include analyses of topic assignments, other academic tasks, and even questions in exam papers or tests. It is important to note that this resource should be used in a pedagogically well-planned and integrated manner and not only provided on the side with the hope that students will use it. The wordlist is available in all official written SA languages. It contains 2 427 terms with their part of speech indicated as well as definitions and usage examples. It may be used under the Creative Commons license, which allows for free distribution and use of the resource provided that ICELDA and SADiLaR are always acknowledged and that no commercial value may result from the use of this resource

    Autshumato Monolingual Siswati Corpus

    No full text
    Monolingual corpus for SiSwati. The data is given as a single UTF-8 text file, with each segment on a newline. The dataset contains existing data sourced for the DSAC funded Autshumato project as well as new data sourced for the SADiLaR: Parallel corpora for English into SiSwati project. The data comprises a total of 138, 651 segments with 1,536, 356 SiSwati words

    CTexT Afrikaans FLAIR Part of Speech tagger model

    No full text
    The CTexT Afrikaans FLAIR Part of Speech tagger model is a neural part of speech tagger model based on the FLAIR framework (Akbik et al. 2019), and includes Afrikaans Glove (Pennington et al., 2014) and FLAIR embeddings (Akbik et al. 2018) from the CTexT Afrikaans word and string embeddings. The model is trained on a collection of 100 000 part of speech annotated tokens, including the NCHLT Afrikaans annotated data

    N|uu language archive

    No full text
    This data collection contains recordings and transcriptions of the N|uu language. This includes N|uu recordings, South African Nama and a local variety of Afrikaans known by the speakers as "Onse Afrikaans" or "Our Afrikaans". All data collected between 2001 and 2022 were collected from mother tongue speakers of the target languages on site in Upington, Askham and Witdraai in the Northern Cape

    Proof of concept: Afrikaans English Venda E-dictionary

    No full text
    This proof of concept is a result of an experiment to compile a trilingual e-dictionary for Afrikaans, Venda and English. It includes 613 items and is compatible with the Lexonomy online dictionary interface. A paper describing this first version of the dictionary and decisions made during compilation has been accepted for publication (details to follow on publication)

    8

    full texts

    536

    metadata records
    Updated in last 30 days.
    SADiLaR Language Resource Repository
    Access Repository Dashboard
    Do you manage Open Research Online? Become a CORE Member to access insider analytics, issue reports and manage access to outputs from your repository in the CORE Repository Dashboard! 👇