SADiLaR Language Resource Repository
Not a member yet
    536 research outputs found

    Autshumato English-Afrikaans Parallel Corpora

    No full text
    Aligned parallel corpora for the language pair English-Afrikaans. The data is given as two separate UTF-8 text files, with each aligned segment on a newline. The data was specifically selected and formatted for use in the training of machine translation systems. Further clean-up and processing might be required depending on the task the data is reused for

    Autshumato English-Sesotho Parallel Corpora

    No full text
    Aligned parallel corpora for the language pair English-Sesotho. The data is given as two separate UTF-8 text files, with each aligned segment on a newline. The data was specifically selected and formatted for use in the training of machine translation systems. Further clean-up and processing might be required depending on the task the data is reused for

    Final year high school examination texts of South African home and first additional language subjects

    No full text
    This data collection consists of reading comprehension and summary writing texts. The texts comprise of the final year high school exam texts for Home Language (HL) and First Additional Language (FAL) subjects written in South Africa between 2008 and 2020. The text collection contains texts from all eleven official South African language subjects: Afrikaans, English, isiNdebele, isiXhosa, isiZulu, Sesotho, Setswana, Sepedi, Siswati, Tshivenda, and Xitsonga. PDF versions of the texts were downloaded from South Africa's Department of Basic Education online public access repository. Plain text was extracted using pdftotext (version 22.02.0). The texts were then tokenized using Ucto (version 0.21.1). The data collection contains a total of 429 exam text files comprising a total of 1,314,551 tokens with 131,650 types (i.e., unique tokens). Of these, 223 are HL texts that have 689,730 tokens and 88,009 types, whereas the 206 FAL text documents contain 624,821 tokens with 73,451 types. In addition to the full exam texts, the reading comprehension and summary writing texts are extracted manually. The data is useful for studies investigating, e.g., linguistic properties, text readability, text properties, text difficulty, and linguistic complexity in any of the eleven languages. Furthermore, both intra-language and inter-language comparisons can be made

    WAT word slips collection

    No full text
    Approximately 3,5 million index cards containing Afrikaans words (lemmas), meanings and expressions, as well as example sentences (quotations) that has been collected from or sent in by the Afrikaans speech community since 1926

    Autshumato English-Sepedi Parallel Corpora

    No full text
    Aligned parallel corpora for the language pair English-Sepedi. The data is given as two separate UTF-8 text files, with each aligned segment on a newline. The data was specifically selected and formatted for use in the training of machine translation systems. Further clean-up and processing might be required depending on the task the data is reused for

    Sesotho syllable wordlist

    No full text
    This package contains a wordlist containing Sesotho words and their syllable information

    Autshumato Monolingual Sesotho Corpus

    No full text
    Monolingual corpus for Sesotho. The data is given as a single UTF-8 text file, with each segment on a newline. The data was specifically selected and formatted for use in the training of machine translation systems. Further clean-up and processing might be required depending on the task the data is reused for

    Autshumato Monolingual Sepedi Corpus

    No full text
    Monolingual corpus for Sepedi. The data is given as a single UTF-8 text file, with each segment on a newline. The data was specifically selected and formatted for use in the training of machine translation systems. Further clean-up and processing might be required depending on the task the data is reused for

    CTexT Afrikaans GloVe Word Embeddings

    No full text
    The CTexT Afrikaans GloVe Word Embeddings is a 300 dimensional Afrikaans embedding model based on the Global Vectors architecture (Pennington, 2014) that provides real-valued vector representations for Afrikaans text. The embedding model was trained on a corpus of 230 million words

    Autshumato Monolingual isiZulu Corpus

    No full text
    Monolingual corpus for isiZulu. The data is given as a single UTF-8 text file, with each segment on a newline. The data was specifically selected and formatted for use in the training of machine translation systems. Further clean-up and processing might be required depending on the task the data is reused for

    8

    full texts

    536

    metadata records
    Updated in last 30 days.
    SADiLaR Language Resource Repository
    Access Repository Dashboard
    Do you manage Open Research Online? Become a CORE Member to access insider analytics, issue reports and manage access to outputs from your repository in the CORE Repository Dashboard! 👇