SADiLaR Language Resource Repository
Not a member yet
536 research outputs found
Sort by
Autshumato English-Afrikaans Parallel Corpora
Aligned parallel corpora for the language pair English-Afrikaans. The data is given as two separate UTF-8 text files, with each aligned segment on a newline. The data was specifically selected and formatted for use in the training of machine translation systems. Further clean-up and processing might be required depending on the task the data is reused for
Autshumato English-Sesotho Parallel Corpora
Aligned parallel corpora for the language pair English-Sesotho. The data is given as two separate UTF-8 text files, with each aligned segment on a newline. The data was specifically selected and formatted for use in the training of machine translation systems. Further clean-up and processing might be required depending on the task the data is reused for
Final year high school examination texts of South African home and first additional language subjects
This data collection consists of reading comprehension and summary
writing texts. The texts comprise of the final year high school exam
texts for Home Language (HL) and First Additional Language (FAL)
subjects written in South Africa between 2008 and 2020. The text
collection contains texts from all eleven official South African
language subjects: Afrikaans, English, isiNdebele, isiXhosa, isiZulu,
Sesotho, Setswana, Sepedi, Siswati, Tshivenda, and Xitsonga. PDF
versions of the texts were downloaded from South Africa's Department
of Basic Education online public access repository. Plain text was
extracted using pdftotext (version 22.02.0). The texts were then
tokenized using Ucto (version 0.21.1). The data collection contains a
total of 429 exam text files comprising a total of 1,314,551 tokens
with 131,650 types (i.e., unique tokens). Of these, 223 are HL texts
that have 689,730 tokens and 88,009 types, whereas the 206 FAL text
documents contain 624,821 tokens with 73,451 types. In addition to
the full exam texts, the reading comprehension and summary writing
texts are extracted manually. The data is useful for studies
investigating, e.g., linguistic properties, text readability, text
properties, text difficulty, and linguistic complexity in any of the
eleven languages. Furthermore, both intra-language and inter-language
comparisons can be made
WAT word slips collection
Approximately 3,5 million index cards containing Afrikaans words (lemmas), meanings and expressions, as well as example sentences (quotations) that has been collected from or sent in by the Afrikaans speech community since 1926
Autshumato English-Sepedi Parallel Corpora
Aligned parallel corpora for the language pair English-Sepedi. The data is given as two separate UTF-8 text files, with each aligned segment on a newline. The data was specifically selected and formatted for use in the training of machine translation systems. Further clean-up and processing might be required depending on the task the data is reused for
Sesotho syllable wordlist
This package contains a wordlist containing Sesotho words and their syllable information
Autshumato Monolingual Sesotho Corpus
Monolingual corpus for Sesotho. The data is given as a single UTF-8 text file, with each segment on a newline. The data was specifically selected and formatted for use in the training of machine translation systems. Further clean-up and processing might be required depending on the task the data is reused for
Autshumato Monolingual Sepedi Corpus
Monolingual corpus for Sepedi. The data is given as a single UTF-8 text file, with each segment on a newline. The data was specifically selected and formatted for use in the training of machine translation systems. Further clean-up and processing might be required depending on the task the data is reused for
CTexT Afrikaans GloVe Word Embeddings
The CTexT Afrikaans GloVe Word Embeddings is a 300 dimensional Afrikaans embedding model based on the Global Vectors architecture (Pennington, 2014) that provides real-valued vector representations for Afrikaans text. The embedding model was trained on a corpus of 230 million words
Autshumato Monolingual isiZulu Corpus
Monolingual corpus for isiZulu. The data is given as a single UTF-8 text file, with each segment on a newline. The data was specifically selected and formatted for use in the training of machine translation systems. Further clean-up and processing might be required depending on the task the data is reused for