SADiLaR Language Resource Repository

Not a member yet

536 research outputs found

Sort by

Autshumato English-Afrikaans Parallel Corpora

Author: McKellar Cindy
Publication venue: CTexT® (Centre for Text Technology, North-West University)
Publication date: 30/09/2022
Field of study

Aligned parallel corpora for the language pair English-Afrikaans. The data is given as two separate UTF-8 text files, with each aligned segment on a newline. The data was specifically selected and formatted for use in the training of machine translation systems. Further clean-up and processing might be required depending on the task the data is reused for

Autshumato English-Sesotho Parallel Corpora

Author: McKellar Cindy
Publication venue: CTexT® (Centre for Text Technology, North-West University)
Publication date: 30/09/2022
Field of study

Aligned parallel corpora for the language pair English-Sesotho. The data is given as two separate UTF-8 text files, with each aligned segment on a newline. The data was specifically selected and formatted for use in the training of machine translation systems. Further clean-up and processing might be required depending on the task the data is reused for

Final year high school examination texts of South African home and first additional language subjects

Author: Sibeko Johannes
van Zaanen Menno
Publication venue: South African Centre for Digital Language Resources
Publication date: 16/11/2022
Field of study

This data collection consists of reading comprehension and summary writing texts. The texts comprise of the final year high school exam texts for Home Language (HL) and First Additional Language (FAL) subjects written in South Africa between 2008 and 2020. The text collection contains texts from all eleven official South African language subjects: Afrikaans, English, isiNdebele, isiXhosa, isiZulu, Sesotho, Setswana, Sepedi, Siswati, Tshivenda, and Xitsonga. PDF versions of the texts were downloaded from South Africa's Department of Basic Education online public access repository. Plain text was extracted using pdftotext (version 22.02.0). The texts were then tokenized using Ucto (version 0.21.1). The data collection contains a total of 429 exam text files comprising a total of 1,314,551 tokens with 131,650 types (i.e., unique tokens). Of these, 223 are HL texts that have 689,730 tokens and 88,009 types, whereas the 206 FAL text documents contain 624,821 tokens with 73,451 types. In addition to the full exam texts, the reading comprehension and summary writing texts are extracted manually. The data is useful for studies investigating, e.g., linguistic properties, text readability, text properties, text difficulty, and linguistic complexity in any of the eleven languages. Furthermore, both intra-language and inter-language comparisons can be made

WAT word slips collection

Author: Bureau of the WAT
Publication venue: Bureau of the WAT
Publication date: 14/10/2022
Field of study

Approximately 3,5 million index cards containing Afrikaans words (lemmas), meanings and expressions, as well as example sentences (quotations) that has been collected from or sent in by the Afrikaans speech community since 1926

Autshumato English-Sepedi Parallel Corpora

Author: McKellar Cindy
Publication venue: CTexT® (Centre for Text Technology, North-West University)
Publication date: 30/09/2022
Field of study

Aligned parallel corpora for the language pair English-Sepedi. The data is given as two separate UTF-8 text files, with each aligned segment on a newline. The data was specifically selected and formatted for use in the training of machine translation systems. Further clean-up and processing might be required depending on the task the data is reused for

Sesotho syllable wordlist

Author: Sibeko Johannes
van Zaanen Menno
Publication venue: South African Centre for Digital Language Resources
Publication date: 03/02/2022
Field of study

This package contains a wordlist containing Sesotho words and their syllable information

Autshumato Monolingual Sesotho Corpus

Author: McKellar Cindy
Publication venue: CTexT® (Centre for Text Technology, North-West University)
Publication date: 30/09/2022
Field of study

Monolingual corpus for Sesotho. The data is given as a single UTF-8 text file, with each segment on a newline. The data was specifically selected and formatted for use in the training of machine translation systems. Further clean-up and processing might be required depending on the task the data is reused for

Autshumato Monolingual Sepedi Corpus

Author: McKellar Cindy
Publication venue: CTexT® (Centre for Text Technology, North-West University)
Publication date: 30/09/2022
Field of study

Monolingual corpus for Sepedi. The data is given as a single UTF-8 text file, with each segment on a newline. The data was specifically selected and formatted for use in the training of machine translation systems. Further clean-up and processing might be required depending on the task the data is reused for

CTexT Afrikaans GloVe Word Embeddings

Author: Eiselen Roald
Publication venue: Centre for Text Technology (CTexT)
Publication date: 10/01/2022
Field of study

The CTexT Afrikaans GloVe Word Embeddings is a 300 dimensional Afrikaans embedding model based on the Global Vectors architecture (Pennington, 2014) that provides real-valued vector representations for Afrikaans text. The embedding model was trained on a corpus of 230 million words

Autshumato Monolingual isiZulu Corpus

Author: McKellar Cindy
Publication venue: CTexT® (Centre for Text Technology, North-West University)
Publication date: 30/09/2022
Field of study

Monolingual corpus for isiZulu. The data is given as a single UTF-8 text file, with each segment on a newline. The data was specifically selected and formatted for use in the training of machine translation systems. Further clean-up and processing might be required depending on the task the data is reused for

8

full texts

536

metadata records

Updated in last 30 days.

SADiLaR Language Resource Repository

Access Repository Dashboard

Do you manage Open Research Online? Become a CORE Member to access insider analytics, issue reports and manage access to outputs from your repository in the CORE Repository Dashboard! 👇