SADiLaR Language Resource Repository
Not a member yet
536 research outputs found
Sort by
Autshumato English-Setswana Parallel Corpora
Aligned parallel corpora for the language pair English-Setswana. The data is given as two separate UTF-8 text files, with each aligned segment on a newline. The data was specifically selected and formatted for use in the training of machine translation systems. Further clean-up and processing might be required depending on the task the data is reused for
Autshumato English-isiZulu Parallel Corpora
Aligned parallel corpora for the language pair English-isiZulu. The data is given as two separate UTF-8 text files, with each aligned segment on a newline. The data was specifically selected and formatted for use in the training of machine translation systems. Further clean-up and processing might be required depending on the task the data is reused for
CTexT Afrikaans fastText CBoW String Embeddings
The CTexT Afrikaans fastText CBoW String Embeddings is a 300 dimensional Afrikaans embedding model based on the Contunious Bag of Words fastText architecture that provides real-valued vector representations for Afrikaans text. The embedding was trained on a corpus of 230 million words
Autshumato Monolingual Xitsonga Corpus
Monolingual corpus for Xitsonga. The data is given as a single UTF-8 text file, with each segment on a newline. The data was specifically selected and formatted for use in the training of machine translation systems. Further clean-up and processing might be required depending on the task the data is reused for
WAT quotation collection
Collection of short quotations/excerpts from a variety of books (fiction, non-fiction & academic)
N|uu language archive
This collection contains information that forms the basis of the N|uu dictionary which contains a word list for N|uu with translations into Afrikaans, Nama, and English
Multilingual spelling checker lexicons
Spelling checker lexicons for 10 South African languages. Lexicons created by collecting data from various sources and manually reviewed by language experts according to the standard written orthography.
For each language there are four different lexicon files:
abbreviations..txt abbreviations and abbreviation compounds.
lowercase..txt words that are correct when written in lower case.
offensive..txt words that are potentially offensive, obscene, racist, or should not be suggested by a spelling checker for some other reason.
uppercase..txt words that should only be written with one or more capitalised characters, such as person and place names
Sesotho syllabification systems
This package contains two syllabification systems for Sesotho (rule-based and TeX-based)
Afrikaans morphological evaluative constructions dataset
A dataset of Afrikaans morphological evaluative constructions (MECs) and their word frequency classes. The MECs have been compiled using extracted constructions from the corpus collections accessible through the Virtual Institute for Afrikaans (VivA). The files are grouped in affixoids, compounds, affixes and other typed of MECs. This dataset forms the basis of the description of Afrikaans MECs in a PhD thesis
Multilingual Linguistic Terminology
Multilingual Linguistic Terminology Project
Termbanks of Linguistic terminology for South African languages
Version 1.0
https://linguisticterminology.wordpress.com/
Languages included: Setswana (tsn), isiZulu (zul), isiXhosa (xho), Sesotho sa Leboa (nso), Tshivenda (ven), Sesotho (sot), Xitsonga (xho), isiNdebele (nde) and Siswati (ssw