SADiLaR Language Resource Repository
Not a member yet
536 research outputs found
Sort by
NCHLT Xitsonga GloVe embeddings
Static word embedding model based on the Global Vectors architecture (Pennington et al., 2014). The embeddings provide real-valued vector representations for Xitsonga text
NCHLT isiXhosa fastText-Skipgram embeddings
Static word and subword embeddings for the Skipgram flavour of the fastText architecture (Bojanowski et al., 2017). The embedding provides real-valued vector representations for isiXhosa text
NCHLT Tshivenḓa fastText-CBoW embeddings
Static word and subword embeddings for the continuous bag of words (CBoW) flavour of the fastText architecture (Bojanowski et al., 2017). The embedding provides real-valued vector representations for Tshivenḓa text
Autshumato English-Xitsonga Parallel Corpora
Aligned parallel corpora for the language pair English-Xitsonga. The data is given as two separate UTF-8 text files, with each aligned segment on a newline. The data was specifically selected and formatted for use in the training of machine translation systems. Further clean-up and processing might be required depending on the task the data is reused for
CTexT Afrikaans FLAIR String Embeddings
The CTexT Afrikaans FLAIR String Embeddings are two Afrikaans embedding models based on the FLAIR architecture (Akbik et al. 2018, 2019) that provides real-valued vector representations for Afrikaans text. The embeddings were trained on a corpus of 230 million words
Generic Multilingual Academic Wordlists with Definitions
This multilingual generic academic wordlist has been developed to serve as a resource to students to assist with building a vocabulary and decoding academic texts. Examples for use include analyses of topic assignments, other academic tasks, and even questions in exam papers or tests. It is important to note that this resource should be used in a pedagogically well-planned and integrated manner and not only provided on the side with the hope that students will use it. The wordlist is available in all official written SA languages. It contains 2 427 terms with their part of speech indicated as well as definitions and usage examples. It may be used under the Creative Commons license, which allows for free distribution and use of the resource provided that ICELDA and SADiLaR are always acknowledged and that no commercial value may result from the use of this resource
Autshumato Monolingual Siswati Corpus
Monolingual corpus for SiSwati. The data is given as a single UTF-8 text file, with each segment on a newline. The dataset contains existing data sourced for the DSAC funded Autshumato project as well as new data sourced for the SADiLaR: Parallel corpora for English into SiSwati project. The data comprises a total of 138, 651 segments with 1,536, 356 SiSwati words
CTexT Afrikaans FLAIR Part of Speech tagger model
The CTexT Afrikaans FLAIR Part of Speech tagger model is a neural part of speech tagger model based on the FLAIR framework (Akbik et al. 2019), and includes Afrikaans Glove (Pennington et al., 2014) and FLAIR embeddings (Akbik et al. 2018) from the CTexT Afrikaans word and string embeddings. The model is trained on a collection of 100 000 part of speech annotated tokens, including the NCHLT Afrikaans annotated data
N|uu language archive
This data collection contains recordings and transcriptions of the
N|uu language. This includes N|uu recordings, South African Nama and
a local variety of Afrikaans known by the speakers as "Onse
Afrikaans" or "Our Afrikaans". All data collected between 2001 and
2022 were collected from mother tongue speakers of the target
languages on site in Upington, Askham and Witdraai in the Northern
Cape
Proof of concept: Afrikaans English Venda E-dictionary
This proof of concept is a result of an experiment to compile a trilingual e-dictionary for Afrikaans, Venda and English. It includes 613 items and is compatible with the Lexonomy online dictionary interface. A paper describing this first version of the dictionary and decisions made during compilation has been accepted for publication (details to follow on publication)