SADiLaR Language Resource Repository

Not a member yet

536 research outputs found

Sort by

NCHLT Xitsonga GloVe embeddings

Author: Roald Eiselen
Publication venue: North-West University; Centre for Text Technology (CTexT)
Publication date: 01/05/2023
Field of study

Static word embedding model based on the Global Vectors architecture (Pennington et al., 2014). The embeddings provide real-valued vector representations for Xitsonga text

NCHLT isiXhosa fastText-Skipgram embeddings

Author: Roald Eiselen
Publication venue: North-West University; Centre for Text Technology (CTexT)
Publication date: 01/05/2023
Field of study

Static word and subword embeddings for the Skipgram flavour of the fastText architecture (Bojanowski et al., 2017). The embedding provides real-valued vector representations for isiXhosa text

NCHLT Tshivenḓa fastText-CBoW embeddings

Author: Roald Eiselen
Publication venue: North-West University; Centre for Text Technology (CTexT)
Publication date: 01/05/2023
Field of study

Static word and subword embeddings for the continuous bag of words (CBoW) flavour of the fastText architecture (Bojanowski et al., 2017). The embedding provides real-valued vector representations for Tshivenḓa text

Autshumato English-Xitsonga Parallel Corpora

Author: McKellar Cindy
Publication venue: CTexT® (Centre for Text Technology, North-West University)
Publication date: 30/09/2022
Field of study

Aligned parallel corpora for the language pair English-Xitsonga. The data is given as two separate UTF-8 text files, with each aligned segment on a newline. The data was specifically selected and formatted for use in the training of machine translation systems. Further clean-up and processing might be required depending on the task the data is reused for

CTexT Afrikaans FLAIR String Embeddings

Author: Eiselen Roald
Publication venue: Centre for Text Technology (CTexT)
Publication date: 10/01/2022
Field of study

The CTexT Afrikaans FLAIR String Embeddings are two Afrikaans embedding models based on the FLAIR architecture (Akbik et al. 2018, 2019) that provides real-valued vector representations for Afrikaans text. The embeddings were trained on a corpus of 230 million words

Generic Multilingual Academic Wordlists with Definitions

Author: Van Dyk Tobie
Publication venue: ICELDA
Publication date: 01/01/2022
Field of study

This multilingual generic academic wordlist has been developed to serve as a resource to students to assist with building a vocabulary and decoding academic texts. Examples for use include analyses of topic assignments, other academic tasks, and even questions in exam papers or tests. It is important to note that this resource should be used in a pedagogically well-planned and integrated manner and not only provided on the side with the hope that students will use it. The wordlist is available in all official written SA languages. It contains 2 427 terms with their part of speech indicated as well as definitions and usage examples. It may be used under the Creative Commons license, which allows for free distribution and use of the resource provided that ICELDA and SADiLaR are always acknowledged and that no commercial value may result from the use of this resource

Autshumato Monolingual Siswati Corpus

Author: McKellar Cindy
Publication venue: North-West University - Centre for Text Technology (CTexT)
Publication date: 31/03/2022
Field of study

Monolingual corpus for SiSwati. The data is given as a single UTF-8 text file, with each segment on a newline. The dataset contains existing data sourced for the DSAC funded Autshumato project as well as new data sourced for the SADiLaR: Parallel corpora for English into SiSwati project. The data comprises a total of 138, 651 segments with 1,536, 356 SiSwati words

CTexT Afrikaans FLAIR Part of Speech tagger model

Author: Eiselen Roald
Publication venue: Centre for Text Technology (CTexT)
Publication date: 10/01/2022
Field of study

The CTexT Afrikaans FLAIR Part of Speech tagger model is a neural part of speech tagger model based on the FLAIR framework (Akbik et al. 2019), and includes Afrikaans Glove (Pennington et al., 2014) and FLAIR embeddings (Akbik et al. 2018) from the CTexT Afrikaans word and string embeddings. The model is trained on a collection of 100 000 part of speech annotated tokens, including the NCHLT Afrikaans annotated data

N|uu language archive

Author: Collins Christopher T
Exter Mats
Jones Kerry
Namaseb Levi
Sands Bonny
Witzlack-Makarevich Alena
Publication venue: South African Centre for Digital Language Resources
Publication date: 15/11/2022
Field of study

This data collection contains recordings and transcriptions of the N|uu language. This includes N|uu recordings, South African Nama and a local variety of Afrikaans known by the speakers as "Onse Afrikaans" or "Our Afrikaans". All data collected between 2001 and 2022 were collected from mother tongue speakers of the target languages on site in Upington, Askham and Witdraai in the Northern Cape

Proof of concept: Afrikaans English Venda E-dictionary

Author: Bosch Sonja
Griesel Marissa
Taljaard Elsabé
Publication venue: Published as a Lexonomy dictionary (https://www.lexonomy.eu/POCVenEngAfr/)
Publication date: 04/03/2022
Field of study

This proof of concept is a result of an experiment to compile a trilingual e-dictionary for Afrikaans, Venda and English. It includes 613 items and is compatible with the Lexonomy online dictionary interface. A paper describing this first version of the dictionary and decisions made during compilation has been accepted for publication (details to follow on publication)

8

full texts

536

metadata records

Updated in last 30 days.

SADiLaR Language Resource Repository

Access Repository Dashboard

Do you manage Open Research Online? Become a CORE Member to access insider analytics, issue reports and manage access to outputs from your repository in the CORE Repository Dashboard! 👇