SADiLaR Language Resource Repository

Not a member yet

536 research outputs found

Sort by

Autshumato Monolingual Afrikaans Corpus

Author: McKellar Cindy
Publication venue: CTexT® (Centre for Text Technology, North-West University)
Publication date: 30/09/2022
Field of study

Monolingual corpus for Afrikaans. The data is given as a single UTF-8 text file, with each segment on a newline. The data was specifically selected and formatted for use in the training of machine translation systems. Further clean-up and processing might be required depending on the task the data is reused for

CTexT fastText Skipgram String Embeddings

Author: Eiselen Roald
Publication venue: Centre for Text Technology (CTexT)
Publication date: 10/01/2022
Field of study

The CTexT Afrikaans fastText Skipgram String Embeddings is a 300 dimensional Afrikaans embedding model based on the Skipgram fastText architecture that provides real-valued vector representations for Afrikaans text. The embedding was trained on a corpus of 230 million words

Autshumato Monolingual Setswana Corpus

Author: McKellar Cindy
Publication venue: CTexT® (Centre for Text Technology, North-West University)
Publication date: 30/09/2022
Field of study

Monolingual corpus for Setswana. The data is given as a single UTF-8 text file, with each segment on a newline. The data was specifically selected and formatted for use in the training of machine translation systems. Further clean-up and processing might be required depending on the task the data is reused for

CTexT Afrikaans FLAIR Named Entity Recognition model

Author: Eiselen Roald
Publication venue: Centre for Text Technology (CTexT)
Publication date: 10/01/2022
Field of study

The CTexT Afrikaans FLAIR Named Entity Recognition model is a neural NER model based on the FLAIR framework (Akbik et al. 2019), and includes Afrikaans fastText (Bojanowski et al., 2017) and FLAIR embeddings (Akbik et al. 2018) from the CTexT Afrikaans word and string embeddings. The model is trained on the NCHLT Afrikaans Named Entity Annotated Corpus

African Wordnet version 1.0

Author: Griesel Marissa
Publication venue: UNISA
Publication date: 20/09/2022
Field of study

Developed using the expand model with Princeton WordNet 3.1 as basis. Please see https://africanwordnet.wordpress.com/ for all details on the project. This work builds on previously released data and is under active development. New releases will be made available at the end of every significant development phase

Autshumato English-Siswati Parallel Corpora

Author: McKellar Cindy
Publication venue: North-West University - Centre for Text Technology (CTexT)
Publication date: 31/03/2022
Field of study

Aligned parallel corpora for the following language pair: English-SiSwati. The data is given as four separate UTF-8 text files, with each segment on a newline. Dataset contains existing data sourced for the DSAC funded Autshumato project as well as new data sourced for the SADiLaR: Parallel corpora for English into SiSwati project. The dataset contains the following types of bilingual data: Translations from English to Siswati and crawled parallel data for English-Siswati. The dataset comprises a total of 114,839 segments with 2,002,293 English words and 1, 423,414 SiSwati words. (A new version issued since the title was changed

COVID-19 Multilingual Terminology

Author: City of Tshwane
South African Centre for Digital Language Resources (SADiLaR)
Department of Science and Innovation (DSI)
Pan South African Language Board (PanSALB)
Publication venue: Pan South African Language Board (PanSALB)
Publication date: 2021
Field of study

COVID-19 multilingual terminology list document in all the South African languages. The development of this terminology list was initiated by City of Tshwane and sponsored by the South African Centre for Digital Language Resources and the Department of Science and Innovation. PanSALB's national language boards assisted in the verification of the terminology list

English-IsiNdebele Glossary of Medical Terms

Author: Malele Nomsebenzi
Publication venue: University of South Africa (UNISA)
Publication date: 01/09/2021
Field of study

This is the PhD project, where English-isiNdebele glossary of medical terms was compiled by a PhD candidate

South African Multilingual Learner Corpus of Academic Texts (SAMuLCAT)

Author: Van Dyk Tobie
Publication venue: SADiLaR
Publication date: 01/01/2021
Field of study

NOTE: THIS HAS BEEN SUPERSEDED. See https://hdl.handle.net/20.500.12185/585 The South African Multilingual Learner Corpus of Academic Texts (SAMuLCAT) is a multi-genre, multi-level learner corpus developed by the Inter-institutional Centre for Language Development and Assessment (ICELDA) in collaboration with the South African Centre for Digital Language Resources (SADiLaR). This corpus includes shorter and longer pieces of texts, from an array of genres, different fields of study, and at all levels of study. The corpus was, and continues to be, contributed to by several institutions of higher education that are part of the ICELDA network. Ethical clearance has been granted at all partnering institutions to collect data; this includes informed consent by all students who contributed to SAMULCAT. The corpus is augmented by two sets of metadata. The first set includes mainly biographical detail about students (completed by students themselves); the second set includes more information on different task types and texts included in the corpus (completed by e.g. lecturers, writing centre staff, etc.). Data can be filtered through the metadata filters available in the search functionality of the corpus. The corpus is available under the creative commons 4.0 license and is open source. Use of the corpus for research purposes requires permission from SADiLaR, and applications should include evidence of ethical clearance from the research institutions to which staff and students are affiliated to. More information about the design of the corpus and metadata available in the corpus can be found in the following article: Carstens, A. and Eiselen, R., 2019. Designing a South African multilingual learner corpus of academic texts (SAMuLCAT). Language Matters, 50(1), pp.64-83. Annotation Corpora for the indigenous South African languages are automatically annotated for lemmas and part of speech using the available NCHLT Text lemmatisers and part of speech taggers. Information on the accuracy and tag sets for these languages are available here: NCHLT Web Service. No quality control of the automatic annotations was performed. The English data is annotated using the open-source NLP4J library available here: https://emorynlp.github.io/nlp4j

Generic Bilingual Academic Wordlist with Definitions

Author: ICELDA
SADiLaR
Publication venue: SADiLaR
Publication date: 01/01/2021
Field of study

The academic wordlist has been developed to serve as a resource to students to assist them to better understand words used within the information they gather to complete their academic assignments and as a result enhance their academic career. The bilingual wordlist contain 2427 terms with their part of speech indicated as well as definition and usage example provided in both Afrikaans and English. A multilingual version of this wordlist is available here: https://repo.sadilar.org/handle/20.500.12185/66

8

full texts

536

metadata records

Updated in last 30 days.

SADiLaR Language Resource Repository

Access Repository Dashboard

Do you manage Open Research Online? Become a CORE Member to access insider analytics, issue reports and manage access to outputs from your repository in the CORE Repository Dashboard! 👇