SADiLaR Language Resource Repository

Not a member yet

536 research outputs found

Sort by

Autshumato English-Setswana Parallel Corpora

Author: McKellar Cindy
Publication venue: CTexT® (Centre for Text Technology, North-West University)
Publication date: 30/09/2022
Field of study

Aligned parallel corpora for the language pair English-Setswana. The data is given as two separate UTF-8 text files, with each aligned segment on a newline. The data was specifically selected and formatted for use in the training of machine translation systems. Further clean-up and processing might be required depending on the task the data is reused for

Autshumato English-isiZulu Parallel Corpora

Author: McKellar Cindy
Publication venue: CTexT® (Centre for Text Technology, North-West University)
Publication date: 30/09/2022
Field of study

Aligned parallel corpora for the language pair English-isiZulu. The data is given as two separate UTF-8 text files, with each aligned segment on a newline. The data was specifically selected and formatted for use in the training of machine translation systems. Further clean-up and processing might be required depending on the task the data is reused for

CTexT Afrikaans fastText CBoW String Embeddings

Author: Eiselen Roald
Publication venue: Centre for Text Technology (CTexT)
Publication date: 10/01/2022
Field of study

The CTexT Afrikaans fastText CBoW String Embeddings is a 300 dimensional Afrikaans embedding model based on the Contunious Bag of Words fastText architecture that provides real-valued vector representations for Afrikaans text. The embedding was trained on a corpus of 230 million words

Autshumato Monolingual Xitsonga Corpus

Author: McKellar Cindy
Publication venue: CTexT® (Centre for Text Technology, North-West University)
Publication date: 30/09/2022
Field of study

Monolingual corpus for Xitsonga. The data is given as a single UTF-8 text file, with each segment on a newline. The data was specifically selected and formatted for use in the training of machine translation systems. Further clean-up and processing might be required depending on the task the data is reused for

WAT quotation collection

Author: Bureau of the WAT
Publication venue: Bureau of the WAT
Publication date: 14/10/2022
Field of study

Collection of short quotations/excerpts from a variety of books (fiction, non-fiction & academic)

N|uu language archive

Author: Collins Chris
Sands Bonny
Jones Kerry
Publication venue: Jones, Kerry
Publication date: 11/08/2022
Field of study

This collection contains information that forms the basis of the N|uu dictionary which contains a word list for N|uu with translations into Afrikaans, Nama, and English

Multilingual spelling checker lexicons

Author: Centre for Text Technology CTexT®
Publication venue: CTexT® (Centre for Text Technology)
Publication date: 30/06/2022
Field of study

Spelling checker lexicons for 10 South African languages. Lexicons created by collecting data from various sources and manually reviewed by language experts according to the standard written orthography. For each language there are four different lexicon files: abbreviations..txt abbreviations and abbreviation compounds. lowercase..txt words that are correct when written in lower case. offensive..txt words that are potentially offensive, obscene, racist, or should not be suggested by a spelling checker for some other reason. uppercase..txt words that should only be written with one or more capitalised characters, such as person and place names

Sesotho syllabification systems

Author: Sibeko Johannes
van Zaanen Menno
Publication venue: South African Centre for Digital Language Resources
Publication date: 03/02/2022
Field of study

This package contains two syllabification systems for Sesotho (rule-based and TeX-based)

Afrikaans morphological evaluative constructions dataset

Author: Trollip Benito
Publication venue: North-West University
Publication date: 04/11/2022
Field of study

A dataset of Afrikaans morphological evaluative constructions (MECs) and their word frequency classes. The MECs have been compiled using extracted constructions from the corpus collections accessible through the Virtual Institute for Afrikaans (VivA). The files are grouped in affixoids, compounds, affixes and other typed of MECs. This dataset forms the basis of the description of Afrikaans MECs in a PhD thesis

Multilingual Linguistic Terminology

Author: Griesel Marissa
Publication venue: UNISA
Publication date: 20/09/2022
Field of study

Multilingual Linguistic Terminology Project Termbanks of Linguistic terminology for South African languages Version 1.0 https://linguisticterminology.wordpress.com/ Languages included: Setswana (tsn), isiZulu (zul), isiXhosa (xho), Sesotho sa Leboa (nso), Tshivenda (ven), Sesotho (sot), Xitsonga (xho), isiNdebele (nde) and Siswati (ssw

8

full texts

536

metadata records

Updated in last 30 days.

SADiLaR Language Resource Repository

Access Repository Dashboard

Do you manage Open Research Online? Become a CORE Member to access insider analytics, issue reports and manage access to outputs from your repository in the CORE Repository Dashboard! 👇