SADiLaR Language Resource Repository

Not a member yet

536 research outputs found

Sort by

Autshumato English-isiXhosa Parallel corpus

Author: McKellar Cindy
Publication venue: North-West University - Centre for Text Technology (CTexT)
Publication date: 10/06/2025
Field of study

Aligned parallel corpora for the following language pair: English-isiXhosa. The data is given as two separate UTF-8 text files, with each segment on a newline. Dataset contains existing data sourced for the DAC funded Autshumato project as well as new data sourced for the SADiLaR: Parallel corpora for English into isiXhosa project. NOTE: Version 2.0 has been processed in the same way as the other Autshumato resources. Content: 109,940 Segments; 1,745,236 English words; 1,264,390 isiXhosa word

Autshumato Monolingual isiXhosa Monolingual corpus

Author: McKellar Cindy
Publication venue: North-West University - Centre for Text Technology (CTexT)
Publication date: 10/06/2025
Field of study

Monolingual corpus for isiXhosa. The data is given as a single UTF-8 text file, with each segment on a newline. The dataset contains existing data sourced for the DAC funded Autshumato project as well as new data sourced for the SADiLaR: Parallel corpora for English into isiXhosa project. NOTE: Version 2.0 has been processed in the same way as the other Autshumato resources. Content: 341,330 Segments; 4,328,245 XH Word

Annotated Short Mystery Novels Data Set

Author: Heyns Nuette
van Zaanen Menno
Publication venue: Menno van Zaanen
Publication date: 20/08/2025
Field of study

This data set consists of ten annotated short mystery novels (whodunits). The novels, written in English between 1891 and 1924, range from 2,000 to 10,000 words each. This length ensures they are long enough to capture the full narrative structure of whodunits while remaining feasible for manual annotation. Unlike data sets like those developed in the SANTA project, which annotated shorter texts (under 2,000 words) for narrative levels and scene segmentation (Reiter, 2019), this data set contains annotated full texts to uncover complete narrative structures

Morphologically annotated corpus for isiNdebele

Author: Gaustad Tanja
Publication venue: Centre for Text Technology (CTexT)
Publication date: 31/01/2024
Field of study

NCHLT corpus of morphologically annotated tokens in isiNdebele converted to the tags used during phases 1 and 2 of the SADiLaR-II project. The data is given as txt files. Each line consists of a token and the corresponding morphological analysis, tab separated. The file for isiNdebele contains a total of approximately 42,335 tokens. All the data has been automatically converted, then manually checked and re-annotated where necessary by linguistic experts as well as quality controlled. Please see the included protocol for more details on the morphological tags used

IsiZulu Second Language Learner Speech Corpus

Author: O'Neil Alexandra
Hjortnaes Nils
Nkosi Zinhle
Ndlovu Thulile
Mlondo Zanele
Pewa Ngami Phumzile
Tyers Francis
Publication venue: Indiana University
Publication date: 01/01/2024
Field of study

This corpus is specifically designed to assist in evaluating the performance of pronunciation feedback tools for second language learning. The corpus is comprised of gold standard recordings from isiZulu teachers (2,493 recordings) and recordings from isiZulu L2 learners that have been annotated by isiZulu teachers for phonemic and tonal pronunciation errors (9,639 recordings). The accompanying database and tsv file include the teacher annotations and demographic information

POS annotated corpus in 5 different genres for Sepedi

Author: Gaustad Tanja
Publication venue: Centre for Text Technology (CTexT)
Publication date: 31/01/2024
Field of study

This corpus contains POS annotated data in 5 different genres for Sepedi. The text types included are: - CAPS gr12 (Academic) - https://www.education.gov.za/Curriculum/NationalSeniorCertificate(NSC)Examinations.aspx; - PhD Theses (Academic) - for Sepedi https://repository.up.ac.za/; - Magazines (Non-Academic) - CTexT acquired data from Pula Imvula; - News (Non-Academic) - for Sepedi CTexT acquired data; - Novels (Fiction) - SADiLaR acquired data from OUP and Shuter and Shooter. For Sepedi, the data was tagged using the NCHLT webservices. The POS tags were then converted to the latest POS tag set (see protocol). The data is given as txt files where each line contains a token and the corresponding POS tag, tab separated. Each text type data file contains approximately 5,000 tokens, amounting to a total of 25,000 tokens per languages. Please see the protocols for more details on the POS tags used. Contents: Sepedi CAPS gr12 - 6,634 tokens, Sepedi PhD Theses - 7,395 tokens, Sepedi Magazines - 5,547 tokens, Sepedi News - 8,782 tokens, Sepedi Novels - 6,924 tokens. Total Sepedi: 30,158 tokens

Morphologically annotated corpus for isiZulu

Author: Gaustad Tanja
Publication venue: Centre for Text Technology (CTexT)
Publication date: 31/01/2024
Field of study

NCHLT corpus of morphologically annotated tokens in isiZulu converted to the tags used during phases 1 and 2 of the SADiLaR-II project. The data is given as txt files. Each line consists of a token and the corresponding morphological analysis, tab separated. The file for isiZulu contains a total of 45,933 tokens. All the data has been automatically converted, then manually checked and re-annotated where necessary by linguistic experts as well as quality controlled. Please see the included protocol for more details on the morphological tags used

Morphologically annotated corpus for Setswana

Author: Gaustad Tanja
Publication venue: Centre for Text Technology (CTexT)
Publication date: 31/01/2024
Field of study

NCHLT corpus of morphologically annotated tokens in Setswana converted to the tags used during phases 1 and 2 of the SADiLaR-II project. The data is given as txt files. Each line consists of a token and the corresponding morphological analysis, tab separated. The file for Setswana contains a total of 72,609 tokens. All the data has been automatically converted, then manually checked and re-annotated where necessary by linguistic experts as well as quality controlled. Please see the included protocol for more details on the morphological tags used

POS annotated corpus with 5 different text types for isiZulu

Author: Gaustad Tanja
Publication venue: Centre for Text Technology (CTexT)
Publication date: 31/01/2024
Field of study

This is a POS annotated corpus with 5 different text types for isiZulu. The text types included are: - CAPS gr12 (Academic) - https://www.education.gov.za/Curriculum/NationalSeniorCertificate(NSC)Examinations.aspx; - PhD Theses (Academic) - for isiZulu https://researchspace.ukzn.ac.za/, for Sepedi https://repository.up.ac.za/; - Magazines (Non-Academic) - CTexT acquired data from Pula Imvula; - News (Non-Academic) - for isiZulu Isolezwe content sourced from Leipzig corpus, for Sepedi CTexT acquired data; - Novels (Fiction) - SADiLaR acquired data from OUP and Shuter and Shooter. For isiZulu, the data was annotated with the Core Tech POS tagger developed during SADiLaR II. The data is given as txt files where each line contains a token and the corresponding POS tag, tab separated. Each text type data file contains approximately 5,000 tokens, amounting to a total of 25,000 tokens per languages. Please see the protocol for more details on the POS tags used. Contents: isiZulu CAPS gr12 - 3,634 tokens, isiZulu PhD Theses - 5,716 tokens, isiZulu Magazines - 3,658, isiZulu News - 5,974 tokens, isiZulu Novels - 5,909 tokens. Total 21,233 tokens

Morphologically annotated corpus for Tshivenḓa

Author: Gaustad Tanja
Publication venue: Centre for Text Technology (CTexT)
Publication date: 31/01/2024
Field of study

NCHLT corpus of morphologically annotated tokens in Tshivenḓa converted to the tags used during phases 1 and 2 of the SADiLaR-II project. The data is given as txt files. Each line consists of a token and the corresponding morphological analysis, tab separated. The file for Tshivenḓa contains a total of 66,487 tokens. All the data has been automatically converted, then manually checked and re-annotated where necessary by linguistic experts as well as quality controlled. Please see the included protocol for more details on the morphological tags used

8

full texts

536

metadata records

Updated in last 30 days.

SADiLaR Language Resource Repository

Access Repository Dashboard

Do you manage Open Research Online? Become a CORE Member to access insider analytics, issue reports and manage access to outputs from your repository in the CORE Repository Dashboard! 👇