SADiLaR Language Resource Repository

Not a member yet

536 research outputs found

Sort by

Morphologically annotated corpus for Sepedi

Author: Gaustad Tanja
Publication venue: Centre for Text Technology (CTexT)
Publication date: 31/01/2024
Field of study

NCHLT corpus of morphologically annotated tokens in Sepedi converted to the tags used during phases 1 and 2 of the SADiLaR-II project. The data is given as txt files. Each line consists of a token and the corresponding morphological analysis, tab separated. The file for Sepedi contains a total of 73,031 tokens. All the data has been automatically converted, then manually checked and re-annotated where necessary by linguistic experts as well as quality controlled. Please see the included protocol for more details on the morphological tags used

Morphologically annotated corpus for isiXhosa

Author: Gaustad Tanja
Publication venue: Centre for Text Technology (CTexT)
Publication date: 31/01/2024
Field of study

NCHLT corpus of morphologically annotated tokens in isiXhosa converted to the tags used during phases 1 and 2 of the SADiLaR-II project. The data is given as txt files. Each line consists of a token and the corresponding morphological analysis, tab separated. The file for isiXhosa contains a total of approximately 46,465 tokens. All the data has been automatically converted, then manually checked and re-annotated where necessary by linguistic experts as well as quality controlled. Please see the included protocol for more details on the morphological tags used

Morphologically annotated corpus for Xitsonga

Author: Gaustad Tanja
Publication venue: Centre for Text Technology (CTexT)
Publication date: 31/01/2024
Field of study

NCHLT corpus of morphologically annotated tokens in Xitsonga converted to the tags used during phases 1 and 2 of the SADiLaR-II project. The data is given as txt files. Each line consists of a token and the corresponding morphological analysis, tab separated. The file for Xitsonga contains a total of 69,584 tokens. All the data has been automatically converted, then manually checked and re-annotated where necessary by linguistic experts as well as quality controlled. Please see the included protocol for more details on the morphological tags used

Morphologically annotated corpus for Sesotho

Author: Gaustad Tanja
Publication venue: Centre for Text Technology (CTexT)
Publication date: 31/01/2024
Field of study

NCHLT corpus of morphologically annotated tokens in Sesotho converted to the tags used during phases 1 and 2 of the SADiLaR-II project. The data is given as txt files. Each line consists of a token and the corresponding morphological analysis, tab separated. The file for Sesotho contains a total of 73,727 tokens. All the data has been automatically converted, then manually checked and re-annotated where necessary by linguistic experts as well as quality controlled. Please see the included protocol for more details on the morphological tags used

Morphologically annotated corpus for Siswati

Author: Gaustad Tanja
Publication venue: Centre for Text Technology (CTexT)
Publication date: 31/01/2024
Field of study

NCHLT corpus of morphologically annotated tokens in Siswati converted to the tags used during phases 1 and 2 of the SADiLaR-II project. The data is given as txt files. Each line consists of a token and the corresponding morphological analysis, tab separated. The file for Siswati contains a total of 43,568 tokens. All the data has been automatically converted, then manually checked and re-annotated where necessary by linguistic experts as well as quality controlled. Please see the included protocol for more details on the morphological tags used

NCHLT isiZulu FLAIR-backward embeddings

Author: Roald Eiselen
Publication venue: North-West University; Centre for Text Technology (CTexT)
Publication date: 01/05/2023
Field of study

Contextual word/string embeddings for the backward flavour of the FLAIR architecture (Akbik et al., 2018). The embedding provides real-valued vector representations for isiZulu text

NCHLT Xitsonga fastText-CBoW embeddings

Author: Roald Eiselen
Publication venue: North-West University; Centre for Text Technology (CTexT)
Publication date: 01/05/2023
Field of study

Static word and subword embeddings for the continuous bag of words (CBoW) flavour of the fastText architecture (Bojanowski et al., 2017). The embedding provides real-valued vector representations for Xitsonga text

NCHLT Sesotho FLAIR-backward embeddings

Author: Roald Eiselen
Publication venue: North-West University; Centre for Text Technology (CTexT)
Publication date: 01/05/2023
Field of study

Contextual word/string embeddings for the backward flavour of the FLAIR architecture (Akbik et al., 2018). The embedding provides real-valued vector representations for Sesotho text

NCHLT Setswana word2vec-Skipgram embeddings

Author: Roald Eiselen
Publication venue: North-West University; Centre for Text Technology (CTexT)
Publication date: 01/05/2023
Field of study

Static word embeddings for the Skipgram flavour of the word2vec (w2v) architecture (Mikolov et al., 2013). The embedding provides real-valued vector representations for Setswana text

NCHLT isiNdebele FLAIR-forward embeddings

Author: Roald Eiselen
Publication venue: North-West University; Centre for Text Technology (CTexT)
Publication date: 01/05/2023
Field of study

Contextual word/string embeddings for the forward flavour of the FLAIR architecture (Akbik et al., 2018). The embedding provides real-valued vector representations for isiNdebele text

8

full texts

536

metadata records

Updated in last 30 days.

SADiLaR Language Resource Repository

Access Repository Dashboard

Do you manage Open Research Online? Become a CORE Member to access insider analytics, issue reports and manage access to outputs from your repository in the CORE Repository Dashboard! 👇