SADiLaR Language Resource Repository

Not a member yet

536 research outputs found

Sort by

NCHLT Siswati GloVe embeddings

Author: Roald Eiselen
Publication venue: North-West University; Centre for Text Technology (CTexT)
Publication date: 01/05/2023
Field of study

Static word embedding model based on the Global Vectors architecture (Pennington et al., 2014). The embeddings provide real-valued vector representations for Siswati text

NCHLT Setswana word2vec-CBOW embeddings

Author: Roald Eiselen
Publication venue: North-West University; Centre for Text Technology (CTexT)
Publication date: 01/05/2023
Field of study

Static word embeddings for the continuous bag of words (CBoW) flavour of the word2vec (w2v) architecture (Mikolov et al., 2013). The embedding provides real-valued vector representations for Setswana text

NCHLT Tshivenḓa fastText-Skipgram embeddings

Author: Roald Eiselen
Publication venue: North-West University; Centre for Text Technology (CTexT)
Publication date: 01/05/2023
Field of study

Static word and subword embeddings for the Skipgram flavour of the fastText architecture (Bojanowski et al., 2017). The embedding provides real-valued vector representations for Tshivenḓa text

NCHLT Sepedi fastText-CBoW embeddings

Author: Roald Eiselen
Publication venue: North-West University; Centre for Text Technology (CTexT)
Publication date: 01/05/2023
Field of study

Static word and subword embeddings for the continuous bag of words (CBoW) flavour of the fastText architecture (Bojanowski et al., 2017). The embedding provides real-valued vector representations for Sepedi text

NCHLT Siswati RoBERTa language model

Author: Roald Eiselen
Publication venue: North-West University; Centre for Text Technology (CTexT)
Publication date: 01/05/2023
Field of study

Contextual masked language model based on the RoBERTa architecture (Liu et al., 2019). The model is trained as a masked language model and not fine-tuned for any downstream process. The model can be used both as a masked LM or as an embedding model to provide real-valued vectorised respresentations of words or string sequences for Siswati text

Autshumato Monolingual Tshivenḓa Corpus

Author: McKellar Cindy
Publication venue: North-West University; Centre for Text Technology (CTexT)
Publication date: 12/12/2023
Field of study

Monolingual corpus for Tshivenḓa. The data is given as a single UTF-8 text file, with each segment on a newline

NCHLT isiNdebele fastText-Skipgram embeddings

Author: Roald Eiselen
Publication venue: North-West University; Centre for Text Technology (CTexT)
Publication date: 01/05/2023
Field of study

Static word and subword embeddings for the Skipgram flavour of the fastText architecture (Bojanowski et al., 2017). The embedding provides real-valued vector representations for isiNdebele text

NCHLT isiZulu GloVe embeddings

Author: Roald Eiselen
Publication venue: North-West University; Centre for Text Technology (CTexT)
Publication date: 01/05/2023
Field of study

Static word embedding model based on the Global Vectors architecture (Pennington et al., 2014). The embeddings provide real-valued vector representations for isiZulu text

NCHLT isiNdebele word2vec-Skipgram embeddings

Author: Roald Eiselen
Publication venue: North-West University; Centre for Text Technology (CTexT)
Publication date: 01/05/2023
Field of study

Static word embeddings for the Skipgram flavour of the word2vec (w2v) architecture (Mikolov et al., 2013). The embedding provides real-valued vector representations for isiNdebele text

Child speech database for the South African context: Speech samples of typically developing Afrikaans and Sesotho sa Leboa-speaking children

Author: Bornman Juan
De Wet Febe
Van Der Linde Jeannie
Publication venue: SADiLaR open call
Publication date: 30/06/2023
Field of study

This dataset contains child speech samples from typically developing Afrikaans and Sesotho sa Leboa-speaking children in South Africa. The recordings, totaling at least 700 minutes per language, were collected in naturalistic interactions between children and trained speech-language therapists using standardized toys and books. Data were gathered in home and clinical settings to support linguistic analysis, speech transcription, and the development of automated tools for early identification and intervention in multilingual contexts. For more details, review the readme.txt for Afrikaans and Sesotho sa Lebo

8

full texts

536

metadata records

Updated in last 30 days.

SADiLaR Language Resource Repository

Access Repository Dashboard

Do you manage Open Research Online? Become a CORE Member to access insider analytics, issue reports and manage access to outputs from your repository in the CORE Repository Dashboard! 👇