SADiLaR Language Resource Repository

Not a member yet

536 research outputs found

Sort by

NCHLT Tshivenḓa RoBERTa language model

Author: Roald Eiselen
Publication venue: North-West University; Centre for Text Technology (CTexT)
Publication date: 01/05/2023
Field of study

Contextual masked language model based on the RoBERTa architecture (Liu et al., 2019). The model is trained as a masked language model and not fine-tuned for any downstream process. The model can be used both as a masked LM or as an embedding model to provide real-valued vectorised respresentations of words or string sequences for Tshivenḓa text

NCHLT isiXhosa FLAIR-backward embeddings

Author: Roald Eiselen
Publication venue: North-West University; Centre for Text Technology (CTexT)
Publication date: 01/05/2023
Field of study

Contextual word/string embeddings for the backward flavour of the FLAIR architecture (Akbik et al., 2018). The embedding provides real-valued vector representations for isiXhosa text

Autshumato Monolingual English Corpus

Author: McKeller Cindy
Publication venue: CTexT® (Centre for Text Technology, North-West University)
Publication date: 30/10/2023
Field of study

Monolingual corpus for South African English. The data is given as a single UTF-8 text file, with each segment on a newline. The data was specifically selected and formatted for use in the training of machine translation systems. Further clean-up and processing might be required depending on the task the data is reused for

NCHLT isiXhosa GloVe embeddings

Author: Roald Eiselen
Publication venue: North-West University; Centre for Text Technology (CTexT)
Publication date: 01/05/2023
Field of study

Static word embedding model based on the Global Vectors architecture (Pennington et al., 2014). The embeddings provide real-valued vector representations for isiXhosa text

NCHLT Sepedi word2vec-Skipgram embeddings

Author: Roald Eiselen
Publication venue: North-West University; Centre for Text Technology (CTexT)
Publication date: 01/05/2023
Field of study

Static word embeddings for the Skipgram flavour of the word2vec (w2v) architecture (Mikolov et al., 2013). The embedding provides real-valued vector representations for Sepedi text

NCHLT Xitsonga word2vec-Skipgram embeddings

Author: Roald Eiselen
Publication venue: North-West University; Centre for Text Technology (CTexT)
Publication date: 01/05/2023
Field of study

Static word embeddings for the Skipgram flavour of the word2vec (w2v) architecture (Mikolov et al., 2013). The embedding provides real-valued vector representations for Xitsonga text

NCHLT Sesotho word2vec-Skipgram embeddings

Author: Roald Eiselen
Publication venue: North-West University; Centre for Text Technology (CTexT)
Publication date: 01/05/2023
Field of study

Static word embeddings for the Skipgram flavour of the word2vec (w2v) architecture (Mikolov et al., 2013). The embedding provides real-valued vector representations for Sesotho text

NCHLT Sepedi FLAIR-forward embeddings

Author: Roald Eiselen
Publication venue: North-West University; Centre for Text Technology (CTexT)
Publication date: 01/05/2023
Field of study

Contextual word/string embeddings for the forward flavour of the FLAIR architecture (Akbik et al., 2018). The embedding provides real-valued vector representations for Sepedi text

South African Multilingual Learner Corpus of Academic Texts (SAMuLCAT) version 2023-03

Author: Van Dyk Tobie
Publication venue: SADiLaR
Publication date: 2023
Field of study

The South African Multilingual Learner Corpus of Academic Texts (SAMuLCAT) is a multi-genre, multi-level learner corpus developed by the Inter-institutional Centre for Language Development and Assessment (ICELDA) in collaboration with the South African Centre for Digital Language Resources (SADiLaR). This corpus includes shorter and longer pieces of texts, from an array of genres, different fields of study, and at all levels of study. The corpus was, and continues to be, contributed to by several institutions of higher education that are part of the ICELDA network. Ethical clearance has been granted at all partnering institutions to collect data; this includes informed consent by all students who contributed to SAMULCAT. The corpus is augmented by two sets of metadata. The first set includes mainly biographical detail about students (completed by students themselves); the second set includes more information on different task types and texts included in the corpus (completed by e.g. lecturers, writing centre staff, etc.). Data can be filtered through the metadata filters available in the search functionality of the corpus. The corpus is available under the Creative Commons Attribution 4.0 International (CC BY 4.0) license and is open source. More information about the design of the corpus and metadata available in the corpus can be found in the following article: Carstens, A. and Eiselen, R., 2019. Designing a South African multilingual learner corpus of academic texts (SAMuLCAT). Language Matters, 50(1), pp.64-83. The Afrikaans part of the corpus is automatically annotated for lemmas and part of speech using the available NCHLT Text lemmatisers and part of speech taggers. Additional information is available here: https://hlt.nwu.ac.za/about No quality control of the automatic annotations was performed. The English data is annotated using the open-source NLP4J library available here: https://emorynlp.github.io/nlp4j/ DISCLAIMER: For a description of SADiLaR's privacy stance and practices, please see the privacy statement: https://sadilar.org/index.php/en/394-privacy-statemen

NCHLT Sepedi fastText-Skipgram embeddings

Author: Roald Eiselen
Publication venue: North-West University; Centre for Text Technology (CTexT)
Publication date: 01/05/2023
Field of study

Static word and subword embeddings for the Skipgram flavour of the fastText architecture (Bojanowski et al., 2017). The embedding provides real-valued vector representations for Sepedi text

8

full texts

536

metadata records

Updated in last 30 days.

SADiLaR Language Resource Repository

Access Repository Dashboard

Do you manage Open Research Online? Become a CORE Member to access insider analytics, issue reports and manage access to outputs from your repository in the CORE Repository Dashboard! 👇