1,720,970 research outputs found

    CTexT Afrikaans GloVe Word Embeddings

    No full text
    The CTexT Afrikaans GloVe Word Embeddings is a 300 dimensional Afrikaans embedding model based on the Global Vectors architecture (Pennington, 2014) that provides real-valued vector representations for Afrikaans text. The embedding model was trained on a corpus of 230 million words

    CTexT fastText Skipgram String Embeddings

    No full text
    The CTexT Afrikaans fastText Skipgram String Embeddings is a 300 dimensional Afrikaans embedding model based on the Skipgram fastText architecture that provides real-valued vector representations for Afrikaans text. The embedding was trained on a corpus of 230 million words

    CTexT Afrikaans FLAIR Part of Speech tagger model

    No full text
    The CTexT Afrikaans FLAIR Part of Speech tagger model is a neural part of speech tagger model based on the FLAIR framework (Akbik et al. 2019), and includes Afrikaans Glove (Pennington et al., 2014) and FLAIR embeddings (Akbik et al. 2018) from the CTexT Afrikaans word and string embeddings. The model is trained on a collection of 100 000 part of speech annotated tokens, including the NCHLT Afrikaans annotated data

    CTexT Afrikaans FLAIR String Embeddings

    No full text
    The CTexT Afrikaans FLAIR String Embeddings are two Afrikaans embedding models based on the FLAIR architecture (Akbik et al. 2018, 2019) that provides real-valued vector representations for Afrikaans text. The embeddings were trained on a corpus of 230 million words

    CTexT Afrikaans FLAIR Named Entity Recognition model

    No full text
    The CTexT Afrikaans FLAIR Named Entity Recognition model is a neural NER model based on the FLAIR framework (Akbik et al. 2019), and includes Afrikaans fastText (Bojanowski et al., 2017) and FLAIR embeddings (Akbik et al. 2018) from the CTexT Afrikaans word and string embeddings. The model is trained on the NCHLT Afrikaans Named Entity Annotated Corpus

    CTexT Afrikaans fastText CBoW String Embeddings

    No full text
    The CTexT Afrikaans fastText CBoW String Embeddings is a 300 dimensional Afrikaans embedding model based on the Contunious Bag of Words fastText architecture that provides real-valued vector representations for Afrikaans text. The embedding was trained on a corpus of 230 million words

    Designing a South African multilingual learner corpus of academic texts (SAMuLCAT)

    Full text link
    This article provides an overview of the process and initial outcomes of designing a multilingual corpus of academic texts produced by university students with different mother tongues in South Africa, with a view to making it available as an open resource for pedagogical applications and research. We first give an overview of the history of corpus development for pedagogical purposes world-wide, with particular emphasis on learner corpora, and highlight the absence of a South African corpus of academic learner texts. Thereafter, the objectives of the corpus project are outlined. The remainder of the article describes and justifies the design-features of the corpus as well as the process of setting up the data management system to facilitate the collection of the learner texts and their integration with the metadata. We conclude with a summary of the current status of the project, including the limitations, and a preview of the way forward.This research was made possible with support from the South African Centre for Digital Language Resources (SADiLaR).https://www.tandfonline.com/loi/rlms202020-07-22hj2019Unit for Academic Literac

    Going Beyond Counting First Authors in Author Co-citation Analysis

    Full text link
    The present study examines one of the fundamental aspects of author co-citation analysis (ACA) - the way co-citation counts are defined. Co-citation counting provides the data on which all subsequent statistical analyses and mappings are based, and we compare ACA results based on two different types of co-citation counting - the traditional type that only counts the first one among a cited work's authors on the one hand and a non-traditional type that takes into account the first 5 authors of a cited work on the other hand. Results indicate that the picture produced through this non-traditional author co-citation counting contains more coherent author groups and is therefore considerably clearer. However, this picture represents fewer specialties in the research field being studied than that produced through the traditional first-author co-citation counting when the same number of top-ranked authors is selected and analyzed. Reasons for these effects are discussed

    Exploring Afrikaans word embeddings with analogies and nearest neighbours

    Full text link
    This paper presents an exploration of word embeddings for Afrikaans using the analogies and nearest neighbours methodologies. We compare the results on three types of embeddings (fastText, FLAIR and GloVe) on a novel analogy data set for Afrikaans, inspired by the Bigger Analogy Test Set: BATS (Gladkova et al. 2016). Our analysis shows that for Afrikaans, similar to English, the types of embeddings influence the quality of analogies found for different linguistic tasks. Our investigation also demonstrates, however, that these Afrikaans embeddings do not encode as clear a linguistic representation as with English embeddings. The exact reason for this is subject to future work, but the added morphological complexity and the lack of data most likely play a role
    corecore