1,721,150 research outputs found

    Decoding Byzantine book epigrams : an exploration of machine-assisted extraction of formulaic material

    No full text
    This paper proposes a machine-assisted methodology for identifying and extracting formulaic sequences from a subset of the Database of Byzantine Book Epigrams (DBBE). The methodology involves conceptualising formulaicity within the DBBE corpus, pre-processing and extracting n-grams from textual data, followed by refinement before delving into the interpretation of the results. Through systematic application of this methodology, some initial insights into the characteristics of formulaic language within the Byzantine book epigram tradition are gained. Representative findings illustrate the nature of recurring patterns, cases of creative elaboration, and their content. This initial exploration aims to facilitate a deeper understanding of the concept of formulaicity in Byzantine book epigrams; while computational analysis provides a quantitative perspective, linguistic and philological research is necessary for a more nuanced understanding. Future research directions include refining the methodology and expanding the scope of analysis beyond the current subset of the DBBE. Overall, this study lays the groundwork for further research on this rich book epigram tradition

    How relevant is part-of-speech information to compute similarity between Greek verses in a graph database?

    No full text
    This paper presents the automatic linguistic analysis of the Database of Byzantine Book Epigrams (DBBE) on the one hand, and its representation and integration in a graph database on the other hand. Firstly, we provide a comprehensive description of the DBBE data we want to provide with a complete morphological analysis. The presented methodology explores the possibilities of fine-tuning the DBBErt transformer-based language model, which was trained on pre-Modern and Modern Greek. Secondly, the automatically annotated epigrams are integrated in a graph database, a new way to represent the relatedness of this entangled corpus. With the graph database, we can compute similarity between words, verses and epigrams. Given the scope of this paper, we computed a complete orthographic similarity between the verses, a similarity based on the automatically assigned part-of-speech information and a final similarity measure that combines both orthography and part-of-speech information. The results of these similarity measures provide scholars with new visual representations of relations between (parts of) texts, which is beneficial for new critical editions and commentaries

    D-TERMINE : data-driven term extraction methodologies investigated

    Full text link
    Automatic term extraction is a task in the field of natural language processing that aims to automatically identify terminology in collections of specialised, domain-specific texts. Terminology is defined as domain-specific vocabulary and consists of both single-word terms (e.g., corpus in the field of linguistics, referring to a large collection of texts) and multi-word terms (e.g., automatic term extraction). Terminology is a crucial part of specialised communication since terms can concisely express very specific and essential information. Therefore, quickly and automatically identifying terms is useful in a wide range of contexts. Automatic term extraction can be used by language professionals to find which terms are used in a domain and how, based on a relevant corpus. It is also useful for other tasks in natural language processing, including machine translation. One of the main difficulties with term extraction, both manual and automatic, is the vague boundary between general language and terminology. When different people identify terms in the same text, it will invariably produce different results. Consequently, creating manually annotated datasets for term extraction is a costly, time- and effort- consuming task. This can hinder research on automatic term extraction, which requires gold standard data for evaluation, preferably even in multiple languages and domains, since terms are language- and domain-dependent. Moreover, supervised machine learning methodologies rely on annotated training data to automatically deduce the characteristics of terms, so this knowledge can be used to detect terms in other corpora as well. Consequently, the first part of this PhD project was dedicated to the construction and validation of a new dataset for automatic term extraction, called ACTER – Annotated Corpora for Term Extraction Research. Terms and Named Entities were manually identified with four different labels in twelve specialised corpora. The dataset contains corpora in three languages and four domains, leading to a total of more than 100k annotations, made over almost 600k tokens. It was made publicly available during a shared task we organised, in which five international teams competed to automatically extract terms from the same test data. This illustrated how ACTER can contribute towards advancing the state-of-the-art. It also revealed that there is still a lot of room for improvement, with moderate scores even for the best teams. Therefore, the second part of this dissertation was devoted to researching how supervised machine learning techniques might contribute. The traditional, hybrid approach to automatic term extraction relies on a combination of linguistic and statistical clues to detect terms. An initial list of unique candidate terms is extracted based on linguistic information (e.g., part-of-speech patterns) and this list is filtered based on statistical metrics that use frequencies to measure whether a candidate term might be relevant. The result is a ranked list of candidate terms. HAMLET – Hybrid, Adaptable Machine Learning Approach to Extract Terminology – was developed based on this traditional approach and applies machine learning to efficiently combine more information than could be used with a rule-based approach. This makes HAMLET less susceptible to typical issues like low recall on rare terms. While domain and language have a large impact on results, robust performance was reached even without domain- specific training data, and HAMLET compared favourably to a state-of-the-art rule-based system. Building on these findings, the third and final part of the project was dedicated to investigating methodologies that are even further removed from the traditional approach. Instead of starting from an initial list of unique candidate terms, potential terms were labelled immediately in the running text, in their original context. Two sequential labelling approaches were developed, evaluated and compared: a feature- based conditional random fields classifier, and a recurrent neural network with word embeddings. The latter outperformed the feature-based approach and was compared to HAMLET as well, obtaining comparable and even better results. In conclusion, this research resulted in an extensive, reusable dataset and three distinct new methodologies for automatic term extraction. The elaborate evaluations went beyond reporting scores and revealed the strengths and weaknesses of the different approaches. This identified challenges for future research, since some terms, especially ambiguous ones, remain problematic for all systems. However, overall, results were promising and the approaches were complementary, revealing great potential for new methodologies that combine multiple strategies

    Koala: An Index for Quantifying Overlaps with Pre-training Corpora

    Full text link
    In very recent years more attention has been placed on probing the role of pre-training data in Large Language Models (LLMs) downstream behaviour. Despite the importance, there is no public tool that supports such analysis of pre-training corpora at large scale. To help research in this space, we launch Koala, a searchable index over large pre-training corpora using lossless compressed suffix arrays with highly efficient compression rate and search support. In its first release we index the public proportion of OPT 175B, GPT-3, GPT-Neo, GPT-Neo, LLaMA, BERT, ELECTRA, RoBERTA, XLNet pre-training corpora. Koala provides a framework to do forensic analysis on the current and future benchmarks as well as to assess the degree of memorization in the output from the LLMs. Koala is available for public use at https://koala-index.erc.monash.edu/
    corecore