Charles University

Biblio at Institute of Formal and Applied Linguistics
Not a member yet
    539 research outputs found

    Hierarchical Classification of Propaganda Techniques in Slavic Texts in Hyperbolic Space

    No full text
    Classification problems can often be tackled by modeling label hierarchies with broader categories in a graph and solving the task via node classification. While recent advances have shown that hyperbolic space is more suitable than Euclidean space for learning graph representations, this concept has yet to be applied to text classification, where node features first need to be extracted from text embeddings. A prototype of such an architecture is this contribution to the Slavic NLP 2025 shared task on the multi-label classification of persuasion techniques in parliamentary debates and social media posts. We do not achieve state-of-the-art performance, but outline the benefits of this hierarchical node classification approach and the advantages of hyperbolic graph embeddings

    Mind the Gap: Diverse NMT Models for Resource-Constrained Environments

    No full text
    We present fast Neural Machine Translation models for 17 diverse languages, developed using Sequence-level Knowledge Distillation. Our selected languages span multiple language families and scripts, including low-resource languages. The distilled models achieve comparable performance while being 10x times faster than transformer-base and 35x times faster than transformer-big architectures. Our experiments reveal that teacher model quality and capacity strongly influence the distillation success, as well as the language script. We also explore the effectiveness of multilingual students. We release publicly our code and models in our Github repository

    ReproHum #0669-08: Reproducing Sentiment Transfer Evaluation

    No full text
    We describe a reproduction of a human annotation experiment that was performed to evaluate the effectiveness of text style transfer systems (Reif et al., 2022). Despite our efforts to closely imitate the conditions of the original study, the results obtained differ significantly from those in the original study. We perform a statistical analysis of the results obtained, discuss the sources of these discrepancies in the study design, and quantify reproducibility. The reproduction followed the common approach to reproduction adopted by the ReproHum project (Belz et al., 2025)

    How (un)faithful are explainable LLM-based NLG metrics?

    No full text
    Explainable NLG metrics are becoming a popular research topic; however, the faithfulness of the explanations they provide is typically not evaluated. In this work, we propose a testbed for assessing the faithfulness of span-based metrics by performing controlled perturbations of their explanations and observing changes in the final score. We show that several popular LLM evaluators do not consistently produce faithful explanations

    OpusPocus: NMT Training Pipeline Manager

    No full text
    The aim of this tutorial is to present the functionality of the OpusPocus framework and demonstrate it on a set of practical examples

    Evaluating Text Style Transfer Evaluation: Are There Any Reliable Metrics?

    No full text
    Text style transfer (TST) is the task of transforming a text to reflect a particular style while preserving its original content. Evaluating TSToutputs is a multidimensional challenge, requiring the assessment of style transfer accuracy, content preservation, and naturalness. Us-ing human evaluation is ideal but costly, as is common in other natural language processing (NLP) tasks; however, automatic metrics forTST have not received as much attention as metrics for, e.g., machine translation or summarization. In this paper, we examine both set ofexisting and novel metrics from broader NLP tasks for TST evaluation, focusing on two popular subtasks—sentiment transfer and detoxification—in a multilingual context comprising English, Hindi, and Bengali. By conducting meta-evaluation through correlation with hu-man judgments, we demonstrate the effectiveness of these metrics when used individually and in ensembles. Additionally, we investigatethe potential of large language models (LLMs) as tools for TST evaluation. Our findings highlight newly applied advanced NLP metrics andLLM-based evaluations provide better insights than existing TST metrics. Our oracle ensemble approaches show even more potential

    SRS-Stories: Vocabulary-constrained multilingual story generation for language learning

    No full text
    In this paper, we use large language models to generate personalized stories for language learners, using only the vocabulary they know. The generated texts are specifically written to teach the user new vocabulary by simply reading stories where it appears in context, while at the same time seamlessly reviewing recently learned vocabulary. The generated stories are enjoyable to read and the vocabulary reviewing/learning is optimized by a Spaced Repetition System. The experiments are conducted in three languages: English, Chinese and Polish, evaluating three story generation methods and three strategies for enforcing lexical constraints. The results show that the generated stories are more grammatical, coherent, and provide better examples of word usage than texts generated by the standard constrained beam search approach

    Hotel Highlights: A Case Study in LLM Summarization and Evaluation

    No full text
    In trivago, we recently introduced Hotel Highlights – short AI-generated summaries of hotel descriptions and reviews that describe a hotel’s unique features. These highlights aim to help travelers choose the right hotel for their planned stay without the need to read multiple traveler reviews and accommodation descriptions. Therefore, ensuring the accuracy of these highlights is crucial. We conducted a human evaluation using a tool that allows annotators to mark the precise location of the error within the evaluated text. This evaluation revealed that 6% of summaries contained incorrect information, 21% contained uncheckable information, and 19% contained misleading information. We use the resulting error annotations to measure the agreement between automatic metrics and human judgment. One of the latest evaluation trends – using large language models as judges proved impractical, as both GPT-4o and Gemini over-annotated and hallucinated errors that were not present in the text. We hypothesize this is because hotel descriptions and reviews are not represented in any existing summarization corpora, thus the newly generated summaries are out of domain. We also investigated the limitations of current metrics and quality assessment approaches for detecting contradictions and hallucinations in AI-generated content. We examined various trained methods, including Natural Language Inference (NLI) and semantic similarity, to evaluate their effectiveness in aligning with human judgment. Both NLI and semantic similarity only showed a weak correlation with human judgment. Named entity recognition, evaluated using SpaCy NER, had only a 5% correlation due to inconsistencies, false positives, and the fact that the majority of evaluated texts had no named entities. Finally, we also explored n-gram overlap between the hotel description and the summarized highlight. 1-gram overlap, i.e. independent words, showed a 63% correlation with human judgment. This makes it the simplest, yet the most effective evaluation method. For comparison, the harmonic mean of 1 to 4-gram overlap (BLEU without brevity penalty) and Rouge-L Recall had correlations of 51% and 57%, respectively. Our findings highlight the challenges in using human evaluation and automatic metrics for quality assessment of AI-generated content. Despite various improvements, current trainable methods, including the use of state-of-the-art LLMs as judges, do not sufficiently align with human judgment. Surprisingly, the much simpler overlap-based metrics showed a comparatively better performance

    Do My Eyes Deceive Me? A Survey of Human Evaluations of Hallucinations in NLG

    No full text
    Hallucinations are one of the most pressing challenges for large language models (LLMs). While numerous methods have been proposed to detect and mitigate them automatically, human evaluation continues to serve as the gold standard. However, these human evaluations of hallucinations show substantial variation in definitions, terminology, and evaluation practices. In this paper, we survey 64 studies involving human evaluation of hallucination published between 2019 and 2024, to investigate how hallucinations are currently defined and assessed. Our analysis reveals a lack of consistency in definitions and exposes several concerning methodological shortcomings. Crucial details, such as evaluation guidelines, user interface design, inter-annotator agreement metrics, and annotator demographics, are frequently under-reported or omitted altogether

    OpeNLGauge: An Explainable Metric for NLG Evaluation with Open-Weights LLMs

    No full text
    Large Language Models (LLMs) have demonstrated great potential as evaluators of NLG systems, allowing for high-quality, reference-free, and multi-aspect assessments. However, existing LLM-based metrics suffer from two major drawbacks: reliance on proprietary models to generate training data or perform evaluations, and a lack of fine-grained, explanatory feedback. We introduce OpeNLGauge, a fully open-source, reference-free NLG evaluation metric that provides accurate explanations based on individual error spans. OpeNLGauge is available as a two-stage ensemble of larger open-weight LLMs, or as a small fine-tuned evaluation model, with confirmed generalizability to unseen tasks, domains and aspects. Our extensive meta-evaluation shows that OpeNLGauge achieves competitive correlation with human judgments, outperforming state-of-the-art models on certain tasks while maintaining full reproducibility and providing explanations more than twice as accurate

    58

    full texts

    539

    metadata records
    Updated in last 30 days.
    Biblio at Institute of Formal and Applied Linguistics
    Access Repository Dashboard
    Do you manage Open Research Online? Become a CORE Member to access insider analytics, issue reports and manage access to outputs from your repository in the CORE Repository Dashboard! 👇