Charles University

Biblio at Institute of Formal and Applied Linguistics

Not a member yet

539 research outputs found

Sort by

Hierarchical Classification of Propaganda Techniques in Slavic Texts in Hyperbolic Space

Author: Pecina Pavel
Brückner Christopher
Publication venue
Publication date: 01/01/2025
Field of study

Classification problems can often be tackled by modeling label hierarchies with broader categories in a graph and solving the task via node classification. While recent advances have shown that hyperbolic space is more suitable than Euclidean space for learning graph representations, this concept has yet to be applied to text classification, where node features first need to be extracted from text embeddings. A prototype of such an architecture is this contribution to the Slavic NLP 2025 shared task on the multi-label classification of persuasion techniques in parliamentary debates and social media posts. We do not achieve state-of-the-art performance, but outline the benefits of this hierarchical node classification approach and the advantages of hyperbolic graph embeddings

Mind the Gap: Diverse NMT Models for Resource-Constrained Environments

Author: O’Brien Dayyán
Variš Dušan
Tiedemann Jörg
De Gibert Bonet Ona
Publication venue
Publication date: 01/01/2025
Field of study

We present fast Neural Machine Translation models for 17 diverse languages, developed using Sequence-level Knowledge Distillation. Our selected languages span multiple language families and scripts, including low-resource languages. The distilled models achieve comparable performance while being 10x times faster than transformer-base and 35x times faster than transformer-big architectures. Our experiments reveal that teacher model quality and capacity strongly influence the distillation success, as well as the language script. We also explore the effectiveness of multilingual students. We release publicly our code and models in our Github repository

ReproHum #0669-08: Reproducing Sentiment Transfer Evaluation

Author: Schmidtová Patrícia
Onderková Kristýna
Dušek Ondřej
Lango Mateusz
Publication venue
Publication date: 01/01/2025
Field of study

We describe a reproduction of a human annotation experiment that was performed to evaluate the effectiveness of text style transfer systems (Reif et al., 2022). Despite our efforts to closely imitate the conditions of the original study, the results obtained differ significantly from those in the original study. We perform a statistical analysis of the results obtained, discuss the sources of these discrepancies in the study design, and quantify reproducibility. The reproduction followed the common approach to reproduction adopted by the ReproHum project (Belz et al., 2025)

How (un)faithful are explainable LLM-based NLG metrics?

Author: Terentowicz Alex
Dušek Ondřej
Lango Mateusz
Publication venue
Publication date: 01/01/2025
Field of study

Explainable NLG metrics are becoming a popular research topic; however, the faithfulness of the explanations they provide is typically not evaluated. In this work, we propose a testbed for assessing the faithfulness of span-based metrics by performing controlled perturbations of their explanations and observing changes in the final score. We show that several popular LLM evaluators do not consistently produce faithful explanations

OpusPocus: NMT Training Pipeline Manager

Author: Variš Dušan
Publication venue
Publication date: 01/01/2025
Field of study

The aim of this tutorial is to present the functionality of the OpusPocus framework and demonstrate it on a set of practical examples

Evaluating Text Style Transfer Evaluation: Are There Any Reliable Metrics?

Author: Mukherjee Sourabrata
McCrae John
Ojha Atul
Dušek Ondřej
Publication venue
Publication date: 01/01/2025
Field of study

Text style transfer (TST) is the task of transforming a text to reflect a particular style while preserving its original content. Evaluating TSToutputs is a multidimensional challenge, requiring the assessment of style transfer accuracy, content preservation, and naturalness. Us-ing human evaluation is ideal but costly, as is common in other natural language processing (NLP) tasks; however, automatic metrics forTST have not received as much attention as metrics for, e.g., machine translation or summarization. In this paper, we examine both set ofexisting and novel metrics from broader NLP tasks for TST evaluation, focusing on two popular subtasks—sentiment transfer and detoxification—in a multilingual context comprising English, Hindi, and Bengali. By conducting meta-evaluation through correlation with hu-man judgments, we demonstrate the effectiveness of these metrics when used individually and in ensembles. Additionally, we investigatethe potential of large language models (LLMs) as tools for TST evaluation. Our findings highlight newly applied advanced NLP metrics andLLM-based evaluations provide better insights than existing TST metrics. Our oracle ensemble approaches show even more potential

SRS-Stories: Vocabulary-constrained multilingual story generation for language learning

Author: Kamzela Wiktor
Dušek Ondřej
Lango Mateusz
Publication venue
Publication date: 01/01/2025
Field of study

In this paper, we use large language models to generate personalized stories for language learners, using only the vocabulary they know. The generated texts are specifically written to teach the user new vocabulary by simply reading stories where it appears in context, while at the same time seamlessly reviewing recently learned vocabulary. The generated stories are enjoyable to read and the vocabulary reviewing/learning is optimized by a Spaced Repetition System. The experiments are conducted in three languages: English, Chinese and Polish, evaluating three story generation methods and three strategies for enforcing lexical constraints. The results show that the generated stories are more grammatical, coherent, and provide better examples of word usage than texts generated by the standard constrained beam search approach

Hotel Highlights: A Case Study in LLM Summarization and Evaluation

Author: Schmidtová Patrícia
Dušek Ondřej
Mahamood Saad
Publication venue
Publication date: 01/01/2025
Field of study

In trivago, we recently introduced Hotel Highlights – short AI-generated summaries of hotel descriptions and reviews that describe a hotel’s unique features. These highlights aim to help travelers choose the right hotel for their planned stay without the need to read multiple traveler reviews and accommodation descriptions. Therefore, ensuring the accuracy of these highlights is crucial. We conducted a human evaluation using a tool that allows annotators to mark the precise location of the error within the evaluated text. This evaluation revealed that 6% of summaries contained incorrect information, 21% contained uncheckable information, and 19% contained misleading information. We use the resulting error annotations to measure the agreement between automatic metrics and human judgment. One of the latest evaluation trends – using large language models as judges proved impractical, as both GPT-4o and Gemini over-annotated and hallucinated errors that were not present in the text. We hypothesize this is because hotel descriptions and reviews are not represented in any existing summarization corpora, thus the newly generated summaries are out of domain. We also investigated the limitations of current metrics and quality assessment approaches for detecting contradictions and hallucinations in AI-generated content. We examined various trained methods, including Natural Language Inference (NLI) and semantic similarity, to evaluate their effectiveness in aligning with human judgment. Both NLI and semantic similarity only showed a weak correlation with human judgment. Named entity recognition, evaluated using SpaCy NER, had only a 5% correlation due to inconsistencies, false positives, and the fact that the majority of evaluated texts had no named entities. Finally, we also explored n-gram overlap between the hotel description and the summarized highlight. 1-gram overlap, i.e. independent words, showed a 63% correlation with human judgment. This makes it the simplest, yet the most effective evaluation method. For comparison, the harmonic mean of 1 to 4-gram overlap (BLEU without brevity penalty) and Rouge-L Recall had correlations of 51% and 57%, respectively. Our findings highlight the challenges in using human evaluation and automatic metrics for quality assessment of AI-generated content. Despite various improvements, current trainable methods, including the use of state-of-the-art LLMs as judges, do not sufficiently align with human judgment. Surprisingly, the much simpler overlap-based metrics showed a comparatively better performance

Do My Eyes Deceive Me? A Survey of Human Evaluations of Hallucinations in NLG

Author: Gkatzia Dimitra
Same Fahime
Huidrom Rudali
Mahamood Saad
Calò Eduardo
Lango Mateusz
Schmidtová Patrícia
Zouhar Vilém
Dušek Ondřej
Balloccu Simone
Publication venue
Publication date: 01/01/2025
Field of study

Hallucinations are one of the most pressing challenges for large language models (LLMs). While numerous methods have been proposed to detect and mitigate them automatically, human evaluation continues to serve as the gold standard. However, these human evaluations of hallucinations show substantial variation in definitions, terminology, and evaluation practices. In this paper, we survey 64 studies involving human evaluation of hallucination published between 2019 and 2024, to investigate how hallucinations are currently defined and assessed. Our analysis reveals a lack of consistency in definitions and exposes several concerning methodological shortcomings. Crucial details, such as evaluation guidelines, user interface design, inter-annotator agreement metrics, and annotator demographics, are frequently under-reported or omitted altogether

OpeNLGauge: An Explainable Metric for NLG Evaluation with Open-Weights LLMs

Author: Dušek Ondřej
Lango Mateusz
Kartáč Ivan
Publication venue
Publication date: 01/01/2025
Field of study

Large Language Models (LLMs) have demonstrated great potential as evaluators of NLG systems, allowing for high-quality, reference-free, and multi-aspect assessments. However, existing LLM-based metrics suffer from two major drawbacks: reliance on proprietary models to generate training data or perform evaluations, and a lack of fine-grained, explanatory feedback. We introduce OpeNLGauge, a fully open-source, reference-free NLG evaluation metric that provides accurate explanations based on individual error spans. OpeNLGauge is available as a two-stage ensemble of larger open-weight LLMs, or as a small fine-tuned evaluation model, with confirmed generalizability to unseen tasks, domains and aspects. Our extensive meta-evaluation shows that OpeNLGauge achieves competitive correlation with human judgments, outperforming state-of-the-art models on certain tasks while maintaining full reproducibility and providing explanations more than twice as accurate

58

full texts

539

metadata records

Updated in last 30 days.

Biblio at Institute of Formal and Applied Linguistics

Access Repository Dashboard

Do you manage Open Research Online? Become a CORE Member to access insider analytics, issue reports and manage access to outputs from your repository in the CORE Repository Dashboard! 👇