Charles University

Biblio at Institute of Formal and Applied Linguistics

Not a member yet

539 research outputs found

Sort by

Towards Semantic Tagging of Segmented Holocaust Narratives

Author: Pecina Pavel
Brückner Christopher
Publication venue
Publication date: 01/01/2025
Field of study

With the increasing loss of Holocaust witnesses, it is becoming more and more important to preserve their memories. Items of cultural heritage, including textual data such as diaries or transcripts of video interviews, are abundant. However, large amounts of this data are not annotated, which poses a significant obstacle for domain experts curating digitized information regarding the Holocaust. A solution for this problem is a natural language processing model that links text segments to a rich domain-specific ontology of subject terms to automatically tag documents for further processing. While we have not yet achieved a comprehensive solution, we show that even a simple model fine-tuned on a small dataset of spoken narratives is a promising first step and transfers its capabilities to written testimonies reasonably well

Exploring ReAct Prompting for Task-Oriented Dialogue: Insights and Shortcomings

Author: Rojas Barahona Lina
Dušek Ondřej
Couceiro Miguel
Elizabeth Michelle
Veyret Morgan
Publication venue
Publication date: 01/01/2025
Field of study

Large language models (LLMs) gained immense popularity due to their impressive capabilities in unstructured conversations. Empowering LLMs with advanced prompting strategies such as reasoning and acting (ReAct) (Yao et al., 2022) has shown promise in solving complex tasks traditionally requiring reinforcement learning. In this work, we apply the ReAct strategy to guide LLMs performing task-oriented dialogue (TOD). We evaluate ReAct-based LLMs (ReAct-LLMs) both in simulation and with real users. While ReAct-LLMs severely underperform state-of-the-art approaches on success rate in simulation, this difference becomes less pronounced in human evaluation. Moreover, compared to the baseline, humans report higher subjective satisfaction with ReAct-LLM despite its lower success rate, most likely thanks to its natural and confidently phrased responses

When Multilingual Models Compete with Monolingual Domain-Specific Models in Clinical Question Answering

Author: Pecina Pavel
Lanz Vojtěch
Publication venue
Publication date: 01/01/2025
Field of study

This paper explores the performance of multilingual models in the general domain on the clinical Question Answering (QA) task to observe their potential medical support for languages that do not benefit from the existence of clinically trained models. In order to improve the model’s performance, we exploit multilingual data augmentation by translating an English clinical QA dataset into six other languages. We propose a translation pipeline including projection of the evidences (answers) into the target languages and thoroughly evaluate several multilingual models fine-tuned on the augmented data, both in mono- and multilingual settings. We find that the translation itself and the subsequent QA experiments present a differently challenging problem for each of the languages. Finally, we compare the performance of multilingual models with pretrained medical domain-specific English models on the original clinical English test set. Contrary to expectations, we find that monolingual domain-specific pretraining is not always superior to general-domain multilingual pretraining. The source code is available at https://github.com/lanzv/Multilingual-emrQ

Large Language Models: How they work and what they are good for

Author: Dušek Ondřej
Publication venue
Publication date: 01/01/2025
Field of study

A short introduction explaining the working of large language models and potential caveats of their usage

Constraining LLM Output

Author: Kasner Zdeněk
Publication venue
Publication date: 01/01/2025
Field of study

This talk shows practical ways to make LLMs follow exact formats — from regex and JSON schemas to token-aware FSMs and CFGs — and explains how those constraints work during decoding. It surveys current tools and implementations, points out pitfalls like tokenization mismatches and unnatural formats, and gives overview of best practices, focusing on MT use cases. A short demo demonstrates constrained decoding in action and common failure modes to watch for

Real-World Summarization: When Evaluation Reaches Its Limits

Author: Schmidtová Patrícia
Dušek Ondřej
Mahamood Saad
Publication venue
Publication date: 01/01/2025
Field of study

We examine evaluation of faithfulness to input data in the context of hotel highlights—brief LLM-generated summaries that capture unique features of accommodations. Through human evaluation campaigns involving categorical error assessment and span-level annotation, we compare traditional metrics, trainable methods, and LLM-as-a-judge approaches. Our findings reveal that simpler metrics like word overlap correlate surprisingly well with human judgments (r=0.63), often outperforming more complex methods when applied to outof- domain data. We further demonstrate that while LLMs can generate high-quality highlights, they prove unreliable for evaluation as they tend to severely under- or over-annotate. Our analysis of real-world business impacts shows incorrect and non-checkable information pose the greatest risks. We also highlight challenges in crowdsourced evaluations

Evaluating LLM Outputs with Humans and LLMs

Author: Dušek Ondřej
Publication venue
Publication date: 01/01/2025
Field of study

How well do LLMs perform on text generation tasks, and how can we tell? We present approaches based on annotating individual errors, using human evaluators as well as LLMs. For humans, we introduce our efficient annotation framework and schema. For LLM-based evaluation, we show a metric using an ensemble of open-source LLMs, which includes a reasoning for each annotated error, evaluated on various generation tasks and evaluation aspects (such as accuracy or fluency) and showing high correlation with human annotators. Both approaches allow us to use benchmarks with recent data unseen to LLMs during training, bypassing the data leakage problem that artificially inflates LLMs' performance on commonly used benchmarks

HPLT’s Second Data Release

We describe the progress of the High Performance Language Technologies (HPLT) project, a 3-year EU-funded project that started in September 2022. We focus on the up-to-date results on the release of free text datasets derived from web crawls, one of the central objectives of the project. The second release used a revised processing pipeline, and an enlarged set of input crawls. From 4.5 petabytes of web crawls we extracted 7.6T tokens of monolingual text in 193 languages, plus 380 million parallel sentences in 51 language pairs. We also release MultiHPLT, a cross-combination of the parallel data, which produces 1,275 pairs, as well as releasing the containing documents for all parallel sentences in order to enable research in document-level MT. We report changes in the pipeline, analysis and evaluation results for the second parallel data release based on machine translation systems. All datasets are released under a permissive CC0 licence

Jak funguje dnešní AI a k čemu (ne)může být

Author: Dušek Ondřej
Publication venue
Publication date: 01/01/2025
Field of study

Artificial intelligence (AI) has become ubiquitous in recent years and provides answers to any question, but the quality of those answers varies considerably. In this article, I would first like to show why this is the case, or rather how large language models (LLMs), on which today's AI is based, work. I will then focus on the question of what AI can be used for when working with text, and I will show several examples of possible inputs

Can Large Language Models Personalize Dialogues to Generational Styles?

Author: Mazzei Alessandro
Dušek Ondřej
Anselma Luca
Balestruci Pier
Publication venue
Publication date: 01/01/2025
Field of study

We investigate how large language models (LLMs) can produce personalized dialogue responses, specifically focusing on whether they reflect linguistic styles pertaining to different generations: Baby Boomers, Generation X, Generation Y, and Generation Z. We create P-MultiWoZ, a personalized, generation-specific version of MultiWOZ 2.2, by prompting LLMs, and validate its alignment with the original dataset through automatic and human evaluations. To validate the appropriateness of generational linguistic traits, we introduce GeMoSC, a corpus of generation-annotated movie dialogues. Linguistic analysis and perplexity test suggest that P-MultiWoZ reflects patterns consistent with GeMoSC. Finally, a human evaluation reveals that annotators were able to mostly correctly identify the generation behind P-MultiWoZ dialogues, based only on a single query-reply pair

58

full texts

539

metadata records

Updated in last 30 days.

Biblio at Institute of Formal and Applied Linguistics

Access Repository Dashboard

Do you manage Open Research Online? Become a CORE Member to access insider analytics, issue reports and manage access to outputs from your repository in the CORE Repository Dashboard! 👇