Charles University

Biblio at Institute of Formal and Applied Linguistics
Not a member yet
    539 research outputs found

    Insight discovery in structured data

    No full text
    My research focuses on improving textual inference in large language models (LLMs) for natural language generation (NLG). LMs are increasingly used to gen- erate reports, summaries, insights, and answer questions about data. However, their outputs are often factually in- accurate, have difficulties with inference operations and produce shallow outputs as most benchmarks do not require much inference (e.g. WebNLG ) and comprehensive content selection. This limits their usefulness in contexts such as communicating complex information and decision making, where people need meaningful and reliable outputs. I am particularly interested in data-to-text generation, which requires both faithfulness to the provided data and an intuition for what might be interesting and important. Such tasks are often under-specified, forcing both models and humans to make implicit presuppositions, which can cause errors when they differ. Beyond prompting LLMs, I want to integrate them with symbolic operations through code generation to make inferences over the data, ensuring that the outputs are more faithful and interpretable

    Large Language Models as Span Annotators

    No full text
    Span annotation is the task of localizing and classifying text spans according to custom guidelines. Annotated spans can be used to analyze and evaluate high-quality texts for which single-score metrics fail to provide actionable feedback. Until recently, span annotation was limited to human annotators or fine-tuned models. In this study, we show that large language models (LLMs) can serve as flexible and cost-effective span annotation backbones. To demonstrate their utility, we compare LLMs to skilled human annotators on three diverse span annotation tasks: evaluating data-to-text generation, identifying translation errors, and detecting propaganda techniques. We demonstrate that LLMs achieve inter-annotator agreement (IAA) comparable to human annotators at a fraction of a cost per output annotation. We also manually analyze model outputs, finding that LLMs make errors at a similar rate to human annotators. We release the dataset of more than 40k model and human annotations for further research

    Pretraining Language Models with LoRA and Artificial Languages

    No full text
    Large language models (LLMs) require a substantial amount of training data, which contrasts with the data-efficient learning observed in humans. In our submission to the BabyLM Challenge, we address this disparity by proposing a parameter-efficient pretraining approach for language acquisition from limited data. Our approach involves initializing the model with token embeddings trained by a shallow model, followed by tuning the non-embedding parameters with non-linguistic data to introduce structural biases. Then, we freeze the resulting model and pretrain it on the 10M-token BabyLM corpus using LoRA adapters. Experiments on small corpora demonstrate that our approach improves upon classic pretraining of the entire model

    OpeNLGauge: An Explainable Metric for NLG Evaluation with Open-Weights LLMs

    No full text
    Large Language Models (LLMs) have demonstrated great potential as evaluators of NLG systems, allowing for high-quality, reference-free, and multi-aspect assessments. However, existing LLM-based metrics suffer from two major drawbacks: reliance on proprietary models to generate training data or perform evaluations, and a lack of fine-grained, explanatory feedback. We introduce OpeNLGauge, a fully open-source, reference-free NLG evaluation metric that provides accurate explanations based on individual error spans. OpeNLGauge is available as a two-stage ensemble of larger open-weight LLMs, or as a small fine-tuned evaluation model, with confirmed generalizability to unseen tasks, domains and aspects. Our extensive meta-evaluation shows that OpeNLGauge achieves competitive correlation with human judgments, outperforming state-of-the-art models on certain tasks while maintaining full reproducibility and providing explanations more than twice as accurate

    OpeNLGauge: An Explainable Metric for NLG Evaluation with Open-Weights LLMs

    No full text
    Large Language Models (LLMs) have demonstrated great potential as evaluators of NLG systems, allowing for high-quality, reference-free, and multi-aspect assessments. However, existing LLM-based metrics suffer from two major drawbacks: reliance on proprietary models to generate training data or perform evaluations, and a lack of fine-grained, explanatory feedback. We introduce OpeNLGauge, a fully open-source, reference-free NLG evaluation metric that provides accurate explanations based on individual error spans. OpeNLGauge is available as a two-stage ensemble of larger open-weight LLMs, or as a small fine-tuned evaluation model, with confirmed generalizability to unseen tasks, domains and aspects. Our extensive meta-evaluation shows that OpeNLGauge achieves competitive correlation with human judgments, outperforming state-of-the-art models on certain tasks while maintaining full reproducibility and providing explanations more than twice as accurate

    Generating Data Insights with LLMs by Querying Tables

    No full text
    Large Language Models (LLMs) have the potential to automate processes such as generating reports, aiding business intelligence, and uncovering actionable insights from datasets. Yet, their practical use is still hindered by their unreliability (making factual errors, i.e., hallucinations) and limited interpretability. Moreover, the resulting insights often lack diversity and depth. To address this, we present a novel agentic workflow that grounds the insights in data queried from input database tables. Here is how it works: First, we prompt the LLM to identify interesting and meaningful idea insights from a provided database table. Next, we ask the model to translate the ideas into SQL queries, which are executed to retrieve the results. Finally, the LLM generates an engaging insight, grounded in the retrieved SQL results. To provide a robust evaluation, we developed a tool for automatic dataset generation from Wikidata and Wikipedia, expanding on the prominent table-to-text LogicNLG benchmark. By creating a fresh dataset that contains only data created after each model's release, we prevent model training data leakage, a common challenge in assessing LLM performance. This setup forces LLMs to truly generalize on the unseen data, giving a fair measure of their ability. We use the common evaluation protocol with automatic metrics for factuality and diversity. On top of that, we run a human evaluation study for factuality (to compensate for shortcomings of automatic metrics) and for the interestingness of the produced claims for the users. The explicit ideation is naturally interpretable to people and can also be filtered by an LLM to proceed to the next stage only with the most interesting ideas. We show that this increases output diversity and interestingness of the generated claims, compared to direct prompting. Moreover, the LLMs exhibit high levels of background knowledge about the diverse tables in our dataset (from sports, politics, to culture). We show that they can make this knowledge explicit to help the user correctly interpret the generated insights, or to find mistakes. The SQL step benefits from an agentic approach where the code is iteratively fixed when an error is encountered. It can also be checked for correctness and match against the original idea. Furthermore, this enables straightforward scaling to larger tables and databases, by generating insight ideas with just a sample of data and then retrieving the relevant data from the whole table. We provide a comparison of generating insight ideas with the whole table shown to the LLM with just a sample and point out what could be improved. In summary, we present a method that leverages the strengths of LLMs for ideation, code generation, and engaging formulation of the insights. To ensure real impact, we include a thorough evaluation and analysis of the method

    Sourcing Fresh Resources for Table-to-Text Generation Evaluation

    No full text
    Table-to-text generation is a challenging subtask of data-to-text generation, where a natural language generation (NLG) system is generating insights from a given data table. Recent research has built on neural language models (LMs), including large language models (LLMs). However, LLMs were shown to memorize common benchmarks, inflating their true performance. Following prior work on dynamic dataset construction, we developed an approach for obtaining up-to-date benchmarks for table-to-text generation, dubbed FreshTab. This dataset family, based on Wikipedia tables, is not affected by the problems of LLM memorization and benchmark contamination, as the underlying tables are newer than the LLM's knowledge cutoff date. We also introduce basic domain labels for each table, allowing for domain-specific evaluation insights. In our experiments with tables from February-May 2025 collected using FreshTab, we show that all recent LLMs perform worse on average than on a comparable set of tables from the earlier LoTNLG/LogicNLG benchmark

    Findings of the WMT25 Multilingual Instruction Shared Task: Persistent Hurdles in Reasoning, Generation, and Evaluation

    No full text
    WMT25 Multilingual Instruction Shared Task (MIST) představuje benchmark pro evaluaci velkých jazykových modelů (LLM) napříč 30 jazyky. Benchmark pokrývá pět typů úloh: strojový překlad, linguistic reasoning, otevřené generování, cross-lingual sumarizaci a LLM-as-a-judge. Poskytujeme automatickou evaluaci a sbíráme lidské anotace, které poukazují na limity automatické evaluace a umožňují další výzkum v oblasti meta-evaluace metrik. Na našem benchmarku evaluujeme širokou škálu open- i closed-weight LLM, čímž poskytujeme komplexní přehled o jejich vícejazyčných schopnostech. Výsledky ukazují výrazné rozdíly mezi jednotlivými sub-tasky a jazyky, což odhaluje přetrvávající výzvy v reasoning, cross-lingual generování a spolehlivosti evaluace. Tato práce zavádí standardizovaný rámec pro měření budoucího pokroku ve vývoji vícejazyčných LLM

    An Expanded Massive Multilingual Dataset for High-Performance Language Technologies (HPLT)

    No full text
    Training state-of-the-art large language models requires vast amounts of clean and diverse textual data. However, building suitable multilingual datasets remains a challenge. In this work, we present HPLT v2, a collection of high-quality multilingual, monolingual and parallel corpora, extending prior work of the HPLT project. The monolingual portion of the data contains 8T tokens covering 193 languages, while the parallel data contains 380M sentence pairs covering 51 languages. We document the entire data pipeline and release the code to reproduce it. We provide extensive analysis of the quality and characteristics of our data. Finally, we evaluate the performance of language models and machine translation systems trained on HPLT v2, demonstrating its value

    CUNI-a at ArchEHR-QA 2025: Do we need Giant LLMs for Clinical QA?

    No full text
    In this paper, we present our submission to the ArchEHR-QA 2025 shared task, which focuses on answering patient questions based on excerpts from electronic health record (EHR) discharge summaries. Our approach identifies essential sentences relevant to a patient’s question using a combination of few-shot inference with the Med42-8B model, cosine similarity over clinical term embeddings, and the MedCPT cross-encoder relevance model. Then, concise answers are generated on the basis of these selected sentences. Despite not relying on large language models (LLMs) with tens of billions of parameters, our method achieves competitive results, demonstrating the potential of resource-efficient solutions for clinical NLP applications

    58

    full texts

    539

    metadata records
    Updated in last 30 days.
    Biblio at Institute of Formal and Applied Linguistics
    Access Repository Dashboard
    Do you manage Open Research Online? Become a CORE Member to access insider analytics, issue reports and manage access to outputs from your repository in the CORE Repository Dashboard! 👇