Charles University

Biblio at Institute of Formal and Applied Linguistics

Not a member yet

539 research outputs found

Sort by

Barriers and enabling factors for error analysis in NLG research

Author: Gkatzia Dimitra
Mahamood Saad
Thomson Craig
Inglis Stephanie
Schoch Stephanie
van Miltenburg Emiel
Clinciu Miruna
Leppänen Leo
Dušek Ondřej
Wen Luou
Publication venue
Publication date: 01/01/2023
Field of study

Earlier research has shown that few studies in Natural Language Generation (NLG) evaluate their system outputs using an error analysis, despite known limitations of automatic evaluation metrics and human ratings. This position paper takes the stance that error analyses should be encouraged, and discusses several ways to do so. This paper is not just based on our shared experience as authors, but we also distributed a survey as a means of public consultation. We provide an overview of existing barriers to carry out error analyses, and proposes changes to improve error reporting in the NLG literature

Polite Chatbot: A Text Style Transfer Application

Author: Mukherjee Sourabrata
Hudeček Vojtěch
Dušek Ondřej
Publication venue
Publication date: 01/01/2023
Field of study

Generating polite responses is essential to build intelligent and engaging dialogue systems. However, this task is far from well-explored due to the difficulties of rendering a particular style in coherent responses, especially when parallel datasets for regular-to-polite pairs are usually unavailable. This paper proposes a polite chatbot that can produce responses that are polite and coherent to the given context. In this study, a politeness transfer model is first used to generate polite synthetic dialogue pairs of contexts and polite utterances. Then, these synthetic pairs are employed to train a dialogue model. Automatic and human evaluations demonstrate that our method outperforms baselines in producing polite dialogue responses while staying competitive in terms of coherent to the given context

TabGenie: A Toolkit for Table-to-Text Generation

Author: Garanina Ekaterina
Kasner Zdeněk
Dušek Ondřej
Plátek Ondřej
Publication venue
Publication date: 01/01/2023
Field of study

Heterogenity of data-to-text generation datasets limits the research on data-to-text generation systems. We present TabGenie - a toolkit which enables researchers to explore, preprocess, and analyze a variety of data-to-text generation datasets through the unified framework of table-to-text generation. In TabGenie, all inputs are represented as tables with associated metadata. The tables can be explored through a web interface, which also provides an interactive mode for debugging table-to-text generation, facilitates side-by-side comparison of generated system outputs, and allows easy exports for manual analysis. Furthermore, TabGenie is equipped with command line processing tools and Python bindings for unified dataset loading and processing. We release TabGenie as a PyPI package and provide its open-source code and a live demo at https://github.com/kasnerz/tabgenie

DocMarker

Author: Pecina Pavel
Mayer Jiří
Publication venue
Publication date: 01/01/2023
Field of study

DocMarker is an annotation tool for creating training data for the text-to-form information retrieval NLP task. Say you have a free-form text (rich-text maybe) that contains some information that should be filled out into some structured form. This tool lets you record and annotate this form-filling process

How Corpus Analysis Helps Operationalize Research Questions and Entices Literary Scholars to Learn Programming.

Author: Janssen Maarten
Cvrček Václav
Křen Michal
Cinková Silvie
Publication venue
Publication date: 01/01/2023
Field of study

We describe the preparation and implementation of the corpus linguistics summer school for DH. We assumed that students have no programming knowledge, but want to familiarize themselves with what they should learn in order to build text corpora and search them

Speaking Multiple Languages Affects the Moral Bias of Language Models

Author: Libovický Jindřich
Rothkopf Constantin
Schramowski Patrick
Kersting Kristian
Dieseroth Björn
Fraser Alexander
Hämmerl Katharina
Publication venue
Publication date: 01/01/2023
Field of study

Pre-trained multilingual language models (PMLMs) are commonly used when dealing with data from multiple languages and cross-lingual transfer. However, PMLMs are trained on varying amounts of data for each language. In practice this means their performance is often much better on English than many other languages. We explore to what extent this also applies to moral norms. Do the models capture moral norms from English and impose them on other languages? Do the models exhibit random and thus potentially harmful beliefs in certain languages? Both these issues could negatively impact cross-lingual transfer and potentially lead to harmful outcomes. In this paper, we (1) apply the MORALDIRECTION framework to multilingual models, comparing results in German, Czech, Arabic, Chinese, and English, (2) analyse model behaviour on filtered parallel subtitles corpora, and (3) apply the models to a Moral Foundations Questionnaire, comparing with human responses from different countries. Our experiments demonstrate that, indeed, PMLMs encode differing moral biases, but these do not necessarily correspond to cultural differences or commonalities in human opinions. We release our code and models

Low-Resource Text Style Transfer for Bangla: Data & Models

Author: Mukherjee Sourabrata
Bansal Akansha
Ojha Atul
Majumdar Pritha
Dušek Ondřej
Publication venue
Publication date: 01/01/2023
Field of study

Text style transfer (TST) involves modifying the linguistic style of a given text while retaining its core content. This paper addresses the challenging task of text style transfer in the Bangla language, which is low-resourced in this area. We present a novel Bangla dataset that facilitates text sentiment transfer, a subtask of TST, enabling the transformation of positive sentiment sentences to negative and vice versa. To establish a high-quality base for further research, we refined and corrected an existing English dataset of 1,000 sentences for sentiment transfer based on Yelp reviews, and we introduce a new human-translated Bangla dataset that parallels its English counterpart. Furthermore, we offer multiple benchmark models that serve as a validation of the dataset and baseline for further research

Leveraging Low-resource Parallel Data for Text Style Transfer

Author: Mukherjee Sourabrata
Dušek Ondřej
Publication venue
Publication date: 01/01/2023
Field of study

Text style transfer (TST) involves transforming a text into a desired style while approximately preserving its content. The biggest challenge in TST in the general lack of parallel data. Many existing approaches rely on complex models using substantial non-parallel data, with mixed results. In this paper, we leverage a pretrained BART language model with minimal parallel data and incorporate low-resource methods such as hyperparameter tuning, data augmentation, and self-training, which have not been explored in TST. We further include novel style-based rewards in the training loss. Through extensive experiments in sentiment transfer, a sub-task of TST, we demonstrate that our simple yet effective approaches achieve well-balanced results, surpassing non-parallel approaches and highlighting the usefulness of parallel data even in small amounts

Are Large Language Models All You Need for Task-Oriented Dialogue?

Author: Hudeček Vojtěch
Dušek Ondřej
Publication venue
Publication date: 01/01/2023
Field of study

Instruction-finetuned large language models (LLMs) gained a huge popularity recently, thanks to their ability to interact with users through conversation. In this work, we aim to evaluate their ability to complete multi-turn tasks and interact with external databases in the context of established task-oriented dialogue benchmarks. We show that in explicit belief state tracking, LLMs underperform compared to specialized task-specific models. Nevertheless, they show some ability to guide the dialogue to a successful ending through their generated responses if they are provided with correct slot values. Furthermore, this ability improves with few-shot in-domain examples

Tackling Hallucinations in Neural Chart Summarization

Author: Dušek Ondřej
Obaid ul Islam Saad
Škrjanec Iza
Demberg Vera
Publication venue
Publication date: 01/01/2023
Field of study

Hallucinations in text generation occur when the system produces text that is not grounded in the input. In this work, we tackle the problem of hallucinations in neural chart summarization. Our analysis shows that the target side of chart summarization training datasets often contains additional information, leading to hallucinations. We propose a natural language inference (NLI) based method to preprocess the training data and show through human evaluation that our method significantly reduces hallucinations. We also found that shortening long-distance dependencies in the input sequence and adding chart-related information like title and legends improves the overall performance

58

full texts

539

metadata records

Updated in last 30 days.

Biblio at Institute of Formal and Applied Linguistics

Access Repository Dashboard

Do you manage Open Research Online? Become a CORE Member to access insider analytics, issue reports and manage access to outputs from your repository in the CORE Repository Dashboard! 👇