Charles University

Biblio at Institute of Formal and Applied Linguistics

Not a member yet

539 research outputs found

Sort by

n Walks in the Fictional Woods

Author: Rosa Rudolf
de Lima Edirlei Soares
Di Bartolomeo Sara
Schetinger Victor
Meinecke Christofer
Publication venue
Publication date: 01/01/2023
Field of study

This paper presents a novel exploration of the interaction between generative AI models, visualization, and narrative generation processes, using OpenAI's GPT as a case study. We look at the question "Where Does Generativeness Comes From?", which has a simple answer at the intersection of many domains. Drawing on Umberto Eco's "Six Walks in the Fictional Woods, we engender a speculative, transdisciplinary scientific narrative using ChatGPT in different roles: as an information repository, a ghost writer, a scientific coach, among others. The paper is written as a piling of plateaus where the titling of each (sub-)section, the "teaser" images, the headers, and a biblock of text are strata forming a narrative about narratives. To enrich our exposition, we present a visualization prototype to analyze storyboarded narratives, and extensive conversations with ChatGPT. Each link to a ChatGPT conversation is an experiment on writing where we try to use different plugins and techniques to investigate the topics that, ultimately form the content of this portable document file. Our visualization uses a dataset of stories with scene descriptions, textual descriptions of scenes (both generated by ChatGPT), and images (generated by Stable Diffusion using scene descriptions as prompts). We employ a simple graph-node diagram to try to make a "forest of narratives" visible, an example of a vis4gen application that can be used to analyze the output of Large Languange + Image Models

Exploratory Analysis of the Applicability of Formalised Knowledge to Personal Experience Narration

Author: Pecina Pavel
Mireles Victor
Revenko Artem
Billib Stephanie
Uiterwaal Frank
Jänicke Stephan
Publication venue
Publication date: 01/01/2023
Field of study

Some of the victims of Nazi prosecution have consigned their personal experiences in the form of diaries of their internment in concentration camps. Such human-centric texts may contrast with the organisation of knowledge about such events that, for example, historians and archivists make. In this work, we analyse six such narrations with the use of Entity Extraction and Named Entity Recognition techniques, present the results of the corresponding exploration, and discuss the suitability of such tools on this corpus. We show that knowledge tools, that have been successfully used to organise documents, can be lacking when describing personal accounts, and we suggest ways to alleviate this

With a Little Help from the Authors: Reproducing Human Evaluation of an MT Error Detector

Author: Dušek Ondřej
Plátek Ondřej
Lango Mateusz
Publication venue
Publication date: 01/01/2023
Field of study

This work presents our efforts to reproduce the results of the human evaluation experiment presented in the paper of Vamvas and Sennrich (2022), which evaluated an automatic system detecting over- and undertranslations (translations containing more or less information than the original) in machine translation (MT) outputs. Despite the high quality of the documentation and code provided by the authors, we discuss some problems we found in reproducing the exact experimental setup and offer recommendations for improving reproducibility. Our replicated results generally confirm the conclusions of the original study, but in some cases statistically significant differences were observed, suggesting a high variability of human annotation

Multi-Parallel Corpus of North Levantine Arabic

Author: Pecina Pavel
Sellat Hashem
Zemánek Petr
Pospíšil Adam
Saleh Shadi
Krubiński Mateusz
Publication venue
Publication date: 01/01/2023
Field of study

Low-resource Machine Translation (MT) is characterized by the scarce availability of training data and/or standardized evaluation benchmarks. In the context of Dialectal Arabic, recent works introduced several evaluation benchmarks covering both Modern Standard Arabic (MSA) and dialects, mapping, however, mostly to a single Indo-European language - English. In this work, we introduce a multi-lingual corpus consisting of 120,600 multi-parallel sentences in English, French, German, Greek, Spanish, and MSA selected from the OpenSubtitles corpus, which were manually translated into the North Levantine Arabic. By conducting a series of training and fine-tuning experiments, we explore how this novel resource can contribute to the research on Arabic MT. We make the dataset publicly available at http://hdl.handle.net/11234/1-5033 for research purposes

CLS INFRA D8.1 Report of the tools for the basic Natural Language Processing (NLP) tasks in the CLS context

Author: Křen Michal
Cinková Silvie
Birkholz Julie
Pozo Alvaro
Heiden Serge
Börner Ingo
Janssen Maarten
Dejaeghere Tess
Publication venue
Publication date: 01/01/2023
Field of study

This report lists and describes a selection of Natural Language Processing (NLP) tools which are considered to form a Corpus-Enrichment and NLP toolchain for common CLS research tasks. The tools were selected to be: • safely positioned in their life cycle, i.e., state-of-the art, and mature as well as continuously maintained, or in development and promised as CLS Infra Deliverables by March 2025 • as multilingual as possible (beyond English and several major European languages) • as interoperable as possible with other tools and texts in other languages

Robust Data-to-text Generation with Pretrained Language Models

Author: Dušek Ondřej
Publication venue
Publication date: 01/01/2023
Field of study

The task of data-to-text generation amounts to describing structured data in fluent natural language sentences. The state-of-the-art approach in research systems today is finetuning pretrained neural language models (PLMs). This often leads to overfitting and hallucinations, i.e. situations where the PLM generates outputs that are not grounded in the input, replicating or amplifying training data noise. Rather than applying a PLM as black box for the whole data-to-text task, we aim at using PLMs for simple subtasks, aiming to achieve broad generalization and minimize hallucination. First, we use a pipeline approach where the PLMs only work as text “editors”, rather than generators, taking advantage of their high output fluency. The data is converted into text in an initial preprocessing step, where we use simple handcrafted templates recounting the individual input facts (i.e. relations between entities). The PLMs then order the facts and fuse them into fluent sentences. This helps us generate without in-domain training data and achieve good fluency and accuracy. We further examine the capability of PLMs to produce accurate descriptions of individual facts from the data, in order to remove the last handcrafted step. Using a specially collected dataset, we show that PLMs finetuned to describe a variety of relations are very robust in verbalizing novel, unseen relations. The key to PLMs’ usability here is providing clear relation names on the input

Semantic Accuracy in Natural Language Generation: A Thesis Proposal

Author: Schmidtová Patrícia
Publication venue
Publication date: 01/01/2023
Field of study

With the fast-growing popularity of current large pre-trained language models (LLMs), it is necessary to dedicate efforts to making them more reliable. In this thesis proposal, we aim to improve the reliability of natural language generation systems (NLG) by researching the semantic accuracy of their outputs. We look at this problem from the outside (evaluation) and from the inside (interpretability). We propose a novel method for evaluating semantic accuracy and discuss the importance of working towards a unified and objective benchmark for NLG metrics. We also review interpretability approaches which could help us pinpoint the sources of inaccuracies within the models and explore potential mitigation strategies

MooseNet: A Trainable Metric for Synthesized Speech with a PLDA Module

Author: Dušek Ondřej
Plátek Ondřej
Publication venue
Publication date: 01/01/2023
Field of study

We present MooseNet, a trainable speech metric that predicts the listeners’ Mean Opinion Score (MOS). We propose a novel approach where the Probabilistic Linear Discriminative Analysis (PLDA) generative model is used on top of an embedding obtained from a self-supervised learning (SSL) neural network (NN) model. We show that PLDA works well with a non-finetuned SSL model when trained only on 136 utterances (ca. one minute training time) and that PLDA consistently improves various neural MOS prediction models, even stateof-the-art models with task-specific fine-tuning. Our ablation study shows PLDA training superiority over SSL model finetuning in a low-resource scenario. We also improve SSL model fine-tuning using a convenient optimizer choice and additional contrastive and multi-task training objectives. The fine-tuned MooseNet NN with the PLDA module achieves the best results, surpassing the SSL baseline on the VoiceMOS Challenge data

Macunaíma, papagaio IA, resolve crimes em Praga: Rumo à visualização de padrões em narrativas de modelos de IA generativos

Author: Di Bartolomeo Sara
Rosa Rudolf
Meinecke Christofer
da Silva Dafne Reis Pedroso
Schetinger Victor
de Lima Edirlei Soares
Publication venue
Publication date: 01/01/2023
Field of study

This paper introduces the development and analysis of results from the Macunaíma project, a use case of AI applied to the creation of scripts and storyboards, which combines narrative, audio, and automatically generated images. Technically, AI models were employed for text (ChatGPT) and image (Stable Diffusion) generation, along with a web interface allowing user interaction. The dimensions of Narrative Temporal Cohesion, Cinematographic Cohesion, and Graphic Cohesion were examined. Additionally, a data visualization prototype was developed to identify the generated narratives and suggest potential improvements for the tool. We highlight the pivotal role of data visualization in analyzing these intricate models, especially considering the vast amount of information involved

Getting Past Chit-chat with ChatGPT: Large Language Models and Structured Outputs

Author: Dušek Ondřej
Publication venue
Publication date: 01/01/2023
Field of study

A quick introduction to text generation using language models (including LLM problems) and a description of our recent experiments with task-oriented dialogue modeling using pre-trained language models and LLM

58

full texts

539

metadata records

Updated in last 30 days.

Biblio at Institute of Formal and Applied Linguistics

Access Repository Dashboard

Do you manage Open Research Online? Become a CORE Member to access insider analytics, issue reports and manage access to outputs from your repository in the CORE Repository Dashboard! 👇