Charles University

Biblio at Institute of Formal and Applied Linguistics

Not a member yet

539 research outputs found

Sort by

Sustaining the European Language Grid: Towards the ELG Legal Entity

Author: Backfried Gerhard
Piperidis Stelios
Rehm Georg
Hajič Jan
Vasiljevs Andrejs
Choukri Khalid
Hegele Stefanie
Germann Ulrich
Marheinecke Katrin
Gómez-Pérez José Manuel
Prinz Katja
Bontcheva Kalina
Publication venue: Springer Nature Switzerland AG
Publication date: 01/01/2023
Field of study

When preparing the European Language Grid EU project proposal and designing the overall concept of the platform, the need for drawing up a long-term sustainability plan was abundantly evident. Already in the phase of developing the proposal, the centrepiece of the sustainability plan was what we called the “ELG legal entity”, i. e., an independent organisation that would be able to take over operations, maintenace, extension and governance of the European Language Grid platform as well as managing and helping to coordinate its community. This chapter describes our current state of planning with regard to this legal entity. It explains the different options discussed and it presents the different products specified, which can be offered by the legal entity in the medium to long run. We also describe which legal form the organisation will take and how it will ensure the sustainability of ELG

Umělá inteligence a výuka jazyků

Author: Rosa Rudolf
Dušek Ondřej
Poslušná Lucie
Publication venue
Publication date: 01/01/2023
Field of study

Functioning principles of artificial intelligence and especially large language models like ChatGPT. What to look out for with AI tools. Legal issues of AI tools. Different ways to use ChatGPT or tips and tricks for prompting. Recommendations on the use of AI in language learning

Missing information, unresponsive authors, experimental flaws: The impossibility of assessing the reproducibility of previous human evaluations in NLP

We report our efforts in identifying a set of previous human evaluations in NLP that would be suitable for a coordinated study examining what makes human evaluations in NLP more/less reproducible. We present our results and findings, which include that just 13% of papers had (i) sufficiently low barriers to reproduction, and (ii) enough obtainable information, to be considered for reproduction, and that all but one of the experiments we selected for reproduction was discovered to have flaws that made the meaningfulness of conducting a reproduction questionable. As a result, we had to change our coordinated study design from a reproduce approach to a standardisethen-reproduce-twice approach. Our overall (negative) finding that the great majority of human evaluations in NLP is not repeatable and/or not reproducible and/or too flawed to justify reproduction, paints a dire picture, but presents an opportunity for a rethink about how to design and report human evaluations in NLP

UFAL Parallel Corpus of North Levantine 1.0

Author: Pecina Pavel
Sellat Hashem
Zemánek Petr
Pospíšil Adam
Saleh Shadi
Krubiński Mateusz
Publication venue
Publication date: 01/01/2023
Field of study

This is the first release of the UFAL Parallel Corpus of North Levantine, compiled by the Institute of Formal and Applied Linguistics (ÚFAL) at Charles University within the Welcome project (https://welcome-h2020.eu/). The corpus consists of 120,600 multiparallel sentences in English, French, German, Greek, Spanish, and Standard Arabic selected from the OpenSubtitles2018 corpus [1] and manually translated into the North Levantine Arabic language. The corpus was created for the purpose of training machine translation for North Levantine and the other languages

Interoperable Metadata Bridges to the wider Language Technology Ecosystem

Author: Labropoulou Penny
Piperidis Stelios
Voukoutis Leon
Hajič Jan
Deligiannis Miltos
Košarko Ondřej
Rehm Georg
Giagkou Maria
Publication venue: Springer Nature Switzerland AG
Publication date: 01/01/2023
Field of study

One of the objectives of the European Language Grid is to help overcome the fragmentation of the European Language Technology community by bringing together language resources and technologies, information about them, Language Technology consumers, providers and the wider public. This chapter describes the mechanisms ELG has put in place to build interoperable bridges to related initiatives, infrastructures, platforms and repositories in the wider Language Technology landscape. We focus on the different approaches implemented for the exchange of metadata records about, in a generic sense, resources and exemplify them with the help of four use cases through which the ELG catalogue has been further populated. The chapter presents the protocols used for the population processes as well as the adaptations of the ELG metadata schema and platform policies that proved necessary to be able to ingest these new records. Last, we discuss the challenges emerging in large-scale metadata aggregation processes and propose a number of alternative options to address them

Exploring Anisotropy and Outliers in Multilingual Language Models for Cross-Lingual Semantic Sentence Similarity

Author: Libovický Jindřich
Fastowski Alina
Fraser Alexander
Hämmerl Katharina
Publication venue
Publication date: 01/01/2023
Field of study

Previous work has shown that the representations output by contextual language models are more anisotropic than static type embeddings, and typically display outlier dimensions. This seems to be true for both monolingual and multilingual models, although much less work has been done on the multilingual context. Why these outliers occur and how they affect the representations is still an active area of research.We investigate outlier dimensions and their relationship to anisotropy in multiple pre-trained multilingual language models. We focus on cross-lingual semantic similarity tasks, as these are natural tasks for evaluating multilingual representations. Specifically, we examine sentence representations. Sentence transformers which are fine-tuned on parallel resources (that are not always available) perform better on this task, and we show that their representations are more isotropic. However, we aim to improve multilingual representations in general. We investigate how much of the performance difference can be made up by only transforming the embedding space without fine-tuning, and visualise the resulting spaces. We test different operations: Removing individual outlier dimensions, cluster-based isotropy enhancement, and ZCA whitening. We publish our code for reproducibility

HPLT High-Performance Language Technology: Building LLMs and TMs in European languages

Author: Hajič Jan
Publication venue
Publication date: 01/01/2023
Field of study

Description of the HPLT project for building large language data (monolingual, parallel) and on their basis large language and translation models in 80 languages

UFAL-ULD at BLP-2023 Task 1: Violence Detection in Bangla Text

Author: Mukherjee Sourabrata
Dušek Ondřej
Ojha Atul
Publication venue
Publication date: 01/01/2023
Field of study

In this paper, we present UFAL-ULD team's system, desinged as a part of the BLP Shared Task 1: Violence Inciting Text Detection (VITD). This task aims to classify text, with a particular challenge of identifying incitement to violence into Direct, Indirect or Non-violence levels. We experimented with several pre-trained sequence classification models, including XLM-RoBERTa, BanglaBERT, Bangla BERT Base, and Multilingual BERT. Our best-performing model was based on the XLM-RoBERTa-base architecture, which outperformed the baseline models. Our system was ranked 20th among the 27 teams that participated in the task

VisuaLLM: Easy Web-based Visualization for Neural Language Generation

Author: Trebuňa František
Dušek Ondřej
Publication venue
Publication date: 01/01/2023
Field of study

VisuaLLM is a Python library that enables interactive visualization of common tasks in natural language generation with pretrained language models (using HuggingFace's model API), with tight integration of benchmark datasets and fine-grained generation control. The system runs as a local generation backend server and features a web-based frontend, allowing simple interface configuration by minimal Python code. The currently implemented views include data visualization, next-token prediction with probability distributions, and decoding parameter control, with simple extension to additional tasks

Macunaíma story generator

Author: Rosa Rudolf
de Lima Edirlei Soares
Schetinger Victor
Publication venue
Publication date: 01/01/2023
Field of study

Born from the HumaneAI network—an EU consortium for developing humane artificial intelligence—our project marries generative models with storytelling. Initiated by Rudolf Rosa, from Charles University Prague, and Victor Schetinger, from TU Wien, it has since expanded to include a diverse team of collaborators. We're exploring the generation of stories, audiovisual content, and the various humanistic and technical challenges that arise. With a prototype in place, we're gearing up for experiments and public testings. As we move forward, we stay true to our ethos—celebrating the ethics, aesthetics, and poetics of artificial intelligence in an era of rapid transformation

58

full texts

539

metadata records

Updated in last 30 days.

Biblio at Institute of Formal and Applied Linguistics

Access Repository Dashboard

Do you manage Open Research Online? Become a CORE Member to access insider analytics, issue reports and manage access to outputs from your repository in the CORE Repository Dashboard! 👇