539 research outputs found
Sort by
Sustaining the European Language Grid: Towards the ELG Legal Entity
When preparing the European Language Grid EU project proposal and
designing the overall concept of the platform, the need for drawing up a long-term
sustainability plan was abundantly evident. Already in the phase of developing the
proposal, the centrepiece of the sustainability plan was what we called the “ELG
legal entity”, i. e., an independent organisation that would be able to take over operations,
maintenace, extension and governance of the European Language Grid platform
as well as managing and helping to coordinate its community. This chapter
describes our current state of planning with regard to this legal entity. It explains the
different options discussed and it presents the different products specified, which
can be offered by the legal entity in the medium to long run. We also describe which
legal form the organisation will take and how it will ensure the sustainability of ELG
Umělá inteligence a výuka jazyků
Functioning principles of artificial intelligence and especially large language models like ChatGPT.
What to look out for with AI tools.
Legal issues of AI tools.
Different ways to use ChatGPT or tips and tricks for prompting.
Recommendations on the use of AI in language learning
Missing information, unresponsive authors, experimental flaws: The impossibility of assessing the reproducibility of previous human evaluations in NLP
We report our efforts in identifying a set of previous human evaluations in NLP that would be suitable for a coordinated study examining what makes human evaluations in NLP more/less reproducible. We present our results and findings, which include that just 13% of papers had (i) sufficiently low barriers to reproduction, and (ii) enough obtainable information, to be considered for reproduction, and that all but one of the experiments we selected for reproduction was discovered to have flaws that made the meaningfulness of conducting a reproduction questionable. As a result, we had to change our coordinated study design from a reproduce approach to a standardisethen-reproduce-twice approach. Our overall (negative) finding that the great majority of human evaluations in NLP is not repeatable and/or not reproducible and/or too flawed to justify reproduction, paints a dire picture, but presents an opportunity for a rethink about how to design and report human evaluations in NLP
UFAL Parallel Corpus of North Levantine 1.0
This is the first release of the UFAL Parallel Corpus of North Levantine, compiled by the Institute of Formal and Applied Linguistics (ÚFAL) at Charles University within the Welcome project (https://welcome-h2020.eu/). The corpus consists of 120,600 multiparallel sentences in English, French, German, Greek, Spanish, and Standard Arabic selected from the OpenSubtitles2018 corpus [1] and manually translated into the North Levantine Arabic language. The corpus was created for the purpose of training machine translation for North Levantine and the other languages
Interoperable Metadata Bridges to the wider Language Technology Ecosystem
One of the objectives of the European Language Grid is to help overcome
the fragmentation of the European Language Technology community by bringing
together language resources and technologies, information about them, Language
Technology consumers, providers and the wider public. This chapter describes the
mechanisms ELG has put in place to build interoperable bridges to related initiatives,
infrastructures, platforms and repositories in the wider Language Technology
landscape. We focus on the different approaches implemented for the exchange of
metadata records about, in a generic sense, resources and exemplify them with the
help of four use cases through which the ELG catalogue has been further populated.
The chapter presents the protocols used for the population processes as well as the
adaptations of the ELG metadata schema and platform policies that proved necessary
to be able to ingest these new records. Last, we discuss the challenges emerging
in large-scale metadata aggregation processes and propose a number of alternative
options to address them
Exploring Anisotropy and Outliers in Multilingual Language Models for Cross-Lingual Semantic Sentence Similarity
Previous work has shown that the representations output by contextual language models are more anisotropic than static type embeddings, and typically display outlier dimensions. This seems to be true for both monolingual and multilingual models, although much less work has been done on the multilingual context. Why these outliers occur and how they affect the representations is still an active area of research.We investigate outlier dimensions and their relationship to anisotropy in multiple pre-trained multilingual language models. We focus on cross-lingual semantic similarity tasks, as these are natural tasks for evaluating multilingual representations. Specifically, we examine sentence representations. Sentence transformers which are fine-tuned on parallel resources (that are not always available) perform better on this task, and we show that their representations are more isotropic. However, we aim to improve multilingual representations in general. We investigate how much of the performance difference can be made up by only transforming the embedding space without fine-tuning, and visualise the resulting spaces. We test different operations: Removing individual outlier dimensions, cluster-based isotropy enhancement, and ZCA whitening. We publish our code for reproducibility
HPLT High-Performance Language Technology: Building LLMs and TMs in European languages
Description of the HPLT project for building large language data (monolingual, parallel) and on their basis large language and translation models in 80 languages
UFAL-ULD at BLP-2023 Task 1: Violence Detection in Bangla Text
In this paper, we present UFAL-ULD team's system, desinged as a part of the BLP Shared Task 1: Violence Inciting Text Detection (VITD). This task aims to classify text, with a particular challenge of identifying incitement to violence into Direct, Indirect or Non-violence levels. We experimented with several pre-trained sequence classification models, including XLM-RoBERTa, BanglaBERT, Bangla BERT Base, and Multilingual BERT. Our best-performing model was based on the XLM-RoBERTa-base architecture, which outperformed the baseline models.
Our system was ranked 20th among the 27 teams that participated in the task
VisuaLLM: Easy Web-based Visualization for Neural Language Generation
VisuaLLM is a Python library that enables interactive visualization of common tasks in natural language generation with pretrained language models (using HuggingFace's model API), with tight integration of benchmark datasets and fine-grained generation control. The system runs as a local generation backend server and features a web-based frontend, allowing simple interface configuration by minimal Python code. The currently implemented views include data visualization, next-token prediction with probability distributions, and decoding parameter control, with simple extension to additional tasks
Macunaíma story generator
Born from the HumaneAI network—an EU consortium for developing humane artificial intelligence—our project marries generative models with storytelling. Initiated by Rudolf Rosa, from Charles University Prague, and Victor Schetinger, from TU Wien, it has since expanded to include a diverse team of collaborators.
We're exploring the generation of stories, audiovisual content, and the various humanistic and technical challenges that arise. With a prototype in place, we're gearing up for experiments and public testings. As we move forward, we stay true to our ethos—celebrating the ethics, aesthetics, and poetics of artificial intelligence in an era of rapid transformation