Charles University

LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University

Not a member yet

1998 research outputs found

Sort by

Crowdsourcing Piedmontese to Test LLMs on Non-Standard Orthography

Author: Vico Gianluca
Libovický Jindřich
Publication venue: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Publication date: 08/02/2026
Field of study

This dataset contains data for testing machine translation and topic classification in Piedmontese. It is based on FLORES+ (NLLB Team et al., 2024) and SIB-200: A Simple, Inclusive, and Big Evaluation Dataset for Topic Classification in 200+ Languages and Dialects (Adelani et al., EACL 2024)

Projekt_ZDH_transkripce

Author: Frančová Marie
Publication venue: Charles University, Faculty of Arts, Institue of Czech language and theory of communication
Publication date: 06/02/2026
Field of study

Text written in kurrent transcribed through Transkribus and then finished by hand

Coreference in Universal Dependencies 1.4 (CorefUD 1.4)

Author: Novák Michal
Popel Martin
Zeman Daniel
Žabokrtský Zdeněk
Nedoluzhko Anna
Acar Kutay
Bamman David
Bourgois Antoine
Bourgonje Peter
Cinková Silvie
Delfino Eleonora
Eckhoff Hanne
Cebiroğlu Eryiğit Gülşen
Hajič Jan
Han Sooyoun
Hardmeier Christian
Haug Dag
Jørgensen Tollef
Kåsen Andre
Krielke Pauline
Landragin Frédéric
Lapshinova-Koltunski Ekaterina
Leotta Roberta Grazia
Mæhlum Petter
Martí M. Antònia
Mélanie-Becquet Frédérique
Mikulová Marie
Milintsevich Kirill
Moretti Giovanni
Mujadia Vandan
Muzerelle Judith
Nam Sangha
Nøklestad Anders
Ogrodniczuk Maciej
Øvrelid Lilja
Pamay Arslan Tuğba
Passarotti Marco
Poibeau Thierry
Porada Ian
Recasens Marta
Seo Sumin
Solberg Per Erik
Stede Manfred
Štěpánek Jan
Štěpánková Barbora
Straka Milan
Swanson Daniel
Toldova Svetlana
Vadász Noémi
van Cranenburgh Andreas
Velldal Erik
Vincze Veronika
Zeldes Amir
Žitkus Voldemaras
Publication venue: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Publication date: 18/02/2026
Field of study

CorefUD is a collection of previously existing coreference-annotated datasets that have been converted to a unified annotation scheme. In its current version (1.4), CorefUD comprises 33 datasets covering 19 languages. The datasets are enriched with automatically assigned morphological and syntactic annotations, fully compliant with the standards of the Universal Dependencies project, in cases where manual morphosyntactic annotation is not available or cannot be reliably converted. The data are stored in the CoNLL-U format, with coreference- and bridging-specific information encoded as attribute–value pairs in the MISC column. The collection is divided into a public edition and a non-public (ÚFAL-internal) edition. The public edition is distributed via LINDAT-CLARIAH-CZ and contains 29 datasets for 19 languages (1 dataset for Ancient Greek, 1 for Ancient Hebrew, 1 for Catalan, 3 for Czech, 1 for Dutch, 4 for English, 3 for French, 2 for German, 1 for Hindi, 2 for Hungarian, 1 for Korean, 1 for Latin, 1 for Lithuanian, 2 for Norwegian, 1 for Old Church Slavonic, 1 for Polish, 1 for Russian, 1 for Spanish, and 1 for Turkish), excluding test portions. The non-public edition is available internally to ÚFAL members and includes an additional 4 datasets for 2 languages (1 for Dutch and 3 for English) that cannot be redistributed due to licensing restrictions. It also contains the test portions for all datasets. When using any of the harmonized datasets, please review the respective license (available in the same directory as the data) and cite the original resource. Compared to version 1.3, version 1.4 introduces new languages and corpora: Czech-PDTSC, Latin-CorefLat, Dutch-OpenBoek, English-FantasyCoref, and French-LitBankFr. The last three consist of long literary documents. In addition, English-GUM, Czech-PCEDT, and Czech-PDT have been updated to newer releases. A detailed list of changes for each dataset is provided in the corresponding README file

Code and data accompanying the SynSemClass paper @ LREC 2026

Author: Straková Jana
Fučíková Eva
Urešová Zdeňka
Hajič Jan
Publication venue: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Publication date: 06/03/2026
Field of study

Snapshot of code and data accompanying the paper accepted at LREC 2026: "Automatic Suggestions Help Extending Eventive Ontology: A Case Study on SynSemClass". The timestamp of the snapshot is March 6th, 2026. The original GitHub repository can be found at https://github.com/ufal/SynSemClassLREC2026

Verbs annotated for morphemic structure in Czech, English, German, Spanish v2

Author: Hledíková Hana
Publication venue: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Publication date: 12/03/2026
Field of study

A sample of verb lemmas in four languages: Czech (19,040 lemmas), English (9,969 lemmas), German (27,158 lemmas), Spanish (11,768 lemmas). Each verb lemma is annotated for its morphemic structure (i.e., segmented into the prefiex(es), root(s), suffix(es) and ending(s) that the given lemma contains), classification of its root morph to a root morpheme where needed (to facilitate grouping of verbs with the same root morpheme), and its frequency of the verb in a 100 M corpus. Two versions are available for each language: one with a more coarse-grained segmentation, which captures the morphemic structure that is synchronically available, and a version with a more fine-grained segmentation, which also captures the word's etymology

CooccurrenceFieldSampler (CFS)

Author: Jan Oliver Rüdiger
Publication venue: Jan Oliver Rüdiger
Publication date: 01/01/2026
Field of study

The CooccurrenceFieldSampler (CFS) was developed for sampling from corpora to facilitate lexicographical data analysis. It works with corpora from different sources, text types or years. In random sentence sampling (random/opportunistic sampling), it can be observed that corpora containing different text types and lengths (per source) cannot always be mixed optimally, as they usually do not have the same size and have different topic weightings, for example. The CFS was designed to solve this problem. The CFS first calculates all co-occurrences for all tokens within sentences – separately for each source. These corpora are then combined in a 1:1 mixture and the co-occurrences for the entire data set are recalculated. The tool evaluates which co-occurrences disappear and which new ones are created, resulting in quotas that control the random mixing of the corpora sentence by sentence. The end result is a sentence-based corpus that (A) strives to retain the maximum number of co-occurrences from all sources (as accurately as possible) and (B) minimises the rejection of corpus data. --- To use the CFS tool, follow these steps: 1. Unzip the ZIP file containing the necessary files. 2. For Windows, Linux, and macOS, you will find precompiled binaries that run exclusively on x64 processors. 3. If you are using a different processor type, such as ARM or ARM64, please use the Universal folder. 4. Windows users should run "cfs.exe" directly. 5. For Linux and macOS users, you must mark the cfs file as executable. 6. If using the Universal version, ensure .NET 10.0 is installed for compiling. You can then run the program with "dotnet cfs.dll". 7. To display help information, use the --help parameter. Help/Parameter: --from (Default: cec / recommended: cec) import file format (valid: cec, bnc, catma, clan, conll, cora, cwd, dewac, dta, folia, fln, korap, leipzig, xces, relannis, salt, json, sketch, speedy, tiger, tlv, treetagger, tsv, txm, weblicht) --input (Default: input/) folder with input-files (mix per file) --to (Default: cec / recommended: cec) export file format (valid: cec, catma, conll, cwd, csv, dta, folia, i5, korap, xces, plain, salt, json, sketch, speedy, tlv, tsv, treetagger, txm, weblicht) --layer (Default: Wort) use this layer to calculate the co-occurrences --output (Default: output.cec6) output file (every round and logfile) --minFrequency (Default: 1 / recommended: 5) min. absolute frequency --minSignificance (Default: 1.0 / recommended: 1.0) min. significance (poisson distribution) --minChangeRate (Default: 0.1 / recommended: 0.1) min. significance (poisson distribution) --maxRounds (Default: 10 / recommended: 5) min. absolute frequency --help Display this help screen. --version Display version information. Supported corpus formats (input/output): cec - CorpusExplorer Corpus (v6) - http://corpusexplorer.de bnc - British National Corpus - http://www.natcorp.ox.ac.uk/ catma - CATMA (Computer assisted text markup and analysis) - https://catma.de/ clan - CLAN/CHILDES - https://talkbank.org/childes/ conll - CoNLL-U https://universaldependencies.org/format.html cora - CORA XML - https://cora.readthedocs.io/en/latest/coraxml/ cwd - IMS Open Corpus Workbench (CWB) - https://cwb.sourceforge.io/ dewac - https://wacky.sslmit.unibo.it/doku.php?id=corpora dta - DTA TCF-XML - https://www.deutschestextarchiv.de/download folia - FoLiA XML - https://proycon.github.io/folia/ fln - Folker/OrthoNormal - https://exmaralda.org/de/folker-de/ korap - KorAP - http://korap.ids-mannheim.de/ leipzig - Wortschatz Leipzig - https://wortschatz.uni-leipzig.de/en/download/ xces - XCes XML - http://www.xces.org/ / https://www.cs.vassar.edu/CES/ relannis - https://corpus-tools.org/annis/ salt - https://corpus-tools.org/archive-2015-2025/salt/ json - https://de.wikipedia.org/wiki/JSON sketch - SketchEngine VERT - https://www.sketchengine.eu/glossary/vertical-file/ speedy - SPEEDy Annotation Editor - http://kups.ub.uni-koeln.de/id/eprint/55224 tiger - TiGER-XML - https://www.ims.uni-stuttgart.de/documents/ressourcen/werkzeuge/tigersearch/doc/html/TigerXML.html tlv - TLV-XML treetagger - TreeTagger - https://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/ tsv - Tab-separated values - https://en.wikipedia.org/wiki/Tab-separated_values txm - TXM - https://txm.gitpages.huma-num.fr/textometrie/?lang=en weblicht - Weblicht - https://weblicht.sfs.uni-tuebingen.de/weblichtwiki/Main_Page.html csv - Comma-separated values - https://en.wikipedia.org/wiki/Comma-separated_values i5 - i5-XML - https://www.ids-mannheim.de/en/digspra/pb-s1/projects/corpus-development/ids-text-model/ plain - Plaintext - https://en.wikipedia.org/wiki/Plain_tex

CRAC 2026 Empty Nodes Baseline Model

Author: Straka Milan
Publication venue: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Publication date: 30/01/2026
Field of study

The crac2026_empty_nodes_baseline is a XLM-RoBERTa-large–based multilingual model for CRAC 2026 Empty Nodes Baseline system https://github.com/ufal/crac2026_empty_nodes_baseline for predicting empty nodes in the input CoNLL-U files, trained on CorefUD 1.4 data. It was was used to generate baseline empty nodes prediction in the CRAC 2026 Shared Task on Multilingual Coreference Resolution https://ufal.mff.cuni.cz/corefud/crac26. The model is language agnostic, so in theory it can be used to predict coreference in any XLM-RoBERTa language. Compared to the last year CRAC 2025 Empty Nodes Baseline https://github.com/ufal/crac2025_empty_nodes_baseline, this year's baseline predicts all available information for the empty nodes, i.e., including forms, lemmas, UPOS, XPOS, and FEATS columns, in addition to previously predicted word order and dependency relations of the empty nodes. Instructions for running prediction, training, and intrinsic evaluation are all available in the repository CRAC 2026 Empty Nodes Baseline https://github.com/ufal/crac2026_empty_nodes_baseline

Treebanks for Unified Taxonomy of Deep Syntactic Relations

Author: Droganova Kira
Zeman Daniel
Publication venue: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Publication date: 2025
Field of study

The datasets described in Droganova, Kira, and Daniel Zeman. "Towards a Unified Taxonomy of Deep Syntactic Relations." Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). 2024. Four languages are included in this release. English PropBank is omitted due to its license terms. The updated version contains changes described in Dependency Parsing beyond Simple Trees (Kira Droganova 2025, PhD thesis, Chapter 4

InCroMin 1.0: Corpus of Cross-lingual Dialogues with Minutes and Detection of Misunderstandings

Author: Marko Čechovič Natália Komorníková, Dominik Macháček, Ondřej Bojar
Publication venue: Charles University in Prague, UFAL
Publication date: 08/07/2025
Field of study

This data package contains published parts of InCroMin, a corpus of cross-lingual dialogues with minutes and detection of misunderstandings. InCroMin is described in a paper **Corpus of Cross-lingual Dialogues with Minutes and Detection of Misunderstandings,** by Marko Čechovič, Natália Komorníková, Dominik Macháček, and Ondřej Bojar. To be published in TSD 2025. The data were created by volunteering participants, by 2-5 people in each meeting. They were matched in a way that there are at least two groups of people who did not understand each other's language. Their meeting was facilitated by simultaneous speech translation tool integrated in Minuteman. The meetings were held via a teleconferencing platform that recorded each speaker in a separate audio track. The participants gave consent with data processing and release. Then, their speech was automatically transcribed in their original language, and automatically translated into English. Then, human annotators manually corrected transcripts and translations, and deidentified audio and texts by removing confidential information such as person names. The annotators also created minutes. InCroMin corpus is a very useful data set intended primarily for evaluating automatic systems that aim to facilitate cross-lingual dialogues in realistic conditions and end-to-end. It can evaluate Automatic Speech Processing, Speech Translation, Simultaneous Speech Translation, Quality Estimation, and Automatic Minuting

Errant Extended Vocabulary

Author: Nevěřilová Zuzana
Publication venue: Natural Language Processing Centre, Faculty of Informatics, Masaryk University
Publication date: 05/10/2025
Field of study

The ontology provides a FAIR, interoperable vocabulary for grammatical error annotation and correction, integrating the English-focused ERRANT taxonomy with Czech-specific extensions from ERRANT-CZ and fine-grained categories derived from Czech proofreading and correction rules (Opravidlo). The ontology formalizes error types, subtypes, and correction operations in RDF, aligns linguistic properties with the LexInfo ontology, and supports multilingual grammatical error correction research, annotation interoperability, and data reuse

0

full texts

1,998

metadata records

Updated in last 30 days.

LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University

Access Repository Dashboard

Do you manage Open Research Online? Become a CORE Member to access insider analytics, issue reports and manage access to outputs from your repository in the CORE Repository Dashboard! 👇