Charles University
LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles UniversityNot a member yet
1998 research outputs found
Sort by
Crowdsourcing Piedmontese to Test LLMs on Non-Standard Orthography
This dataset contains data for testing machine translation and topic classification in Piedmontese.
It is based on FLORES+ (NLLB Team et al., 2024) and SIB-200: A Simple, Inclusive, and Big Evaluation Dataset for Topic Classification in 200+ Languages and Dialects (Adelani et al., EACL 2024)
Projekt_ZDH_transkripce
Text written in kurrent transcribed through Transkribus and then finished by hand
Coreference in Universal Dependencies 1.4 (CorefUD 1.4)
CorefUD is a collection of previously existing coreference-annotated datasets that have been converted to a unified annotation scheme. In its current version (1.4), CorefUD comprises 33 datasets covering 19 languages. The datasets are enriched with automatically assigned morphological and syntactic annotations, fully compliant with the standards of the Universal Dependencies project, in cases where manual morphosyntactic annotation is not available or cannot be reliably converted. The data are stored in the CoNLL-U format, with coreference- and bridging-specific information encoded as attribute–value pairs in the MISC column. The collection is divided into a public edition and a non-public (ÚFAL-internal) edition. The public edition is distributed via LINDAT-CLARIAH-CZ and contains 29 datasets for 19 languages (1 dataset for Ancient Greek, 1 for Ancient Hebrew, 1 for Catalan, 3 for Czech, 1 for Dutch, 4 for English, 3 for French, 2 for German, 1 for Hindi, 2 for Hungarian, 1 for Korean, 1 for Latin, 1 for Lithuanian, 2 for Norwegian, 1 for Old Church Slavonic, 1 for Polish, 1 for Russian, 1 for Spanish, and 1 for Turkish), excluding test portions. The non-public edition is available internally to ÚFAL members and includes an additional 4 datasets for 2 languages (1 for Dutch and 3 for English) that cannot be redistributed due to licensing restrictions. It also contains the test portions for all datasets. When using any of the harmonized datasets, please review the respective license (available in the same directory as the data) and cite the original resource. Compared to version 1.3, version 1.4 introduces new languages and corpora: Czech-PDTSC, Latin-CorefLat, Dutch-OpenBoek, English-FantasyCoref, and French-LitBankFr. The last three consist of long literary documents. In addition, English-GUM, Czech-PCEDT, and Czech-PDT have been updated to newer releases. A detailed list of changes for each dataset is provided in the corresponding README file
Code and data accompanying the SynSemClass paper @ LREC 2026
Snapshot of code and data accompanying the paper accepted at LREC 2026: "Automatic Suggestions Help Extending Eventive Ontology: A Case Study on SynSemClass". The timestamp of the snapshot is March 6th, 2026. The original GitHub repository can be found at https://github.com/ufal/SynSemClassLREC2026
Verbs annotated for morphemic structure in Czech, English, German, Spanish v2
A sample of verb lemmas in four languages: Czech (19,040 lemmas), English (9,969 lemmas), German (27,158 lemmas), Spanish (11,768 lemmas). Each verb lemma is annotated for its morphemic structure (i.e., segmented into the prefiex(es), root(s), suffix(es) and ending(s) that the given lemma contains), classification of its root morph to a root morpheme where needed (to facilitate grouping of verbs with the same root morpheme), and its frequency of the verb in a 100 M corpus. Two versions are available for each language: one with a more coarse-grained segmentation, which captures the morphemic structure that is synchronically available, and a version with a more fine-grained segmentation, which also captures the word's etymology
CooccurrenceFieldSampler (CFS)
The CooccurrenceFieldSampler (CFS) was developed for sampling from corpora to facilitate lexicographical data analysis. It works with corpora from different sources, text types or years. In random sentence sampling (random/opportunistic sampling), it can be observed that corpora containing different text types and lengths (per source) cannot always be mixed optimally, as they usually do not have the same size and have different topic weightings, for example. The CFS was designed to solve this problem.
The CFS first calculates all co-occurrences for all tokens within sentences – separately for each source. These corpora are then combined in a 1:1 mixture and the co-occurrences for the entire data set are recalculated. The tool evaluates which co-occurrences disappear and which new ones are created, resulting in quotas that control the random mixing of the corpora sentence by sentence.
The end result is a sentence-based corpus that (A) strives to retain the maximum number of co-occurrences from all sources (as accurately as possible) and (B) minimises the rejection of corpus data.
---
To use the CFS tool, follow these steps:
1. Unzip the ZIP file containing the necessary files.
2. For Windows, Linux, and macOS, you will find precompiled binaries that run exclusively on x64 processors.
3. If you are using a different processor type, such as ARM or ARM64, please use the Universal folder.
4. Windows users should run "cfs.exe" directly.
5. For Linux and macOS users, you must mark the cfs file as executable.
6. If using the Universal version, ensure .NET 10.0 is installed for compiling. You can then run the program with "dotnet cfs.dll".
7. To display help information, use the --help parameter.
Help/Parameter:
--from (Default: cec / recommended: cec) import file format (valid: cec, bnc, catma, clan, conll, cora, cwd, dewac, dta, folia, fln, korap, leipzig, xces,
relannis, salt, json, sketch, speedy, tiger, tlv, treetagger, tsv, txm, weblicht)
--input (Default: input/) folder with input-files (mix per file)
--to (Default: cec / recommended: cec) export file format (valid: cec, catma, conll, cwd, csv, dta, folia, i5, korap, xces, plain, salt, json, sketch,
speedy, tlv, tsv, treetagger, txm, weblicht)
--layer (Default: Wort) use this layer to calculate the co-occurrences
--output (Default: output.cec6) output file (every round and logfile)
--minFrequency (Default: 1 / recommended: 5) min. absolute frequency
--minSignificance (Default: 1.0 / recommended: 1.0) min. significance (poisson distribution)
--minChangeRate (Default: 0.1 / recommended: 0.1) min. significance (poisson distribution)
--maxRounds (Default: 10 / recommended: 5) min. absolute frequency
--help Display this help screen.
--version Display version information.
Supported corpus formats (input/output):
cec - CorpusExplorer Corpus (v6) - http://corpusexplorer.de
bnc - British National Corpus - http://www.natcorp.ox.ac.uk/
catma - CATMA (Computer assisted text markup and analysis) - https://catma.de/
clan - CLAN/CHILDES - https://talkbank.org/childes/
conll - CoNLL-U https://universaldependencies.org/format.html
cora - CORA XML - https://cora.readthedocs.io/en/latest/coraxml/
cwd - IMS Open Corpus Workbench (CWB) - https://cwb.sourceforge.io/
dewac - https://wacky.sslmit.unibo.it/doku.php?id=corpora
dta - DTA TCF-XML - https://www.deutschestextarchiv.de/download
folia - FoLiA XML - https://proycon.github.io/folia/
fln - Folker/OrthoNormal - https://exmaralda.org/de/folker-de/
korap - KorAP - http://korap.ids-mannheim.de/
leipzig - Wortschatz Leipzig - https://wortschatz.uni-leipzig.de/en/download/
xces - XCes XML - http://www.xces.org/ / https://www.cs.vassar.edu/CES/
relannis - https://corpus-tools.org/annis/
salt - https://corpus-tools.org/archive-2015-2025/salt/
json - https://de.wikipedia.org/wiki/JSON
sketch - SketchEngine VERT - https://www.sketchengine.eu/glossary/vertical-file/
speedy - SPEEDy Annotation Editor - http://kups.ub.uni-koeln.de/id/eprint/55224
tiger - TiGER-XML - https://www.ims.uni-stuttgart.de/documents/ressourcen/werkzeuge/tigersearch/doc/html/TigerXML.html
tlv - TLV-XML
treetagger - TreeTagger - https://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/
tsv - Tab-separated values - https://en.wikipedia.org/wiki/Tab-separated_values
txm - TXM - https://txm.gitpages.huma-num.fr/textometrie/?lang=en
weblicht - Weblicht - https://weblicht.sfs.uni-tuebingen.de/weblichtwiki/Main_Page.html
csv - Comma-separated values - https://en.wikipedia.org/wiki/Comma-separated_values
i5 - i5-XML - https://www.ids-mannheim.de/en/digspra/pb-s1/projects/corpus-development/ids-text-model/
plain - Plaintext - https://en.wikipedia.org/wiki/Plain_tex
CRAC 2026 Empty Nodes Baseline Model
The crac2026_empty_nodes_baseline is a XLM-RoBERTa-large–based multilingual model for CRAC 2026 Empty Nodes Baseline system https://github.com/ufal/crac2026_empty_nodes_baseline for predicting empty nodes in the input CoNLL-U files, trained on CorefUD 1.4 data. It was was used to generate baseline empty nodes prediction in the CRAC 2026 Shared Task on Multilingual Coreference Resolution https://ufal.mff.cuni.cz/corefud/crac26.
The model is language agnostic, so in theory it can be used to predict coreference in any XLM-RoBERTa language.
Compared to the last year CRAC 2025 Empty Nodes Baseline https://github.com/ufal/crac2025_empty_nodes_baseline, this year's baseline predicts all available information for the empty nodes, i.e., including forms, lemmas, UPOS, XPOS, and FEATS columns, in addition to previously predicted word order and dependency relations of the empty nodes.
Instructions for running prediction, training, and intrinsic evaluation are all available in the repository CRAC 2026 Empty Nodes Baseline https://github.com/ufal/crac2026_empty_nodes_baseline
Treebanks for Unified Taxonomy of Deep Syntactic Relations
The datasets described in Droganova, Kira, and Daniel Zeman. "Towards a Unified Taxonomy of Deep Syntactic Relations." Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). 2024. Four languages are included in this release. English PropBank is omitted due to its license terms.
The updated version contains changes described in Dependency Parsing beyond Simple Trees (Kira Droganova 2025, PhD thesis, Chapter 4
InCroMin 1.0: Corpus of Cross-lingual Dialogues with Minutes and Detection of Misunderstandings
This data package contains published parts of InCroMin, a corpus of
cross-lingual dialogues with minutes and detection of misunderstandings.
InCroMin is described in a paper **Corpus of Cross-lingual Dialogues with Minutes
and Detection of Misunderstandings,** by Marko Čechovič, Natália Komorníková,
Dominik Macháček, and Ondřej Bojar. To be published in TSD 2025.
The data were created by volunteering participants, by 2-5 people in each
meeting. They were matched in a way that there are at least two groups of
people who did not understand each other's language. Their meeting was facilitated by
simultaneous speech translation tool integrated in Minuteman. The meetings were
held via a teleconferencing platform that recorded each speaker in a separate
audio track. The participants gave consent with data processing and release.
Then, their speech was automatically transcribed in their original language, and
automatically translated into English. Then, human annotators manually corrected
transcripts and translations, and deidentified audio and texts by removing
confidential information such as person names. The annotators also created minutes.
InCroMin corpus is a very useful data set intended primarily for evaluating
automatic systems that aim to facilitate cross-lingual dialogues in realistic
conditions and end-to-end. It can evaluate Automatic Speech Processing, Speech
Translation, Simultaneous Speech Translation, Quality Estimation, and Automatic
Minuting
Errant Extended Vocabulary
The ontology provides a FAIR, interoperable vocabulary for grammatical error annotation and correction, integrating the English-focused ERRANT taxonomy with Czech-specific extensions from ERRANT-CZ and fine-grained categories derived from Czech proofreading and correction rules (Opravidlo). The ontology formalizes error types, subtypes, and correction operations in RDF, aligns linguistic properties with the LexInfo ontology, and supports multilingual grammatical error correction research, annotation interoperability, and data reuse