Charles University
LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles UniversityNot a member yet
1998 research outputs found
Sort by
AI Koditex v1
AI Koditex is a corpus of Czech texts generated with large language models (LLMs). Its main purpose is to create a resource for comparing human-written texts with LLM-generated text linguistically. The corpus is multi-genre and rich in terms of topics, authors, and text types, and comparabile with existing human-created corpora. The corpus replicates reference human Koditex corpus that follows the Brown Corpus tradition. The new corpus was generated using models from OpenAI, Anthropic, Alphabet, Meta, and DeepSeek, ranging from GPT-3 (davinci-002) to GPT-4.5, and are tagged according to the Universal Dependencies standard (i.e., the texts are tokenized, lemmatized, and morphologically and syntactically annotated). The subcorpus size varies according to the model used (The subcorpus size varies according to the model used (768k tokens per model on average, 21.5M tokens altogether). The raw data and plain texts are freely available for download under the CC BY 4.0 license, the UD annotated data are under CC BY-NC-SA 4.0 licence. The corpus is also accessible through the KonText search interface of the Czech National Corpus (https://www.korpus.cz/kontext/query?corpname=ai_koditex_v1)
CEC6-Converter (2025-05-29)
Diese Software erlaubt eine Konvertierung von *.cec6-Dateien in 24 Formate, die in der Korpuslinguistik / NLProc üblich sind. Die Ausführung ist unter allen modernen Betriebssystemen möglich (Windows, Linux, MacOS). Die Binärdateien wurden für die x64-Architektur kompiliert. Sollten Sie einen Prozessor (CPU) verwenden, der eine x86- oder ARM-Architektur hat, dann nutzen Sie bitte die Anleitung: andere Betriebssysteme bzw. x86 / ARM / ARM64.
---
This software allows the conversion of *.cec6 files into 24 formats that are commonly used in corpus linguistics / NLProc. Execution is possible under all modern operating systems (Windows, Linux, MacOS). The binary files have been compiled for the x64 architecture. If you are using a processor (CPU) with x86 or ARM architecture, please use the instructions for "other operating systems or x86 / ARM / ARM64"
SynSemClass 5.5
The SynSemClass event-type ontology evolved from the original idea to create a bilingual, later multilingual, synonym lexicon. This version (5.5.) builds on previous versions, but substantially enriches them with new synonymous classes (the number has risen from 1546 to 1993). In addition, version 5.5. has been extended by two items: Czech deverbal nouns (a small sample) and hierarchical relations. The hierarchical structure captures specialization and generalization relations between classes that are formally and technically unrelated in the original ontology, and it is now integrated with the main files constituting the lexicon (symsemclass55.zip). As a lexical-semantic resource, this version continues to link to similar resources, such as to PDT-Vallex, EngVallex, CzEngVallex, NomVallex, FrameNet, VerbNet, PropBank, Ontonotes Woxikon, E-VALBU, GUP, and German FrameNet), ADESSE, SenSem, AnCora, and Spanish WordNet and FrameNet. Examples of sentences in which multilingual synonyms have been used are also included (example_sentences.zip). Version with the original classes composition as automatically pre-suggested but later removed in the manual correction and further annotation process are included for completeness and historical reasons (removed_cms.zip).
The individual languages are linked as follows (referenced resources not included but all are available online):
The Spanish entries are linked to ADESSE (http://adesse.uvigo.es/), Spanish SenSem (https://grial-research.github.io/en/index.html), Spanish WordNet (https://adimen.ehu.eus/cgi-bin/wei/public/wei.consult.perl), AnCora (https://clic.ub.edu/corpus/en/ancoraverb_es), and Spanish FrameNet (http://sfn.spanishfn.org/SFNreports.php).
The English entries are linked to EngVallex (http://hdl.handle.net/11858/00-097C-0000-0023-4337-2), CzEngVallex (http://hdl.handle.net/11234/1-1512), FrameNet (https://framenet.icsi.berkeley.edu/), VerbNet (https://uvi.colorado.edu/ and http://verbs.colorado.edu/verbnet/index.html), PropBank (http://propbank.github.io/), Ontonotes (http://clear.colorado.edu/compsem/index.php?page=lexicalresources&sub=ontonotes), and the Open English Wordnet (https://en-word.net/).
The Czech verbal entries are linked to PDT-Vallex4.5 (http://hdl.handle.net/11234/1-5814), Vallex (http://hdl.handle.net/11234/1-4756), and CzEngVallex (http://hdl.handle.net/11234/1-1512). The Czech deverbal nouns are linked to https://ufal.mff.cuni.cz/nomvallex/2.5.
The German entries are linked to Woxikon (https://synonyme.woxikon.de), E-VALBU (https://grammis.ids-mannheim.de/verbvalenz), and GUP (https://github.com/UniversalDependencies/UD_German-GSD)
ParCzech4Speech 1.0
We introduce ParCzech4Speech 1.0, a processed version of the ParCzech 4.0 corpus, targeted at speech modeling tasks with the largest variant containing 2,695 hours of aligned speech from 587 speakers. We combined the sound recordings of the Czech parliamentary speeches with the official transcripts. The recordings were processed with WhisperX and Wav2Vec 2.0 to extract automated audio-text alignment.
The dataset is offered in three flexible variants:
(1) sentence-segmented for automatic speech recognition and speech synthesis tasks with clean boundaries,
(2) unsegmented preserving original utterance flow across sentences, and
(3) a raw-alignment for further custom refinement for other possible tasks.
Note: This release contains alignment data and text segments (official and recognized transcripts). The source audio must be obtained separately from the AudioPSP 24.01 corpus , using the 'filePath' column to locate the corresponding audio file and the 'start'/ 'end' timestamps to extract specific segments.
The official transcripts are available in ParCzech 4.0 corpus (http://hdl.handle.net/11234/1-5360).
The original audio files are available in AudioPSP 24.01 corpus (http://hdl.handle.net/11234/1-5404).
Note: All three variants are provided in both .tsv (tab-separated values) and .parquet (columnar binary) formats. The data content is identical across formats
EdUKate translation models 2025
This package includes three models adapted for sentence-level machine translation in educational domain: Czech-to-Ukrainian, Czech-to-English and Czech-to-German. The models are provided as LoRA adapters on top of EuroLLM-9B-Instruct LLM and can be used in the Charles Translator service (https://translator.cuni.cz) and in the web portal Škola s nadhledem (https://skolasnadhledem.cz/). The models were developed within the EdUKate project, which aims to help mitigate language barriers between non-Czech-speaking children in the Czech Republic and the education in the Czech school system. The project focuses on the development and dissemination of multilingual digital learning materials for students in primary and secondary schools
DeriNet 2.3
DeriNet is a lexical network modeling derivational and compositional relations in Czech. The nodes of the network represent Czech lexemes, while the edges capture word-formational relations between derived words and their base word(s). The current version, DeriNet 2.3, introduces several key improvements over version 2.2:
(a) the set of 1,040,126 lexemes is aligned with the latest version of MorfFlex CZ (version 2.1),
(b) 5,781 derivational trees containing loanwords are enriched with etymological information specifying their origins, adopted from the Czech Etymological Lexicon,
(c) 8,867 new derivational and 1,262 new compound relations have been identified, resulting in a total of 791,771 derivational and 7,598 compound relations, and
(d) the morphological segmentation and classification of morphs have been significantly enhanced
Universal Dependencies 2.16
Universal Dependencies is a project that seeks to develop cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual parser development, cross-lingual learning, and parsing research from a language typology perspective. The annotation scheme is based on (universal) Stanford dependencies (de Marneffe et al., 2006, 2008, 2014), Google universal part-of-speech tags (Petrov et al., 2012), and the Interset interlingua for morphosyntactic tagsets (Zeman, 2008)
Universal Dependencies 2.17 models for UDPipe 2 (2025-11-25)
Tokenizer, POS Tagger, Lemmatizer and Parser models for 169 treebanks of 93 languages of Universal Depenencies 2.17 Treebanks, created solely using UD 2.17 data (http://hdl.handle.net/11234/1-6036). The model documentation including performance can be found at https://ufal.mff.cuni.cz/udpipe/2/models#universal_dependencies_217_models .
To use these models, you need UDPipe version 2.0, which you can download from https://ufal.mff.cuni.cz/udpipe/2
Antoninus Liberalis, Μεταμορφώσεων συναγωγή (Transformationum congeries)
Μεταμορφώσεων συναγωγή (lat. "Transformationum congeries", English "Collection of Metamorphoses") is a Greek prosaic mythographic work attributed to Antoninus Liberalis, otherwise unknown author, and dated most likely to the 1st or 2nd century CE
Odia Visual Genome
The Odia Visual Genome is a multimodal dataset comprising aligned textual and visual data, designed to support research in English-Odia multimodal machine translation as well as broader studies in multimodal language processing. The dataset is derived from the Visual Genome corpus, which provides short English image captions paired with corresponding images. For the Odia Visual Genome, we selected a subset of these captions and automatically translated them into Odia, followed by careful manual post-editing. In the post-editing stage, annotators explicitly considered the associated visual context to ensure semantic adequacy and naturalness of the Odia translations.
The corpus is partitioned into four subsets. The training set contains approximately 29,000 segments, while the development set and the test set contain 1,000 and 1,600 segments, respectively. Both were taken from Hindi Visual Genome 1.1 where they were created via random sampling from the Visual Genome corpus. In addition, a challenge test set of 1,400 segments was prepared for the WAT2019 Multimodal Translation Task. The challenge test set was constructed to specifically target lexical ambiguity in English captions. Candidate items were identified based on embedding similarity, and ambiguous instances were manually selected where visual information plays a crucial role in disambiguation. Although in many cases surrounding textual context also provides sufficient cues, the inclusion of the image enhances the robustness of disambiguation.
Odia Visual Genome was used in WAT 2025 Multimodal Translation Task (https://ufal.mff.cuni.cz/wat2025english-indicmultimodaltranslation).
Dataset Formats
The dataset contains both textual and visual components.
Textual Data. The training, development, and test partitions are distributed as tab-delimited plain-text files. Each file consists of seven columns:
Column1 - image_id
Column2 - X
Column3 - Y
Column4 - Width
Column5 - Height
Column6 - English Text
Column7 - Odia Text
The bounding-box coordinates (X, Y, Width, Height) specify the rectangular region of the image referenced by the caption.
Visual Data.
The image collection contains full-resolution images, each identified by the corresponding image_id. The bounding-box metadata enables linking of captions to specific regions within the images.
Corpus Statistics
Parallel corpus statistics for Odia Visual Genome.
Dataset Segments English Words Odia Words
---------------- --------- ---------------- -------------
Train 28930 143134 141652
Dev 998 4922 4912
Test 1595 7854 7734
Challenge Test 1400 8186 8100
---------------- --------- ---------------- -------------
Total 32923 164096 162398
The word counts are approximate, prior to tokenization