Charles University

LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University

Not a member yet

1998 research outputs found

Sort by

MIXPAR Database: Version 1.0 (September 2025)

Author: Štichauer Pavel
Ripamonti Fabio
Publication venue: Filozofická fakulta, Univerzita Karlova
Publication date: 01/10/2025
Field of study

MIXPAR: A Database of Mixed Perfective Auxiliation in Italo-Romance (v1.0). This is the first public release (v1.0) of the MIXPAR database, a large-scale dataset documenting auxiliary selection patterns in Italo-Romance dialects. The project focuses on mixed perfective paradigms—systems that alternate between essere (‘be’) and avere (‘have’) within the same TAM paradigm. The data is based on fieldwork and dialectal atlases and includes both structural pattern annotations and GPS-based geographic metadata, enabling advanced statistical and spatial modeling. Three files are included: (1) MIXPAR-database-long-format_withGPS_September-2025.xlsx: full version of the database in long format with detailed linguistic and geographical annotations. (2) mixpar_for_R_final.csv: cleaned and pre-processed version optimized for statistical modeling in R. (3) patterns_with_gps.csv: summary of auxiliary patterns by location and TAM, enriched with geographic coordinates

Czech Proofreading Rules

Author: Hlaváčková Dana
Machura Jakub
Žižková Hana
Kovář Vojtěch
Nevěřilová Zuzana
Publication venue: Natural Language Processing Centre, Faculty of Informatics, Masaryk University
Publication date: 19/10/2025
Field of study

The collection describes proofreading errors in Czech covered by Opravidlo 1.0. It consists of: - the grammar rules applicable via the SET Czech syntactic parser - description of grammar rules with relation to ERRANT codes - extended ERRANT ontology, created from the original ERRANT [Bryant et al., 2017] and its Czech extension [Náplava et al., 2022] - Python script that demonstrates how to apply the SET rules to proofreading The dataset contains 6649 SET rules in main categories: agreement, capitals, commas, dependent clauses, non-grammatical structures, pronouns, spelling complex, and others. The error categories form a taxonomy with Czech and English descriptions, examples, and links to ERRANT codes, 175 classes in total

CantusCorpus v1.0

Author: Anna Dvořáková
Debra Lacoste
Hajič jr. Jan
Publication venue: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Publication date: 19/11/2025
Field of study

CantusCorpus 1.0 is a large dataset of Gregorian chant intended for computational research. The dataset consists of all chants that are accessible through the Cantus Index federated search interface, combining data from 10 individual chant databases. Primarily these are catalogue records: which chants appear in which manuscripts. What allows us to identify multiple instances of a chant across different manuscripts is the Cantus ID mechanism, established from the long history of the Cantus Database. Thus, CantusCorpus 1.0 has two components: chant records (chants.csv), and source - overwhelmingly manuscript - records (sources.csv). CantusCorpus lies inherently downstream of the Cantus Database and the whole Cantus Index network of compatbile chant databases: we do not revisit anyone's editorial decisions. However, the value of this dataset is that the sum of all the editorial decisions made over the databases' decades of existence are being made available as a dataset for computational research. The PyCantus library (https://github.com/dact-chant/PyCantus) then makes handling this dataset (almost) easy. The accompanying source code (CantusCorpus-1.0.zip) contains a subdirectory with code and documentation for this particular version of CantusCorpus (v1.0). We expect re-collecting the dataset annually, as the Cantus network grows by tens of thousands of chant records each year

[RPK] - Radiopredigtenkorpus (german radio sermons): 1933-1939; 1950-1960; 2010-2024

Author: Anna-Maria Balbach
Jan Oliver Rüdiger
Publication venue: Universität Münster
Publication date: 01/09/2025
Field of study

*** german version see below *** The first corpus of German radio sermons comprises over 29,000 fully digitized and annotated manuscripts of modern radio sermons from the years 2010–2024 and 267 manuscripts of historical radio sermons from the Nazi era (1933–1939) and 96 texts from GDR radio (1950–1960). The modern radio sermons are the final versions of the manuscripts that were recorded in the studio and then broadcast on various German stations. The radio sermons come from RBB, HR, WDR, SWR, BR, and SR, covering the entire German broadcasting area. Texts from an average of ten years are available from each station. The historical Catholic radio sermons from the Nazi era were collected from the diocesan archives in Münster. They were broadcast between 1933 and 1938 on Reichssender Köln. The historical Protestant radio sermons come from the State Church Archive of the Evangelical Lutheran Church in Bavaria (LAELKB) in Nuremberg and were broadcast between 1933 and 1939 on Reichssender München. The radio sermons from the GDR regime cover the years 1950–1960 and were broadcast on GDR radio. The Catholic texts were taken from the publication Pfeiffer, Ernst (1962): Frohbotschaft in Rundfunkansprachen. Leipzig: St. Benno-Verlag GmbH, while the Protestant radio sermons were taken from the publication Wagner, Heinz (1973): Annahme einer Nachricht. Predigten im Rundfunk. Berlin: Evangelische Verlagsanstalt. The corpus thus comprises texts from three exemplary periods in the 100-year history of radio sermons (1924–2024), which were created under three different political systems. The radio sermons from the Nazi era and the GDR period were subject to strict censorship before they were broadcast. This allows for both diachronic linguistic analyses and synchronic analyses of radio sermons from different stations. The addition of metadata (station, denomination, author, date, title) enables linguistic analyses from regional or denominational perspectives, for example. It is also possible to conduct studies of radio sermons by specific authors or from specific times/periods (e.g., during the pandemic). The corpus can be used free of charge. Please note that the files are saved as *.cec6. These are binary files from CorpusExplorer (available free of charge as open source at http://corpusexplorer.de). These files ensure efficient archiving. You can use both CorpusExplorer and the ‘CEC6 Converter’ (available for Linux, MacOS, and Windows - see: http://hdl.handle.net/11372/LRT-5913) to convert the data. The data can be exported in the following formats: CATMA v6; CoNLL; CSV; CSV (nur Metadaten); DTA TCF-XML; DWDS TEI-XML; HTML; IDS I5-XML; IDS KorAP XML; IMS Open Corpus Workbench; JSON; OPUS Corpus Collection XCES; Plaintext; SaltXML; SlashA XML; SketchEngine VERT; SPEEDy/CODEX (JSON); TLV-XML; TreeTagger; TXM; WebLicht und einfaches XML. *** english version see above *** Das erste Korpus deutscher Radiopredigten umfasst über 29.000 vollständig digitalisierte und annotierte Manuskripte moderner Radiopredigten aus den Jahren 2010–2024 und 267 Manuskripte historischer Radiopredigten aus dem Nationalsozialismus (1933–1939) sowie 96 Texte aus dem DDR-Rundfunk (1950–1960). Bei den modernen Radiopredigten handelt es sich um die Endfassungen derjenigen Manuskripte, die dann im Studio eingesprochen und anschließend in verschiedenen deutschen Sendern ausgestrahlt worden sind. Die Radiopredigten stammen aus dem RBB, HR, WDR, SWR, BR und SR, sodass sie das gesamtdeutsche Rundfunkgebiet abdecken. Aus jedem Sender liegen Texte aus durchschnittlich zehn Jahren vor. Die historischen katholischen Radiopredigten aus dem Nationalsozialismus wurden im Bistumsarchiv Münster erhoben. Sie wurden zwischen 1933–1938 im Reichssender Köln gesendet. Die historischen evangelischen Radiopredigten stammen aus dem Landeskirchlichen Archiv der Evangelisch-Lutherischen Kirche in Bayern (LAELKB) in Nürnberg und wurden von 1933–1939 im Reichssender München ausgestrahlt. Die Radiopredigten aus der Zeit des DDR-Regimes umfassen die Jahre 1950–1960 und wurden im DDR-Rundfunk gesendet. Die katholischen Texte wurden der Publikation Pfeiffer, Ernst (1962): Frohbotschaft in Rundfunkansprachen. Leipzig: St. Benno-Verlag GmbH entnommen, die evangelischen Radiopredigten der Publikation Wagner, Heinz (1973): Annahme einer Nachricht. Predigten im Rundfunk. Berlin: Evangelische Verlagsanstalt. Damit umfasst das Korpus die Texte aus drei exemplarischen Zeiträumen aus der 100-jährigen Geschichte der Radiopredigten (1924–2024), die unter drei verschiedenen politischen Systemen entstanden sind. Die Radiopredigten aus dem Nationalsozialismus und aus der DDR-Zeit unterlagen vor der Ausstrahlung einer strengen Zensur. Damit sind sowohl diachrone sprachlichen Analysen möglich, als auch synchrone Analysen unter den Radiopredigten der verschiedenen Sender. Durch die Anreicherung mit Metadaten (Sender, Konfession, Autor:in, Datum, Titel) sind z. B. Sprachanalysen unter regionalen oder konfessionellen Aspekten möglich. Auch Untersuchungen von Radiopredigten bestimmter Autor:innen oder Zeitpunkten/-räumen (z. B. Pandemiezeit) können durchgeführt werden. Das Korpus kann kostenlos genutzt werden. Bitte beachten Sie, dass die Dateien als *.cec6 gespeichert sind. Dies sind Binärdateien des CorpusExplorers (OpenSource kostenfrei verfügbar unter http://corpusexplorer.de). Diese Dateien gewährleisten eine effiziente Archivierung. Sie können sowohl den CorpusExplorer als auch den ‚CEC6-Converter‘ (verfügbar für Linux, MacOS und Windows - siehe: http://hdl.handle.net/11372/LRT-5913) zur Konvertierung der Daten verwenden. Die Daten können in folgende Formate exportiert werden: CATMA v6; CoNLL; CSV; CSV (nur Metadaten); DTA TCF-XML; DWDS TEI-XML; HTML; IDS I5-XML; IDS KorAP XML; IMS Open Corpus Workbench; JSON; OPUS Corpus Collection XCES; Plaintext; SaltXML; SlashA XML; SketchEngine VERT; SPEEDy/CODEX (JSON); TLV-XML; TreeTagger; TXM; WebLicht und einfaches XML

Coreference in Universal Dependencies 1.3 (CorefUD 1.3)

Author: Novák Michal
Popel Martin
Zeman Daniel
Žabokrtský Zdeněk
Nedoluzhko Anna
Acar Kutay
Bamman David
Bourgonje Peter
Cinková Silvie
Eckhoff Hanne
Cebiroğlu Eryiğit Gülşen
Hajič Jan
Hardmeier Christian
Haug Dag
Jørgensen Tollef
Kåsen Andre
Krielke Pauline
Landragin Frédéric
Lapshinova-Koltunski Ekaterina
Mæhlum Petter
Martí M. Antònia
Mikulová Marie
Milintsevich Kirill
Mujadia Vandan
Muzerelle Judith
Nam Sangha
Nøklestad Anders
Ogrodniczuk Maciej
Øvrelid Lilja
Pamay Arslan Tuğba
Porada Ian
Recasens Marta
Solberg Per Erik
Stede Manfred
Straka Milan
Swanson Daniel
Toldova Svetlana
Vadász Noémi
Velldal Erik
Vincze Veronika
Zeldes Amir
Žitkus Voldemaras
Publication venue: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Publication date: 17/04/2025
Field of study

CorefUD is a collection of previously existing datasets annotated with coreference, which we converted into a common annotation scheme. In total, CorefUD in its current version 1.3 consists of 28 datasets for 18 languages. The datasets are enriched with automatic morphological and syntactic annotations that are fully compliant with the standards of the Universal Dependencies project. All the datasets are stored in the CoNLL-U format, with coreference- and bridging-specific information captured by attribute-value pairs located in the MISC column. The collection is divided into a public edition and a non-public (ÚFAL-internal) edition. The publicly available edition is distributed via LINDAT-CLARIAH-CZ and contains 24 datasets for 17 languages (1 dataset for Ancient Greek, 1 for Ancient Hebrew, 1 for Catalan, 2 for Czech, 3 for English, 2 for French, 2 for German, 1 for Hindi, 2 for Hungarian, 1 for Korean, 1 for Lithuanian, 2 for Norwegian, 1 for Old Church Slavonic, 1 for Polish, 1 for Russian, 1 for Spanish, and 1 for Turkish), excluding the test data. The non-public edition is available internally to ÚFAL members and contains additional 4 datasets for 2 languages (1 dataset for Dutch, and 3 for English), which we are not allowed to distribute due to their original license limitations. It also contains the test data portions for all datasets. When using any of the harmonized datasets, please get acquainted with its license (placed in the same directory as the data) and cite the original data resource too. Compared to the previous version 1.2, the version 1.3 comprises new languages and corpora, namely French-ANCOR, Hindi-HDTB, and Korean-ECMT. In addition, English-GUM and Czech-PDT have been updated to newer versions and conversion of zeros in Hungarian-KorKor has been improved (a list of all changes in each dataset can be found in the corresponding README file)

Diadem Speech-Cognitive Dataset (DSCD-CZ)

Author: Šmídl Luboš
Krejčová Marie
Zapletalová Michaela
Polák Filip
Zajícová Lucie
Švec Jan
Víta Martin
Bartoš Aleš
Publication venue: Institute of Physics, Czech Academy of Sciences
Publication date: 29/05/2025
Field of study

The dataset was created to investigate the speech and cognitive performance of people with varying degrees of cognitive impairment, primarily dementia. The dataset contains a comprehensive set of data including the results of standardized neuropsychological tests (RBANS, ALBA, POBAV, MASTCZ), speech tasks focused on comprehension, memory, naming, and repetition, and demographic data (age, gender, education). Participants were divided into four groups based on clinical assessment: healthy individuals, healthy individuals with possible mild cognitive impairment, patients with mild cognitive impairment, and patients with dementia. All recordings and examinations were managed as part of routine clinical practice in the neurological outpatient clinic – Memory Disorders Advisory Unit, at the Neurological Clinic of the Faculty Hospital Královské Vinohrady. The dataset containing 268 examinations was divided into a training and test part using stratification by clinical group, age, gender, and level of education to ensure an even distribution of these key characteristics in both parts of the data. The aim of the dataset is to support the development of methods for automated detection of cognitive disorders based on speech analysis and cognitive performance. The data are suitable for research in the areas of clinical neuropsychology, computational linguistics, and machine learning. The dataset is intended for non-commercial research purposes

EduPo: Analysis and Generation of Czech Poetry, v0.5

Author: Rosa Rudolf
Musil Tomáš
Mareček David
Chudoba Michal
Landsperský Jakub
Plecháč Petr
Dosoudil Jiří
Publication venue: Institute of Czech Literature, Czech Academy of Sciences
Publication date: 17/03/2025
Field of study

A suite of tools for analysis and generation of Czech poetry. This is a snapshot of the public Github repository at https://github.com/ufal/edupo -- the beta-version of the tool suite, released together with a scientific paper at the NLP4DH 2025 conference. Sada nástrojů pro analýzu a generování české poezie. Tato verze veřejného repozitáře na Githubu https://github.com/ufal/edupo je beta-verzí doprovázející vydání vědeckého článku na konferenci NLP4DH 2025

ORATOR v3: corpus of spoken Czech monologues (transcriptions)

Author: Kopřivová Marie
Laubeová Zuzana
Lukeš David
Poukarová Petra
Horký Václav
Jelínek Tomáš
Křivan Jan
Publication venue: Charles University, Faculty of Arts, Department of Linguistics
Publication date: 28/05/2025
Field of study

The ORATOR v3 corpus contains monologues by native Czech speakers. The typical situations include a lecture, instruction, guided tour, welcome address, sermon etc. The corpus is composed of 489 recordings from 2005–2019 and contains 1 212 729 orthographic words (i.e. a total of 1 542 133 tokens including punctuation); a total of 468 different speakers appear in the probes. The transcription was manual and it is linked to the corresponding audio track. ORATOR v3 is lemmatized and morphologically tagged according to the SYN2020 standard. The (anonymized) corpus is provided in a (semi-XML) vertical format used as an input to the Manatee query engine. The data thus exactly correspond to the corpus available to registered users of the CNC via KonText at https://www.korpus.cz/kontext/query?corpname=orator_v3 Please note: this item includes only the transcriptions, audio (and the transcripts in their original format) is available under more restrictive non-CC license at http://hdl.handle.net/11234/1-593

Saxophone Trills Dataset

Author: Šimon Libřický Jan Hajič jr.
Publication venue: Charles University in Prague, UFAL
Publication date: 26/06/2025
Field of study

This is the audio data of saxophone trills, used for difficulty estimation in the paper "Modeling the difficulty of saxophone music" by Šimon Libřický and Jan Hajič jr., ISMIR 2025. The dataset consists of recordings of saxophone trills played at the maximum attainable speed under some quality conditions (most importantly: stable intonation) split into individual sessions. Five different saxophone players (all approximately at the level of conservatory students) recorded trills for all intervals available on the tenor saxophone. The trill speeds are then used as inputs into a difficulty estimation model, with the assumption that intervals with lower attainable trill speeds are harder to play (e.g., larger leaps, or those that require re-using the same finger for a different key). There is a total of 817 recorded trills. Six "anchor" intervals were recorded in each of the 13 sessions, so that the players can be compared, making the total number of distinct recorded trills 745. If you use this dataset, please cite: Libřický, Šimon and Hajič jr., Jan (2025). "Modeling the difficulty of saxophone music." In: Proceedings of the 26th International Society for Music Information Retrieval Conference, Daejeon, Republic of Korea, Sep 2025

LatinISE corpus (version 5)

Author: McGillivray Barbara
Publication venue: Lexical Computing
Publication date: 25/03/2025
Field of study

The LatinISE corpus is a text corpus collected from the LacusCurtius, Intratext and Musisque Deoque websites. Corpus texts have rich metadata containing information as genre, title, century or specific date. This Latin corpus was built by Barbara McGillivray. In the version 5 of the corpus the author names and datings of texts before 600 CE have been manually corrected and duplicates of texts have been removed. Thanks to Valentina Lunardi for this data curation

0

full texts

1,998

metadata records

Updated in last 30 days.

LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University

Access Repository Dashboard

Do you manage Open Research Online? Become a CORE Member to access insider analytics, issue reports and manage access to outputs from your repository in the CORE Repository Dashboard! 👇