Charles University

LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University
Not a member yet
    1998 research outputs found

    MIXPAR Database: Version 1.0 (September 2025)

    No full text
    MIXPAR: A Database of Mixed Perfective Auxiliation in Italo-Romance (v1.0). This is the first public release (v1.0) of the MIXPAR database, a large-scale dataset documenting auxiliary selection patterns in Italo-Romance dialects. The project focuses on mixed perfective paradigms—systems that alternate between essere (‘be’) and avere (‘have’) within the same TAM paradigm. The data is based on fieldwork and dialectal atlases and includes both structural pattern annotations and GPS-based geographic metadata, enabling advanced statistical and spatial modeling. Three files are included: (1) MIXPAR-database-long-format_withGPS_September-2025.xlsx: full version of the database in long format with detailed linguistic and geographical annotations. (2) mixpar_for_R_final.csv: cleaned and pre-processed version optimized for statistical modeling in R. (3) patterns_with_gps.csv: summary of auxiliary patterns by location and TAM, enriched with geographic coordinates

    Czech Proofreading Rules

    No full text
    The collection describes proofreading errors in Czech covered by Opravidlo 1.0. It consists of: - the grammar rules applicable via the SET Czech syntactic parser - description of grammar rules with relation to ERRANT codes - extended ERRANT ontology, created from the original ERRANT [Bryant et al., 2017] and its Czech extension [Náplava et al., 2022] - Python script that demonstrates how to apply the SET rules to proofreading The dataset contains 6649 SET rules in main categories: agreement, capitals, commas, dependent clauses, non-grammatical structures, pronouns, spelling complex, and others. The error categories form a taxonomy with Czech and English descriptions, examples, and links to ERRANT codes, 175 classes in total

    CantusCorpus v1.0

    No full text
    CantusCorpus 1.0 is a large dataset of Gregorian chant intended for computational research. The dataset consists of all chants that are accessible through the Cantus Index federated search interface, combining data from 10 individual chant databases. Primarily these are catalogue records: which chants appear in which manuscripts. What allows us to identify multiple instances of a chant across different manuscripts is the Cantus ID mechanism, established from the long history of the Cantus Database. Thus, CantusCorpus 1.0 has two components: chant records (chants.csv), and source - overwhelmingly manuscript - records (sources.csv). CantusCorpus lies inherently downstream of the Cantus Database and the whole Cantus Index network of compatbile chant databases: we do not revisit anyone's editorial decisions. However, the value of this dataset is that the sum of all the editorial decisions made over the databases' decades of existence are being made available as a dataset for computational research. The PyCantus library (https://github.com/dact-chant/PyCantus) then makes handling this dataset (almost) easy. The accompanying source code (CantusCorpus-1.0.zip) contains a subdirectory with code and documentation for this particular version of CantusCorpus (v1.0). We expect re-collecting the dataset annually, as the Cantus network grows by tens of thousands of chant records each year

    [RPK] - Radiopredigtenkorpus (german radio sermons): 1933-1939; 1950-1960; 2010-2024

    No full text
    *** german version see below *** The first corpus of German radio sermons comprises over 29,000 fully digitized and annotated manuscripts of modern radio sermons from the years 2010–2024 and 267 manuscripts of historical radio sermons from the Nazi era (1933–1939) and 96 texts from GDR radio (1950–1960). The modern radio sermons are the final versions of the manuscripts that were recorded in the studio and then broadcast on various German stations. The radio sermons come from RBB, HR, WDR, SWR, BR, and SR, covering the entire German broadcasting area. Texts from an average of ten years are available from each station. The historical Catholic radio sermons from the Nazi era were collected from the diocesan archives in Münster. They were broadcast between 1933 and 1938 on Reichssender Köln. The historical Protestant radio sermons come from the State Church Archive of the Evangelical Lutheran Church in Bavaria (LAELKB) in Nuremberg and were broadcast between 1933 and 1939 on Reichssender München. The radio sermons from the GDR regime cover the years 1950–1960 and were broadcast on GDR radio. The Catholic texts were taken from the publication Pfeiffer, Ernst (1962): Frohbotschaft in Rundfunkansprachen. Leipzig: St. Benno-Verlag GmbH, while the Protestant radio sermons were taken from the publication Wagner, Heinz (1973): Annahme einer Nachricht. Predigten im Rundfunk. Berlin: Evangelische Verlagsanstalt. The corpus thus comprises texts from three exemplary periods in the 100-year history of radio sermons (1924–2024), which were created under three different political systems. The radio sermons from the Nazi era and the GDR period were subject to strict censorship before they were broadcast. This allows for both diachronic linguistic analyses and synchronic analyses of radio sermons from different stations. The addition of metadata (station, denomination, author, date, title) enables linguistic analyses from regional or denominational perspectives, for example. It is also possible to conduct studies of radio sermons by specific authors or from specific times/periods (e.g., during the pandemic). The corpus can be used free of charge. Please note that the files are saved as *.cec6. These are binary files from CorpusExplorer (available free of charge as open source at http://corpusexplorer.de). These files ensure efficient archiving. You can use both CorpusExplorer and the ‘CEC6 Converter’ (available for Linux, MacOS, and Windows - see: http://hdl.handle.net/11372/LRT-5913) to convert the data. The data can be exported in the following formats: CATMA v6; CoNLL; CSV; CSV (nur Metadaten); DTA TCF-XML; DWDS TEI-XML; HTML; IDS I5-XML; IDS KorAP XML; IMS Open Corpus Workbench; JSON; OPUS Corpus Collection XCES; Plaintext; SaltXML; SlashA XML; SketchEngine VERT; SPEEDy/CODEX (JSON); TLV-XML; TreeTagger; TXM; WebLicht und einfaches XML. *** english version see above *** Das erste Korpus deutscher Radiopredigten umfasst über 29.000 vollständig digitalisierte und annotierte Manuskripte moderner Radiopredigten aus den Jahren 2010–2024 und 267 Manuskripte historischer Radiopredigten aus dem Nationalsozialismus (1933–1939) sowie 96 Texte aus dem DDR-Rundfunk (1950–1960). Bei den modernen Radiopredigten handelt es sich um die Endfassungen derjenigen Manuskripte, die dann im Studio eingesprochen und anschließend in verschiedenen deutschen Sendern ausgestrahlt worden sind. Die Radiopredigten stammen aus dem RBB, HR, WDR, SWR, BR und SR, sodass sie das gesamtdeutsche Rundfunkgebiet abdecken. Aus jedem Sender liegen Texte aus durchschnittlich zehn Jahren vor. Die historischen katholischen Radiopredigten aus dem Nationalsozialismus wurden im Bistumsarchiv Münster erhoben. Sie wurden zwischen 1933–1938 im Reichssender Köln gesendet. Die historischen evangelischen Radiopredigten stammen aus dem Landeskirchlichen Archiv der Evangelisch-Lutherischen Kirche in Bayern (LAELKB) in Nürnberg und wurden von 1933–1939 im Reichssender München ausgestrahlt. Die Radiopredigten aus der Zeit des DDR-Regimes umfassen die Jahre 1950–1960 und wurden im DDR-Rundfunk gesendet. Die katholischen Texte wurden der Publikation Pfeiffer, Ernst (1962): Frohbotschaft in Rundfunkansprachen. Leipzig: St. Benno-Verlag GmbH entnommen, die evangelischen Radiopredigten der Publikation Wagner, Heinz (1973): Annahme einer Nachricht. Predigten im Rundfunk. Berlin: Evangelische Verlagsanstalt. Damit umfasst das Korpus die Texte aus drei exemplarischen Zeiträumen aus der 100-jährigen Geschichte der Radiopredigten (1924–2024), die unter drei verschiedenen politischen Systemen entstanden sind. Die Radiopredigten aus dem Nationalsozialismus und aus der DDR-Zeit unterlagen vor der Ausstrahlung einer strengen Zensur. Damit sind sowohl diachrone sprachlichen Analysen möglich, als auch synchrone Analysen unter den Radiopredigten der verschiedenen Sender. Durch die Anreicherung mit Metadaten (Sender, Konfession, Autor:in, Datum, Titel) sind z. B. Sprachanalysen unter regionalen oder konfessionellen Aspekten möglich. Auch Untersuchungen von Radiopredigten bestimmter Autor:innen oder Zeitpunkten/-räumen (z. B. Pandemiezeit) können durchgeführt werden. Das Korpus kann kostenlos genutzt werden. Bitte beachten Sie, dass die Dateien als *.cec6 gespeichert sind. Dies sind Binärdateien des CorpusExplorers (OpenSource kostenfrei verfügbar unter http://corpusexplorer.de). Diese Dateien gewährleisten eine effiziente Archivierung. Sie können sowohl den CorpusExplorer als auch den ‚CEC6-Converter‘ (verfügbar für Linux, MacOS und Windows - siehe: http://hdl.handle.net/11372/LRT-5913) zur Konvertierung der Daten verwenden. Die Daten können in folgende Formate exportiert werden: CATMA v6; CoNLL; CSV; CSV (nur Metadaten); DTA TCF-XML; DWDS TEI-XML; HTML; IDS I5-XML; IDS KorAP XML; IMS Open Corpus Workbench; JSON; OPUS Corpus Collection XCES; Plaintext; SaltXML; SlashA XML; SketchEngine VERT; SPEEDy/CODEX (JSON); TLV-XML; TreeTagger; TXM; WebLicht und einfaches XML

    Coreference in Universal Dependencies 1.3 (CorefUD 1.3)

    No full text
    CorefUD is a collection of previously existing datasets annotated with coreference, which we converted into a common annotation scheme. In total, CorefUD in its current version 1.3 consists of 28 datasets for 18 languages. The datasets are enriched with automatic morphological and syntactic annotations that are fully compliant with the standards of the Universal Dependencies project. All the datasets are stored in the CoNLL-U format, with coreference- and bridging-specific information captured by attribute-value pairs located in the MISC column. The collection is divided into a public edition and a non-public (ÚFAL-internal) edition. The publicly available edition is distributed via LINDAT-CLARIAH-CZ and contains 24 datasets for 17 languages (1 dataset for Ancient Greek, 1 for Ancient Hebrew, 1 for Catalan, 2 for Czech, 3 for English, 2 for French, 2 for German, 1 for Hindi, 2 for Hungarian, 1 for Korean, 1 for Lithuanian, 2 for Norwegian, 1 for Old Church Slavonic, 1 for Polish, 1 for Russian, 1 for Spanish, and 1 for Turkish), excluding the test data. The non-public edition is available internally to ÚFAL members and contains additional 4 datasets for 2 languages (1 dataset for Dutch, and 3 for English), which we are not allowed to distribute due to their original license limitations. It also contains the test data portions for all datasets. When using any of the harmonized datasets, please get acquainted with its license (placed in the same directory as the data) and cite the original data resource too. Compared to the previous version 1.2, the version 1.3 comprises new languages and corpora, namely French-ANCOR, Hindi-HDTB, and Korean-ECMT. In addition, English-GUM and Czech-PDT have been updated to newer versions and conversion of zeros in Hungarian-KorKor has been improved (a list of all changes in each dataset can be found in the corresponding README file)

    Diadem Speech-Cognitive Dataset (DSCD-CZ)

    No full text
    The dataset was created to investigate the speech and cognitive performance of people with varying degrees of cognitive impairment, primarily dementia. The dataset contains a comprehensive set of data including the results of standardized neuropsychological tests (RBANS, ALBA, POBAV, MASTCZ), speech tasks focused on comprehension, memory, naming, and repetition, and demographic data (age, gender, education). Participants were divided into four groups based on clinical assessment: healthy individuals, healthy individuals with possible mild cognitive impairment, patients with mild cognitive impairment, and patients with dementia. All recordings and examinations were managed as part of routine clinical practice in the neurological outpatient clinic – Memory Disorders Advisory Unit, at the Neurological Clinic of the Faculty Hospital Královské Vinohrady. The dataset containing 268 examinations was divided into a training and test part using stratification by clinical group, age, gender, and level of education to ensure an even distribution of these key characteristics in both parts of the data. The aim of the dataset is to support the development of methods for automated detection of cognitive disorders based on speech analysis and cognitive performance. The data are suitable for research in the areas of clinical neuropsychology, computational linguistics, and machine learning. The dataset is intended for non-commercial research purposes

    EduPo: Analysis and Generation of Czech Poetry, v0.5

    No full text
    A suite of tools for analysis and generation of Czech poetry. This is a snapshot of the public Github repository at https://github.com/ufal/edupo -- the beta-version of the tool suite, released together with a scientific paper at the NLP4DH 2025 conference. Sada nástrojů pro analýzu a generování české poezie. Tato verze veřejného repozitáře na Githubu https://github.com/ufal/edupo je beta-verzí doprovázející vydání vědeckého článku na konferenci NLP4DH 2025

    ORATOR v3: corpus of spoken Czech monologues (transcriptions)

    No full text
    The ORATOR v3 corpus contains monologues by native Czech speakers. The typical situations include a lecture, instruction, guided tour, welcome address, sermon etc. The corpus is composed of 489 recordings from 2005–2019 and contains 1 212 729 orthographic words (i.e. a total of 1 542 133 tokens including punctuation); a total of 468 different speakers appear in the probes. The transcription was manual and it is linked to the corresponding audio track. ORATOR v3 is lemmatized and morphologically tagged according to the SYN2020 standard. The (anonymized) corpus is provided in a (semi-XML) vertical format used as an input to the Manatee query engine. The data thus exactly correspond to the corpus available to registered users of the CNC via KonText at https://www.korpus.cz/kontext/query?corpname=orator_v3 Please note: this item includes only the transcriptions, audio (and the transcripts in their original format) is available under more restrictive non-CC license at http://hdl.handle.net/11234/1-593

    Saxophone Trills Dataset

    No full text
    This is the audio data of saxophone trills, used for difficulty estimation in the paper "Modeling the difficulty of saxophone music" by Šimon Libřický and Jan Hajič jr., ISMIR 2025. The dataset consists of recordings of saxophone trills played at the maximum attainable speed under some quality conditions (most importantly: stable intonation) split into individual sessions. Five different saxophone players (all approximately at the level of conservatory students) recorded trills for all intervals available on the tenor saxophone. The trill speeds are then used as inputs into a difficulty estimation model, with the assumption that intervals with lower attainable trill speeds are harder to play (e.g., larger leaps, or those that require re-using the same finger for a different key). There is a total of 817 recorded trills. Six "anchor" intervals were recorded in each of the 13 sessions, so that the players can be compared, making the total number of distinct recorded trills 745. If you use this dataset, please cite: Libřický, Šimon and Hajič jr., Jan (2025). "Modeling the difficulty of saxophone music." In: Proceedings of the 26th International Society for Music Information Retrieval Conference, Daejeon, Republic of Korea, Sep 2025

    LatinISE corpus (version 5)

    No full text
    The LatinISE corpus is a text corpus collected from the LacusCurtius, Intratext and Musisque Deoque websites. Corpus texts have rich metadata containing information as genre, title, century or specific date. This Latin corpus was built by Barbara McGillivray. In the version 5 of the corpus the author names and datings of texts before 600 CE have been manually corrected and duplicates of texts have been removed. Thanks to Valentina Lunardi for this data curation

    0

    full texts

    1,998

    metadata records
    Updated in last 30 days.
    LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University
    Access Repository Dashboard
    Do you manage Open Research Online? Become a CORE Member to access insider analytics, issue reports and manage access to outputs from your repository in the CORE Repository Dashboard! 👇