Charles University

LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University

Not a member yet

1998 research outputs found

Sort by

Lexical Dataset of Czech nene- Constructions (2026-02-28)

Author: Hartmanová Lucie
Publication venue: Charles University in Prague, ÚČJTK
Publication date: 01/01/2025
Field of study

Databáze dokladů slov s dvojí záporkou nene- v češtině, sloužící primárně pro účely bakalářské práce na ÚČJTK FFUK (Dvojitá negace nene- (typ nenedostal, nenepatrný) v češtině; Lucie Hartmanová, 2024). Abstrakt k původní bakalářské práci: Bakalářská práce se zabývá dvojitou negací v češtině, která je vyjadřována opakováním záporky ne- před negovaným slovem dvakrát bezprostředně za sebou. Při výzkumu jsme usilovali o sebrání co největšího počtu dokladů slov s dvojí záporkou, přičemž materiál jsme shromáždili z vybraných slovníků češtiny, psaných jazykových korpusů, Lexikální databáze humanistické a barokní češtiny a digitální knihovny Kramerius (verze 5). Na základě excerpce jsme následně provedli kvantitativní vyhodnocení podle slovních druhů, dále jsme se zaměřili na sémantiku dvojí záporky a zabývali jsme se rovněž konkurenčními prostředky s dvojitou negací bezprostředně za sebou. Zvláštní pozornost je v práci věnována vývoji tohoto typu záporu v biblickém překladu tam, kde byly detekovány jeho výskyty ve staročeských či raněnovověkých překladech. Výsledky výzkumu jsme následně shrnuli a nabídli jsme možná východiska pro další bádání

Debiasing Algorithm through Model Adaptation

Author: Limisiewicz Tomasz
Mareček David
Musil Tomáš
Publication venue: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Publication date: 31/01/2025
Field of study

Debiasing Algorithm through Model Adaptation (DAMA) is based on guarding stereotypical gender signals and model editing. DAMA is performed on specific modules prone to convey gender bias, as shown by causal tracing. Our novel method effectively reduces gender bias in LLaMA models in three diagnostic tests: generation, coreference (WinoBias), and stereotypical sentence likelihood (StereoSet). The method does not change the model’s architecture, parameter count, or inference cost. We have also shown that the model’s performance in language modeling and a diverse set of downstream tasks is almost unaffected. This package contains both the source codes and English, English-to-Czech, and English-to-German datasets

Uniform Meaning Representation 2.0

Author: Bonn Julia
Bonial Claire
Buchholz Matt
Cheng Hsiao-Jung
Chen Alvin
Chen Ching-wen
Cowell Andrew
Croft William
Denk Lukas
Elsayed Ahmed
Fučíková Eva
Gamba Federica
Gomez Carlos
Hajič Jan
Hajičová Eva
Havelka Jiří
Havenmeier Loden
Kilgore Ath
Kolářová Veronika
Kučová Lucie
Lai Kenneth
Li Bin
Li Jingyi
Lopatková Markéta
MacGregor Marie
Mikulová Marie
Mírovský Jiří
Nedoluzhko Anna
Myers Skatje
Novák Michal
O’Gorman Tim
Pajas Petr
Palmer Alexis
Palmer Martha
Panevová Jarmila
Post Benét
Pustejovsky James
Sgall Petr
Song Jialin
Song Li
Ševčíková Magda
Štěpánek Jan
Urešová Zdeňka
Sun Haibo
Sun Yao
Vallejos Yopán Rosa
Van Gysel Jens
Vigus Meagan
Wright‑Bettner Kristin
Wu Jiawei
Xue Nianwen
Xing Dan
Xu Keer
Xu Zhixing
Yue Liulu
Zeman Daniel
Zhao Jin
Zikánová Šárka
Žabokrtský Zdeněk
Publication venue: UMR Consortium
Publication date: 17/05/2025
Field of study

The goal of the Uniform Meaning Representation (UMR) project is to design a meaning representation that can be used to annotate the semantic content of a text. UMR is primarily based on Abstract Meaning Representation (AMR), an annotation framework initially designed for English, but also draws from other meaning representations. UMR extends AMR to other languages, particularly morphologically complex, low-resource languages. UMR also adds features to AMR that are critical to semantic interpretation and enhances AMR by proposing a companion document-level representation that captures linguistic phenomena such as coreference as well as temporal and modal dependencies that potentially go beyond sentence boundaries. UMR is intended to be scalable, learnable, and cross-linguistically plausible. It is designed to support both lexical and logical inference

ACoRD - Aligned Continuo Realization Dataset

Author: Štefunko Adam
Chiruthapudi Suhit
Cancino-Chacón Carlos Eduardo
Hajič jr. Jan
Publication venue: Johannes Kepler University
Publication date: 22/04/2025
Field of study

This is a dataset of 175 MIDI recordings of basso continuo performances and manual performance-to-score alignments of some of them, used in the paper "Basso Continuo Goes Digital: Collecting and Aligning a Symbolic Dataset of Continuo Performance", AIMC 2025, Brussels. The dataset consists of recordings made by 7 harpsichordists (4 professionals, 3 students), with the mean of 9 years in experience in basso continuo (range 1 to 35 years). Each harpsichordist played 5 pieces (the score MusicXML files are also included in the dataset), and every harpsichordist played each piece 5 times. For 15 pieces, manual ground-truth manual alignments were made in two levels of detail:, correct identification of the performance bass line and its alignment to the score bass line, and the correct alignment of the whole basso continuo realization. If you use this dataset, please, cite the relevant paper that can be found on the project URL

Swedish-speaking population of Finland - statistics

Author: Pšeničková Andrea
Publication venue: Charles University, Faculty of Arts
Publication date: 17/06/2025
Field of study

data about the Swedish speaking minority in Finland from Finstat used for a report in PowerB

Czech PDT-C 2.0 Model for UDPipe 2 (2025-10-25)

Author: Straka Milan
Publication venue: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Publication date: 25/10/2025
Field of study

Tokenizer, POS Tagger, Lemmatizer, and Parser model based on the PDT-C 2.0 treebank (http://hdl.handle.net/11234/1-5813). The model documentation including performance can be found at https://ufal.mff.cuni.cz/udpipe/2/models#czech_pdtc2.0_model . To use these models, you need UDPipe version 2.1, which you can download from https://ufal.mff.cuni.cz/udpipe/2

The YouTube Corpus of Singapore English Podcasts

Author: Coats Steven
Basile Carmelo Alessandro
Morin Cameron
Fuchs Robert
Publication venue: University of Oulu
Publication date: 01/01/2025
Field of study

The YouTube Corpus of Singapore English Podcasts (YCSEP) contains transcripts from 620 hours of over 1,300 podcast episodes by Singapore-based content creators. The dataset, diarized into individual speaker turns, contains over 757,000 individual turns and 8.38 million word tokens. Created using a pipeline comprising yt-dlp, WhisperX, and pyannote.audio, it is intended to advance the study of the linguistic and discourse properties of Singapore English

Russian Media Corpus on the Harris–Trump Debate (RMC_HTD)

Author: Shorokhova Elena
Publication venue: Universidad Rey Juan Carlos
Publication date: 13/11/2025
Field of study

Russian Media Corpus on the Harris–Trump Debate contains metadata from Russian-language news articles reporting on the presidential debate between Kamala Harris and Donald Trump, which took place on 10 September 2024 and was broadcast by ABC News. The corpus includes articles published by four Russian-language media outlets: Kommersant, Argumenty i Fakty, Meduza, and BBC News Russian. All articles were published on 11 September 2024. The corpus consists of 19 articles written in Russian. The primary purpose of the corpus is to support research in Critical Discourse Analysis and studies on media representation of the event in the Russian-speaking press

AI Brown v1

Author: Milička Jiří
Marklová Anna
Cvrček Václav
Publication venue: Charles University, Faculty of Arts, Department of Linguistics
Publication date: 27/09/2025
Field of study

AI Brown is a corpus of English texts generated with large language models (LLMs). Its main purpose is to create a resource for comparing human-written texts with LLM-generated text linguistically. The corpus is multi-genre and rich in terms of topics, authors, and text types, and comparabile with existing human-created corpora. The corpus replicates reference human corpus: BE21 by Paul Baker, which is a modern version of the original Brown Corpus. The new corpus was generated using models from OpenAI, Anthropic, Alphabet, Meta, and DeepSeek, ranging from GPT-3 (davinci-002) to GPT-4.5, and are tagged according to the Universal Dependencies standard (i.e., the texts are tokenized, lemmatized, and morphologically and syntactically annotated). The subcorpus size varies according to the model used (864k tokens per model on average, 27.7M tokens altogether). The raw data and plain texts are freely available for download under the CC BY 4.0 license, the UD annotated data are under CC BY-NC-SA 4.0 licence. The corpus is also accessible through the KonText search interface of the Czech National Corpus (https://www.korpus.cz/kontext/query?corpname=ai_brown_v1)

QuickAnnotator

Author: Jan Oliver Rüdiger
Publication venue: Leibniz-Institut für Deutsche Sprache
Publication date: 01/07/2025
Field of study

Projektübersicht (Deutsch) - [English project description see below] IDS.QuickAnnotator ist ein umfassendes, modular aufgebautes System zur effizienten, transparenten und reproduzierbaren Annotation von Textkorpora. Ziel des Projekts ist es, den gesamten Workflow von der Auswahl und Vorbereitung der Texte über die eigentliche Annotation bis hin zur Auswertung und Konvertierung der Ergebnisse zu unterstützen und zu automatisieren. Das System besteht aus mehreren spezialisierten Komponenten, die jeweils einen klar abgegrenzten Aufgabenbereich abdecken: IDS.QuickAnnotator.API Die zentrale Server-Komponente stellt eine REST-basierte Web-API bereit, über die sämtliche Annotationen, Annotations-Jobs und Nutzerinteraktionen verwaltet werden. Sie sorgt für die Konsistenz der Daten und ermöglicht die Integration externer Tools und Clients. IDS.QuickAnnotator.Client Die Hauptoberfläche für Annotatoren bietet eine intuitive Benutzerführung und unterstützt die individuelle Bearbeitung und Verwaltung von Annotationen. Jeder Nutzer arbeitet mit eigenen Annotationen, wodurch eine klare Trennung und Nachvollziehbarkeit gewährleistet ist. IDS.QuickAnnotator.Client.Selector Dieses Tool unterstützt Hilfskraftbetreuer bei der Vorauswahl von Texten. Mithilfe von statistischem Sampling können gezielt relevante Textausschnitte für die Annotation zusammengestellt werden, um eine ausgewogene und repräsentative Stichprobe zu gewährleisten. IDS.QuickAnnotator.CorpusPreSampler Das Presampling-Modul automatisiert die statistische Vorauswahl und Bereinigung von Texten. Es bereitet die Daten für den IDS.QuickAnnotator.Client.Selector vor und stellt sicher, dass die zu annotierenden Texte den gewünschten Kriterien entsprechen. IDS.QuickAnnotator.Processor Dieses Modul konvertiert verschiedene Korpusformate (z. B. KorAP) in ein einheitliches, von der API verarbeitbares Format. Dadurch können unterschiedlich strukturierte Ausgangsdaten problemlos integriert und weiterverarbeitet werden. IDS.QuickAnnotator.QafSampler Der QafSampler ermöglicht eine quotenbasierte Auswahl von Texten, um bestimmte Kriterien oder Verteilungen innerhalb des Korpus gezielt abzubilden und die Zusammensetzung der Stichprobe zu steuern. IDS.QuickAnnotator.Tool4.AnnotatedBy Mit diesem Analyse-Tool lässt sich nachvollziehen, welche Texte und Textstellen von welchen Personen annotiert wurden. Es unterstützt die Qualitätssicherung, die Auswertung der Annotationen und die Dokumentation der Arbeitsprozesse. IDS.QuickAnnotator.Tool4.ApplyAnnotatorFixes Dieses Tool dient dazu, nachträgliche Korrekturen und Anpassungen an bestehenden Annotationen vorzunehmen, etwa um Fehler zu beheben oder die Datenqualität zu erhöhen. IDS.QuickAnnotator.Tool4.CalcDiff Das Berichtstool erstellt Auswertungen zu abgeschlossenen Annotationen, darunter Interannotator Agreement, DIFF-Ansichten im HTML-Format und Analyse-Diagramme zur Visualisierung der Ergebnisse. So können Unterschiede und Übereinstimmungen zwischen Annotatoren systematisch erfasst werden. IDS.QuickAnnotator.Tool4.ConvertToCorpus Nach Abschluss der Annotationen können die Korpora mit diesem Tool in verschiedene Zielformate (z. B. KorAP) konvertiert werden, um sie für weitere Analysen oder externe Anwendungen bereitzustellen. IDS.QuickAnnotator.Tool4.ConvertToJournal Dieses Modul konvertiert die annotierten Korpora in ein internes Journal-Format, das für spezifische Workflows und Dokumentationszwecke innerhalb des Projekts genutzt wird. IDS.QuickAnnotator.Tool4.FindMatchSentences Mit diesem Tool können übereinstimmende Sätze in verschiedenen annotierten Korpora gefunden und verglichen werden, was die Konsistenzprüfung und Qualitätssicherung erleichtert. IDS.QuickAnnotator.Tool4.OnlyAnnotatedBy Dieses Analyse-Tool identifiziert Annotationen, die ausschließlich von einem bestimmten Annotator erstellt wurden, und unterstützt so die gezielte Auswertung individueller Beiträge und die Überprüfung der Annotationstiefe. IDS.QuickAnnotator.Tool4.RemoveAnnotator Ermöglicht das nachträgliche Entfernen von Annotationen, beispielsweise wenn ein Annotator ausfällt oder Daten bereinigt werden müssen. IDS.QuickAnnotator.Web Die Web-Version des Clients befindet sich aktuell im Beta-Stadium und bietet eine moderne, browserbasierte Oberfläche für die Annotation. Sie ermöglicht ortsunabhängiges Arbeiten und eine einfache Integration in bestehende Workflows. Alle Komponenten sind in separaten Unterordnern organisiert und greifen über klar definierte Schnittstellen ineinander. Die modulare Architektur erlaubt eine flexible Erweiterung und Anpassung an unterschiedliche Anforderungen und Korpusformate. So entsteht eine skalierbare Infrastruktur, die den gesamten Prozess von der Auswahl und Konvertierung der Texte bis zur Analyse und Auswertung der Annotationen abdeckt und eine hohe Datenqualität sowie Nachvollziehbarkeit sicherstellt. project description (Englisch) IDS.QuickAnnotator is a comprehensive, modular system for the efficient, transparent, and reproducible annotation of text corpora. The aim of the project is to support and automate the entire workflow, from the selection and preparation of texts to the actual annotation and evaluation and conversion of the results. The system consists of several specialized components, each covering a clearly defined area of responsibility: IDS.QuickAnnotator.API The central server component provides a REST-based web API that manages all annotations, annotation jobs, and user interactions. It ensures data consistency and enables the integration of external tools and clients. IDS.QuickAnnotator.Client The main interface for annotators offers intuitive user guidance and supports individual editing and management of annotations. Each user works with their own annotations, ensuring clear separation and traceability. IDS.QuickAnnotator.Client.Selector This tool supports assistant supervisors in the preselection of texts. With the help of statistical sampling, relevant text excerpts can be compiled for annotation in order to ensure a balanced and representative sample. IDS.QuickAnnotator.CorpusPreSampler The presampling module automates the statistical preselection and cleaning of texts. It prepares the data for the IDS.QuickAnnotator.Client.Selector and ensures that the texts to be annotated meet the desired criteria. IDS.QuickAnnotator.Tool4.AnnotatedBy This analysis tool allows you to track which texts and text passages have been annotated by which individuals. It supports quality assurance, the evaluation of annotations, and the documentation of work processes. IDS.QuickAnnotator.Tool4.ApplyAnnotatorFixes This tool is used to make subsequent corrections and adjustments to existing annotations, for example to fix errors or improve data quality. IDS.QuickAnnotator.Tool4.CalcDiff The reporting tool generates evaluations of completed annotations, including interannotator agreement, DIFF views in HTML format, and analysis diagrams for visualizing the results. This allows differences and similarities between annotators to be systematically recorded. IDS.QuickAnnotator.Tool4.ConvertToCorpus Once the annotations are complete, this tool can be used to convert the corpora into various target formats (e.g., KorAP) in order to make them available for further analysis or external applications. IDS.QuickAnnotator.Tool4.ConvertToJournal This module converts the annotated corpora into an internal journal format that is used for specific workflows and documentation purposes within the project. IDS.QuickAnnotator.Tool4.FindMatchSentences This tool can be used to find and compare matching sentences in different annotated corpora, which facilitates consistency checking and quality assurance. IDS.QuickAnnotator.Tool4.OnlyAnnotatedBy This analysis tool identifies annotations that were created exclusively by a specific annotator, thereby supporting the targeted evaluation of individual contributions and the review of annotation depth. IDS.QuickAnnotator.Tool4.RemoveAnnotator Enables annotations to be removed retrospectively, for example if an annotator is unavailable or data needs to be cleaned up. IDS.QuickAnnotator.Web The web version of the client is currently in beta and offers a modern, browser-based interface for annotation. It enables location-independent working and easy integration into existing workflows. All components are organized in separate subfolders and interact via clearly defined interfaces. The modular architecture allows for flexible expansion and adaptation to different requirements and corpus formats. This creates a scalable infrastructure that covers the entire process from text selection and conversion to annotation analysis and evaluation, ensuring high data quality and traceability

0

full texts

1,998

metadata records

Updated in last 30 days.

LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University

Access Repository Dashboard

Do you manage Open Research Online? Become a CORE Member to access insider analytics, issue reports and manage access to outputs from your repository in the CORE Repository Dashboard! 👇