Charles University

LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University

Not a member yet

1998 research outputs found

Sort by

Bibliography of scholarly works on artificial consciousness

Author: Marvan Tomas
Mihálik Jakub
Publication venue: Institute of Philosophy of the Czech Academy of Sciences
Publication date: 14/04/2025
Field of study

A comprehensive bibliography of scholarly works on artificial consciousness. The bibliography focuses on English language works published in recent decades in academic journals, books and other scholarly outlets. When preparing this bibliography we searched existing scientific bibliographies, including OpenAlex (openalex.org) and PhilPapers (philpapers.org), and strived to exclude irrelevant search results. The resulting database currently has over six hundred entries. Our goal is to offer a useful, freely available bibliographical tool with advanced search functionality to researchers working in the interdisciplinary field of machine consciousness, AI and related topics. The live version of the database, which we try to regularly update, can be found on this link: https://artcon.flu.cas.cz/bibliography/ and can be added/downloaded in Zotero here: https://www.zotero.org/groups/5900684/artcon

EdUKate translation software 2

Author: Popel Martin
Novák Michal
Balhar Jiří
Košarko Ondřej
Mayer Jiří
Poláková Lucie
Kloudová Věra
Anisimova Mariia
Publication venue: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Publication date: 01/01/2025
Field of study

This software package includes three tools: web frontend (charles-translator-web-frontend) for machine translation featuring phonetic transcription of Ukrainian suitable for Czech speakers, API server (lindat-translation) and a tool for translation of documents with markup including html, docx, odt, pptx and odp (document-translations). These tools are used in the Charles Translator service (https://translator.cuni.cz). This software was developed within the EdUKate project, which aims to help mitigate language barriers between non-Czech-speaking children in the Czech Republic and the education in the Czech school system. The project focuses on the development and dissemination of multilingual digital learning materials for students in primary and secondary schools

ORATOR v3: corpus of spoken Czech monologues (transcriptions & audio)

Author: Kopřivová Marie
Laubeová Zuzana
Lukeš David
Poukarová Petra
Horký Václav
Jelínek Tomáš
Křivan Jan
Publication venue: Charles University, Faculty of Arts, Department of Linguistics
Publication date: 28/05/2025
Field of study

The ORATOR v3 corpus contains monologues by native Czech speakers. The typical situations include a lecture, instruction, guided tour, welcome address, sermon etc. The corpus is composed of 489 recordings from 2005–2019 and contains 1 212 729 orthographic words (i.e. a total of 1 542 133 tokens including punctuation); a total of 468 different speakers appear in the probes. The transcription was manual and it is linked to the corresponding audio track. ORATOR v3 is lemmatized and morphologically tagged according to the SYN2020 standard. The (anonymized) transcriptions are provided in the XML ELAN Annotation format, audio (with corresponding anonymization beeps) is in uncompressed 16-bit PCM WAV, mono, 16 kHz format. Another format option of the transcriptions is also available under less restrictive CC BY-NC-SA license at http://hdl.handle.net/11234/1-593

Evaldio-residency | Automatic Assessment of Spoken Czech as a Foreign Language: Permanent Residency in the Czech Republic

Author: Novák Michal
Polák Peter
Rysová Kateřina
Rysová Magdaléna
Bojar Ondřej
Publication venue: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Publication date: 31/10/2025
Field of study

Evaldio for Permanent Residency Permit is a service/tool that provides an automatic speech assessment of the oral part of the Czech language exam at the A2 level. Passing the exam is mandatory for issuing the permanent residency permit in Czechia. The service/tool expects a recording of the exam in the input and outputs the predicted relative score and probability of passing the exam at the A2 level. Furthermore, the service/tool presents the user with the automatic transcription, diarization, and additional statistics

NameTag 3 Multilingual Model 250203

Author: Straková Jana
Straka Milan
Publication venue: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Publication date: 03/02/2025
Field of study

This is a trained model for the supervised machine learning tool NameTag 3 (https://ufal.mff.cuni.cz/nametag/3/). NameTag 3 is an open-source tool for both flat and nested named entity recognition (NER). NameTag 3 identifies proper names in text and classifies them into a set of predefined categories, such as names of persons, locations, organizations, etc. The model was trained jointly on 21 flat NE corpora of 17 languages: Arabic, Chinese, Croatian, Czech, Danish, Dutch, English, German, Maghrebi Arabic French, Norwegian Bokmaal, Norwegian Nynorsk, Portuguese, Serbian, Slovak, Spanish, Swedish, and Ukrainian. The model documentation can be found at https://ufal.mff.cuni.cz/nametag/3/models#multilingual

Stereotypes and Discourse Connectors in Czech

Author: Gvoždiak Vít
Publication venue: Institute of Philosophy of the Czech Academy of Sciences
Publication date: 26/08/2025
Field of study

The purpose of the dataset is to test three variables: (i) the effect of argument order in Ale-constructions (But-constructions) “A, ale B” (“A, but B”): positive A, but negative (or stereotypical category) B vs. negative (or stereotypical category) A, but positive B; (ii) the effect of the discourse connector that introduces the conclusion following from the Ale-construction (“takže” (so/therefore) vs. “nicméně” (however/nevertheless)); (iii) the effect of propositional content (stereotypical vs. neutral) on inference of the conclusion. At the most general level, the dataset is divided into two groups according to the connective introducing the conclusion: 24 “takže” scenarios and 24 “nicméně” scenarios. Each of these two categories is further divided according to (non-)stereotypicality of content, i.e., 12 neutral (non-stereotypical) scenarios and 12 stereotypical scenarios (categories: age, gender, and nationality/ethnicity), and according to the order of arguments. In neutral scenarios: 6 scenarios with the structure “positive A [positive = argument for performing R], but negative B [negative = argument for performing non-R]”. See the README file for more information regarding the structure and use of the data

SYN v14: large corpus of written Czech

Author: Křen Michal
Čapka Tomáš
Hnátková Milena
Jelínek Tomáš
Křivan Jan
Petkevič Vladimír
Skoumalová Hana
Vondřička Pavel
Publication venue: Charles University, Faculty of Arts, Department of Linguistics
Publication date: 01/01/2025
Field of study

Corpus of contemporary written (printed) Czech sized almost 5.5 GW (i.e. 6.6 billion tokens). It covers mostly the 1990-2024 period and features rich metadata including detailed bibliographical information, text-type classification etc. SYN v14 contains a wide variety of text types (fiction, non-fiction, newspapers), but the newspapers prevail noticeably. The corpus is lemmatized and morphologically tagged by the unified CNC tagset, and features also an annotation of multiword expressions. The data provided here exactly correspond to those available via the KonText query interface to registered users of the CNC with one important exception: they are shuffled, i.e. divided into blocks sized max. 100 words (respecting the sentence boundaries) with ordering randomized within the given document. SYN v14 is provided in a semi-XML / CoNLL-U-like vertical format used as an input to the Manatee query engine. The vertical format is a sequence of lines. Each of the lines is either a structure (that starts with '') or a token (with a fixed set of tab-separated columns). The columns of the SYN v14 token lines are as follows: word / sword [syntactic word] / lemma / sublemma / tag / pos / case / verbtag [verbal tag] / mwe_lemma [multiword lemma] / mwe_tag [multiword tag

Czech Etymological Lexicon 1.0

Author: Rejzek Jiří
Papáček Aleš
Brezinová Viktória
Žabokrtský Zdeněk
Publication venue: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Publication date: 28/01/2025
Field of study

The Czech Etymological Lexicon, version 1.0, contains 10,502 Czech words, each annotated with a sequence of ISO 639-3 language codes representing its etymological origin. The dataset is provided in a simple tab-separated format, with the first column containing the lemma and the second listing the language codes separated by commas. Example entry: architekt deu,lat,ell loan The word architekt originated from Greek, and came to Czech through Latin and German. The third column indicates whether the word is a loanword (marked as "loan") or a native word (marked as "native"). Note that "native" refers to inherited words as opposed to loanwords. The language sequences were extracted from the printed dictionary REJZEK, Jiří. Český etymologický slovník [Czech etymological dictionary]. LEDA, 2015. The extraction of language sequences from the entries in the original dictionary was fully automated and, therefore, may contain imperfections. Please refer to the original dictionary for highly precise information

Slavic UD Treebanks with Periphrastic Verb Forms

Author: Krippnerová Lenka
Zeman Daniel
Publication venue: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Publication date: 16/06/2025
Field of study

This dataset is based on Universal Dependencies v2.16 (http://hdl.handle.net/11234/1-5901). It contains treebanks for 15 Slavic languages, enriched with periphrastic verb form annotations. While UD encodes morphological features at the token level, our annotation extends this by marking periphrastic verb phrases that span multiple tokens — possibly discontinuously — to capture more complex verbal constructions. This kind of annotation is added to the last column of the CoNLL-U format (MISC). The added annotation is encoded in Phrase* attributes in MISC. In certain cases, the annotation of FEATS and DEPREL was modified, too, to provide more uniform annotation across the languages. For more details, see the paper: Lenka Krippnerová and Daniel Zeman. 2025. Periphrastic Verb Forms in Universal Dependencies. In: Proceedings of SyntaxFest / Depling 2025, Ljubljana, Slovenia

Content-based annotation of page images from the (archaeological) historical archive

Author: Lutsai Kateryna
Křivánková Dana
Publication venue: Charles University in Prague, UFAL
Publication date: 10/10/2025
Field of study

This dataset employs a comprehensive 11-label classification scheme to categorize scanned images of document pages. The types are based on their content and presentation format. The scheme distinguishes between visual content (drawings, maps, paintings, schematics, and photographs), textual content (handwritten, printed, or machine-typed), and hybrid formats that combine multiple elements. Special attention is given to layout characteristics, with separate labels designated for content presented in tabular or form-like structures versus paragraph or block formats. For instance, we differentiate between standard drawings (DRAW) and drawings with table-based legends (DRAW_L), as well as between regular photographs (PHOTO) and those embedded within tabular layouts (PHOTO_L). The textual categories are particularly nuanced, distinguishing between three input methods—handwritten (✏️), printed (), and machine-typed ()—and further subdividing these based on structural organization. Text can appear in either tabular/form-like arrangements (LINE_HW, LINE_P, LINE_T) or in traditional paragraph/block formats (TEXT_HW, TEXT_P, TEXT_T). An additional TEXT category accommodates mixed documents that combine multiple text types or include minor graphical elements, providing flexibility for complex real-world documents. The dataset is organized using a 5-fold cross-validation structure, with each fold maintaining an 80-10-10 split for training, development, and test sets respectively. This partitioning information is documented in an accompanying CSV file, enabling robust model evaluation and the potential for ensemble approaches where models trained on different folds can be averaged together to create a more robust combined model, provided they share the same base architecture

0

full texts

1,998

metadata records

Updated in last 30 days.

LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University

Access Repository Dashboard

Do you manage Open Research Online? Become a CORE Member to access insider analytics, issue reports and manage access to outputs from your repository in the CORE Repository Dashboard! 👇