Charles University

LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University
Not a member yet
    1998 research outputs found

    Bibliography of scholarly works on artificial consciousness

    No full text
    A comprehensive bibliography of scholarly works on artificial consciousness. The bibliography focuses on English language works published in recent decades in academic journals, books and other scholarly outlets. When preparing this bibliography we searched existing scientific bibliographies, including OpenAlex (openalex.org) and PhilPapers (philpapers.org), and strived to exclude irrelevant search results. The resulting database currently has over six hundred entries. Our goal is to offer a useful, freely available bibliographical tool with advanced search functionality to researchers working in the interdisciplinary field of machine consciousness, AI and related topics. The live version of the database, which we try to regularly update, can be found on this link: https://artcon.flu.cas.cz/bibliography/ and can be added/downloaded in Zotero here: https://www.zotero.org/groups/5900684/artcon

    EdUKate translation software 2

    No full text
    This software package includes three tools: web frontend (charles-translator-web-frontend) for machine translation featuring phonetic transcription of Ukrainian suitable for Czech speakers, API server (lindat-translation) and a tool for translation of documents with markup including html, docx, odt, pptx and odp (document-translations). These tools are used in the Charles Translator service (https://translator.cuni.cz). This software was developed within the EdUKate project, which aims to help mitigate language barriers between non-Czech-speaking children in the Czech Republic and the education in the Czech school system. The project focuses on the development and dissemination of multilingual digital learning materials for students in primary and secondary schools

    ORATOR v3: corpus of spoken Czech monologues (transcriptions & audio)

    No full text
    The ORATOR v3 corpus contains monologues by native Czech speakers. The typical situations include a lecture, instruction, guided tour, welcome address, sermon etc. The corpus is composed of 489 recordings from 2005–2019 and contains 1 212 729 orthographic words (i.e. a total of 1 542 133 tokens including punctuation); a total of 468 different speakers appear in the probes. The transcription was manual and it is linked to the corresponding audio track. ORATOR v3 is lemmatized and morphologically tagged according to the SYN2020 standard. The (anonymized) transcriptions are provided in the XML ELAN Annotation format, audio (with corresponding anonymization beeps) is in uncompressed 16-bit PCM WAV, mono, 16 kHz format. Another format option of the transcriptions is also available under less restrictive CC BY-NC-SA license at http://hdl.handle.net/11234/1-593

    Evaldio-residency | Automatic Assessment of Spoken Czech as a Foreign Language: Permanent Residency in the Czech Republic

    No full text
    Evaldio for Permanent Residency Permit is a service/tool that provides an automatic speech assessment of the oral part of the Czech language exam at the A2 level. Passing the exam is mandatory for issuing the permanent residency permit in Czechia. The service/tool expects a recording of the exam in the input and outputs the predicted relative score and probability of passing the exam at the A2 level. Furthermore, the service/tool presents the user with the automatic transcription, diarization, and additional statistics

    NameTag 3 Multilingual Model 250203

    No full text
    This is a trained model for the supervised machine learning tool NameTag 3 (https://ufal.mff.cuni.cz/nametag/3/). NameTag 3 is an open-source tool for both flat and nested named entity recognition (NER). NameTag 3 identifies proper names in text and classifies them into a set of predefined categories, such as names of persons, locations, organizations, etc. The model was trained jointly on 21 flat NE corpora of 17 languages: Arabic, Chinese, Croatian, Czech, Danish, Dutch, English, German, Maghrebi Arabic French, Norwegian Bokmaal, Norwegian Nynorsk, Portuguese, Serbian, Slovak, Spanish, Swedish, and Ukrainian. The model documentation can be found at https://ufal.mff.cuni.cz/nametag/3/models#multilingual

    Stereotypes and Discourse Connectors in Czech

    No full text
    The purpose of the dataset is to test three variables: (i) the effect of argument order in Ale-constructions (But-constructions) “A, ale B” (“A, but B”): positive A, but negative (or stereotypical category) B vs. negative (or stereotypical category) A, but positive B; (ii) the effect of the discourse connector that introduces the conclusion following from the Ale-construction (“takže” (so/therefore) vs. “nicméně” (however/nevertheless)); (iii) the effect of propositional content (stereotypical vs. neutral) on inference of the conclusion. At the most general level, the dataset is divided into two groups according to the connective introducing the conclusion: 24 “takže” scenarios and 24 “nicméně” scenarios. Each of these two categories is further divided according to (non-)stereotypicality of content, i.e., 12 neutral (non-stereotypical) scenarios and 12 stereotypical scenarios (categories: age, gender, and nationality/ethnicity), and according to the order of arguments. In neutral scenarios: 6 scenarios with the structure “positive A [positive = argument for performing R], but negative B [negative = argument for performing non-R]”. See the README file for more information regarding the structure and use of the data

    SYN v14: large corpus of written Czech

    No full text
    Corpus of contemporary written (printed) Czech sized almost 5.5 GW (i.e. 6.6 billion tokens). It covers mostly the 1990-2024 period and features rich metadata including detailed bibliographical information, text-type classification etc. SYN v14 contains a wide variety of text types (fiction, non-fiction, newspapers), but the newspapers prevail noticeably. The corpus is lemmatized and morphologically tagged by the unified CNC tagset, and features also an annotation of multiword expressions. The data provided here exactly correspond to those available via the KonText query interface to registered users of the CNC with one important exception: they are shuffled, i.e. divided into blocks sized max. 100 words (respecting the sentence boundaries) with ordering randomized within the given document. SYN v14 is provided in a semi-XML / CoNLL-U-like vertical format used as an input to the Manatee query engine. The vertical format is a sequence of lines. Each of the lines is either a structure (that starts with '') or a token (with a fixed set of tab-separated columns). The columns of the SYN v14 token lines are as follows: word / sword [syntactic word] / lemma / sublemma / tag / pos / case / verbtag [verbal tag] / mwe_lemma [multiword lemma] / mwe_tag [multiword tag

    Czech Etymological Lexicon 1.0

    No full text
    The Czech Etymological Lexicon, version 1.0, contains 10,502 Czech words, each annotated with a sequence of ISO 639-3 language codes representing its etymological origin. The dataset is provided in a simple tab-separated format, with the first column containing the lemma and the second listing the language codes separated by commas. Example entry: architekt deu,lat,ell loan The word architekt originated from Greek, and came to Czech through Latin and German. The third column indicates whether the word is a loanword (marked as "loan") or a native word (marked as "native"). Note that "native" refers to inherited words as opposed to loanwords. The language sequences were extracted from the printed dictionary REJZEK, Jiří. Český etymologický slovník [Czech etymological dictionary]. LEDA, 2015. The extraction of language sequences from the entries in the original dictionary was fully automated and, therefore, may contain imperfections. Please refer to the original dictionary for highly precise information

    Slavic UD Treebanks with Periphrastic Verb Forms

    No full text
    This dataset is based on Universal Dependencies v2.16 (http://hdl.handle.net/11234/1-5901). It contains treebanks for 15 Slavic languages, enriched with periphrastic verb form annotations. While UD encodes morphological features at the token level, our annotation extends this by marking periphrastic verb phrases that span multiple tokens — possibly discontinuously — to capture more complex verbal constructions. This kind of annotation is added to the last column of the CoNLL-U format (MISC). The added annotation is encoded in Phrase* attributes in MISC. In certain cases, the annotation of FEATS and DEPREL was modified, too, to provide more uniform annotation across the languages. For more details, see the paper: Lenka Krippnerová and Daniel Zeman. 2025. Periphrastic Verb Forms in Universal Dependencies. In: Proceedings of SyntaxFest / Depling 2025, Ljubljana, Slovenia

    Content-based annotation of page images from the (archaeological) historical archive

    No full text
    This dataset employs a comprehensive 11-label classification scheme to categorize scanned images of document pages. The types are based on their content and presentation format. The scheme distinguishes between visual content (drawings, maps, paintings, schematics, and photographs), textual content (handwritten, printed, or machine-typed), and hybrid formats that combine multiple elements. Special attention is given to layout characteristics, with separate labels designated for content presented in tabular or form-like structures versus paragraph or block formats. For instance, we differentiate between standard drawings (DRAW) and drawings with table-based legends (DRAW_L), as well as between regular photographs (PHOTO) and those embedded within tabular layouts (PHOTO_L). The textual categories are particularly nuanced, distinguishing between three input methods—handwritten (✏️), printed (), and machine-typed ()—and further subdividing these based on structural organization. Text can appear in either tabular/form-like arrangements (LINE_HW, LINE_P, LINE_T) or in traditional paragraph/block formats (TEXT_HW, TEXT_P, TEXT_T). An additional TEXT category accommodates mixed documents that combine multiple text types or include minor graphical elements, providing flexibility for complex real-world documents. The dataset is organized using a 5-fold cross-validation structure, with each fold maintaining an 80-10-10 split for training, development, and test sets respectively. This partitioning information is documented in an accompanying CSV file, enabling robust model evaluation and the potential for ensemble approaches where models trained on different folds can be averaged together to create a more robust combined model, provided they share the same base architecture

    0

    full texts

    1,998

    metadata records
    Updated in last 30 days.
    LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University
    Access Repository Dashboard
    Do you manage Open Research Online? Become a CORE Member to access insider analytics, issue reports and manage access to outputs from your repository in the CORE Repository Dashboard! 👇