Charles University

LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University

Not a member yet

1998 research outputs found

Sort by

CorPipe 24 Multilingual CorefUD 1.2 Model (corpipe24-corefud1.2-240906)

Author: Straka Milan
Publication venue: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Publication date: 06/09/2024
Field of study

The `corpipe24-corefud1.2-240906` is a `mT5-large`-based multilingual model for coreference resolution usable in CorPipe 24 (https://github.com/ufal/crac2024-corpipe). It is released under the CC BY-NC-SA 4.0 license. The model is language agnostic (no corpus id on input), so it can be in theory used to predict coreference in any `mT5` language. This model jointly predicts also the empty nodes needed for zero coreference. The paper introducing this model also presents an alternative two-stage approach first predicting empty nodes (via https://www.kaggle.com/models/ufal-mff/crac2024_zero_nodes_baseline/) and then performing coreference resolution (via http://hdl.handle.net/11234/1-5673), which is circa twice as slow but slightly better

Dataset used in the paper Uncovering Relationships using Bayesian Networks: A Case Study on Conspiracy Theories

Author: Vomlel Jiří
Kuběna Aleš
Šmíd Martin
Weinerová Josefína
Publication venue: Proceedings of Machine Learning Research
Publication date: 11/09/2024
Field of study

Dataset from a Czech university entrance exam. This dataset includes a test of active, open-minded thinking designed by Jonathan Baron, as well as a test of students’ attitudes toward various conspiracies. Data were analyzed in the paper J. Vomlel, A. Kuběna, M. Šmíd, J. Weinerová. Uncovering Relationships using Bayesian Networks: A Case Study on Conspiracy Theories, Proceedings of Machine Learning Research, Volume 246 : International Conference on Probabilistic Graphical Models, p. 470-485, International Conference on Probabilistic Graphical Models 2024 /12./, (Nijmegen, NL, 20240911) https://raw.githubusercontent.com/mlresearch/v246/main/assets/vomlel24a/vomlel24a.pd

KUKY1.0

Author: Cinková Silvie
Kuk Michal
Šamánková Jana
Kubíková Barbora
Pospíšil Přemysl
Mírovský Jiří
Hladká Barbora
Novotná Tereza
Publication venue: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Publication date: 31/12/2024
Field of study

KUKY is a curated selection of 224 Czech administrative and legal documents for readability research, stored in two JSON files. The documents come partly from public databases (Office of the Ombudsman, courts) and from private sources (letters, public local administration announcements). Some documents come in documented draft-revision pairs. They are manually enriched with a two-level annotation: "Relevance Stoplight" and "Speech Acts". This annotation mimics the way a plain-language expert scrutinizes a document before redesigning it for better readability: first, they closely read the entire document and detect problematic passages ("Relevance Stoplight"), classifying them as either incomprehensible or superfluous, or approving them as relevant. In a second step, the editor works with the relevant text according to a genre-specific template ("Speech Acts"). At the metadata level, the documents are graded with respect to their readability, as perceived by experienced plain legal writing teachers

PDT-Vallex: Czech Valency lexicon linked to treebanks 4.5 (PDT-Vallex 4.5)

Author: Urešová Zdeňka
Bémová Alevtina
Fučíková Eva
Hajič Jan
Kolářová Veronika
Mikulová Marie
Pajas Petr
Panevová Jarmila
Štěpánek Jan
Publication venue: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Publication date: 30/12/2024
Field of study

The valency lexicon PDT-Vallex 4.5 is a part of the PDT-C 2.0 release https://hdl.handle.net/11234/1-5813. It is a slightly modified version of PDT-Vallex 4.0 from 2020 (as a part of PDT-C 1.0 corpus) for full compatibility with PDT-C 2.0 annotation, including a completely reworked reference IDs for the word and frame entries. PDT-Vallex has been built in close connection with the annotation of the Prague Dependency Treebank project (PDT) and its successors (mainly the Prague Czech-English Dependency Treebank project, PCEDT, the spoken language corpus (PDTSC) and corpus of user-generated texts in the project Faust). It contains over 14500 valency frames for almost 8500 verbs which occurred in the PDT, PCEDT, PDTSC and Faust corpora. In addition, there are nouns, adjectives and adverbs, linked from the PDT part only, increasing the total to over 20000 valency frames for almost 13000 words. All the corpora have been published in 2024 as the PDT-C 2.0 corpus with the PDT-Vallex 4.5 dictionary included; this is a copy of the dictionary published as a separate item for those not interested in the corpora themselves. It is available in electronically processable format (XML), and also in more human readable form including corpus examples (see the project and web browser links below, and the links to its main publications elsewhere in this metadata). The main feature of the lexicon is its linking to the annotated corpora - each occurrence of each verb is linked to the appropriate valency frame with additional (generalized) information about its usage and surface morphosyntactic form alternatives

Quality of Working Life 2024

Author: Vinopal Jiří
Štěpánek Martin
Publication venue: Occupational Safety Research Institute, v.v.i.
Publication date: 2024
Field of study

A regular survey conducted as part of the long-term monitoring of the quality of working life in the Czech Republic, carried out using the research tool SQWLi (https://www.pracovnipohoda.cz/o-projektu-kpz/o-projektu/indikator-sqwli/). Monitoring has been conducted since 2011, usually once a year, allowing the data to be linked into time series

ParCzech 4.0

Author: Kopp Matyáš
Publication venue: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Publication date: 31/01/2024
Field of study

The ParCzech 4.0 corpus consists of stenographic protocols that record the Chamber of Deputies' meetings in the 7th term (2013-2017), the 8th term (2017-2021) and the current 9th term (2021-Jul 2023). The protocols are provided in their original HTML format, Parla-CLARIN TEI format. The corpus is automatically enriched with the morphological, syntactic, and named-entity annotations using the procedures UDPipe 2 and NameTag 2. The audio files are aligned with the texts in the annotated TEI files. The audio files in this corpus are available in AudioPSP 24.01 corpus (http://hdl.handle.net/11234/1-5404). This corpus covers the same period as ParlaMint-CZ corpus v4.0 (http://hdl.handle.net/11356/1860). ParCzech corpus follows and extends the ParlaMint schema. Both annotated and non-annotated versions include hypertext references to voting and parliamentary prints. In addition to ParlaMint's recommendation, the annotated version contains source audio alignment, PDT xtag, and more detailed CNEC2.0 named entity categorization

EdUKate translation software 1

Author: Popel Martin
Novák Michal
Balhar Jiří
Košarko Ondřej
Mayer Jiří
Poláková Lucie
Kloudová Věra
Anisimova Mariia
Publication venue: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Publication date: 28/06/2024
Field of study

This software package includes three tools: web frontend for machine translation featuring phonetic transcription of Ukrainian suitable for Czech speakers, API server and a tool for translation of documents with markup (html, docx, odt, pptx, odp,...). These tools are used in the Charles Translator service (https://translator.cuni.cz). This software was developed within the EdUKate project, which aims to help mitigate language barriers between non-Czech-speaking children in the Czech Republic and the education in the Czech school system. The project focuses on the development and dissemination of multilingual digital learning materials for students in primary and secondary schools

Derinet 2.2

Author: Svoboda Emil
Vidra Jonáš
Ševčíková Magda
Žabokrtský Zdeněk
Publication venue: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Publication date: 25/06/2024
Field of study

DeriNet is a lexical network which models derivational and compositional relations in the lexicon of Czech. Nodes of the network correspond to Czech lexemes, while edges represent word-formational relations between a derived word and its base word / words. The present version, DeriNet 2.2, contains: - 1,040,127 lexemes (sampled from the MorfFlex CZ 2.0 dictionary), connected by - 782,904 derivational, - 50,511 orthographic variant, - 6,336 compounding, - 288 univerbation, and - 135 conversion relations. Compared to the previous version, version 2.1 contains an overhaul of the compounding annotation scheme, 4384 extra compounds, 83 more affixoid lexemes serving as bases for compounding, more parts of speech serving as bases for compounding (adverbs, pronouns, numerals), and several minor corrections of derivational relations

EvaldioData 1.0

Author: Rysová Kateřina
Novák Michal
Rysová Magdaléna
Polák Peter
Bojar Ondřej
Publication venue: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Publication date: 31/10/2024
Field of study

EvaldioData 1.0 is the language corpus of spoken performances by non-native speakers of Czech. It includes recordings capturing the oral part of the Czech Language Certificate Exam. The recordings consist of dialogues between the examiner (a native speaker) and the candidate (a non-native speaker). In addition to the recordings, the corpus also contains their transcriptions, which are richly linguistically annotated. Some recordings are accompanied by multiple transcriptions from different annotators, allowing for comparisons of various transcripts of the same recording and evaluations of the degree of consistency in converting spoken language into written text. The current version focuses on the A2 level (according to the CEFR), which is required for the granting of permanent residency in the Czech Republic

Ancillary Monitor Corpus: Common Crawl - german web (YEAR 2022 – VERSION 1)

Author: Rüdiger Jan Oliver
Publication venue: Rüdiger, Jan Oliver
Publication date: 21/11/2024
Field of study

*** german version see below *** The ‘Ancillary Monitor Corpus: Common Crawl - german web’ was designed with the aim of enabling a broad-based linguistic analysis of the German-language (visible) internet over time - with the aim of achieving comparability with the DeReKo (‘German Reference Corpus’ of the Leibniz Institute for the German Language - DeReKo volume 57 billion tokens - status: DeReKo Release 2024-I). The corpus is separated by year (here year 2022) and versioned (here version 1). Version 1 comprises (all years 2013-2024) 97.45 billion tokens. The corpus is based on the data dumps from CommonCrawl (https://commoncrawl.org/). CommonCrawl is a non-profit organisation that provides copies of the visible Internet free of charge for research purposes. The CommonCrawl WET raw data was first filtered by TLD (top-level domain). Only pages ending in the following TLDs were taken into account: ‘.at; .bayern; .berlin; .ch; .cologne; .de; .gmbh; .hamburg; .koeln; .nrw; .ruhr; .saarland; .swiss; .tirol; .wien; .zuerich’. These are the exclusive German-language TLDs according to ICANN (https://data.iana.org/TLD/tlds-alpha-by-domain.txt) as of 1 June 2024 - TLDs with a purely corporate reference (e.g. ‘.edeka; .bmw; .ford’) were excluded. The language of the individual documents (URLs) was then estimated with the help of NTextCat (https://github.com/ivanakcheurov/ntextcat) (via the CORE14 profile of NTextCat) - only those documents/URLs for which German was the most likely language were processed further (e.g. to exclude foreign-language material such as individual subpages). The third step involved filtering for manual selectors and filtering for 1:1 duplicates (within one year). The filtering and subsequent processing was carried out using CorpusExplorer (http://hdl.handle.net/11234/1-2634) and our own (supplementary) scripts, and the TreeTagger (http://hdl.handle.net/11372/LRT-323) was used for automatic annotation. The corpus was processed on the HELIX HPC cluster. The author would like to take this opportunity to thank the state of Baden-Württemberg and the German Research Foundation (DFG) for the possibility to use the bwHPC/HELIX HPC cluster - funding code HPC cluster: INST 35/1597-1 FUGG. Data content: - Tokens and record boundaries - Automatic lemma and POS annotation (using TreeTagger) - Metadata: - GUID - Unique identifier of the document - YEAR - Year of capture (please use this information for data slices) - Url - Full URL - Tld - Top-Level Domain - Domain - Domain without TLD (but with sub-domains if applicable) - DomainFull - Complete domain (incl. TLD) - DomainFull - Complete domain (incl. TLD) - Datum - (System Information): Date of the CorpusExplorer (date of capture by CommonCrawl - not date of creation/modification of the document). - Hash - (System Information): SHA1 hash of the CommonCrawl - Pfad - (System Information): Path of the cluster (raw data) - is supplied by the system. Please note that the files are saved as *.cec6.gz. These are binary files of the CorpusExplorer (see above). These files ensure efficient archiving. You can use both CorpusExplorer and the ‘CEC6-Converter’ (available for Linux, MacOS and Windows - see: https://lindat.mff.cuni.cz/repository/xmlui/handle/11372/LRT-5705) to convert the data. The data can be exported in the following formats: - CATMA v6 - CoNLL - CSV - CSV (only meta-data) - DTA TCF-XML - DWDS TEI-XML - HTML - IDS I5-XML - IDS KorAP XML - IMS Open Corpus Workbench - JSON - OPUS Corpus Collection XCES - Plaintext - SaltXML - SlashA XML - SketchEngine VERT - SPEEDy/CODEX (JSON) - TLV-XML - TreeTagger - TXM - WebLicht - XML Please note that an export increases the storage space requirement extensively. The ‘CorpusExplorerConsole’ (https://github.com/notesjor/CorpusExplorer.Terminal.Console - available for Linux, MacOS and Windows) also offers a simple solution for editing and analysing. If you have any questions, please contact the author. Legal information The data was downloaded on 01.11.2024. The use, processing and distribution is subject to §60d UrhG (german copyright law), which authorises the use for non-commercial purposes in research and teaching. LINDAT/CLARIN is responsible for long-term archiving in accordance with §69d para. 5 and ensures that only authorised persons can access the data. The data has been checked to the best of our knowledge and belief (on a random basis) - should you nevertheless find legal violations (e.g. right to be forgotten, personal rights, etc.), please write an e-mail to the author ([email protected]) with the following information: 1) why this content is undesirable (please outline only briefly) and 2) how the content can be identified - e.g. file name, URL or domain, etc. The author will endeavour to identify the content. The author will endeavour to remove the content and re-upload the data (modified) within two weeks (new version). If you have any further questions, please contact CLARIN. *** english version see above *** Das ‚Ancillary Monitor Corpus: Common Crawl - german web‘ wurde mit dem Ziel konzipiert - eine breit angelegte und zeitlich begleitende linguistische Analyse des deutschsprachigen (sichtbaren) Internets zu ermöglichen - wobei eine Vergleichbarkeit mit dem DeReKo (‚Deutsches Referenz Korpus‘ des Leibniz-Instituts für Deutsche Sprache - DeReKo Umfang 57 Mrd. Token - Stand: DeReKo Release 2024-I) angestrebt wird. Das Korpus ist nach Jahren getrennt (hier Jahr 2022) und versioniert (hier Version 1). Die Version 1 umfasst (alle Jahre 2013-2024) 97,45 Mrd. Token. Das Korpus basiert auf den Daten-Dumps von CommonCrawl (https://commoncrawl.org/). CommonCrawl ist eine Non-Profit-Organisation, die Kopien des sichtbaren Internets kostenlos für die Forschung zur Verfügung stellt. Die CommonCrawl WET Rohdaten wurden zunächst nach TLD (Top-Level Domain) gefiltert. Es wurden nur Seiten berücksichtigt, die auf folgende TLDs enden: „.at; .bayern; .berlin; .ch; .cologne; .de; .gmbh; .hamburg; .koeln; .nrw; .ruhr; .saarland; .swiss; .tirol; .wien; .zuerich“. Dies sind die exklusiven deutschsprachigen TLDs laut ICANN (https://data.iana.org/TLD/tlds-alpha-by-domain.txt) zum Stand 01.06.2024 - ausgeschlossen wurden TLDs mit reinem Firmenbezug (z.B. „.edeka; .bmw; .ford“). Für die einzelnen Dokumente (URLs) wurde dann mit Hilfe von NTextCat (https://github.com/ivanakcheurov/ntextcat) die Sprache geschätzt (über das CORE14-Profil von NTextCat) - es wurden nur solche Dokumente/URLs weiterverarbeitet, bei denen Deutsch die wahrscheinlichste Sprache war (z.B. um möglichst auszuschließen, dass fremdsprachiges Material wie einzelne Unterseitenbereiche enthalten sind). Als dritter Schritt erfolgte eine Filterung nach manuellen Selektoren und eine Filterung nach 1:1-Dubletten (innerhalb eines Jahres). Die Filterung und anschließende Aufbereitung erfolgte mit dem CorpusExplorer (http://hdl.handle.net/11234/1-2634) und eigenen (ergänzenden) Skripten, wobei für die automatische Annotation der TreeTagger (http://hdl.handle.net/11372/LRT-323) verwendet wurde. Die Aufbereitung des Korpus erfolgte auf dem HELIX-HPC-Cluster. Der Autor dankt an dieser Stelle dem Land Baden-Württemberg und der Deutschen Forschungsgemeinschaft (DFG) für die Möglichkeit das bwHPC/HELIX HPC-Cluster nutzen zu können – Förderkennzeichen HPC-Cluster: INST 35/1597-1 FUGG. Dateninhalt: - Token und Satzgrenzen - Automatische Lemma- und POS-Annotation (mittels TreeTagger) - Metadaten: - GUID - Eindeutiger Identifikator des Dokuments - YEAR - Jahr der Erfassung (bitte verwenden Sie diese Angabe für Datenschnitte) - Url - Vollständige URL - Tld – Top-Level Domain - Domain – Domain ohne TLD (aber ggf. mit Sub-Domains) - DomainFull – Vollständige Domain (inkl. TLD) - DomainFull - Komplette Domain (inkl. TLD) - Datum - (System Information): Datum des CorpusExplorers (Tag der Erfassung durch CommonCrawl - nicht Tag der Erstellung/Änderung des Dokuments). - Hash - (System Information): SHA1-Hash des CommonCrawl - Pfad - (System Information): Pfad des Clusters (Rohdaten) - wird systembedingt geliefert. Bitte beachten Sie, dass die Dateien als *.cec6.gz gespeichert sind. Dies sind Binärdateien des CorpusExplorers (siehe oben). Diese Dateien gewährleisten eine effiziente Archivierung. Sie können sowohl den CorpusExplorer als auch den ‚CEC6-Converter‘ (verfügbar für Linux, MacOS und Windows - siehe: https://lindat.mff.cuni.cz/repository/xmlui/handle/11372/LRT-5705) zur Konvertierung der Daten verwenden. Die Daten können in folgende Formate exportiert werden: - CATMA v6 - CoNLL - CSV - CSV (only meta-data) - DTA TCF-XML - DWDS TEI-XML - HTML - IDS I5-XML - IDS KorAP XML - IMS Open Corpus Workbench - JSON - OPUS Corpus Collection XCES - Plaintext - SaltXML - SlashA XML - SketchEngine VERT - SPEEDy/CODEX (JSON) - TLV-XML - TreeTagger - TXM - WebLicht - XML Bitte beachten Sie, dass ein Export den Speicherplatzbedarf erheblich erhöht. Eine einfache Lösung zur Bearbeitung und Analyse bietet auch die „CorpusExplorerConsole“ (https://github.com/notesjor/CorpusExplorer.Terminal.Console - verfügbar für Linux, MacOS und Windows). Bei Fragen wenden Sie sich bitte an den Autor. Rechtliche Hinweise Die Daten wurden am 01.11.2024 heruntergeladen. Die Nutzung, Verarbeitung und Verbreitung unterliegt §60d UrhG, der die Nutzung für nicht kommerzielle Zwecke in Forschung und Lehre erlaubt. LINDAT/CLARIN übernimmt die Langzeitarchivierung nach §69d Abs. 5 und stellt sicher, dass nur berechtigte Personen auf die Daten zugreifen können. Die Daten wurden nach bestem Wissen und Gewissen (stichprobenartig) überprüft - sollten Sie dennoch Rechtsverletzungen (z.B. Recht auf Vergessenwerden, Persönlichkeitsrechte etc.) finden, schreiben Sie bitte eine E-Mail an den Autor ([email protected]) mit folgenden Informationen: 1) warum dieser Inhalt unerwünscht ist (bitte nur kurz skizzieren) und 2) wie der Inhalt identifiziert werden kann - z.B. Dateiname, URL oder Domain etc. Der Autor wird sich bemühen, den Inhalt zu entfernen und die Daten innerhalb von zwei Wochen (verändert) wieder hochzuladen (neue Version). Bei weiteren Fragen wenden Sie sich bitte an CLARIN

0

full texts

1,998

metadata records

Updated in last 30 days.

LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University

Access Repository Dashboard

Do you manage Open Research Online? Become a CORE Member to access insider analytics, issue reports and manage access to outputs from your repository in the CORE Repository Dashboard! 👇