Charles University
LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles UniversityNot a member yet
1998 research outputs found
Sort by
Lexical Dataset of Czech nene- Constructions
Databáze dokladů slov s dvojí záporkou nene- v češtině, sloužící primárně pro účely bakalářské práce na ÚČJTK FFUK (Dvojitá negace nene- (typ nenedostal, nenepatrný) v češtině; Lucie Hartmanová, 2024).
Abstrakt k původní bakalářské práci:
Bakalářská práce se zabývá dvojitou negací v češtině, která je vyjadřována
opakováním záporky ne- před negovaným slovem dvakrát bezprostředně za sebou.
Při výzkumu jsme usilovali o sebrání co největšího počtu dokladů slov s dvojí záporkou,
přičemž materiál jsme shromáždili z vybraných slovníků češtiny, psaných jazykových
korpusů, Lexikální databáze humanistické a barokní češtiny a digitální knihovny Kramerius
(verze 5). Na základě excerpce jsme následně provedli kvantitativní vyhodnocení podle
slovních druhů, dále jsme se zaměřili na sémantiku dvojí záporky a zabývali jsme se rovněž
konkurenčními prostředky s dvojitou negací bezprostředně za sebou. Zvláštní pozornost je
v práci věnována vývoji tohoto typu záporu v biblickém překladu tam, kde byly detekovány
jeho výskyty ve staročeských či raněnovověkých překladech. Výsledky výzkumu jsme
následně shrnuli a nabídli jsme možná východiska pro další bádání
SYN2025: representative corpus of written Czech
Representative corpus of contemporary written (printed) Czech sized 100 MW. It was created as a representation of printed language from 2020–2024 containing a wide range of text types (fiction, professional literature, newspapers etc.). The corpus is lemmatized, morphologically and syntactically annotated by a combination of various methods. The corpus is provided in a (semi-XML) vertical format used as an input to the Manatee query engine. The vertical format is a sequence of lines. Each of the lines is either a structure (that starts with '') or a token (with a fixed set of tab-separated columns). The columns of the SYN2025 token lines are described in more detail at https://wiki.korpus.cz/doku.php/en:seznamy:syn2025_attributes
The data provided here exactly correspond to those available via the KonText query interface to registered users of the CNC with one important exception: they are shuffled, i.e. divided into blocks sized max. 100 words (respecting the sentence boundaries) with ordering randomized within the given document
HeCz: Large Scale Self-Paced Reading Corpus Newspaper Headlines in Czech
The HeCz corpus comprises self-paced reading data for 1919 newspaper headlines (23,634 words) in Czech, with each headline being accompanied by a yes–no comprehension question, resulting in a rich dataset of reading times for each individual word and comprehension accuracy. The corpus is novel in terms of the sheer scale of data collection, with 1872 native Czech speakers, each reading approximately 120 headlines, with 1162 of those participants also completing the experiment again in a re-testing session using the same stimuli approximately 1 month later. There is participant level meta-data also available relating to basic demographic information, reading habits and a profile of their mood state prior to completing the experiment. Beyond the behavioral and demographic data, we also include a range of linguistic annotations for several variables, e.g., frequency, surprisal, morphological tagging
MockConf: Student Interpretation Dataset
This repository contains the dataset centered on Czech, comprising simultaneous interpreting data with human-annotated transcriptions at both the span and word levels. The dataset interpretings that were collected from Mock Conferences run as part of the student interpreters curriculum. These data was then manually aligned and annotated at the word and span level using InterAlign, a dedicated tool designed to facilitate the annotation at the span and word levels. The dataset is described and used in the paper MockConf: A Student Interpretation Dataset: Analysis, Word- and Span-level Alignment and Baselines
Word embeddings based on a large corpus of written Czech
This package comprises six models of Czech word embeddings: two sets with dimensions 100, 200 and 300, one for lemmas and one for word forms. They were trained by fastText (P. Bojanowski, E. Grave, A. Joulin, T. Mikolov (2016): Enriching Word Vectors with Subword Information, https://fasttext.cc/) on the SYN v13 corpus of contemporary written Czech (Křen et al. 2024, https://wiki.korpus.cz/doku.php/en:cnk:syn:verze13) based on its lemmatisation and tagging. The skipgram algorithm was used for the training, with -minn 2 and -maxn 5 for subwords
Uniform Meaning Representation 2.1 (Czech and Latin)
Czech and Latin UMR data, both manually annotated and programmatically converted from manually annotated tectogrammatical data
Little Big Translation Literature – Czech and German Translations of Yiddish Literature as a Reflection of Changing Politics and Society
In order to make the process of preparing analyses for a planned monograph about Czech and German translations of Yiddish texts transparent, five source texts were transcribed from Yiddish and Czech and German target texts were annotated. The intention of the annotations is to clarify the individual steps and aspects of the text translation analysis. The focus lies primarily on semantic and culturally conditioned shifts in the partial analyses. Shifts at the grammatical level and shifts conditioned by the different language structure are taken into consideration only if they are relevant to the analysis
DigiDiaDem Speech-Cognitive Dataset (DSCD-CZ-2)
An updated and expanded version of the dataset was created to investigate the speech and cognitive performance of people with varying degrees of cognitive impairment, primarily dementia. The dataset contains a comprehensive set of data including the results of standardized neuropsychological tests (RBANS, ALBA, POBAV, MASTCZ), speech tasks focused on comprehension, memory, naming, and repetition, and demographic data (age, gender, education).
Participants were divided into four groups based on clinical assessment: healthy individuals, healthy individuals with possible mild cognitive impairment, patients with mild cognitive impairment, and patients with dementia. All recordings and examinations were managed as part of routine clinical practice in the neurological outpatient clinic – Memory Clinic at the Department of Neurology at the Faculty Hospital Královské Vinohrady. The dataset containing 371 examinations was divided into a training and test part using stratification by clinical group, age, gender, and level of education to ensure an even distribution of these key characteristics in both parts of the data.
Additionally, Manually Engineered Features and Scores were added to the previous version of the dataset.
The aim of the dataset is to support the development of methods for automated detection of cognitive disorders based on speech analysis and cognitive performance. The data are suitable for research in the areas of clinical neuropsychology, computational linguistics, and machine learning. The dataset is intended for non-commercial research purposes
Testimonies of Roma and Sinti
The key idea of our project is to convey to the widest possible readership detailed abstracts of the testimonies of Roma and Sinti and thus their personal and irreplaceable experience of the Second World War. We hope that the Testimonies of Roma and Sinti project will contribute to greater awareness of their genocide and will be an irreplaceable source of information for researchers, relatives of the victims, or anyone else interested in this important topic.
First of all, we defined the project geographically: we focused on the testimonies of Roma and Sinti from the Bohemian lands (today's Czech Republic) and Slovakia. The second definition is that we are only processing printed testimonies into the database. A valuable, and extremely demanding, part of the database is the detailed abstracts of these testimonies prepared by Romani studies experts in cooperation with historians and linguistic stylists. These abstracts are important not only for Czech and Slovak readers, as many publications with testimonies are not easily accessible, but especially for users from abroad - whether researchers, members of Romani communities or any other interested parties - as the vast majority of the hundreds of published testimonies exist only in Czech, Slovak or Romani, and are thus inaccessible to most people from abroad.
Within the database, the testimonies are analyzed according to several criteria, which allow detailed searches and their classification, for example, according to the type of war experience (internment, participation in armed struggle, hiding, etc.). In the analysis, we then focused mainly on geographical data. Therefore, projections of collected data on maps are an integral part of the database, which allow us to show the war trajectory of individuals and groups, to show, for example, the locations of mass murders or guerrilla fighting, or to search for testimonies related to a place
Ancillary Monitor Corpus: Common Crawl - german web (YEAR 2015 – VERSION 1)
*** german version see below ***
The ‘Ancillary Monitor Corpus: Common Crawl - german web’ was designed with the aim of enabling a broad-based linguistic analysis of the German-language (visible) internet over time - with the aim of achieving comparability with the DeReKo (‘German Reference Corpus’ of the Leibniz Institute for the German Language - DeReKo volume 57 billion tokens - status: DeReKo Release 2024-I). The corpus is separated by year (here year 2015) and versioned (here version 1). Version 1 comprises (all years 2013-2024) 97.45 billion tokens.
The corpus is based on the data dumps from CommonCrawl (https://commoncrawl.org/). CommonCrawl is a non-profit organisation that provides copies of the visible Internet free of charge for research purposes.
The CommonCrawl WET raw data was first filtered by TLD (top-level domain). Only pages ending in the following TLDs were taken into account: ‘.at; .bayern; .berlin; .ch; .cologne; .de; .gmbh; .hamburg; .koeln; .nrw; .ruhr; .saarland; .swiss; .tirol; .wien; .zuerich’. These are the exclusive German-language TLDs according to ICANN (https://data.iana.org/TLD/tlds-alpha-by-domain.txt) as of 1 June 2024 - TLDs with a purely corporate reference (e.g. ‘.edeka; .bmw; .ford’) were excluded. The language of the individual documents (URLs) was then estimated with the help of NTextCat (https://github.com/ivanakcheurov/ntextcat) (via the CORE14 profile of NTextCat) - only those documents/URLs for which German was the most likely language were processed further (e.g. to exclude foreign-language material such as individual subpages). The third step involved filtering for manual selectors and filtering for 1:1 duplicates (within one year).
The filtering and subsequent processing was carried out using CorpusExplorer (http://hdl.handle.net/11234/1-2634) and our own (supplementary) scripts, and the TreeTagger (http://hdl.handle.net/11372/LRT-323) was used for automatic annotation. The corpus was processed on the HELIX HPC cluster. The author would like to take this opportunity to thank the state of Baden-Württemberg and the German Research Foundation (DFG) for the possibility to use the bwHPC/HELIX HPC cluster - funding code HPC cluster: INST 35/1597-1 FUGG.
Data content:
- Tokens and record boundaries
- Automatic lemma and POS annotation (using TreeTagger)
- Metadata:
- GUID - Unique identifier of the document
- YEAR - Year of capture (please use this information for data slices)
- Url - Full URL
- Tld - Top-Level Domain
- Domain - Domain without TLD (but with sub-domains if applicable)
- DomainFull - Complete domain (incl. TLD)
- DomainFull - Complete domain (incl. TLD)
- Datum - (System Information): Date of the CorpusExplorer (date of capture by CommonCrawl - not date of creation/modification of the document).
- Hash - (System Information): SHA1 hash of the CommonCrawl
- Pfad - (System Information): Path of the cluster (raw data) - is supplied by the system.
Please note that the files are saved as *.cec6.gz. These are binary files of the CorpusExplorer (see above). These files ensure efficient archiving. You can use both CorpusExplorer and the ‘CEC6-Converter’ (available for Linux, MacOS and Windows - see: https://lindat.mff.cuni.cz/repository/xmlui/handle/11372/LRT-5705) to convert the data. The data can be exported in the following formats:
- CATMA v6
- CoNLL
- CSV
- CSV (only meta-data)
- DTA TCF-XML
- DWDS TEI-XML
- HTML
- IDS I5-XML
- IDS KorAP XML
- IMS Open Corpus Workbench
- JSON
- OPUS Corpus Collection XCES
- Plaintext
- SaltXML
- SlashA XML
- SketchEngine VERT
- SPEEDy/CODEX (JSON)
- TLV-XML
- TreeTagger
- TXM
- WebLicht
- XML
Please note that an export increases the storage space requirement extensively. The ‘CorpusExplorerConsole’ (https://github.com/notesjor/CorpusExplorer.Terminal.Console - available for Linux, MacOS and Windows) also offers a simple solution for editing and analysing. If you have any questions, please contact the author.
Legal information
The data was downloaded on 01.11.2024. The use, processing and distribution is subject to §60d UrhG (german copyright law), which authorises the use for non-commercial purposes in research and teaching. LINDAT/CLARIN is responsible for long-term archiving in accordance with §69d para. 5 and ensures that only authorised persons can access the data. The data has been checked to the best of our knowledge and belief (on a random basis) - should you nevertheless find legal violations (e.g. right to be forgotten, personal rights, etc.), please write an e-mail to the author ([email protected]) with the following information: 1) why this content is undesirable (please outline only briefly) and 2) how the content can be identified - e.g. file name, URL or domain, etc. The author will endeavour to identify the content. The author will endeavour to remove the content and re-upload the data (modified) within two weeks (new version). If you have any further questions, please contact CLARIN.
*** english version see above ***
Das ‚Ancillary Monitor Corpus: Common Crawl - german web‘ wurde mit dem Ziel konzipiert - eine breit angelegte und zeitlich begleitende linguistische Analyse des deutschsprachigen (sichtbaren) Internets zu ermöglichen - wobei eine Vergleichbarkeit mit dem DeReKo (‚Deutsches Referenz Korpus‘ des Leibniz-Instituts für Deutsche Sprache - DeReKo Umfang 57 Mrd. Token - Stand: DeReKo Release 2024-I) angestrebt wird. Das Korpus ist nach Jahren getrennt (hier Jahr 2015) und versioniert (hier Version 1). Die Version 1 umfasst (alle Jahre 2013-2024) 97,45 Mrd. Token.
Das Korpus basiert auf den Daten-Dumps von CommonCrawl (https://commoncrawl.org/). CommonCrawl ist eine Non-Profit-Organisation, die Kopien des sichtbaren Internets kostenlos für die Forschung zur Verfügung stellt.
Die CommonCrawl WET Rohdaten wurden zunächst nach TLD (Top-Level Domain) gefiltert. Es wurden nur Seiten berücksichtigt, die auf folgende TLDs enden: „.at; .bayern; .berlin; .ch; .cologne; .de; .gmbh; .hamburg; .koeln; .nrw; .ruhr; .saarland; .swiss; .tirol; .wien; .zuerich“. Dies sind die exklusiven deutschsprachigen TLDs laut ICANN (https://data.iana.org/TLD/tlds-alpha-by-domain.txt) zum Stand 01.06.2024 - ausgeschlossen wurden TLDs mit reinem Firmenbezug (z.B. „.edeka; .bmw; .ford“). Für die einzelnen Dokumente (URLs) wurde dann mit Hilfe von NTextCat (https://github.com/ivanakcheurov/ntextcat) die Sprache geschätzt (über das CORE14-Profil von NTextCat) - es wurden nur solche Dokumente/URLs weiterverarbeitet, bei denen Deutsch die wahrscheinlichste Sprache war (z.B. um möglichst auszuschließen, dass fremdsprachiges Material wie einzelne Unterseitenbereiche enthalten sind). Als dritter Schritt erfolgte eine Filterung nach manuellen Selektoren und eine Filterung nach 1:1-Dubletten (innerhalb eines Jahres).
Die Filterung und anschließende Aufbereitung erfolgte mit dem CorpusExplorer (http://hdl.handle.net/11234/1-2634) und eigenen (ergänzenden) Skripten, wobei für die automatische Annotation der TreeTagger (http://hdl.handle.net/11372/LRT-323) verwendet wurde. Die Aufbereitung des Korpus erfolgte auf dem HELIX-HPC-Cluster. Der Autor dankt an dieser Stelle dem Land Baden-Württemberg und der Deutschen Forschungsgemeinschaft (DFG) für die Möglichkeit das bwHPC/HELIX HPC-Cluster nutzen zu können – Förderkennzeichen HPC-Cluster: INST 35/1597-1 FUGG.
Dateninhalt:
- Token und Satzgrenzen
- Automatische Lemma- und POS-Annotation (mittels TreeTagger)
- Metadaten:
- GUID - Eindeutiger Identifikator des Dokuments
- YEAR - Jahr der Erfassung (bitte verwenden Sie diese Angabe für Datenschnitte)
- Url - Vollständige URL
- Tld – Top-Level Domain
- Domain – Domain ohne TLD (aber ggf. mit Sub-Domains)
- DomainFull – Vollständige Domain (inkl. TLD)
- DomainFull - Komplette Domain (inkl. TLD)
- Datum - (System Information): Datum des CorpusExplorers (Tag der Erfassung durch CommonCrawl - nicht Tag der Erstellung/Änderung des Dokuments).
- Hash - (System Information): SHA1-Hash des CommonCrawl
- Pfad - (System Information): Pfad des Clusters (Rohdaten) - wird systembedingt geliefert.
Bitte beachten Sie, dass die Dateien als *.cec6.gz gespeichert sind. Dies sind Binärdateien des CorpusExplorers (siehe oben). Diese Dateien gewährleisten eine effiziente Archivierung. Sie können sowohl den CorpusExplorer als auch den ‚CEC6-Converter‘ (verfügbar für Linux, MacOS und Windows - siehe: https://lindat.mff.cuni.cz/repository/xmlui/handle/11372/LRT-5705) zur Konvertierung der Daten verwenden. Die Daten können in folgende Formate exportiert werden:
- CATMA v6
- CoNLL
- CSV
- CSV (only meta-data)
- DTA TCF-XML
- DWDS TEI-XML
- HTML
- IDS I5-XML
- IDS KorAP XML
- IMS Open Corpus Workbench
- JSON
- OPUS Corpus Collection XCES
- Plaintext
- SaltXML
- SlashA XML
- SketchEngine VERT
- SPEEDy/CODEX (JSON)
- TLV-XML
- TreeTagger
- TXM
- WebLicht
- XML
Bitte beachten Sie, dass ein Export den Speicherplatzbedarf erheblich erhöht. Eine einfache Lösung zur Bearbeitung und Analyse bietet auch die „CorpusExplorerConsole“ (https://github.com/notesjor/CorpusExplorer.Terminal.Console - verfügbar für Linux, MacOS und Windows). Bei Fragen wenden Sie sich bitte an den Autor.
Rechtliche Hinweise
Die Daten wurden am 01.11.2024 heruntergeladen. Die Nutzung, Verarbeitung und Verbreitung unterliegt §60d UrhG, der die Nutzung für nicht kommerzielle Zwecke in Forschung und Lehre erlaubt. LINDAT/CLARIN übernimmt die Langzeitarchivierung nach §69d Abs. 5 und stellt sicher, dass nur berechtigte Personen auf die Daten zugreifen können. Die Daten wurden nach bestem Wissen und Gewissen (stichprobenartig) überprüft - sollten Sie dennoch Rechtsverletzungen (z.B. Recht auf Vergessenwerden, Persönlichkeitsrechte etc.) finden, schreiben Sie bitte eine E-Mail an den Autor ([email protected]) mit folgenden Informationen: 1) warum dieser Inhalt unerwünscht ist (bitte nur kurz skizzieren) und 2) wie der Inhalt identifiziert werden kann - z.B. Dateiname, URL oder Domain etc. Der Autor wird sich bemühen, den Inhalt zu entfernen und die Daten innerhalb von zwei Wochen (verändert) wieder hochzuladen (neue Version). Bei weiteren Fragen wenden Sie sich bitte an CLARIN