Charles University

LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University

Universal Dependencies 2.15

Author: Zeman Daniel
Nivre Joakim
Abrams Mitchell
Ackermann Elia
Aepli Noëmi
Aghaei Hamid
Agić Željko
Ahmadi Amir
Ahrenberg Lars
Ajede Chika Kennedy
Akhundjanova Arofat
Akkurt Furkan
Aleksandravičiūtė Gabrielė
Alfina Ika
Algom Avner
Alnajjar Khalid
Alzetta Chiara
Andersen Erik
Andrews Matthew
Antonsen Lene
Aoyama Tatsuya
Aplonova Katya
Aquino Angelina
Aragon Carolina
Aranes Glyd
Aranzabe Maria Jesus
Arıcan Bilge Nas
Arnardóttir Þórunn
Arutie Gashaw
Arwidarasti Jessica Naraiswari
Asahara Masayuki
Ásgeirsdóttir Katla
Aslan Deniz Baran
Asmazoğlu Cengiz
Ateyah Luma
Atmaca Furkan
Attia Mohammed
Atutxa Aitziber
Augustinus Liesbeth
Avelãs Mariana
Badmaeva Elena
Balasubramani Keerthana
Ballesteros Miguel
Banerjee Esha
Bank Sebastian
Barbosa Bryan Khelven da Silva
Barbu Mititelu Verginica
Barkarson Starkaður
Basile Rodolfo
Basmov Victoria
Batchelor Colin
Bauer John
Bedir Seyyit Talha
Behzad Shabnam
Belieni Juan
Bengoetxea Kepa
Benli İbrahim
Ben Moshe Yifat
Berg Ansu
Berk Gözde
Bhat Riyaz Ahmad
Biagetti Erica
Bick Eckhard
Bielinskienė Agnė
Bilgin Taşdemir Esma Fatıma
Bjarnadóttir Kristín
Blaschke Verena
Blokland Rogier
Böbel Nina
Bobicev Victoria
Boizou Loïc
Bonilla Johnatan
Borges Völker Emanuel
Börstell Carl
Bosco Cristina
Bouma Gosse
Bowman Sam
Boyd Adriane
Braggaar Anouck
Branco António
Brokaitė Kristina
Burchardt Aljoscha
Cabeza Carmen
Cáceres Arandia Natalia
Campos Marisa
Candito Marie
Caron Bernard
Caron Gauthier
Carvalheiro Catarina
Carvalho Rita
Cassidy Lauren
Castro Maria Clara
Castro Sérgio
Cavalcanti Tatiana
Cebiroğlu Eryiğit Gülşen
Cecchini Flavio Massimiliano
Celano Giuseppe G. A.
Çepani Anila
Čéplö Slavomír
Cesur Neslihan
Cetin Savas
Çetinoğlu Özlem
Chalub Fabricio
Chamila Liyanage
Chamoreau Claudine
Chauhan Shweta
Chen Yifei
Chi Ethan
Chika Taishi
Cho Yongseok
Choi Jinho
Chontaeva Bermet
Chun Jayeol
Chung Juyeon
Cignarella Alessandra T.
Cinková Silvie
Collomb Aurélie
Çöltekin Çağrı
Connor Miriam
Corbetta Claudia
Corbetta Daniela
Costa Francisco
Courtin Marine
Crabbé Benoît
Cristescu Mihaela
Cvetkoski Vladimir
Dahan Netanel
Dale Ingerid Løyning
Daniel Philemon
Davidson Elizabeth
de Alencar Leonel Figueiredo
Dehouck Mathieu
de Laurentiis Martina
de Marneffe Marie-Catherine
de Paiva Valeria
Derin Mehmet Oguz
de Souza Elvis
Diaz de Ilarraza Arantza
Díaz Hernández Roberto Antonio
Dickerson Carly
Di Felippo Ariani
Dinakaramani Arawinda
Di Nuovo Elisa
Dione Bamba
Dirix Peter
Do Hoa
Dobrovoljc Kaja
Döhmer Caroline
Doyle Adrian
Dozat Timothy
Droganova Kira
Duran Magali Sanches
Dwivedi Puneet
Ebert Christian
Eckhoff Hanne
Eguchi Masaki
Eiche Sandra
Eiselen Roald
Eli Marhaba
Elkahky Ali
Ephrem Binyam
Erina Olga
Erjavec Tomaž
Eslami Soudabeh
Essaidi Farah
Etienne Aline
Evelyn Wograine
Facundes Sidney
Farkas Richárd
Faryad Ján
Favero Federica
Ferdaousi Jannatul
Fernanda Marília
Fernandez Alcalde Hector
Fethi Amal
Foster Jennifer
Fransen Theodorus
Freitas Cláudia
Fujita Kazunori
Gajdošová Katarína
Galbraith Daniel
Galy Edith
Gamba Federica
Garcia Marcos
García-Miguel José María
Gärdenfors Moa
Gaustad Tanja
Genç Efe Eren
Gerardi Fabrício Ferraz
Gerdes Kim
Gessler Luke
Ginter Filip
Godoy Gustavo
Goenaga Iakes
Gojenola Koldo
Gökırmak Memduh
Goldberg Yoav
Goldin Gili
Gómez Guinovart Xavier
González Saavedra Berta
Griciūtė Bernadeta
Grioni Matias
Grobol Loïc
Grūzītis Normunds
Guillaume Bruno
Guiller Kirian
Guillot-Barbance Céline
Güngör Tunga
Gurevich Vladimir
Habash Nizar
Hafsteinsson Hinrik
Hajič Jan
Hajič jr. Jan
Hämäläinen Mika
Hà Mỹ Linh
Han Na-Rae
Hanifmuti Muhammad Yudistira
Harada Takahiro
Hardwick Sam
Harris Kim
Hassert Naïma
Haug Dag
Heinecke Johannes
Hellwig Oliver
Hennig Felix
Hladká Barbora
Hlaváčová Jaroslava
Hociung Florinel
Hoefels Diana
Hohle Petter
Howell Nick
Huang Yidi
Huerta Mendez Marivel
Hwang Jena
Ikeda Takumi
Iliadou Inessa
Ingason Anton Karl
Ion Radu
Irimia Elena
Ishola Ọlájídé
Islamaj Artan
Ito Kaoru
Iurescia Federica
Jagodzińska Sandra
Jannat Siratun
Jelínek Tomáš
Jha Apoorva
Jiang Katharine
Jobanputra Mayank
Johannsen Anders
Jónsdóttir Hildur
Jørgensen Fredrik
Juutinen Markus
Kaşıkara Hüner
Kabaeva Nadezhda
Kahane Sylvain
Kanayama Hiroshi
Kanerva Jenna
Kara Neslihan
Karahóǧa Ritván
Kåsen Andre
Kayadelen Tolga
Kengatharaiyer Sarveswaran
Kettnerová Václava
Kharatyan Lilit
Kirchner Jesse
Klementieva Elena
Klyachko Elena
Kocharov Petr
Köhn Arne
Köksal Abdullatif
Kopacewicz Kamil
Korkiakangas Timo
Köse Mehmet
Koshevoy Alexey
Kote Nelda
Kotsyba Natalia
Kovačić Barbara
Kovalevskaitė Jolanta
Kowner Emmanuelle
Krek Simon
Krishnamurthy Parameswari
Kübler Sandra
Kuqi Adrian
Kuyrukçu Oğuzhan
Kuzgun Aslı
Kwak Sookyoung
Kyle Kris
Laan Käbi
Laippala Veronika
Lambertino Lorenzo
Landau Israel
Lando Tatiana
Larasati Septina Dian
Lavrentiev Alexei
Lee John
Lê Hồng Phương
Lenci Alessandro
Lertpradit Saran
Leung Herman
Levina Maria
Levine Lauren
Li Cheuk Ying
Li Josie
Li Keying
Li Yixuan
Li Yuan
Lim KyungTae
Lima Padovani Bruna
Lin Yi-Ju Jessica
Lindén Krister
Liu Yang Janet
Ljubešić Nikola
Lobzhanidze Irina
Loginova Olga
Lopes Lucelene
Luftiu Edita
Lukashevskyi Arsenii
Lusito Stefano
Lutgen Anne-Marie
Luthfi Andry
Luukko Mikko
Lyashevskaya Olga
Lynn Teresa
Macketanz Vivien
Mahamdi Menel
Maillard Jean
Makarchuk Ilya
Makazhanov Aibek
Mambrini Francesco
Mandl Michael
Manning Christopher
Manurung Ruli
Marşan Büşra
Mărănduc Cătălina
Mareček David
Marheinecke Katrin
Markantonatou Stella
Martínez Alonso Héctor
Martín Rodríguez Lorena
Martins André
Martins Cláudia
Mašek Jan
Matsuda Hiroshi
Matsumoto Yuji
Mazzei Alessandro
McDonald Ryan
McGuinness Sarah
Mehta Maitrey
Ménard Pierre André
Mendonça Gustavo
Merhav Hilla
Merzhevich Tatiana
Meurer Paul
Miekka Niko
Milano Emilia
Miller Aaron
Minerbi Yael
Mischenkova Karina
Missilä Anna
Mititelu Cătălin
Mitrofan Maria
Miyao Yusuke
Mojiri Foroushani AmirHossein
Molnár Judit
Moloodi Amirsaeid
Montemagni Simonetta
More Amir
Moreno Romero Laura
Moretti Giovanni
Mori Shinsuke
Morioka Tomohiko
Moro Shigeki
Mortensen Bjartur
Moskalevskyi Bohdan
Muischnek Kadri
Munro Robert
Murawaki Yugo
Müürisep Kaili
Nainwani Pinkey
Nakhlé Mariam
Navarro Horñiacek Juan Ignacio
Nedoluzhko Anna
Nešpore-Bērzkalne Gunta
Nevaci Manuela
Nguyễn Thị Lương
Nguyễn Thị Minh Huyền
Nikaido Yoshihiro
Nikolaev Vitaly
Nitisaroj Rattima
Norrman Victor
Nourian Alireza
Nunes Maria das Graças Volpe
Nurmi Hanna
Ojala Stina
Ojha Atul Kr.
Óladóttir Hulda
Olúòkun Adédayọ̀
Omura Mai
Onwuegbuzia Emeka
Ordan Noam
Osenova Petya
Östling Robert
Ott Annika
Øvrelid Lilja
Özateş Şaziye Betül
Özçelik Merve
Özgür Arzucan
Öztürk Başaran Balkız
Paccosi Teresa
Palmero Aprosio Alessio
Panova Anastasia
Pardo Thiago Alexandre Salgueiro
Park Hyunji Hayley
Partanen Niko
Pascual Elena
Passarotti Marco
Patejuk Agnieszka
Paulino-Passos Guilherme
Pedonese Giulia
Peeters Oggi
Peljak-Łapińska Angelika
Peng Siyao
Peng Siyao Logan
Pereira Rita
Pereira Sílvia
Perez Cenel-Augusto
Perkova Natalia
Perrier Guy
Petrov Slav
Petrova Daria
Peverelli Andrea
Phelan Jason
Pierre-Louis Claudel
Piitulainen Jussi
Pinter Yuval
Pinto Clara
Pintucci Rodrigo
Pirinen Tommi A
Pitler Emily
Plamada Magdalena
Plank Barbara
Plum Alistair
Poibeau Thierry
Ponomareva Larisa
Popel Martin
Pretkalniņa Lauma
Pretorius Rigardt
Prévost Sophie
Prokopidis Prokopis
Przepiórkowski Adam
Pugh Robert
Puolakainen Tiina
Purschke Christoph
Pyysalo Sampo
Qi Peng
Querido Andreia
Rääbis Andriela
Rabinovich Ella
Rademaker Alexandre
Rahoman Mizanur
Rama Taraka
Ramasamy Loganathan
Ramisch Carlos
Ramos Joana
Rashel Fam
Rasooli Mohammad Sadegh
Ravishankar Vinit
Real Livy
Rebeja Petru
Reddy Siva
Regnault Mathilde
Rehm Georg
Riabi Arij
Riabov Ivan
Rießler Michael
Rimkutė Erika
Rinaldi Larissa
Rituma Laura
Rizqiyah Putri
Rocha Luisa
Rögnvaldsson Eiríkur
Roksandic Ivan
Roman Norton Trevisan
Romanenko Mykhailo
Rosa Rudolf
Roșca Valentin
Roulon Paulette
Rovati Davide
Rozonoyer Ben
Rudina Olga
Rueter Jack
Ruffolo Paolo
Rúnarsson Kristján
Rushiti Rozana
Sadde Shoval
Safari Pegah
Sahala Aleksi
Saleh Shadi
Salomoni Alessio
Publication venue: Universal Dependencies Consortium
Publication date: 15/11/2024
Field of study

Universal Dependencies is a project that seeks to develop cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual parser development, cross-lingual learning, and parsing research from a language typology perspective. The annotation scheme is based on (universal) Stanford dependencies (de Marneffe et al., 2006, 2008, 2014), Google universal part-of-speech tags (Petrov et al., 2012), and the Interset interlingua for morphosyntactic tagsets (Zeman, 2008)

YouTube-ASL Clip Keypoint Dataset

Author: Zelezny Tomas
Hruz Marek
Straka Jaub
Gueuwou Shester
Publication venue: University of Western Bohemia, Pilsen
Publication date: 01/07/2024
Field of study

The YouTube-ASL Clip Keypoint Dataset is a curated collection of sentence-level American Sign Language (ASL) keypoint sequences derived from publicly available YouTube videos. Rather than providing raw video files, the dataset consists solely of JSON files containing frame-by-frame 2D keypoints extracted from segmented clips of individual signed sentences. Each frame has been processed using MediaPipe, which generates 208 2D keypoints representing body, face, hands, and pose landmarks. These keypoint sequences provide a compact, privacy-preserving representation of ASL visual-linguistic content, enabling research in sign language recognition, gesture analysis, and multimodal communication. The dataset consists of 390 547 json files zipped in 10 separate zip files for easier manipulation. Beside the keypoint files, we also provide the annotation json files

SFU Opinion and Comments Corpus (SOCC) for NoSketch Engine

Author: Marek Hába
Publication venue: Masaryk University, NLP Centre
Publication date: 01/01/2024
Field of study

The SFU Opinion and Comments Corpus (SOCC) is a corpus for the analysis of online news comments. It contains opinionated articles and comments. It was tagged using TreeTagger and prepared for the NoSketch Engine corpus manager. The 7z archive already contains the prepared registry ("sfu_opinion_and_comments"), subcdef files, scripts and the vertical file which was also archived in 7z format. To complete the setup, simply configure the paths in the registry and compile the corpus

Human Label Variation in Attribution and Discourse (Hlava AD)

Author: Zikánová Šárka
Mírovský Jiří
Nedoluzhko Anna
Hajičová Eva
Dohnalová Šárka
Kmječová Anna
Nodlová Eliška
Teska Dominik
Publication venue: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Publication date: 13/12/2024
Field of study

Human Label Variation in Attribution and Discourse (Hlava AD) is a collection of commented multiple annotations (5 annotators) of inter-sentential explicit discourse relations between complex sentences containing verbs of attribution (saying, thinking) and following sentences in Czech. The main aim of the annotation is to capture how often the following sentence is seen as a follow-up of the direct/reported speech OR the author's speech. The dataset contains fillers (complex sentences with other types of verbs). Please visit https://ufal.mff.cuni.cz/hvar/hlava-ad for detailed and updated information about the corpus

AudioPSP 24.01: Audio recordings of proceedings of the Chamber of Deputies of the Parliament of the Czech Republic

Author: Kopp Matyáš
Publication venue: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Publication date: 01/01/2024
Field of study

This record contains audio recordings of proceedings of the Chamber of Deputies of the Parliament of the Czech Republic. The recordings have been provided by the official websites of the Chamber of Deputies, and the set contains them in their original format with no further processing. Recordings cover all available audio files from 2013-11-25 to 2023-07-26. Audio files are packed by year (2013-2023) and quarter (Q1-Q4) in tar archives audioPSP-YYYY-QN.tar. Furthermore, there are two TSV files: audioPSP-meta.quarterArchive.tsv contains metadata about archives, and audioPSP-meta.audioFile.tsv contains metadata about individual audio files

Smashcima (2025-03-28)

Author: Mayer Jiří
Pecina Pavel
Hajič jr. Jan
Publication venue: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Publication date: 30/12/2024
Field of study

Smashcima is a library and framework for synthesizing images containing handwritten music for creating synthetic training data for OMR models. It is primarily intended to be used as part of optical music recognition workflows, esp. with domain adaptation in mind. The target user is therefore a machine-learning, document processing, library sciences, or computational musicology researcher with minimal skills in python programming. Smashcima is the only tool that simultaneously: - synthesizes handwritten music notation, - produces not only raster images but also segmentation masks, classification labels, bounding boxes, and more, - synthesizes entire pages as well as individual symbols, - synthesizes background paper textures, - synthesizes also polyphonic and pianoform music images, - accepts just MusicXML as input, - is written in Python, which simplifies its adoption and extensibility. Therefore, Smashcima brings a unique new capability for optical music recognition (OMR): synthesizing a near-realistic image of handwritten sheet music from just a MusicXML file. As opposed to notation editors, which work with a fixed set of fonts and a set of layout rules, it can adapt handwriting styles from existing OMR datasets to arbitrary music (beyond the music encoded in existing OMR datasets), and randomize layout to simulate the imprecisions of handwriting, while guaranteeing the semantic correctness of the output rendering. Crucially, the rendered image is provided also with the positions of all the visual elements of music notation, so that both object detection-based and sequence-to-sequence OMR pipelines can utilize Smashcima as a synthesizer of training data. (In combination with the LMX canonical linearization of MusicXML, one can imagine the endless possibilities of running Smashcima on inputs from a MusicXML generator.

Ancillary Monitor Corpus: Common Crawl - german web (YEAR 2016 – VERSION 1)

Author: Rüdiger Jan Oliver
Publication venue: Rüdiger, Jan Oliver
Publication date: 12/11/2024
Field of study

*** german version see below *** The ‘Ancillary Monitor Corpus: Common Crawl - german web’ was designed with the aim of enabling a broad-based linguistic analysis of the German-language (visible) internet over time - with the aim of achieving comparability with the DeReKo (‘German Reference Corpus’ of the Leibniz Institute for the German Language - DeReKo volume 57 billion tokens - status: DeReKo Release 2024-I). The corpus is separated by year (here year 2016) and versioned (here version 1). Version 1 comprises (all years 2013-2024) 97.45 billion tokens. The corpus is based on the data dumps from CommonCrawl (https://commoncrawl.org/). CommonCrawl is a non-profit organisation that provides copies of the visible Internet free of charge for research purposes. The CommonCrawl WET raw data was first filtered by TLD (top-level domain). Only pages ending in the following TLDs were taken into account: ‘.at; .bayern; .berlin; .ch; .cologne; .de; .gmbh; .hamburg; .koeln; .nrw; .ruhr; .saarland; .swiss; .tirol; .wien; .zuerich’. These are the exclusive German-language TLDs according to ICANN (https://data.iana.org/TLD/tlds-alpha-by-domain.txt) as of 1 June 2024 - TLDs with a purely corporate reference (e.g. ‘.edeka; .bmw; .ford’) were excluded. The language of the individual documents (URLs) was then estimated with the help of NTextCat (https://github.com/ivanakcheurov/ntextcat) (via the CORE14 profile of NTextCat) - only those documents/URLs for which German was the most likely language were processed further (e.g. to exclude foreign-language material such as individual subpages). The third step involved filtering for manual selectors and filtering for 1:1 duplicates (within one year). The filtering and subsequent processing was carried out using CorpusExplorer (http://hdl.handle.net/11234/1-2634) and our own (supplementary) scripts, and the TreeTagger (http://hdl.handle.net/11372/LRT-323) was used for automatic annotation. The corpus was processed on the HELIX HPC cluster. The author would like to take this opportunity to thank the state of Baden-Württemberg and the German Research Foundation (DFG) for the possibility to use the bwHPC/HELIX HPC cluster - funding code HPC cluster: INST 35/1597-1 FUGG. Data content: - Tokens and record boundaries - Automatic lemma and POS annotation (using TreeTagger) - Metadata: - GUID - Unique identifier of the document - YEAR - Year of capture (please use this information for data slices) - Url - Full URL - Tld - Top-Level Domain - Domain - Domain without TLD (but with sub-domains if applicable) - DomainFull - Complete domain (incl. TLD) - DomainFull - Complete domain (incl. TLD) - Datum - (System Information): Date of the CorpusExplorer (date of capture by CommonCrawl - not date of creation/modification of the document). - Hash - (System Information): SHA1 hash of the CommonCrawl - Pfad - (System Information): Path of the cluster (raw data) - is supplied by the system. Please note that the files are saved as *.cec6.gz. These are binary files of the CorpusExplorer (see above). These files ensure efficient archiving. You can use both CorpusExplorer and the ‘CEC6-Converter’ (available for Linux, MacOS and Windows - see: https://lindat.mff.cuni.cz/repository/xmlui/handle/11372/LRT-5705) to convert the data. The data can be exported in the following formats: - CATMA v6 - CoNLL - CSV - CSV (only meta-data) - DTA TCF-XML - DWDS TEI-XML - HTML - IDS I5-XML - IDS KorAP XML - IMS Open Corpus Workbench - JSON - OPUS Corpus Collection XCES - Plaintext - SaltXML - SlashA XML - SketchEngine VERT - SPEEDy/CODEX (JSON) - TLV-XML - TreeTagger - TXM - WebLicht - XML Please note that an export increases the storage space requirement extensively. The ‘CorpusExplorerConsole’ (https://github.com/notesjor/CorpusExplorer.Terminal.Console - available for Linux, MacOS and Windows) also offers a simple solution for editing and analysing. If you have any questions, please contact the author. Legal information The data was downloaded on 01.11.2024. The use, processing and distribution is subject to §60d UrhG (german copyright law), which authorises the use for non-commercial purposes in research and teaching. LINDAT/CLARIN is responsible for long-term archiving in accordance with §69d para. 5 and ensures that only authorised persons can access the data. The data has been checked to the best of our knowledge and belief (on a random basis) - should you nevertheless find legal violations (e.g. right to be forgotten, personal rights, etc.), please write an e-mail to the author ([email protected]) with the following information: 1) why this content is undesirable (please outline only briefly) and 2) how the content can be identified - e.g. file name, URL or domain, etc. The author will endeavour to identify the content. The author will endeavour to remove the content and re-upload the data (modified) within two weeks (new version). If you have any further questions, please contact CLARIN. *** english version see above *** Das ‚Ancillary Monitor Corpus: Common Crawl - german web‘ wurde mit dem Ziel konzipiert - eine breit angelegte und zeitlich begleitende linguistische Analyse des deutschsprachigen (sichtbaren) Internets zu ermöglichen - wobei eine Vergleichbarkeit mit dem DeReKo (‚Deutsches Referenz Korpus‘ des Leibniz-Instituts für Deutsche Sprache - DeReKo Umfang 57 Mrd. Token - Stand: DeReKo Release 2024-I) angestrebt wird. Das Korpus ist nach Jahren getrennt (hier Jahr 2016) und versioniert (hier Version 1). Die Version 1 umfasst (alle Jahre 2013-2024) 97,45 Mrd. Token. Das Korpus basiert auf den Daten-Dumps von CommonCrawl (https://commoncrawl.org/). CommonCrawl ist eine Non-Profit-Organisation, die Kopien des sichtbaren Internets kostenlos für die Forschung zur Verfügung stellt. Die CommonCrawl WET Rohdaten wurden zunächst nach TLD (Top-Level Domain) gefiltert. Es wurden nur Seiten berücksichtigt, die auf folgende TLDs enden: „.at; .bayern; .berlin; .ch; .cologne; .de; .gmbh; .hamburg; .koeln; .nrw; .ruhr; .saarland; .swiss; .tirol; .wien; .zuerich“. Dies sind die exklusiven deutschsprachigen TLDs laut ICANN (https://data.iana.org/TLD/tlds-alpha-by-domain.txt) zum Stand 01.06.2024 - ausgeschlossen wurden TLDs mit reinem Firmenbezug (z.B. „.edeka; .bmw; .ford“). Für die einzelnen Dokumente (URLs) wurde dann mit Hilfe von NTextCat (https://github.com/ivanakcheurov/ntextcat) die Sprache geschätzt (über das CORE14-Profil von NTextCat) - es wurden nur solche Dokumente/URLs weiterverarbeitet, bei denen Deutsch die wahrscheinlichste Sprache war (z.B. um möglichst auszuschließen, dass fremdsprachiges Material wie einzelne Unterseitenbereiche enthalten sind). Als dritter Schritt erfolgte eine Filterung nach manuellen Selektoren und eine Filterung nach 1:1-Dubletten (innerhalb eines Jahres). Die Filterung und anschließende Aufbereitung erfolgte mit dem CorpusExplorer (http://hdl.handle.net/11234/1-2634) und eigenen (ergänzenden) Skripten, wobei für die automatische Annotation der TreeTagger (http://hdl.handle.net/11372/LRT-323) verwendet wurde. Die Aufbereitung des Korpus erfolgte auf dem HELIX-HPC-Cluster. Der Autor dankt an dieser Stelle dem Land Baden-Württemberg und der Deutschen Forschungsgemeinschaft (DFG) für die Möglichkeit das bwHPC/HELIX HPC-Cluster nutzen zu können – Förderkennzeichen HPC-Cluster: INST 35/1597-1 FUGG. Dateninhalt: - Token und Satzgrenzen - Automatische Lemma- und POS-Annotation (mittels TreeTagger) - Metadaten: - GUID - Eindeutiger Identifikator des Dokuments - YEAR - Jahr der Erfassung (bitte verwenden Sie diese Angabe für Datenschnitte) - Url - Vollständige URL - Tld – Top-Level Domain - Domain – Domain ohne TLD (aber ggf. mit Sub-Domains) - DomainFull – Vollständige Domain (inkl. TLD) - DomainFull - Komplette Domain (inkl. TLD) - Datum - (System Information): Datum des CorpusExplorers (Tag der Erfassung durch CommonCrawl - nicht Tag der Erstellung/Änderung des Dokuments). - Hash - (System Information): SHA1-Hash des CommonCrawl - Pfad - (System Information): Pfad des Clusters (Rohdaten) - wird systembedingt geliefert. Bitte beachten Sie, dass die Dateien als *.cec6.gz gespeichert sind. Dies sind Binärdateien des CorpusExplorers (siehe oben). Diese Dateien gewährleisten eine effiziente Archivierung. Sie können sowohl den CorpusExplorer als auch den ‚CEC6-Converter‘ (verfügbar für Linux, MacOS und Windows - siehe: https://lindat.mff.cuni.cz/repository/xmlui/handle/11372/LRT-5705) zur Konvertierung der Daten verwenden. Die Daten können in folgende Formate exportiert werden: - CATMA v6 - CoNLL - CSV - CSV (only meta-data) - DTA TCF-XML - DWDS TEI-XML - HTML - IDS I5-XML - IDS KorAP XML - IMS Open Corpus Workbench - JSON - OPUS Corpus Collection XCES - Plaintext - SaltXML - SlashA XML - SketchEngine VERT - SPEEDy/CODEX (JSON) - TLV-XML - TreeTagger - TXM - WebLicht - XML Bitte beachten Sie, dass ein Export den Speicherplatzbedarf erheblich erhöht. Eine einfache Lösung zur Bearbeitung und Analyse bietet auch die „CorpusExplorerConsole“ (https://github.com/notesjor/CorpusExplorer.Terminal.Console - verfügbar für Linux, MacOS und Windows). Bei Fragen wenden Sie sich bitte an den Autor. Rechtliche Hinweise Die Daten wurden am 01.11.2024 heruntergeladen. Die Nutzung, Verarbeitung und Verbreitung unterliegt §60d UrhG, der die Nutzung für nicht kommerzielle Zwecke in Forschung und Lehre erlaubt. LINDAT/CLARIN übernimmt die Langzeitarchivierung nach §69d Abs. 5 und stellt sicher, dass nur berechtigte Personen auf die Daten zugreifen können. Die Daten wurden nach bestem Wissen und Gewissen (stichprobenartig) überprüft - sollten Sie dennoch Rechtsverletzungen (z.B. Recht auf Vergessenwerden, Persönlichkeitsrechte etc.) finden, schreiben Sie bitte eine E-Mail an den Autor ([email protected]) mit folgenden Informationen: 1) warum dieser Inhalt unerwünscht ist (bitte nur kurz skizzieren) und 2) wie der Inhalt identifiziert werden kann - z.B. Dateiname, URL oder Domain etc. Der Autor wird sich bemühen, den Inhalt zu entfernen und die Daten innerhalb von zwei Wochen (verändert) wieder hochzuladen (neue Version). Bei weiteren Fragen wenden Sie sich bitte an CLARIN

Treebanks for Unified Taxonomy of Deep Syntactic Relations

Author: Droganova Kira
Zeman Daniel
Publication venue: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Publication date: 01/01/2024
Field of study

The datasets described in Droganova, Kira, and Daniel Zeman. "Towards a Unified Taxonomy of Deep Syntactic Relations." Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). 2024. Four languages are included in this release. English PropBank is omitted due to its license terms

The Use of Machine Translation by Ukrainian War Refugees in Czechia

Author: Agapova Anna
Špačková Stanislava
Publication venue: Oxford University Press
Publication date: 01/01/2024
Field of study

Data from a questionnaire survey conducted from 2022-08-25 to 2022-11-15 and exploring the use of machine translation by Ukrainian refugees in the Czech Republic. The presented spreadsheet contains minimally processed data exported from the two questionnaires that were created in Google Forms in the Ukrainian and the Russian language. The links to these questionnaires were distributed by three methods: direct email to particular refugees whose contact details the authors obtained while volunteering; through a non-profit organisation helping refugees (Vesna women’s education institution) and on social networks by posting links to the survey in groups associating the Ukrainian community across Czech regions and towns. Since we asked potential respondents to spread the questionnaire further, we could not prevent it from reaching Ukrainians who had arrived in Czechia previously, or received temporary protection in other countries. Due to this fact, the textual answers to the question 1.5 "Which country are you in right now?" were replaced in the dataset by numbers (1 for the Czech Republic, 2 for other countries) in order for us to be able to separate the data of respondents not located in the Czech Republic, which were irrelevant for our survey. Also, in this version of the dataset, the textual answers to the question 1.6 "How many months have you been to this country?" were replaced by numbers, so that we could separate the data of respondents who arrived in the Czech Republic in February 2022 or later from the other data (0 for those staying in Czechia before February 2022, 1 for those staying in Czechia since February 2022 or later, 2 for those staying in other countries)

Czech Natural Language Inference Dataset with Explanations

Author: Víta Martin
Nevěřilová Zuzana
Publication venue: Masaryk University, NLP Centre
Publication date: 01/01/2024
Field of study

The dataset contains two parts: the original Stanford Natural Language Inference (SNLI) dataset with automatic translations to Czech, for some items from the SNLI, it contains annotation of the Czech content and explanation. The Czech SNLI data contain both Czech and English pairs premise-hypothesis. SNLI split into train/test/dev is preserved. - CZtrainSNLI.csv: 550152 pairs - CZtestSNLI.csv: 10000 pairs - CZdevSNLI.csv: 10000 pairs The explanation dataset contains batches of pairs premise-hypothesis. Each batch contains 1499 pairs. Each pair contains: - reference to original SNLI example - English premise and English hypothesis - English gold label (one of Entailment, Contradiction, Neutral) - automatically translated premise and hypothesis to Czech - Czech gold label (one of entailment, contradiction, neutral, bad translation) - explanations for Czech label Example record: CSNLI ID: 4857558207.jpg#4r1e English premise: A mother holds her newborn baby. English hypothesis: A person holding a child. English gold label: entailment Czech premise: Matka drží své novorozené dítě. Czech hypothesis: Osoba, která drží dítě. Czech gold label: Entailment Explanation-hypothesis: Matka Explanation-premise: Osoba Explanation-relation: generalization Size of the explanations dataset: - train: 159650 - dev: 2860 - test: 2880 Inter-Annotator Agreement (IAA) Packages 1 and 12 annotate the same data. The IAA measured by the kappa score is 0.67 (substantial agreement). The translation was performed via LINDAT translation service. Next, the translated pairs were manually checked (without access to the original English gold label), with possible check of the original pair. Explanations were annotated as follows: - if there is a part of the premise or hypothesis that is relevant for the annotator's decision, it is marked - if there are two such parts and there exists a relation between them, the relation is marked Possible relation types: - generalization: white long skirt - skirt - specification: dog - bulldog - similar: couch - sofa - independence: they have no instruments - they belong to the group - exclusion: man - woman Original SNLI dataset: https://nlp.stanford.edu/projects/snli/ LINDAT Translation Service: https://lindat.mff.cuni.cz/services/translation

0

full texts

1,998

metadata records

Updated in last 30 days.

LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University

Access Repository Dashboard

Do you manage Open Research Online? Become a CORE Member to access insider analytics, issue reports and manage access to outputs from your repository in the CORE Repository Dashboard! 👇