Charles University

LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University

Not a member yet

1998 research outputs found

Sort by

Ancillary Monitor Corpus: Common Crawl - german web (YEAR 2024 – VERSION 1)

Author: Rüdiger Jan Oliver
Publication venue: Rüdiger, Jan Oliver
Publication date: 03/12/2024
Field of study

*** german version see below *** The ‘Ancillary Monitor Corpus: Common Crawl - german web’ was designed with the aim of enabling a broad-based linguistic analysis of the German-language (visible) internet over time - with the aim of achieving comparability with the DeReKo (‘German Reference Corpus’ of the Leibniz Institute for the German Language - DeReKo volume 57 billion tokens - status: DeReKo Release 2024-I). The corpus is separated by year (here year 2024) and versioned (here version 1). Version 1 comprises (all years 2013-2024) 97.45 billion tokens. The corpus is based on the data dumps from CommonCrawl (https://commoncrawl.org/). CommonCrawl is a non-profit organisation that provides copies of the visible Internet free of charge for research purposes. The CommonCrawl WET raw data was first filtered by TLD (top-level domain). Only pages ending in the following TLDs were taken into account: ‘.at; .bayern; .berlin; .ch; .cologne; .de; .gmbh; .hamburg; .koeln; .nrw; .ruhr; .saarland; .swiss; .tirol; .wien; .zuerich’. These are the exclusive German-language TLDs according to ICANN (https://data.iana.org/TLD/tlds-alpha-by-domain.txt) as of 1 June 2024 - TLDs with a purely corporate reference (e.g. ‘.edeka; .bmw; .ford’) were excluded. The language of the individual documents (URLs) was then estimated with the help of NTextCat (https://github.com/ivanakcheurov/ntextcat) (via the CORE14 profile of NTextCat) - only those documents/URLs for which German was the most likely language were processed further (e.g. to exclude foreign-language material such as individual subpages). The third step involved filtering for manual selectors and filtering for 1:1 duplicates (within one year). The filtering and subsequent processing was carried out using CorpusExplorer (http://hdl.handle.net/11234/1-2634) and our own (supplementary) scripts, and the TreeTagger (http://hdl.handle.net/11372/LRT-323) was used for automatic annotation. The corpus was processed on the HELIX HPC cluster. The author would like to take this opportunity to thank the state of Baden-Württemberg and the German Research Foundation (DFG) for the possibility to use the bwHPC/HELIX HPC cluster - funding code HPC cluster: INST 35/1597-1 FUGG. Data content: - Tokens and record boundaries - Automatic lemma and POS annotation (using TreeTagger) - Metadata: - GUID - Unique identifier of the document - YEAR - Year of capture (please use this information for data slices) - Url - Full URL - Tld - Top-Level Domain - Domain - Domain without TLD (but with sub-domains if applicable) - DomainFull - Complete domain (incl. TLD) - DomainFull - Complete domain (incl. TLD) - Datum - (System Information): Date of the CorpusExplorer (date of capture by CommonCrawl - not date of creation/modification of the document). - Hash - (System Information): SHA1 hash of the CommonCrawl - Pfad - (System Information): Path of the cluster (raw data) - is supplied by the system. Please note that the files are saved as *.cec6.gz. These are binary files of the CorpusExplorer (see above). These files ensure efficient archiving. You can use both CorpusExplorer and the ‘CEC6-Converter’ (available for Linux, MacOS and Windows - see: https://lindat.mff.cuni.cz/repository/xmlui/handle/11372/LRT-5705) to convert the data. The data can be exported in the following formats: - CATMA v6 - CoNLL - CSV - CSV (only meta-data) - DTA TCF-XML - DWDS TEI-XML - HTML - IDS I5-XML - IDS KorAP XML - IMS Open Corpus Workbench - JSON - OPUS Corpus Collection XCES - Plaintext - SaltXML - SlashA XML - SketchEngine VERT - SPEEDy/CODEX (JSON) - TLV-XML - TreeTagger - TXM - WebLicht - XML Please note that an export increases the storage space requirement extensively. The ‘CorpusExplorerConsole’ (https://github.com/notesjor/CorpusExplorer.Terminal.Console - available for Linux, MacOS and Windows) also offers a simple solution for editing and analysing. If you have any questions, please contact the author. Legal information The data was downloaded on 01.11.2024. The use, processing and distribution is subject to §60d UrhG (german copyright law), which authorises the use for non-commercial purposes in research and teaching. LINDAT/CLARIN is responsible for long-term archiving in accordance with §69d para. 5 and ensures that only authorised persons can access the data. The data has been checked to the best of our knowledge and belief (on a random basis) - should you nevertheless find legal violations (e.g. right to be forgotten, personal rights, etc.), please write an e-mail to the author ([email protected]) with the following information: 1) why this content is undesirable (please outline only briefly) and 2) how the content can be identified - e.g. file name, URL or domain, etc. The author will endeavour to identify the content. The author will endeavour to remove the content and re-upload the data (modified) within two weeks (new version). If you have any further questions, please contact CLARIN. *** english version see above *** Das ‚Ancillary Monitor Corpus: Common Crawl - german web‘ wurde mit dem Ziel konzipiert - eine breit angelegte und zeitlich begleitende linguistische Analyse des deutschsprachigen (sichtbaren) Internets zu ermöglichen - wobei eine Vergleichbarkeit mit dem DeReKo (‚Deutsches Referenz Korpus‘ des Leibniz-Instituts für Deutsche Sprache - DeReKo Umfang 57 Mrd. Token - Stand: DeReKo Release 2024-I) angestrebt wird. Das Korpus ist nach Jahren getrennt (hier Jahr 2024) und versioniert (hier Version 1). Die Version 1 umfasst (alle Jahre 2013-2024) 97,45 Mrd. Token. Das Korpus basiert auf den Daten-Dumps von CommonCrawl (https://commoncrawl.org/). CommonCrawl ist eine Non-Profit-Organisation, die Kopien des sichtbaren Internets kostenlos für die Forschung zur Verfügung stellt. Die CommonCrawl WET Rohdaten wurden zunächst nach TLD (Top-Level Domain) gefiltert. Es wurden nur Seiten berücksichtigt, die auf folgende TLDs enden: „.at; .bayern; .berlin; .ch; .cologne; .de; .gmbh; .hamburg; .koeln; .nrw; .ruhr; .saarland; .swiss; .tirol; .wien; .zuerich“. Dies sind die exklusiven deutschsprachigen TLDs laut ICANN (https://data.iana.org/TLD/tlds-alpha-by-domain.txt) zum Stand 01.06.2024 - ausgeschlossen wurden TLDs mit reinem Firmenbezug (z.B. „.edeka; .bmw; .ford“). Für die einzelnen Dokumente (URLs) wurde dann mit Hilfe von NTextCat (https://github.com/ivanakcheurov/ntextcat) die Sprache geschätzt (über das CORE14-Profil von NTextCat) - es wurden nur solche Dokumente/URLs weiterverarbeitet, bei denen Deutsch die wahrscheinlichste Sprache war (z.B. um möglichst auszuschließen, dass fremdsprachiges Material wie einzelne Unterseitenbereiche enthalten sind). Als dritter Schritt erfolgte eine Filterung nach manuellen Selektoren und eine Filterung nach 1:1-Dubletten (innerhalb eines Jahres). Die Filterung und anschließende Aufbereitung erfolgte mit dem CorpusExplorer (http://hdl.handle.net/11234/1-2634) und eigenen (ergänzenden) Skripten, wobei für die automatische Annotation der TreeTagger (http://hdl.handle.net/11372/LRT-323) verwendet wurde. Die Aufbereitung des Korpus erfolgte auf dem HELIX-HPC-Cluster. Der Autor dankt an dieser Stelle dem Land Baden-Württemberg und der Deutschen Forschungsgemeinschaft (DFG) für die Möglichkeit das bwHPC/HELIX HPC-Cluster nutzen zu können – Förderkennzeichen HPC-Cluster: INST 35/1597-1 FUGG. Dateninhalt: - Token und Satzgrenzen - Automatische Lemma- und POS-Annotation (mittels TreeTagger) - Metadaten: - GUID - Eindeutiger Identifikator des Dokuments - YEAR - Jahr der Erfassung (bitte verwenden Sie diese Angabe für Datenschnitte) - Url - Vollständige URL - Tld – Top-Level Domain - Domain – Domain ohne TLD (aber ggf. mit Sub-Domains) - DomainFull – Vollständige Domain (inkl. TLD) - DomainFull - Komplette Domain (inkl. TLD) - Datum - (System Information): Datum des CorpusExplorers (Tag der Erfassung durch CommonCrawl - nicht Tag der Erstellung/Änderung des Dokuments). - Hash - (System Information): SHA1-Hash des CommonCrawl - Pfad - (System Information): Pfad des Clusters (Rohdaten) - wird systembedingt geliefert. Bitte beachten Sie, dass die Dateien als *.cec6.gz gespeichert sind. Dies sind Binärdateien des CorpusExplorers (siehe oben). Diese Dateien gewährleisten eine effiziente Archivierung. Sie können sowohl den CorpusExplorer als auch den ‚CEC6-Converter‘ (verfügbar für Linux, MacOS und Windows - siehe: https://lindat.mff.cuni.cz/repository/xmlui/handle/11372/LRT-5705) zur Konvertierung der Daten verwenden. Die Daten können in folgende Formate exportiert werden: - CATMA v6 - CoNLL - CSV - CSV (only meta-data) - DTA TCF-XML - DWDS TEI-XML - HTML - IDS I5-XML - IDS KorAP XML - IMS Open Corpus Workbench - JSON - OPUS Corpus Collection XCES - Plaintext - SaltXML - SlashA XML - SketchEngine VERT - SPEEDy/CODEX (JSON) - TLV-XML - TreeTagger - TXM - WebLicht - XML Bitte beachten Sie, dass ein Export den Speicherplatzbedarf erheblich erhöht. Eine einfache Lösung zur Bearbeitung und Analyse bietet auch die „CorpusExplorerConsole“ (https://github.com/notesjor/CorpusExplorer.Terminal.Console - verfügbar für Linux, MacOS und Windows). Bei Fragen wenden Sie sich bitte an den Autor. Rechtliche Hinweise Die Daten wurden am 01.11.2024 heruntergeladen. Die Nutzung, Verarbeitung und Verbreitung unterliegt §60d UrhG, der die Nutzung für nicht kommerzielle Zwecke in Forschung und Lehre erlaubt. LINDAT/CLARIN übernimmt die Langzeitarchivierung nach §69d Abs. 5 und stellt sicher, dass nur berechtigte Personen auf die Daten zugreifen können. Die Daten wurden nach bestem Wissen und Gewissen (stichprobenartig) überprüft - sollten Sie dennoch Rechtsverletzungen (z.B. Recht auf Vergessenwerden, Persönlichkeitsrechte etc.) finden, schreiben Sie bitte eine E-Mail an den Autor ([email protected]) mit folgenden Informationen: 1) warum dieser Inhalt unerwünscht ist (bitte nur kurz skizzieren) und 2) wie der Inhalt identifiziert werden kann - z.B. Dateiname, URL oder Domain etc. Der Autor wird sich bemühen, den Inhalt zu entfernen und die Daten innerhalb von zwei Wochen (verändert) wieder hochzuladen (neue Version). Bei weiteren Fragen wenden Sie sich bitte an CLARIN

Prague Dependency Treebank - Consolidated 2.0 (PDT-C 2.0)

Author: Hajič Jan
Bejček Eduard
Bémová Alevtina
Buráňová Eva
Fučíková Eva
Hajičová Eva
Havelka Jiří
Hlaváčová Jaroslava
Homola Petr
Ircing Pavel
Kárník Jiří
Kettnerová Václava
Klyueva Natalia
Kolářová Veronika
Kučová Lucie
Lopatková Markéta
Mareček David
Mikulová Marie
Mírovský Jiří
Nedoluzhko Anna
Novák Michal
Pajas Petr
Panevová Jarmila
Peterek Nino
Poláková Lucie
Popel Martin
Popelka Jan
Romportl Jan
Rysová Magdaléna
Semecký Jiří
Sgall Petr
Spoustová Johanka
Straka Milan
Straňák Pavel
Synková Pavlína
Ševčíková Magda
Šindlerová Jana
Štěpánek Jan
Štěpánková Barbora
Toman Josef
Urešová Zdeňka
Vidová Hladká Barbora
Zeman Daniel
Zikánová Šárka
Žabokrtský Zdeněk
Publication venue: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Publication date: 30/12/2024
Field of study

A manually annotated and genre-diversified language resource with rich linguistic information from morphology and syntax to semantics, the Prague Dependency Treebank – Consolidated 2.0 (PDT-C 2.0) is a consolidated release of the existing PDT-corpora of Czech data, uniformly annotated using the standard PDT scheme. PDT-corpora included in PDT-C: Prague Dependency Treebank (written newspaper and journal texts from three genres); Czech part of Prague Czech-English Dependency Treebank (translated financial texts, from English), Prague Dependency Treebank of Spoken Czech (spoken data, including audio and transcripts and multiple speech reconstruction annotation); PDT-Faust (user-generated texts). The separately published original treebanks are published in one package, to allow easier data handling for all the datasets and they are enhanced with further manual linguistic annotation. In the previous PDT-C 1.0 version, the data was enhanced with a manual linguistic annotation at the morphological layer. For the PDT-C 2.0 version, manual annotation at the analytical layer is performed in those parts of the corpus that were previously annotated only by automatic tools. The goal of the annotation work is also to consolidate the manual annotation across all layers. This resulted in many modifications and corrections to the original annotation. Manual annotation of discourse relations is also now provided for all PDT-C 2.0 data. In the PDT-C 2.0 release, there is now a manual annotation at the all annotation layers (morphological, surface syntactic (analytical), deep syntactic layer (tectogrammatical)) in all four datasets. Additional semantic features in the PDT dataset are also manually annotated. New version of morphological dictionary is enclosed; a common valency lexicon for all four original parts is enclosed. Documentation provides two browsing and editing desktop tools (TrEd and MEd) and the corpus is also available online for searching using PML-TQ

NomVallex 2.5

Author: Kolářová Veronika
Kettnerová Václava
Klímová Jana
Mírovský Jiří
Vernerová Anna
Publication venue: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Publication date: 31/12/2024
Field of study

NomVallex is a manually annotated valency lexicon of Czech nouns and adjectives, adopting the theoretical framework of Functional Generative Description as its theoretical basis. In total, NomVallex 2.5 comprises 1337 lexical units contained in 730 lexemes. As for derivational categories, it covers deverbal, deadjectival or denominal nouns, and deverbal, denominal, deadjectival or primary adjectives. Valency properties of a lexical unit are captured in a valency frame (modeled as a sequence of valency slots, each supplemented with a list of morphemic forms) and documented by corpus examples (extracted from the SYN series of corpora from the Czech National Corpus or from the Araneum Bohemicum Maximum corpus). To enable analysis of the relationship between the valency behavior of base words and their derivatives, lexical units of nouns and adjectives in NomVallex are linked to their respective base lexical units (contained either in NomVallex itself or, in the case of verbs, in the VALLEX lexicon), linking together up to three parts of speech (i.e., noun–verb, e.g., vnímání ‘perception’ – vnímat ‘perceive’, adjective–verb, e.g., vnímatelný ‘perceivable’ – vnímat ‘perceive’, noun–adjective, e.g., vnímavost ‘perceptiveness’ – vnímavý ‘perceptive’, and noun–adjective–verb, e.g., vnímavost ‘perceptiveness’ – vnímavý ‘perceptive’ – vnímat ‘perceive’). NomVallex 2.5 is an enhanced edition of the NomVallex 2.0 version; new developments that feature in the NomVallex 2.5 version include an increase in the number of noun and adjectival lexemes covered, treatment of negation (i.e., negative forms of nouns and adjectives), and annotation of reciprocity or reflexivity. Annotators: Veronika Kolářová, Václava Kettnerová, Jana Klímová and Jakub Sláma. Software and technical support: Jiří Mírovský and Anna Vernerová

Universal Dependencies 2.14

Author: Zeman Daniel
Nivre Joakim
Abrams Mitchell
Ackermann Elia
Aepli Noëmi
Aghaei Hamid
Agić Željko
Ahmadi Amir
Ahrenberg Lars
Ajede Chika Kennedy
Akkurt Salih Furkan
Aleksandravičiūtė Gabrielė
Alfina Ika
Algom Avner
Alnajjar Khalid
Alzetta Chiara
Andersen Erik
Antonsen Lene
Aoyama Tatsuya
Aplonova Katya
Aquino Angelina
Aragon Carolina
Aranes Glyd
Aranzabe Maria Jesus
Arıcan Bilge Nas
Arnardóttir Þórunn
Arutie Gashaw
Arwidarasti Jessica Naraiswari
Asahara Masayuki
Ásgeirsdóttir Katla
Aslan Deniz Baran
Asmazoğlu Cengiz
Ateyah Luma
Atmaca Furkan
Attia Mohammed
Atutxa Aitziber
Augustinus Liesbeth
Avelãs Mariana
Badmaeva Elena
Balasubramani Keerthana
Ballesteros Miguel
Banerjee Esha
Bank Sebastian
Barbu Mititelu Verginica
Barkarson Starkaður
Basile Rodolfo
Basmov Victoria
Batchelor Colin
Bauer John
Bedir Seyyit Talha
Behzad Shabnam
Belieni Juan
Bengoetxea Kepa
Benli İbrahim
Ben Moshe Yifat
Berg Ansu
Berk Gözde
Bhat Riyaz Ahmad
Biagetti Erica
Bick Eckhard
Bielinskienė Agnė
Bilgin Taşdemir Esma Fatıma
Bjarnadóttir Kristín
Blaschke Verena
Blokland Rogier
Bobicev Victoria
Boizou Loïc
Bonilla Johnatan
Borges Völker Emanuel
Börstell Carl
Bosco Cristina
Bouma Gosse
Bowman Sam
Boyd Adriane
Braggaar Anouck
Branco António
Brokaitė Kristina
Burchardt Aljoscha
Campos Marisa
Candito Marie
Caron Bernard
Caron Gauthier
Carvalheiro Catarina
Carvalho Rita
Cassidy Lauren
Castro Maria Clara
Castro Sérgio
Cavalcanti Tatiana
Cebiroğlu Eryiğit Gülşen
Cecchini Flavio Massimiliano
Celano Giuseppe G. A.
Čéplö Slavomír
Cesur Neslihan
Cetin Savas
Çetinoğlu Özlem
Chalub Fabricio
Chamila Liyanage
Chauhan Shweta
Chen Yifei
Chi Ethan
Chika Taishi
Cho Yongseok
Choi Jinho
Chontaeva Bermet
Chun Jayeol
Chung Juyeon
Cignarella Alessandra T.
Cinková Silvie
Collomb Aurélie
Çöltekin Çağrı
Connor Miriam
Corbetta Claudia
Corbetta Daniela
Costa Francisco
Courtin Marine
Crabbé Benoît
Cristescu Mihaela
Cvetkoski Vladimir
Dale Ingerid Løyning
Daniel Philemon
Davidson Elizabeth
de Alencar Leonel Figueiredo
Dehouck Mathieu
de Laurentiis Martina
de Marneffe Marie-Catherine
de Paiva Valeria
Derin Mehmet Oguz
de Souza Elvis
Diaz de Ilarraza Arantza
Díaz Hernández Roberto Antonio
Dickerson Carly
Dinakaramani Arawinda
Di Nuovo Elisa
Dione Bamba
Dirix Peter
Do Hoa
Dobrovoljc Kaja
Döhmer Caroline
Doyle Adrian
Dozat Timothy
Droganova Kira
Duran Magali Sanches
Dwivedi Puneet
Ebert Christian
Eckhoff Hanne
Eguchi Masaki
Eiche Sandra
Eiselen Roald
Eli Marhaba
Elkahky Ali
Ephrem Binyam
Erina Olga
Erjavec Tomaž
Eslami Soudabeh
Essaidi Farah
Etienne Aline
Evelyn Wograine
Facundes Sidney
Farkas Richárd
Favero Federica
Ferdaousi Jannatul
Fernanda Marília
Fernandez Alcalde Hector
Fethi Amal
Foster Jennifer
Fransen Theodorus
Freitas Cláudia
Fujita Kazunori
Gajdošová Katarína
Galbraith Daniel
Galy Edith
Gamba Federica
Garcia Marcos
Gärdenfors Moa
Gaustad Tanja
Genç Efe Eren
Gerardi Fabrício Ferraz
Gerdes Kim
Gessler Luke
Ginter Filip
Godoy Gustavo
Goenaga Iakes
Gojenola Koldo
Gökırmak Memduh
Goldberg Yoav
Gómez Guinovart Xavier
González Saavedra Berta
Griciūtė Bernadeta
Grioni Matias
Grobol Loïc
Grūzītis Normunds
Guillaume Bruno
Guiller Kirian
Guillot-Barbance Céline
Güngör Tunga
Habash Nizar
Hafsteinsson Hinrik
Hajič Jan
Hajič jr. Jan
Hämäläinen Mika
Hà Mỹ Linh
Han Na-Rae
Hanifmuti Muhammad Yudistira
Harada Takahiro
Hardwick Sam
Harris Kim
Hassert Naïma
Haug Dag
Heinecke Johannes
Hellwig Oliver
Hennig Felix
Hladká Barbora
Hlaváčová Jaroslava
Hociung Florinel
Hoefels Diana
Hohle Petter
Huang Yidi
Huerta Mendez Marivel
Hwang Jena
Ikeda Takumi
Iliadou Inessa
Ingason Anton Karl
Ion Radu
Irimia Elena
Ishola Ọlájídé
Islamaj Artan
Ito Kaoru
Iurescia Federica
Jagodzińska Sandra
Jannat Siratun
Jelínek Tomáš
Jha Apoorva
Jiang Katharine
Jobanputra Mayank
Johannsen Anders
Jónsdóttir Hildur
Jørgensen Fredrik
Juutinen Markus
Kaşıkara Hüner
Kabaeva Nadezhda
Kahane Sylvain
Kanayama Hiroshi
Kanerva Jenna
Kara Neslihan
Karahóǧa Ritván
Kåsen Andre
Kayadelen Tolga
Kengatharaiyer Sarveswaran
Kettnerová Václava
Kharatyan Lilit
Kirchner Jesse
Klementieva Elena
Klyachko Elena
Kocharov Petr
Köhn Arne
Köksal Abdullatif
Kopacewicz Kamil
Korkiakangas Timo
Köse Mehmet
Koshevoy Alexey
Kotsyba Natalia
Kovačić Barbara
Kovalevskaitė Jolanta
Krek Simon
Krishnamurthy Parameswari
Kübler Sandra
Kuqi Adrian
Kuyrukçu Oğuzhan
Kuzgun Aslı
Kwak Sookyoung
Kyle Kris
Laan Käbi
Laippala Veronika
Lambertino Lorenzo
Lando Tatiana
Larasati Septina Dian
Lavrentiev Alexei
Lee John
Lê Hồng Phương
Lenci Alessandro
Lertpradit Saran
Leung Herman
Levina Maria
Levine Lauren
Li Cheuk Ying
Li Josie
Li Keying
Li Yixuan
Li Yuan
Lim KyungTae
Lima Padovani Bruna
Lin Yi-Ju Jessica
Lindén Krister
Liu Yang Janet
Ljubešić Nikola
Lobzhanidze Irina
Loginova Olga
Lopes Lucelene
Lusito Stefano
Lutgen Anne-Marie
Luthfi Andry
Luukko Mikko
Lyashevskaya Olga
Lynn Teresa
Macketanz Vivien
Mahamdi Menel
Maillard Jean
Makarchuk Ilya
Makazhanov Aibek
Mambrini Francesco
Mandl Michael
Manning Christopher
Manurung Ruli
Marşan Büşra
Mărănduc Cătălina
Mareček David
Marheinecke Katrin
Markantonatou Stella
Martínez Alonso Héctor
Martín Rodríguez Lorena
Martins André
Martins Cláudia
Mašek Jan
Matsuda Hiroshi
Matsumoto Yuji
Mazzei Alessandro
McDonald Ryan
McGuinness Sarah
Mehta Maitrey
Ménard Pierre André
Mendonça Gustavo
Merzhevich Tatiana
Meurer Paul
Miekka Niko
Milano Emilia
Miller Aaron
Mischenkova Karina
Missilä Anna
Mititelu Cătălin
Mitrofan Maria
Miyao Yusuke
Mojiri Foroushani AmirHossein
Molnár Judit
Moloodi Amirsaeid
Montemagni Simonetta
More Amir
Moreno Romero Laura
Moretti Giovanni
Mori Shinsuke
Morioka Tomohiko
Moro Shigeki
Mortensen Bjartur
Moskalevskyi Bohdan
Muischnek Kadri
Munro Robert
Murawaki Yugo
Müürisep Kaili
Nainwani Pinkey
Nakhlé Mariam
Navarro Horñiacek Juan Ignacio
Nedoluzhko Anna
Nešpore-Bērzkalne Gunta
Nevaci Manuela
Nguyễn Thị Lương
Nguyễn Thị Minh Huyền
Nikaido Yoshihiro
Nikolaev Vitaly
Nitisaroj Rattima
Norrman Victor
Nourian Alireza
Nunes Maria das Graças Volpe
Nurmi Hanna
Ojala Stina
Ojha Atul Kr.
Óladóttir Hulda
Olúòkun Adédayọ̀
Omura Mai
Onwuegbuzia Emeka
Ordan Noam
Osenova Petya
Östling Robert
Ott Annika
Øvrelid Lilja
Özateş Şaziye Betül
Özçelik Merve
Özgür Arzucan
Öztürk Başaran Balkız
Paccosi Teresa
Palmero Aprosio Alessio
Panova Anastasia
Pardo Thiago Alexandre Salgueiro
Park Hyunji Hayley
Partanen Niko
Pascual Elena
Passarotti Marco
Patejuk Agnieszka
Paulino-Passos Guilherme
Pedonese Giulia
Peljak-Łapińska Angelika
Peng Siyao
Peng Siyao Logan
Pereira Rita
Pereira Sílvia
Perez Cenel-Augusto
Perkova Natalia
Perrier Guy
Petrov Slav
Petrova Daria
Peverelli Andrea
Phelan Jason
Pierre-Louis Claudel
Piitulainen Jussi
Pinter Yuval
Pinto Clara
Pintucci Rodrigo
Pirinen Tommi A
Pitler Emily
Plamada Magdalena
Plank Barbara
Plum Alistair
Poibeau Thierry
Ponomareva Larisa
Popel Martin
Pretkalniņa Lauma
Pretorius Rigardt
Prévost Sophie
Prokopidis Prokopis
Przepiórkowski Adam
Pugh Robert
Puolakainen Tiina
Purschke Christoph
Pyysalo Sampo
Qi Peng
Querido Andreia
Rääbis Andriela
Rademaker Alexandre
Rahoman Mizanur
Rama Taraka
Ramasamy Loganathan
Ramisch Carlos
Ramos Joana
Rashel Fam
Rasooli Mohammad Sadegh
Ravishankar Vinit
Real Livy
Rebeja Petru
Reddy Siva
Regnault Mathilde
Rehm Georg
Riabi Arij
Riabov Ivan
Rießler Michael
Rimkutė Erika
Rinaldi Larissa
Rituma Laura
Rizqiyah Putri
Rocha Luisa
Rögnvaldsson Eiríkur
Roksandic Ivan
Romanenko Mykhailo
Rosa Rudolf
Roșca Valentin
Rovati Davide
Rozonoyer Ben
Rudina Olga
Rueter Jack
Ruffolo Paolo
Rúnarsson Kristján
Sadde Shoval
Safari Pegah
Sahala Aleksi
Saleh Shadi
Salomoni Alessio
Samardžić Tanja
Samson Stephanie
Sánchez-Rodríguez Xulia
Sanguinetti Manuela
Sanıyar Ezgi
Särg Dage
Sartor Marta
Sarymsakova Albina
Sasaki Mitsuya
Saulīte Baiba
Savary Agata
Sawanakunanon Yanin
Saxena Shefali
Scannell Kevin
Scarlata Salvatore
Schang Emmanuel
Schneider Nathan
Schuster Sebastian
Schwartz Lane
Seddah Djamé
Seeker Wolfgang
Sellmer Sven
Seraji Mojgan
Shahzadi Syeda
Shen Mo
Shimada Atsuko
Shirasu Hiroyuki
Publication venue: Universal Dependencies Consortium
Publication date: 15/05/2024
Field of study

Universal Dependencies is a project that seeks to develop cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual parser development, cross-lingual learning, and parsing research from a language typology perspective. The annotation scheme is based on (universal) Stanford dependencies (de Marneffe et al., 2006, 2008, 2014), Google universal part-of-speech tags (Petrov et al., 2012), and the Interset interlingua for morphosyntactic tagsets (Zeman, 2008)

Parallel Global Voices, Czech-English NER+NEL

Author: Nevěřilová Zuzana
Žižková Hana
Publication venue: Masaryk University, Brno
Publication date: 15/06/2024
Field of study

Annotation of named entities to the existing source Parallel Global Voices, ces-eng language pair. The named entity annotations distinguish four classes: Person, Organization, Location, Misc. The annotation is in the IOB schema (annotation per token, beginning + inside of the multi-word annotation). NEL annotation contains Wikidata Qnames

Prague Discourse Treebank 4.0

Author: Synková Pavlína
Mírovský Jiří
Paclíková Marie
Poláková Lucie
Rysová Magdaléna
Scheller Veronika
Zdeňková Jana
Zikánová Šárka
Hajičová Eva
Publication venue: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Publication date: 18/12/2024
Field of study

The Prague Discourse Treebank 4.0 (PDiT 4.0; Synková et al., 2024) is an annotation of discourse relations marked by primary and secondary discourse connectives in the whole data of the Prague Dependency Treebank - Consolidated 2.0 (PDT-C 2.0; Hajič et al., 2024). With respect to the previous versions of PDiT, annotating discourse relations in the whole PDT-C 2.0 means a significant increase in the size of the annotated data

Ancillary Monitor Corpus: Common Crawl - german web (YEAR 2018 – VERSION 1)

Author: Rüdiger Jan Oliver
Publication venue: Rüdiger, Jan Oliver
Publication date: 12/11/2024
Field of study

*** german version see below *** The ‘Ancillary Monitor Corpus: Common Crawl - german web’ was designed with the aim of enabling a broad-based linguistic analysis of the German-language (visible) internet over time - with the aim of achieving comparability with the DeReKo (‘German Reference Corpus’ of the Leibniz Institute for the German Language - DeReKo volume 57 billion tokens - status: DeReKo Release 2024-I). The corpus is separated by year (here year 2018) and versioned (here version 1). Version 1 comprises (all years 2013-2024) 97.45 billion tokens. The corpus is based on the data dumps from CommonCrawl (https://commoncrawl.org/). CommonCrawl is a non-profit organisation that provides copies of the visible Internet free of charge for research purposes. The CommonCrawl WET raw data was first filtered by TLD (top-level domain). Only pages ending in the following TLDs were taken into account: ‘.at; .bayern; .berlin; .ch; .cologne; .de; .gmbh; .hamburg; .koeln; .nrw; .ruhr; .saarland; .swiss; .tirol; .wien; .zuerich’. These are the exclusive German-language TLDs according to ICANN (https://data.iana.org/TLD/tlds-alpha-by-domain.txt) as of 1 June 2024 - TLDs with a purely corporate reference (e.g. ‘.edeka; .bmw; .ford’) were excluded. The language of the individual documents (URLs) was then estimated with the help of NTextCat (https://github.com/ivanakcheurov/ntextcat) (via the CORE14 profile of NTextCat) - only those documents/URLs for which German was the most likely language were processed further (e.g. to exclude foreign-language material such as individual subpages). The third step involved filtering for manual selectors and filtering for 1:1 duplicates (within one year). The filtering and subsequent processing was carried out using CorpusExplorer (http://hdl.handle.net/11234/1-2634) and our own (supplementary) scripts, and the TreeTagger (http://hdl.handle.net/11372/LRT-323) was used for automatic annotation. The corpus was processed on the HELIX HPC cluster. The author would like to take this opportunity to thank the state of Baden-Württemberg and the German Research Foundation (DFG) for the possibility to use the bwHPC/HELIX HPC cluster - funding code HPC cluster: INST 35/1597-1 FUGG. Data content: - Tokens and record boundaries - Automatic lemma and POS annotation (using TreeTagger) - Metadata: - GUID - Unique identifier of the document - YEAR - Year of capture (please use this information for data slices) - Url - Full URL - Tld - Top-Level Domain - Domain - Domain without TLD (but with sub-domains if applicable) - DomainFull - Complete domain (incl. TLD) - DomainFull - Complete domain (incl. TLD) - Datum - (System Information): Date of the CorpusExplorer (date of capture by CommonCrawl - not date of creation/modification of the document). - Hash - (System Information): SHA1 hash of the CommonCrawl - Pfad - (System Information): Path of the cluster (raw data) - is supplied by the system. Please note that the files are saved as *.cec6.gz. These are binary files of the CorpusExplorer (see above). These files ensure efficient archiving. You can use both CorpusExplorer and the ‘CEC6-Converter’ (available for Linux, MacOS and Windows - see: https://lindat.mff.cuni.cz/repository/xmlui/handle/11372/LRT-5705) to convert the data. The data can be exported in the following formats: - CATMA v6 - CoNLL - CSV - CSV (only meta-data) - DTA TCF-XML - DWDS TEI-XML - HTML - IDS I5-XML - IDS KorAP XML - IMS Open Corpus Workbench - JSON - OPUS Corpus Collection XCES - Plaintext - SaltXML - SlashA XML - SketchEngine VERT - SPEEDy/CODEX (JSON) - TLV-XML - TreeTagger - TXM - WebLicht - XML Please note that an export increases the storage space requirement extensively. The ‘CorpusExplorerConsole’ (https://github.com/notesjor/CorpusExplorer.Terminal.Console - available for Linux, MacOS and Windows) also offers a simple solution for editing and analysing. If you have any questions, please contact the author. Legal information The data was downloaded on 01.11.2024. The use, processing and distribution is subject to §60d UrhG (german copyright law), which authorises the use for non-commercial purposes in research and teaching. LINDAT/CLARIN is responsible for long-term archiving in accordance with §69d para. 5 and ensures that only authorised persons can access the data. The data has been checked to the best of our knowledge and belief (on a random basis) - should you nevertheless find legal violations (e.g. right to be forgotten, personal rights, etc.), please write an e-mail to the author ([email protected]) with the following information: 1) why this content is undesirable (please outline only briefly) and 2) how the content can be identified - e.g. file name, URL or domain, etc. The author will endeavour to identify the content. The author will endeavour to remove the content and re-upload the data (modified) within two weeks (new version). If you have any further questions, please contact CLARIN. *** english version see above *** Das ‚Ancillary Monitor Corpus: Common Crawl - german web‘ wurde mit dem Ziel konzipiert - eine breit angelegte und zeitlich begleitende linguistische Analyse des deutschsprachigen (sichtbaren) Internets zu ermöglichen - wobei eine Vergleichbarkeit mit dem DeReKo (‚Deutsches Referenz Korpus‘ des Leibniz-Instituts für Deutsche Sprache - DeReKo Umfang 57 Mrd. Token - Stand: DeReKo Release 2024-I) angestrebt wird. Das Korpus ist nach Jahren getrennt (hier Jahr 2018) und versioniert (hier Version 1). Die Version 1 umfasst (alle Jahre 2013-2024) 97,45 Mrd. Token. Das Korpus basiert auf den Daten-Dumps von CommonCrawl (https://commoncrawl.org/). CommonCrawl ist eine Non-Profit-Organisation, die Kopien des sichtbaren Internets kostenlos für die Forschung zur Verfügung stellt. Die CommonCrawl WET Rohdaten wurden zunächst nach TLD (Top-Level Domain) gefiltert. Es wurden nur Seiten berücksichtigt, die auf folgende TLDs enden: „.at; .bayern; .berlin; .ch; .cologne; .de; .gmbh; .hamburg; .koeln; .nrw; .ruhr; .saarland; .swiss; .tirol; .wien; .zuerich“. Dies sind die exklusiven deutschsprachigen TLDs laut ICANN (https://data.iana.org/TLD/tlds-alpha-by-domain.txt) zum Stand 01.06.2024 - ausgeschlossen wurden TLDs mit reinem Firmenbezug (z.B. „.edeka; .bmw; .ford“). Für die einzelnen Dokumente (URLs) wurde dann mit Hilfe von NTextCat (https://github.com/ivanakcheurov/ntextcat) die Sprache geschätzt (über das CORE14-Profil von NTextCat) - es wurden nur solche Dokumente/URLs weiterverarbeitet, bei denen Deutsch die wahrscheinlichste Sprache war (z.B. um möglichst auszuschließen, dass fremdsprachiges Material wie einzelne Unterseitenbereiche enthalten sind). Als dritter Schritt erfolgte eine Filterung nach manuellen Selektoren und eine Filterung nach 1:1-Dubletten (innerhalb eines Jahres). Die Filterung und anschließende Aufbereitung erfolgte mit dem CorpusExplorer (http://hdl.handle.net/11234/1-2634) und eigenen (ergänzenden) Skripten, wobei für die automatische Annotation der TreeTagger (http://hdl.handle.net/11372/LRT-323) verwendet wurde. Die Aufbereitung des Korpus erfolgte auf dem HELIX-HPC-Cluster. Der Autor dankt an dieser Stelle dem Land Baden-Württemberg und der Deutschen Forschungsgemeinschaft (DFG) für die Möglichkeit das bwHPC/HELIX HPC-Cluster nutzen zu können – Förderkennzeichen HPC-Cluster: INST 35/1597-1 FUGG. Dateninhalt: - Token und Satzgrenzen - Automatische Lemma- und POS-Annotation (mittels TreeTagger) - Metadaten: - GUID - Eindeutiger Identifikator des Dokuments - YEAR - Jahr der Erfassung (bitte verwenden Sie diese Angabe für Datenschnitte) - Url - Vollständige URL - Tld – Top-Level Domain - Domain – Domain ohne TLD (aber ggf. mit Sub-Domains) - DomainFull – Vollständige Domain (inkl. TLD) - DomainFull - Komplette Domain (inkl. TLD) - Datum - (System Information): Datum des CorpusExplorers (Tag der Erfassung durch CommonCrawl - nicht Tag der Erstellung/Änderung des Dokuments). - Hash - (System Information): SHA1-Hash des CommonCrawl - Pfad - (System Information): Pfad des Clusters (Rohdaten) - wird systembedingt geliefert. Bitte beachten Sie, dass die Dateien als *.cec6.gz gespeichert sind. Dies sind Binärdateien des CorpusExplorers (siehe oben). Diese Dateien gewährleisten eine effiziente Archivierung. Sie können sowohl den CorpusExplorer als auch den ‚CEC6-Converter‘ (verfügbar für Linux, MacOS und Windows - siehe: https://lindat.mff.cuni.cz/repository/xmlui/handle/11372/LRT-5705) zur Konvertierung der Daten verwenden. Die Daten können in folgende Formate exportiert werden: - CATMA v6 - CoNLL - CSV - CSV (only meta-data) - DTA TCF-XML - DWDS TEI-XML - HTML - IDS I5-XML - IDS KorAP XML - IMS Open Corpus Workbench - JSON - OPUS Corpus Collection XCES - Plaintext - SaltXML - SlashA XML - SketchEngine VERT - SPEEDy/CODEX (JSON) - TLV-XML - TreeTagger - TXM - WebLicht - XML Bitte beachten Sie, dass ein Export den Speicherplatzbedarf erheblich erhöht. Eine einfache Lösung zur Bearbeitung und Analyse bietet auch die „CorpusExplorerConsole“ (https://github.com/notesjor/CorpusExplorer.Terminal.Console - verfügbar für Linux, MacOS und Windows). Bei Fragen wenden Sie sich bitte an den Autor. Rechtliche Hinweise Die Daten wurden am 01.11.2024 heruntergeladen. Die Nutzung, Verarbeitung und Verbreitung unterliegt §60d UrhG, der die Nutzung für nicht kommerzielle Zwecke in Forschung und Lehre erlaubt. LINDAT/CLARIN übernimmt die Langzeitarchivierung nach §69d Abs. 5 und stellt sicher, dass nur berechtigte Personen auf die Daten zugreifen können. Die Daten wurden nach bestem Wissen und Gewissen (stichprobenartig) überprüft - sollten Sie dennoch Rechtsverletzungen (z.B. Recht auf Vergessenwerden, Persönlichkeitsrechte etc.) finden, schreiben Sie bitte eine E-Mail an den Autor ([email protected]) mit folgenden Informationen: 1) warum dieser Inhalt unerwünscht ist (bitte nur kurz skizzieren) und 2) wie der Inhalt identifiziert werden kann - z.B. Dateiname, URL oder Domain etc. Der Autor wird sich bemühen, den Inhalt zu entfernen und die Daten innerhalb von zwei Wochen (verändert) wieder hochzuladen (neue Version). Bei weiteren Fragen wenden Sie sich bitte an CLARIN

Ancillary Monitor Corpus: Common Crawl - german web (YEAR 2020 – VERSION 1)

Author: Rüdiger Jan Oliver
Publication venue: Rüdiger, Jan Oliver
Publication date: 14/11/2024
Field of study

*** german version see below *** The ‘Ancillary Monitor Corpus: Common Crawl - german web’ was designed with the aim of enabling a broad-based linguistic analysis of the German-language (visible) internet over time - with the aim of achieving comparability with the DeReKo (‘German Reference Corpus’ of the Leibniz Institute for the German Language - DeReKo volume 57 billion tokens - status: DeReKo Release 2024-I). The corpus is separated by year (here year 2020) and versioned (here version 1). Version 1 comprises (all years 2013-2024) 97.45 billion tokens. The corpus is based on the data dumps from CommonCrawl (https://commoncrawl.org/). CommonCrawl is a non-profit organisation that provides copies of the visible Internet free of charge for research purposes. The CommonCrawl WET raw data was first filtered by TLD (top-level domain). Only pages ending in the following TLDs were taken into account: ‘.at; .bayern; .berlin; .ch; .cologne; .de; .gmbh; .hamburg; .koeln; .nrw; .ruhr; .saarland; .swiss; .tirol; .wien; .zuerich’. These are the exclusive German-language TLDs according to ICANN (https://data.iana.org/TLD/tlds-alpha-by-domain.txt) as of 1 June 2024 - TLDs with a purely corporate reference (e.g. ‘.edeka; .bmw; .ford’) were excluded. The language of the individual documents (URLs) was then estimated with the help of NTextCat (https://github.com/ivanakcheurov/ntextcat) (via the CORE14 profile of NTextCat) - only those documents/URLs for which German was the most likely language were processed further (e.g. to exclude foreign-language material such as individual subpages). The third step involved filtering for manual selectors and filtering for 1:1 duplicates (within one year). The filtering and subsequent processing was carried out using CorpusExplorer (http://hdl.handle.net/11234/1-2634) and our own (supplementary) scripts, and the TreeTagger (http://hdl.handle.net/11372/LRT-323) was used for automatic annotation. The corpus was processed on the HELIX HPC cluster. The author would like to take this opportunity to thank the state of Baden-Württemberg and the German Research Foundation (DFG) for the possibility to use the bwHPC/HELIX HPC cluster - funding code HPC cluster: INST 35/1597-1 FUGG. Data content: - Tokens and record boundaries - Automatic lemma and POS annotation (using TreeTagger) - Metadata: - GUID - Unique identifier of the document - YEAR - Year of capture (please use this information for data slices) - Url - Full URL - Tld - Top-Level Domain - Domain - Domain without TLD (but with sub-domains if applicable) - DomainFull - Complete domain (incl. TLD) - DomainFull - Complete domain (incl. TLD) - Datum - (System Information): Date of the CorpusExplorer (date of capture by CommonCrawl - not date of creation/modification of the document). - Hash - (System Information): SHA1 hash of the CommonCrawl - Pfad - (System Information): Path of the cluster (raw data) - is supplied by the system. Please note that the files are saved as *.cec6.gz. These are binary files of the CorpusExplorer (see above). These files ensure efficient archiving. You can use both CorpusExplorer and the ‘CEC6-Converter’ (available for Linux, MacOS and Windows - see: https://lindat.mff.cuni.cz/repository/xmlui/handle/11372/LRT-5705) to convert the data. The data can be exported in the following formats: - CATMA v6 - CoNLL - CSV - CSV (only meta-data) - DTA TCF-XML - DWDS TEI-XML - HTML - IDS I5-XML - IDS KorAP XML - IMS Open Corpus Workbench - JSON - OPUS Corpus Collection XCES - Plaintext - SaltXML - SlashA XML - SketchEngine VERT - SPEEDy/CODEX (JSON) - TLV-XML - TreeTagger - TXM - WebLicht - XML Please note that an export increases the storage space requirement extensively. The ‘CorpusExplorerConsole’ (https://github.com/notesjor/CorpusExplorer.Terminal.Console - available for Linux, MacOS and Windows) also offers a simple solution for editing and analysing. If you have any questions, please contact the author. Legal information The data was downloaded on 01.11.2024. The use, processing and distribution is subject to §60d UrhG (german copyright law), which authorises the use for non-commercial purposes in research and teaching. LINDAT/CLARIN is responsible for long-term archiving in accordance with §69d para. 5 and ensures that only authorised persons can access the data. The data has been checked to the best of our knowledge and belief (on a random basis) - should you nevertheless find legal violations (e.g. right to be forgotten, personal rights, etc.), please write an e-mail to the author ([email protected]) with the following information: 1) why this content is undesirable (please outline only briefly) and 2) how the content can be identified - e.g. file name, URL or domain, etc. The author will endeavour to identify the content. The author will endeavour to remove the content and re-upload the data (modified) within two weeks (new version). If you have any further questions, please contact CLARIN. *** english version see above *** Das ‚Ancillary Monitor Corpus: Common Crawl - german web‘ wurde mit dem Ziel konzipiert - eine breit angelegte und zeitlich begleitende linguistische Analyse des deutschsprachigen (sichtbaren) Internets zu ermöglichen - wobei eine Vergleichbarkeit mit dem DeReKo (‚Deutsches Referenz Korpus‘ des Leibniz-Instituts für Deutsche Sprache - DeReKo Umfang 57 Mrd. Token - Stand: DeReKo Release 2024-I) angestrebt wird. Das Korpus ist nach Jahren getrennt (hier Jahr 2020) und versioniert (hier Version 1). Die Version 1 umfasst (alle Jahre 2013-2024) 97,45 Mrd. Token. Das Korpus basiert auf den Daten-Dumps von CommonCrawl (https://commoncrawl.org/). CommonCrawl ist eine Non-Profit-Organisation, die Kopien des sichtbaren Internets kostenlos für die Forschung zur Verfügung stellt. Die CommonCrawl WET Rohdaten wurden zunächst nach TLD (Top-Level Domain) gefiltert. Es wurden nur Seiten berücksichtigt, die auf folgende TLDs enden: „.at; .bayern; .berlin; .ch; .cologne; .de; .gmbh; .hamburg; .koeln; .nrw; .ruhr; .saarland; .swiss; .tirol; .wien; .zuerich“. Dies sind die exklusiven deutschsprachigen TLDs laut ICANN (https://data.iana.org/TLD/tlds-alpha-by-domain.txt) zum Stand 01.06.2024 - ausgeschlossen wurden TLDs mit reinem Firmenbezug (z.B. „.edeka; .bmw; .ford“). Für die einzelnen Dokumente (URLs) wurde dann mit Hilfe von NTextCat (https://github.com/ivanakcheurov/ntextcat) die Sprache geschätzt (über das CORE14-Profil von NTextCat) - es wurden nur solche Dokumente/URLs weiterverarbeitet, bei denen Deutsch die wahrscheinlichste Sprache war (z.B. um möglichst auszuschließen, dass fremdsprachiges Material wie einzelne Unterseitenbereiche enthalten sind). Als dritter Schritt erfolgte eine Filterung nach manuellen Selektoren und eine Filterung nach 1:1-Dubletten (innerhalb eines Jahres). Die Filterung und anschließende Aufbereitung erfolgte mit dem CorpusExplorer (http://hdl.handle.net/11234/1-2634) und eigenen (ergänzenden) Skripten, wobei für die automatische Annotation der TreeTagger (http://hdl.handle.net/11372/LRT-323) verwendet wurde. Die Aufbereitung des Korpus erfolgte auf dem HELIX-HPC-Cluster. Der Autor dankt an dieser Stelle dem Land Baden-Württemberg und der Deutschen Forschungsgemeinschaft (DFG) für die Möglichkeit das bwHPC/HELIX HPC-Cluster nutzen zu können – Förderkennzeichen HPC-Cluster: INST 35/1597-1 FUGG. Dateninhalt: - Token und Satzgrenzen - Automatische Lemma- und POS-Annotation (mittels TreeTagger) - Metadaten: - GUID - Eindeutiger Identifikator des Dokuments - YEAR - Jahr der Erfassung (bitte verwenden Sie diese Angabe für Datenschnitte) - Url - Vollständige URL - Tld – Top-Level Domain - Domain – Domain ohne TLD (aber ggf. mit Sub-Domains) - DomainFull – Vollständige Domain (inkl. TLD) - DomainFull - Komplette Domain (inkl. TLD) - Datum - (System Information): Datum des CorpusExplorers (Tag der Erfassung durch CommonCrawl - nicht Tag der Erstellung/Änderung des Dokuments). - Hash - (System Information): SHA1-Hash des CommonCrawl - Pfad - (System Information): Pfad des Clusters (Rohdaten) - wird systembedingt geliefert. Bitte beachten Sie, dass die Dateien als *.cec6.gz gespeichert sind. Dies sind Binärdateien des CorpusExplorers (siehe oben). Diese Dateien gewährleisten eine effiziente Archivierung. Sie können sowohl den CorpusExplorer als auch den ‚CEC6-Converter‘ (verfügbar für Linux, MacOS und Windows - siehe: https://lindat.mff.cuni.cz/repository/xmlui/handle/11372/LRT-5705) zur Konvertierung der Daten verwenden. Die Daten können in folgende Formate exportiert werden: - CATMA v6 - CoNLL - CSV - CSV (only meta-data) - DTA TCF-XML - DWDS TEI-XML - HTML - IDS I5-XML - IDS KorAP XML - IMS Open Corpus Workbench - JSON - OPUS Corpus Collection XCES - Plaintext - SaltXML - SlashA XML - SketchEngine VERT - SPEEDy/CODEX (JSON) - TLV-XML - TreeTagger - TXM - WebLicht - XML Bitte beachten Sie, dass ein Export den Speicherplatzbedarf erheblich erhöht. Eine einfache Lösung zur Bearbeitung und Analyse bietet auch die „CorpusExplorerConsole“ (https://github.com/notesjor/CorpusExplorer.Terminal.Console - verfügbar für Linux, MacOS und Windows). Bei Fragen wenden Sie sich bitte an den Autor. Rechtliche Hinweise Die Daten wurden am 01.11.2024 heruntergeladen. Die Nutzung, Verarbeitung und Verbreitung unterliegt §60d UrhG, der die Nutzung für nicht kommerzielle Zwecke in Forschung und Lehre erlaubt. LINDAT/CLARIN übernimmt die Langzeitarchivierung nach §69d Abs. 5 und stellt sicher, dass nur berechtigte Personen auf die Daten zugreifen können. Die Daten wurden nach bestem Wissen und Gewissen (stichprobenartig) überprüft - sollten Sie dennoch Rechtsverletzungen (z.B. Recht auf Vergessenwerden, Persönlichkeitsrechte etc.) finden, schreiben Sie bitte eine E-Mail an den Autor ([email protected]) mit folgenden Informationen: 1) warum dieser Inhalt unerwünscht ist (bitte nur kurz skizzieren) und 2) wie der Inhalt identifiziert werden kann - z.B. Dateiname, URL oder Domain etc. Der Autor wird sich bemühen, den Inhalt zu entfernen und die Daten innerhalb von zwei Wochen (verändert) wieder hochzuladen (neue Version). Bei weiteren Fragen wenden Sie sich bitte an CLARIN

AlbNews Albanian Topic Modeling

Author: Çano Erion
Publication venue: University of Vienna
Publication date: 07/02/2024
Field of study

AlbNews is a topic modeling corpus of news headlines in Albanian, consisting of 600 labeled samples and 2600 unlabeled samples. Each labeled sample includes a headline text retrieved from Albanian online news portals. It also contains one of the four labels: 'pol' for politics, 'cul' for culture, 'eco' for economy, and 'spo' for sport. Each of the unlabeled samples contain a headline text only.AlbTopic corpus is released under CC-BY 4.0 license (https://creativecommons.org/licenses/by/4.0/). If using the data, please cite the following paper: Çano Erion, Lamaj Dario. AlbNews: A Corpus of Headlines for Topic Modeling in Albanian. CoRR, abs/2402.04028, 2024. URL: https://arxiv.org/abs/2402.04028

GrandStaff-LMX: Linearized MusicXML Encoding of the GrandStaff Dataset

Author: Mayer Jiří
Straka Milan
Hajič jr. Jan
Pecina Pavel
Publication venue: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Publication date: 12/02/2024
Field of study

The GrandStaff-LMX dataset is based on the GrandStaff dataset described in the "End-to-end optical music recognition for pianoform sheet music" paper by Antonio Ríos-Vila et al., 2023, https://doi.org/10.1007/s10032-023-00432-z . The GrandStaff-LMX dataset contains MusicXML and Linearized MusicXML encodings of all systems from the original datase, suitable for evaluation with the TEDn metric. It also contains the GrandStaff official train/dev/split

0

full texts

1,998

metadata records

Updated in last 30 days.

LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University

Access Repository Dashboard

Do you manage Open Research Online? Become a CORE Member to access insider analytics, issue reports and manage access to outputs from your repository in the CORE Repository Dashboard! 👇