Charles University

LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University

Not a member yet

1998 research outputs found

Sort by

Universal Dependencies 2.15 models for UDPipe 2 (2024-11-21)

Author: Straka Milan
Publication venue: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Publication date: 21/11/2024
Field of study

Tokenizer, POS Tagger, Lemmatizer and Parser models for 147 treebanks of 78 languages of Universal Depenencies 2.15 Treebanks, created solely using UD 2.15 data (https://hdl.handle.net/11234/1-5787). The model documentation including performance can be found at https://ufal.mff.cuni.cz/udpipe/2/models#universal_dependencies_215_models . To use these models, you need UDPipe version 2.0, which you can download from https://ufal.mff.cuni.cz/udpipe/2

SynSemClass 5.1

Author: Urešová Zdeňka
Alcaina Cristina Fernández
Bourgonje Peter
Fučíková Eva
Hajič Jan
Hajičová Eva
Rehm Georg
Rysová Kateřina
Zaczynska Karolina
Publication venue: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Publication date: 11/12/2024
Field of study

The SynSemClass synonym verb lexicon version 5.1 is a multilingual resource that enriches previous editions of this event-type ontology with a new language, Spanish. The existing languages, English, Czech and German, are further substantially extended by a larger number of classes. SSC 5.1 data also contain lists (in a separate removed_cms.zip file) with originally (pre-)proposed but later rejected class members. All languages are organized into classes and have links to other lexical sources. In addition to the existing links, links to Spanish sources have been added. The major change against v5.0 is that links to English Princeton Wordnet and to German GUP point to their new versions and new websites that host them. English Wordnet now links to the Open English Wordnet, a fork of the Princeton WordNet developed under an open source methodology and released through the Open English Wordnet website (https://en-word.net/). German Universal PropBank (GUP) is now part of the Universal Propbanks and can be viewed at https://github.com/UniversalDependencies/UD_German-GSD. The individual languages are thus now linked as follows: The Spanish entries are linked to ADESSE (http://adesse.uvigo.es/), Spanish SenSem (http://grial.edu.es/sensem/lexico?idioma=en), Spanish WordNet (https://adimen.ehu.eus/cgi-bin/wei/public/wei.consult.perl), AnCora (https://clic.ub.edu/corpus/en/ancoraverb_es), and Spanish FrameNet (http://sfn.spanishfn.org/SFNreports.php). The English entries are linked to EngVallex (http://hdl.handle.net/11858/00-097C-0000-0023-4337-2), CzEngVallex (http://hdl.handle.net/11234/1-1512), FrameNet (https://framenet.icsi.berkeley.edu/), VerbNet (https://uvi.colorado.edu/ and http://verbs.colorado.edu/verbnet/index.html), PropBank (http://propbank.github.io/), Ontonotes (http://clear.colorado.edu/compsem/index.php?page=lexicalresources&sub=ontonotes), and the Open English Wordnet (https://en-word.net/). The Czech entries are linked to PDT-Vallex (http://hdl.handle.net/11858/00-097C-0000-0023-4338-F), Vallex (http://hdl.handle.net/11234/1-3524), and CzEngVallex (http://hdl.handle.net/11234/1-1512). The German entries are linked to Woxikon (https://synonyme.woxikon.de), E-VALBU (https://grammis.ids-mannheim.de/verbvalenz), and GUP (https://github.com/UniversalDependencies/UD_German-GSD)

The Model latinpipe-evalatin24-240520 for LatinPipe 2024

Author: Straka Milan
Straková Jana
Gamba Federica
Publication venue: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Publication date: 20/05/2024
Field of study

The latinpipe-evalatin24-240520 is a PhilBerta-based model for LatinPipe 2024 , performing tagging, lemmatization, and dependency parsing of Latin, based on the winning entry to the EvaLatin 2024 shared task. It is released under the CC BY-NC-SA 4.0 license

CorPipe 23 multilingual CorefUD 1.2 model (corpipe23-corefud1.2-240906)

Author: Straka Milan
Publication venue: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Publication date: 06/09/2024
Field of study

The `corpipe23-corefud1.2-240906` is a `mT5-large`-based multilingual model for coreference resolution usable in CorPipe 23 . It is released under the CC BY-NC-SA 4.0 license. The model is language agnostic (no corpus id on input), so it can be in theory used to predict coreference in any `mT5` language. However, the model expects empty nodes to be already present on input, predicted by the https://www.kaggle.com/models/ufal-mff/crac2024_zero_nodes_baseline/. This model was present in the CorPipe 24 paper as an alternative to a single-stage approach, where the empty nodes are predicted joinly with coreference resolution (via http://hdl.handle.net/11234/1-5672), an approach circa twice as fast but of slightly worse quality

NameTag 3 Czech CNEC 2.0 Model

Author: Straková Jana
Publication venue: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Publication date: 30/08/2024
Field of study

This is a trained model for the supervised machine learning tool NameTag 3 (https://ufal.mff.cuni.cz/nametag/3/), trained on the Czech Named Entity Corpus 2.0 (https://ufal.mff.cuni.cz/cnec/cnec2.0). NameTag 3 is an open-source tool for both flat and nested named entity recognition (NER). NameTag 3 identifies proper names in text and classifies them into a set of predefined categories, such as names of persons, locations, organizations, etc. The model documentation can be found at https://ufal.mff.cuni.cz/nametag/3/models#czech-cnec2

NameTag 3 Multilingual CoNLL Model

Author: Straková Jana
Publication venue: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Publication date: 30/08/2024
Field of study

This is a trained model for the supervised machine learning tool NameTag 3 (https://ufal.mff.cuni.cz/nametag/3/), trained jointly on several NE corpora: English CoNLL-2003, German CoNLL-2003, Dutch CoNLL-2002, Spanish CoNLL-2002, Ukrainian Lang-uk, and Czech CNEC 2.0, all harmonized to flat NEs with 4 labels PER, ORG, LOC, and MISC. NameTag 3 is an open-source tool for both flat and nested named entity recognition (NER). NameTag 3 identifies proper names in text and classifies them into a set of predefined categories, such as names of persons, locations, organizations, etc. The model documentation can be found at https://ufal.mff.cuni.cz/nametag/3/models#multilingual-conll

ESIC 1.1 -- Europarl Simultaneous Interpreting Corpus (2024-02-05)

Author: Macháček Dominik
Žilinec Matúš
Bojar Ondřej
Publication venue: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Publication date: 05/02/2024
Field of study

ESIC (Europarl Simultaneous Interpreting Corpus) is a corpus of 370 speeches (10 hours) in English, with manual transcripts, transcribed simultaneous interpreting into Czech and German, and parallel translations. The corpus contains source English videos and audios. The interpreters' voices are not published within the corpus, but there is a tool that downloads them from the web of European Parliament, where they are publicly avaiable. The transcripts are equipped with metadata (disfluencies, mixing voices and languages, read or spontaneous speech, etc.), punctuated, and with word-level timestamps. The speeches in the corpus come from the European Parliament plenary sessions, from the period 2008-11. Most of the speakers are MEP, both native and non-native speakers of English. The corpus contains metadata about the speakers (name, surname, id, fraction) and about the speech (date, topic, read or spontaneous). ESIC has validation and evaluation parts. The current version is ESIC v1.1, it extends v1.0 with manual sentence alignment of the tri-parallel texts, and with bi-parallel sentence alignment of English original transcripts and German interpreting

English-Czech parallel song lyrics

Author: Štěpánková Barbora
Rosa Rudolf
Publication venue: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Publication date: 20/03/2024
Field of study

English–Czech parallel corpus of song lyrics, aligned section by section. The songs are sourced from musical films. The dataset is provided in JSON format with the following structure: { "language": { "song_id": { "section_id": [list of lines in the section] } }

Coreference in Universal Dependencies 1.2 (CorefUD 1.2)

Author: Popel Martin
Novák Michal
Žabokrtský Zdeněk
Zeman Daniel
Nedoluzhko Anna
Acar Kutay
Bamman David
Bourgonje Peter
Cinková Silvie
Eckhoff Hanne
Cebiroğlu Eryiğit Gülşen
Hajič Jan
Hardmeier Christian
Haug Dag
Jørgensen Tollef
Kåsen Andre
Krielke Pauline
Landragin Frédéric
Lapshinova-Koltunski Ekaterina
Mæhlum Petter
Martí M. Antònia
Mikulová Marie
Nøklestad Anders
Ogrodniczuk Maciej
Øvrelid Lilja
Pamay Arslan Tuğba
Recasens Marta
Solberg Per Erik
Stede Manfred
Straka Milan
Swanson Daniel
Toldova Svetlana
Vadász Noémi
Velldal Erik
Vincze Veronika
Zeldes Amir
Žitkus Voldemaras
Publication venue: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Publication date: 28/03/2024
Field of study

CorefUD is a collection of previously existing datasets annotated with coreference, which we converted into a common annotation scheme. In total, CorefUD in its current version 1.2 consists of 25 datasets for 16 languages. The datasets are enriched with automatic morphological and syntactic annotations that are fully compliant with the standards of the Universal Dependencies project. All the datasets are stored in the CoNLL-U format, with coreference- and bridging-specific information captured by attribute-value pairs located in the MISC column. The collection is divided into a public edition and a non-public (ÚFAL-internal) edition. The publicly available edition is distributed via LINDAT-CLARIAH-CZ and contains 21 datasets for 15 languages (1 dataset for Ancient Greek, 1 for Ancient Hebrew, 1 for Catalan, 2 for Czech, 3 for English, 1 for French, 2 for German, 2 for Hungarian, 1 for Lithuanian, 2 for Norwegian, 1 for Old Church Slavonic, 1 for Polish, 1 for Russian, 1 for Spanish, and 1 for Turkish), excluding the test data. The non-public edition is available internally to ÚFAL members and contains additional 4 datasets for 2 languages (1 dataset for Dutch, and 3 for English), which we are not allowed to distribute due to their original license limitations. It also contains the test data portions for all datasets. When using any of the harmonized datasets, please get acquainted with its license (placed in the same directory as the data) and cite the original data resource, too. Compared to the previous version 1.1, the version 1.2 comprises new languages and corpora, namely Ancient_Greek-PROIEL, Ancient_Hebrew-PTNK, English-LitBank, and Old_Church_Slavonic-PROIEL. In addition, English-GUM and Turkish-ITCC have been updated to newer versions, conversion of zeros in Polish-PCC has been improved, and the conversion pipelines for multiple other datasets have been refined (a list of all changes in each dataset can be found in the corresponding README file)

Multilingual static embeddings for Verbal Multiword Expressions trained on PARSEME raw corpora

Author: Estève Louis Clément
Savary Agata
Lavergne Thomas
Publication venue: Université Paris-Saclay, CNRS, Laboratoire Interdisciplinaire des Sciences du Numérique
Publication date: 07/06/2024
Field of study

This resource is a set of 14 vector spaces for single words and Verbal Multiword Expressions (VMWEs) in different languages (German, Greek, Basque, French, Irish, Hebrew, Hindi, Italian, Polish, Brazilian Portuguese, Romanian, Swedish, Turkish, Chinese). They were trained with the Word2Vec algorithm, in its skip-gram version, on PARSEME raw corpora automatically annotated for morpho-syntax (http://hdl.handle.net/11234/1-3367). These corpora were annotated by Seen2Seen, a rule-based VMWE identifier, one of the leading tools of the PARSEME shared task version 1.2. VMWE tokens were merged into single tokens. The format of the vector space files is that of the original Word2Vec implementation by Mikolov et al. (2013), i.e. a binary format. For compression, bzip2 was used

0

full texts

1,998

metadata records

Updated in last 30 days.

LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University

Access Repository Dashboard

Do you manage Open Research Online? Become a CORE Member to access insider analytics, issue reports and manage access to outputs from your repository in the CORE Repository Dashboard! 👇