Common Language Resources and Technology Infrastructure - Slovenia

Not a member yet

840 research outputs found

Sort by

Dataset of Authentic and Synthetic Slovene Language Errors DASSLE 1.0

Author: Arhar Holdt Špela
Antloga Špela
Gantar Polona
Munda Tina
Robida Nejc
Zgonc Matjaž
Publication venue: Centre for Language Resources and Technologies, University of Ljubljana
Publication date: 30/09/2025
Field of study

DASSLE 1.0 (Dataset of Authentic and Synthetic Slovene Language Errors) comprises 7,385 manually prepared entries, each consisting of a Slovene sentence containing a single, specific language problem, its corrected version, and metadata including both coarse- and fine-grained correction classifications, as well as the source of the example. Language problems are divided into five top-level categories: spelling, orthography, morphology, vocabulary, and syntax. These are further specified using 128 fine-grained error types, aligned with the typology developed for the Šolar 3.0 corpus. The typology is explained at https://wiki.cjvt.si/books/11-developmental-corpus-solar/page/introduction-to-tags and in more detail in the annotation guidelines at https://wiki.cjvt.si/books/11-developmental-corpus-solar/page/annotation-guidelines. The examples in DASSLE 1.0 were sourced from four distinct origins, combining both authentic and synthetic data creation. From Šolar 3.0, the corpus of student writing with teacher-provided corrections, sentences were manually reviewed and edited to contain only one clearly defined language problem. In Gigafida 2.0, the reference corpus of standard written Slovene, examples were either manually corrected or deliberately corrupted to introduce typical deviations from the current norm. Synthetic examples were generated using GPT-4o, which was prompted with authentic sentence pairs; outputs were manually reviewed to select only those most closely resembling natural language use. A small number of examples were collected from Jezikovna svetovalnica, based on real language queries submitted by speakers. The dataset is primarily intended for the development and evaluation of natural language processing tools for automatic error detection and correction for Slovene. It is available in TSV format, accompanied by a README document that describes its contents in more detail

Monitor corpus of Slovene Trendi 2024-12

Author: Kosem Iztok
Čibej Jaka
Dobrovoljc Kaja
Erjavec Tomaž
Ljubešić Nikola
Ponikvar Primož
Šinkec Mihael
Krek Simon
Publication venue: Centre for Language Resources and Technologies, University of Ljubljana
Publication date: 08/01/2025
Field of study

The Trendi corpus is a monitor corpus of Slovenian. It contains news articles from 106 media websites, published by 76 publishers. Trendi 2024-12 covers the period from January 2019 to December 2024, complementing the Gigafida 2.0 reference corpus of written Slovene (http://hdl.handle.net/11356/1320). The contents of the Trendi corpus are obtained using the Jožef Stefan Institute Newsfeed service (http://newsfeed.ijs.si/). The texts have been annotated using the CLASSLA-Stanza pipeline (https://github.com/clarinsi/classla), including syntactic parsing according to the Universal Dependencies (https://universaldependencies.org/sl/) and Named Entities (https://nl.ijs.si/janes/wp-content/uploads/2017/09/SlovenianNER-eng-v1.1.pdf). An important addition are topics or thematical categories, which have been automatically assigned to each text. There are 13 categories altogether: Arts and culture, Crime and accidents, Economy, Environment, Health, Leisure, Politics and Law, Science and Technology, Society, Sports, Weather, Entertainment, and Education. The text classification uses the following models: Text classification model SloBERTa-Trendi-Topics 1.0 (http://hdl.handle.net/11356/1709), Text classification model fastText-Trendi-Topics 1.0 (http://hdl.handle.net/11356/1710), and the SloBERTa model (https://huggingface.co/cjvt/sloberta-trendi-topics). The corpus is currently not available as a downloadable dataset due to copyright restrictions but we hope to make at least some of it available in the near future. The corpus is accessible through CLARIN.SI concordancers. If you would like to use the dataset for research purposes, please contact Iztok Kosem ([email protected]). This version adds texts from December 2024

Monitor corpus of Slovene Trendi 2025-01

Author: Kosem Iztok
Čibej Jaka
Dobrovoljc Kaja
Erjavec Tomaž
Ljubešić Nikola
Ponikvar Primož
Šinkec Mihael
Krek Simon
Publication venue: Centre for Language Resources and Technologies, University of Ljubljana
Publication date: 03/02/2025
Field of study

The Trendi corpus is a monitor corpus of Slovenian. It contains news articles from 106 media websites, published by 77 publishers. Trendi 2025-01 covers the period from January 2019 to January 2025, complementing the Gigafida 2.0 reference corpus of written Slovene (http://hdl.handle.net/11356/1320). The contents of the Trendi corpus are obtained using the Jožef Stefan Institute Newsfeed service (http://newsfeed.ijs.si/). The texts have been annotated using the CLASSLA-Stanza pipeline (https://github.com/clarinsi/classla), including syntactic parsing according to the Universal Dependencies (https://universaldependencies.org/sl/) and Named Entities (https://nl.ijs.si/janes/wp-content/uploads/2017/09/SlovenianNER-eng-v1.1.pdf). An important addition are topics or thematical categories, which have been automatically assigned to each text. There are 13 categories altogether: Arts and culture, Crime and accidents, Economy, Environment, Health, Leisure, Politics and Law, Science and Technology, Society, Sports, Weather, Entertainment, and Education. The text classification uses the following models: Text classification model SloBERTa-Trendi-Topics 1.0 (http://hdl.handle.net/11356/1709), Text classification model fastText-Trendi-Topics 1.0 (http://hdl.handle.net/11356/1710), and the SloBERTa model (https://huggingface.co/cjvt/sloberta-trendi-topics). The corpus is currently not available as a downloadable dataset due to copyright restrictions but we hope to make at least some of it available in the near future. The corpus is accessible through CLARIN.SI concordancers. If you would like to use the dataset for research purposes, please contact Iztok Kosem ([email protected]). This version adds texts from January 2025

The CLASSLA-Stanza model for named entity recognition of standard Slovenian 2.2

Author: Terčon Luka
Dobrovoljc Kaja
Ljubešić Nikola
Publication venue: Jožef Stefan Institute
Publication date: 07/02/2025
Field of study

This model for named entity recognition of standard Slovenian was built with the CLASSLA-Stanza tool (https://github.com/clarinsi/classla) by training on the SUK training corpus (http://hdl.handle.net/11356/1959) and using the CLARIN.SI-embed.sl 2.0 word embeddings (http://hdl.handle.net/11356/1791). The difference to the previous version of the model is that the model was trained using the SUK training corpus and uses new embeddings

The CLASSLA-Stanza model for UD dependency parsing of spoken Slovenian 2.2

Author: Terčon Luka
Dobrovoljc Kaja
Ljubešić Nikola
Publication venue: Jožef Stefan Institute
Publication date: 07/02/2025
Field of study

This model for UD dependency parsing of spoken Slovenian was built with the CLASSLA-Stanza tool (https://github.com/clarinsi/classla) by training on the SST treebank of spoken Slovenian (https://github.com/UniversalDependencies/UD_Slovenian-SST) combined with the SUK training corpus (http://hdl.handle.net/11356/1959) and using the CLARIN.SI-embed.sl word embeddings (http://hdl.handle.net/11356/1791) that were expanded with the MaCoCu-sl Slovene web corpus (http://hdl.handle.net/11356/1517). The estimated LAS of the parser is ~81.91

Monitor corpus of Slovene Trendi 2025-02

Author: Kosem Iztok
Čibej Jaka
Dobrovoljc Kaja
Erjavec Tomaž
Ljubešić Nikola
Ponikvar Primož
Šinkec Mihael
Krek Simon
Publication venue: Centre for Language Resources and Technologies, University of Ljubljana
Publication date: 04/03/2025
Field of study

The Trendi corpus is a monitor corpus of Slovenian. It contains news articles from 106 media websites, published by 77 publishers. Trendi 2025-02 covers the period from January 2019 to February 2025, complementing the Gigafida 2.0 reference corpus of written Slovene (http://hdl.handle.net/11356/1320). The contents of the Trendi corpus are obtained using the Jožef Stefan Institute Newsfeed service (http://newsfeed.ijs.si/). The texts have been annotated using the CLASSLA-Stanza pipeline (https://github.com/clarinsi/classla), including syntactic parsing according to the Universal Dependencies (https://universaldependencies.org/sl/) and Named Entities (https://nl.ijs.si/janes/wp-content/uploads/2017/09/SlovenianNER-eng-v1.1.pdf). An important addition are topics or thematical categories, which have been automatically assigned to each text. There are 13 categories altogether: Arts and culture, Crime and accidents, Economy, Environment, Health, Leisure, Politics and Law, Science and Technology, Society, Sports, Weather, Entertainment, and Education. The text classification uses the following models: Text classification model SloBERTa-Trendi-Topics 1.0 (http://hdl.handle.net/11356/1709), Text classification model fastText-Trendi-Topics 1.0 (http://hdl.handle.net/11356/1710), and the SloBERTa model (https://huggingface.co/cjvt/sloberta-trendi-topics). The corpus is currently not available as a downloadable dataset due to copyright restrictions but we hope to make at least some of it available in the near future. The corpus is accessible through CLARIN.SI concordancers. If you would like to use the dataset for research purposes, please contact Iztok Kosem ([email protected]). This version adds texts from February 2025

Parallel sense-annotated corpus ELEXIS-WSD 1.2

ELEXIS-WSD is a parallel sense-annotated corpus in which content words (nouns, adjectives, verbs, and adverbs) have been assigned senses. Version 1.2 contains sentences for 10 languages: Bulgarian, Danish, English, Spanish, Estonian, Hungarian, Italian, Dutch, Portuguese, and Slovene. The corpus was compiled by automatically extracting a set of sentences from WikiMatrix (Schwenk et al., 2019), a large open-access collection of parallel sentences derived from Wikipedia, using an automatic approach based on multilingual sentence embeddings. The sentences were manually validated according to specific formal, lexical and semantic criteria (e.g. by removing incorrect punctuation, morphological errors, notes in square brackets and etymological information typically provided in Wikipedia pages). To obtain a satisfactory semantic coverage, we filtered out sentences with less than 5 words and less than 2 polysemous words were filtered out. Subsequently, in order to obtain datasets in the other nine target languages, for each selected sentence in English, the corresponding WikiMatrix translation into each of the other languages was retrieved. If no translation was available, the English sentence was translated manually. The resulting corpus is comprised of 2,024 sentences for each language. The sentences were tokenized, lemmatized, and tagged with UPOS tags using UDPipe v2.6 (https://lindat.mff.cuni.cz/services/udpipe/). Senses were annotated using LexTag (https://elexis.babelscape.com/): each content word (noun, verb, adjective, and adverb) was assigned a sense from among the available senses from the sense inventory selected for the language (see below) or BabelNet. Sense inventories were also updated with new senses during annotation. Dependency relations were added with UDPipe 2.15 in version 1.2. List of sense inventories BG: Dictionary of Bulgarian DA: DanNet – The Danish WordNet EN: Open English WordNet ES: Spanish Wiktionary ET: The EKI Combined Dictionary of Estonian HU: The Explanatory Dictionary of the Hungarian Language IT: PSC + Italian WordNet NL: Open Dutch WordNet PT: Portuguese Academy Dictionary (DACL) SL: Digital Dictionary Database of Slovene The corpus is available in the CoNLL-U tab-separated format. In order, the columns contain the token ID, its form, its lemma, its UPOS-tag, its XPOS-tag (if available), its morphological features (FEATS), the head of the dependency relation (HEAD), the type of dependency relation (DEPREL); the ninth column (DEPS) is empty; the final MISC column contains the following: the token's whitespace information (whether the token is followed by a whitespace or not; e.g. SpaceAfter=No), the ID of the sense assigned to the token, the index of the multiword expression (if the token is part of an annotated multiword expression), and the index and type of the named entity annotation (currently only available in elexis-wsd-sl). Each language has a separate sense inventory containing all the senses (and their definitions) used for annotation in the corpus. Not all the senses from the sense inventory are necessarily included in the corpus annotations: for instance, all occurrences of the English noun "bank" in the corpus might be annotated with the sense of "financial institution", but the sense inventory also contains the sense "edge of a river" as well as all other possible senses to disambiguate between. For more information, please refer to 00README.txt. Updates in version 1.2: - Several tokenization errors with multiword tokens were fixed in all subcorpora (e.g. the order of subtokens was incorrect in many cases; the issue has now been resolved). - XPOS, FEATS, HEAD, and DEPREL columns were added automatically with UDPipe (except for elexis-wsd-sl and elexis-wsd-et; for Slovene, all columns were manually validated; for Estonian, HEAD and DEPREL were manually validated; all other languages contain automatic tags in these columns – for more information on the models used and their performance, see 00README.txt). - The entry now includes lists of potential errors in automatically assigned XPOS and FEATS values. In previous versions, only UPOS tags were manually annotated, while the XPOS and FEATS columns were left empty. XPOS and FEATS have now been added automatically through UDPipe. The list of potential errors contains the list of lines in the corpus in which the XPOS and FEATS columns are potentially incorrect because the manually validated UPOS tag differs from the automatically assigned UPOS tag, which indicates that the automatically assigned XPOS and FEATS columns are probably incorrect. This is meant as a reference for future validation efforts. - For Slovene, named entity annotations were added based on the annotations from the SUK 1.1 Training Corpus of Slovene (http://hdl.handle.net/11356/1959)

Slovene Conformer CTC BPE E2E Automated Speech Recognition model PROTOVERB-ASR-E2E 1.0

Author: Lebar Bajec Iztok
Bajec Marko
Publication venue: Faculty of Computer and Information Science, University of Ljubljana
Publication date: 17/04/2025
Field of study

This Conformer CTC BPE E2E Automated Speech Recognition model was trained following the NVIDIA NeMo Conformer-CTC fine-tuning recipe (for details see the official NVIDIA NeMo NMT documentation, https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/asr/intro.html, and NVIDIA NeMo GitHub repository https://github.com/NVIDIA/NeMo). It provides functionality for transcribing Slovene speech to text. The starting point was the Conformer CTC BPE E2E Automated Speech Recognition model RSDO-DS2-ASR-E2E 2.0, which was fine-tuned on the Protoverb closed dataset. The model was fine-tuned for 20 epochs, which improved the performance on the Protoverb test dataset for 9.8% relative WER, and for 3.3% relative WER on the Slobench dataset

Linguistically annotated multilingual comparable corpora of parliamentary debates in English ParlaMint-en.ana 5.0

Author: Kuzman Pungeršek Taja
Ljubešić Nikola
Erjavec Tomaž
Kopp Matyáš
Ogrodniczuk Maciej
Osenova Petya
Rayson Paul
Vidler John
Agerri Rodrigo
Agirrezabal Manex
Agnoloni Tommaso
Aires José
Albini Monica
Alkorta Jon
Antiba-Cartazo Iván
Arrieta Ekain
Barcala Mario
Bardanca Daniel
Barkarson Starkaður
Bartolini Roberto
Battistoni Roberto
Bel Nuria
Bonet Ramos Maria del Mar
Calzada Pérez María
Cardoso Aida
Çöltekin Çağrı
Coole Matthew
Darģis Roberts
de Does Jesse
de Libano Ruben
Depoorter Griet
Depuydt Katrien
Diwersy Sascha
Dodé Réka
Fernandez Kike
Fernández Rei Elisa
Frontini Francesca
Garcia Marcos
García Díaz Noelia
García Louzao Pedro
Gavriilidou Maria
Gkoumas Dimitris
Grigorov Ilko
Grigorova Vladislava
Haltrup Hansen Dorte
Iruskieta Mikel
Jarlbrink Johan
Jelencsik-Mátyus Kinga
Jongejan Bart
Kahusk Neeme
Kirnbauer Martin
Kryvenko Anna
Ligeti-Nagy Noémi
Luxardo Giancarlo
Magariños Carmen
Magnusson Måns
Marchetti Carlo
Marx Maarten
Meden Katja
Mendes Amália
Mochtak Michal
Mölder Martin
Montemagni Simonetta
Navarretta Costanza
Nitoń Bartłomiej
Norén Fredrik Mohammadi
Nwadukwe Amanda
Ojsteršek Mihael
Pančur Andrej
Papavassiliou Vassilis
Pereira Rui
Pérez Lago María
Piperidis Stelios
Pirker Hannes
Pisani Marilina
Pol Henk van der
Prokopidis Prokopis
Quochi Valeria
Regueira Xosé Luís
Rii Andriana
Rudolf Michał
Ruisi Manuela
Rupnik Peter
Schopper Daniel
Simov Kiril
Sinikallio Laura
Skubic Jure
Tamper Minna
Tungland Lars Magne
Tuominen Jouni
van Heusden Ruben
Varga Zsófia
Vázquez Abuín Marta
Venturi Giulia
Vidal Miguéns Adrián
Vider Kadri
Vivel Couso Ainhoa
Vladu Adina Ioana
Wissik Tanja
Yrjänäinen Väinö
Zevallos Rodolfo
Fišer Darja
Publication venue: CLARIN ERIC
Publication date: 08/07/2025
Field of study

ParlaMint-en.ana 5.0 is the English machine translation of the ParlaMint.ana 5.0 (http://hdl.handle.net/11356/2005) set of corpora of parliamentary debates across Europe. The translation keeps the structure and metadata of the original corpora and is linguistically annotated similarly to the original language corpora (but without UD syntax), and with the addition of USAS semantic tags (https://ucrel.lancs.ac.uk/usas/). Because of the addition of semantic tags the UK corpus (ParlaMint-GB) is also included, even though it has, of course, not been machine translated. The translation to English was done with EasyNMT (https://github.com/UKPLab/EasyNMT) using OPUS-MT models (https://github.com/Helsinki-NLP/Opus-MT). Machine translation was done on the sentence level over both speeches and transcriber notes, including headings. Note that corpus metadata is mostly available both in the source language and in English. The linguistic annotation of the speeches, i.e. tokenisation, tagging with UD PoS and morphological features, lemmatisation, and NER annotation was done with Stanza (https://stanfordnlp.github.io/stanza/) using the conll03 model (4 classes). The annotation of MWEs (phrases) and tokens with USAS tags was done with pyMusas (https://github.com/ucrel/pymusas). Note that the English in the corpora contains typical NMT errors, including factual errors even when high fluency is achieved, and any use of this corpus should take the machine translation limitations into account. The files associated with this entry include the machine translated and linguistically annotated corpora in several formats: the corpora in the canonical ParlaMint TEI XML encoding; the corpora in the derived vertical format (for use with CQP-based concordancers, such as CWB, noSketch Engine or KonText); and the corpora in the CoNLL-U format with TSV speech metadata. The CoNLL-U files include pyMusas USAS tags. Also included is the 5.0 release of the sample data and scripts available at the GitHub repository of the ParlaMint project at https://github.com/clarin-eric/ParlaMint and the log files produced in the process of building the corpora for this release. The log files show e.g. known errors in the corpora, while more information about known problems is available in the (open) issues at the GitHub repository of the project. As opposed to the previous version 4.1, this version adds information on the topic of each speech and the sentence-level sentiment for all corpora, changes the IDs of the categories in corpus-specific taxonomies to prevent ID clashes and corrects some other minor errors

Collection of Slovenian riddles Uganke 1.0

Author: Babič Saša
Erjavec Tomaž
Farič Ana
Peče Miha
Publication venue: ZRC SAZU
Publication date: 25/04/2025
Field of study

The Uganke corpus collects 2,790 Slovenian riddles from the folklore collection of the Institute of Slovenian Ethnology. The riddles come from 171 sources: fieldwork, newspapers, journals, manuscripts and printed riddle collections from the 19th and 20th centuries. The material is categorised into eight types, depending on the content, semantics, length and presumed context of the riddle: true riddle, narrative true riddle, joking question, wisdom question, joking wisdom question, logical riddle, neck riddle, sexual riddle. Each riddle is split into the question and answer part, and each is given in the diplomatic transcription, mirroring the riddle in the source document, and the critical transcription, which is brought closer to the contemporary Slovenian standard orthography. The critical transcriptions have been automatically annotated with lemmas, MULTEXT-East morphosyntactic descriptions (https://nl.ijs.si/ME/V6/msd/html/msd-sl.html) and Universal dependencies (https://universaldependencies.org/) with the CLASSLA toolchain (https://github.com/clarinsi/classla). The canonical encoding of the corpus is TEI, but the corpus is also distributed in two derived encodings. One is the riddles and the bibliography as two TSV files, and the other the vertical file with the linguistically annotated riddles, as used by CQP-type concordancers, such as Sketch Engine

5

full texts

840

metadata records

Updated in last 30 days.

Common Language Resources and Technology Infrastructure - Slovenia

Access Repository Dashboard

Do you manage Open Research Online? Become a CORE Member to access insider analytics, issue reports and manage access to outputs from your repository in the CORE Repository Dashboard! 👇