Common Language Resources and Technology Infrastructure - Slovenia

Not a member yet

840 research outputs found

Sort by

The CLASSLA-Stanza model for lemmatisation of spoken Slovenian 2.2

Author: Terčon Luka
Dobrovoljc Kaja
Ljubešić Nikola
Publication venue: Jožef Stefan Institute
Publication date: 07/02/2025
Field of study

This model for lemmatisation of spoken Slovenian was built with the CLASSLA-Stanza tool (https://github.com/clarinsi/classla) by training on the SST treebank of spoken Slovenian (https://github.com/UniversalDependencies/UD_Slovenian-SST) combined with the SUK training corpus (http://hdl.handle.net/11356/1959) and using the CLARIN.SI-embed.sl word embeddings (http://hdl.handle.net/11356/1791) that were expanded with the MaCoCu-sl Slovene web corpus (http://hdl.handle.net/11356/1517). The estimated F1 of the lemma annotations is ~99.23

Spoken corpora of parliamentary debates ParlaSpeech 3.0

Author: Ljubešić Nikola
Rupnik Peter
Porupski Ivan
Kuzman Pungeršek Taja
Koržinek Danijel
Kopp Matyáš
Publication venue: Jožef Stefan Institute
Publication date: 12/06/2025
Field of study

The ParlaSpeech corpora are built from the transcripts of parliamentary proceedings of Croatian, Serbian, Polish, and Czech parliaments available in the ParlaMint 4.0 corpus (http://hdl.handle.net/11356/1859), and the parliamentary recordings available from the parliaments' YouTube channels. An instance is a transcript sentence with the corresponding metadata and the aligned audio. This version of the ParlaSpeech corpora does not release the audio files as it covers the same data as the preceding versions, i.e. version 2.0 for HR (http://hdl.handle.net/11356/1914) and version 1.0 for RS (http://hdl.handle.net/11356/1834), PL (http://hdl.handle.net/11356/1686), and CZ (http://hdl.handle.net/11356/1785). This version's main extension are five enrichment layers: * ParlaSpeech-Pause: automatic annotations of filled pauses ("eerm") * ParlaSpeech-Align: precise word- and grapheme-level alignment (HR, RS only) * ParlaSpeech-Stress: Labelled primary stress in multisyllabic words (HR, RS only) * ParlaSpeech-Ling: Universal Dependencies (UD) formatted linguistic annotations (lemma, part-of-speech, syntax, etc.) * ParlaSpeech-Senti: sentiment estimation based on the transcript Data size per parliament is the following: * Croatia (HR): 923k sentences, 3k hours, 324k filled pauses, 11M word stresses * Serbia (RS): 291k sentences, 900 hours, 74k filled pauses, 2M word stresses * Czechia (CZ): 718k sentences, 1.2k hours, 200k filled pauses, no word stresses * Poland (PL): 535k sentences, 1k hours, 200k filled pauses, no word stresses The data are available in the following formats: * JSONL: master format, containing all the data. Distributed as a newline delimited JSON, where each line is a valid JSON serialization. Mostly intended for computerized processing. * VERT: vertical format intended for concordancers with text, links to audio, linguistic annotations, sentiment, filled pauses, and primary word stress (where available). * TextGrid (HR and RS only): word- and grapheme alignment, primary word stress, and filled pauses. This format's intended use is with the Praat software (https://www.fon.hum.uva.nl/praat/) for research and applications in phonetics and other speech-focused disciplines. For a detailed dataset schema description and examples, please see our dedicated website: https://clarinsi.github.io/parlaspeech/

Ontology of topics for Slovenian as a second and foreign language ONTEM 1.0

Author: Pori Eva
Knez Mihaela
Klemen Matej
Jerman Tanja
Publication venue: Centre for Slovene as a Second and Foreign Language, University of Ljubljana
Publication date: 15/11/2025
Field of study

ONTEM 1.0 comprises 1,019 manually prepared entries, each consisting of information about the lemma, part-of-speech (following the MULTEXT-East tagset for Slovenian, https://nl.ijs.si/ME/V6/msd/html/msd-sl.html), CEFR level (based on the Core vocabulary for Slovenian as L2, organized by levels A1, A2, and B1; http://hdl.handle.net/11356/1697), confirmation of the CEFR level (based on expert validation), as well as metadata including information about the semantic categorization with detailed descriptions of each semantic category (metatopic, topic, and subtopic) and the source of the word. The words are classified into up to three levels of hierarchically organised semantic categories: into 12 top-level categories, i.e. metatopics, and 23 topics, the latter further divided into 29 subtopics. All categories are described in more detail in the provided README file. The words in ONTEM 1.0 were sourced from the KUUS corpus (http://hdl.handle.net/11356/1696) which comprises 17 textbooks for Slovenian as a Second and Foreign Language and contains 520,796 words. From this corpus, 1,019 semantically and thematically diverse words were manually selected to represent different parts-of-speech and CEFR levels, with a primary focus on A1 and A2 textbook vocabulary, while also including higher-level words to build a robust hierarchically structured system with potential for future expansion. The ontology will be integrated into the Dictionary for Speakers of Slovene as a Second and Foreign Language – SLOGOST (https://lexonomy.cjvt.si/slovar-za-govorce-slovenscine-kot-drugega-in-tujega-jezika/). The dataset is available in CSV format, accompanied by a README document that describes its contents in more detail

Monitor corpus of Slovene Trendi 2025-11

Author: Kosem Iztok
Čibej Jaka
Dobrovoljc Kaja
Erjavec Tomaž
Ljubešić Nikola
Ponikvar Primož
Šinkec Mihael
Krek Simon
Publication venue: Centre for Language Resources and Technologies, University of Ljubljana
Publication date: 03/12/2025
Field of study

The Trendi corpus is a monitor corpus of Slovenian. It contains news articles from 106 media websites, published by 59 publishers. Trendi 2025-11 covers the period from January 2019 to November 2025, complementing the Gigafida 2.0 reference corpus of written Slovene (http://hdl.handle.net/11356/1320). The contents of the Trendi corpus are obtained using the Jožef Stefan Institute Newsfeed service (http://newsfeed.ijs.si/). The texts have been annotated using the CLASSLA-Stanza pipeline (https://github.com/clarinsi/classla), including syntactic parsing according to the Universal Dependencies (https://universaldependencies.org/sl/) and Named Entities (https://nl.ijs.si/janes/wp-content/uploads/2017/09/SlovenianNER-eng-v1.1.pdf). An important addition are topics or thematical categories, which have been automatically assigned to each text. There are 13 categories altogether: Arts and culture, Crime and accidents, Economy, Environment, Health, Leisure, Politics and Law, Science and Technology, Society, Sports, Weather, Entertainment, and Education. The text classification uses the following models: Text classification model SloBERTa-Trendi-Topics 1.0 (http://hdl.handle.net/11356/1709), Text classification model fastText-Trendi-Topics 1.0 (http://hdl.handle.net/11356/1710), and the SloBERTa model (https://huggingface.co/cjvt/sloberta-trendi-topics). The corpus is currently not available as a downloadable dataset due to copyright restrictions but we hope to make at least some of it available in the near future. The corpus is accessible through CLARIN.SI concordancers. If you would like to use the dataset for research purposes, please contact Iztok Kosem ([email protected]). This version adds texts from November 2025

Frequency list of collocations from the Šolar 3.0 corpus

Author: Munda Tina
Arhar Holdt Špela
Rozman Tadeja
Stritar Kučuk Mojca
Krek Simon
Krapš Vodopivec Irena
Stabej Marko
Pori Eva
Goli Teja
Lavrič Polona
Laskowski Cyprian
Kocjančič Polonca
Klemenc Bojan
Krsnik Luka
Kosem Iztok
Publication venue: Faculty of Arts, University of Ljubljana
Publication date: 31/01/2025
Field of study

The frequency list of collocations from the developmental corpus Šolar 3.0 (http://hdl.handle.net/11356/1589), specifically from the original, uncorrected student texts ("solar-orig.conllu") was extracted with the CORDEX library (https://github.com/clarinsi/cordex/). The extraction is based on 82 predefined syntactic structures (cf. Krek et al., 2021) using the MULTEXT-East morphosyntactic (https://wiki.cjvt.si/books/04-multext-east-morphosyntax) and JOS-SYN dependency parsing (https://wiki.cjvt.si/books/06-jos-syn-syntax) annotations, where the latter serves as a syntactic complement to the former. The formal description of syntactic structures is included in the CORDEX library (see "structures_JOS.xml"). There are 3 output files: - solar-orig3.0_kolokacije.csv" contains the original output of collocations with absolute frequency 1 and above, corresponding to 81 (out of 82) predefined syntactic structures. The list is sorted by absolute frequency of collocations (Joint_representative_form) and includes frequency and POS information for each lemma of the collocation. The file also provides additional statistical measures (Delta_p12, Delta_p21, LogDice_core, LogDice_all) and shows the number of distinct forms in which the lemmas appear in the corpus for each collocation. - "solar-orig3.0_kolokacije_collocation_sentence_mapper.csv" complements the file above by showing all occurrences of the extracted collocations in the corpus. Each row lists a collocation ID (matching the first file), identifies the sentence in which the collocation appears, and provides the exact tokens that form the collocation. - "solar-orig3.0_kolokacije_collocation_sentence_mapper_metadata.csv" is an extension of the "solar-orig3.0_kolokacije_collocation_sentence_mapper.csv" file that includes school-text metadata. The dataset can be used for analyses of school writing in Slovene in (Slovene) schools, especially in combination with comparable data (http://hdl.handle.net/11356/2012) from the Slovene textbook corpus Učbeniki 1.0—which presents the expected or desired scope of reception—to identify core student vocabulary. The data was prepared in the following manner: In the preprocessing phase, the MULTEXT-East morphosyntactic tags (MSD tags) in the CoNLL-U input corpus were converted from Slovene to their English equivalents because the library then in use did not support Slovene MSD tags. Next, collocation data were extracted using the CORDEX library. Any collocations containing punctuation were excluded from the output. The lookup lexicon (https://www.clarin.si/repository/xmlui/handle/11356/1854) was used to improve collocation representations (applicable only when using the JOS system). In the postprocessing phase, the MSD tags in the output were translated back into their original Slovene MSD tags. For more details, see "00README.txt". --- KREK, Simon, GANTAR, Polona, KOSEM, Iztok, DOBROVOLJC, Kaja. Opis modela za pridobivanje in strukturiranje kolokacijskih podatkov iz korpusa. V: ARHAR HOLDT, Špela (ur.). Nova slovnica sodobne standardne slovenščine : viri in metode. 1. izd. Ljubljana: Znanstvena založba Filozofske fakultete, 2021. Str. 160-194, ilustr. Zbirka Sporazumevanje. https://ebooks.uni-lj.si/ZalozbaUL/catalog/view/325/477/732

Monitor corpus of Slovene Trendi 2025-06

Author: Kosem Iztok
Čibej Jaka
Dobrovoljc Kaja
Erjavec Tomaž
Ljubešić Nikola
Ponikvar Primož
Šinkec Mihael
Krek Simon
Publication venue: Centre for Language Resources and Technologies, University of Ljubljana
Publication date: 03/07/2025
Field of study

The Trendi corpus is a monitor corpus of Slovenian. It contains news articles from 106 media websites, published by 57 publishers. Trendi 2025-06 covers the period from January 2019 to June 2025, complementing the Gigafida 2.0 reference corpus of written Slovene (http://hdl.handle.net/11356/1320). The contents of the Trendi corpus are obtained using the Jožef Stefan Institute Newsfeed service (http://newsfeed.ijs.si/). The texts have been annotated using the CLASSLA-Stanza pipeline (https://github.com/clarinsi/classla), including syntactic parsing according to the Universal Dependencies (https://universaldependencies.org/sl/) and Named Entities (https://nl.ijs.si/janes/wp-content/uploads/2017/09/SlovenianNER-eng-v1.1.pdf). An important addition are topics or thematical categories, which have been automatically assigned to each text. There are 13 categories altogether: Arts and culture, Crime and accidents, Economy, Environment, Health, Leisure, Politics and Law, Science and Technology, Society, Sports, Weather, Entertainment, and Education. The text classification uses the following models: Text classification model SloBERTa-Trendi-Topics 1.0 (http://hdl.handle.net/11356/1709), Text classification model fastText-Trendi-Topics 1.0 (http://hdl.handle.net/11356/1710), and the SloBERTa model (https://huggingface.co/cjvt/sloberta-trendi-topics). The corpus is currently not available as a downloadable dataset due to copyright restrictions but we hope to make at least some of it available in the near future. The corpus is accessible through CLARIN.SI concordancers. If you would like to use the dataset for research purposes, please contact Iztok Kosem ([email protected]). This version adds texts from June 2025, plus texts from two sources missing from May 2025

Monitor corpus of Slovene Trendi 2025-07

Author: Kosem Iztok
Čibej Jaka
Dobrovoljc Kaja
Erjavec Tomaž
Ljubešić Nikola
Ponikvar Primož
Šinkec Mihael
Krek Simon
Publication venue: Centre for Language Resources and Technologies, University of Ljubljana
Publication date: 13/08/2025
Field of study

The Trendi corpus is a monitor corpus of Slovenian. It contains news articles from 106 media websites, published by 57 publishers. Trendi 2025-07 covers the period from January 2019 to July 2025, complementing the Gigafida 2.0 reference corpus of written Slovene (http://hdl.handle.net/11356/1320). The contents of the Trendi corpus are obtained using the Jožef Stefan Institute Newsfeed service (http://newsfeed.ijs.si/). The texts have been annotated using the CLASSLA-Stanza pipeline (https://github.com/clarinsi/classla), including syntactic parsing according to the Universal Dependencies (https://universaldependencies.org/sl/) and Named Entities (https://nl.ijs.si/janes/wp-content/uploads/2017/09/SlovenianNER-eng-v1.1.pdf). An important addition are topics or thematical categories, which have been automatically assigned to each text. There are 13 categories altogether: Arts and culture, Crime and accidents, Economy, Environment, Health, Leisure, Politics and Law, Science and Technology, Society, Sports, Weather, Entertainment, and Education. The text classification uses the following models: Text classification model SloBERTa-Trendi-Topics 1.0 (http://hdl.handle.net/11356/1709), Text classification model fastText-Trendi-Topics 1.0 (http://hdl.handle.net/11356/1710), and the SloBERTa model (https://huggingface.co/cjvt/sloberta-trendi-topics). The corpus is currently not available as a downloadable dataset due to copyright restrictions but we hope to make at least some of it available in the near future. The corpus is accessible through CLARIN.SI concordancers. If you would like to use the dataset for research purposes, please contact Iztok Kosem ([email protected]). This version adds texts from July 2025

Slovenian Equity Evaluation Corpus EEC-SL 1.0

Author: Vintar Špela
Publication venue: Jožef Stefan Institute
Publication date: 18/09/2025
Field of study

The EEC-SL dataset is a localised and adapted version of the Equity Evaluation Corpus (EEC, Kiritchenko and Mohammad, 2018, https://aclanthology.org/S18-2005/). It consists of 8,640 sentences which were automatically generated to evaluate social bias in sentiment analysis systems. The sentences are created from 22 templates, with each template containing a reference to , where the slot can be filled either by a name (female and male, Slovenian and non-Slovenian), or by a generic noun phrase (e.g., moja sestra [my sister], ta moški [this man], moj oče [my dad]). The second and third variables that are present in 7 out of 11 templates are and , which can be filled by words expressing four basic emotional states: Anger, Fear, Joy and Sadness. Template example: Zaradi te situacije se počuti . The selection of names was conceptualised to represent the current social reality in Slovenia, so that the foreign names were carefully selected to match the demographic situation in the country, and at the same time be perceived as non-Slovenian. Hence, we selected 10 female and 10 male Slovenian names, 6 female and 6 male names from former Yugoslavia, 2 female and 2 male names from EU countries, and 2 female and 2 male names from non-EU countries. All the names were selected from the registry of names available at the Statistical Office of Slovenia. The emotional state and emotional situation words were selected to represent various intensities of the basic emotions. Their emotional valence was taken from SloEmoLex (http://hdl.handle.net/11356/1875). The templates, names, generic forms and adjectives have been linguistically adapted to Slovenian which is a highly inflected language with agreement in number, gender and case. Thus, instead of the original 11 templates in English, Slovenian uses 22 templates as each English example was translated into a female and male version, depending on the gender of the variable. Along similar lines, each variable can appear in different cases and numbers, which is reflected in the sentence templates. More details are given in the README file. The dataset was originally designed to tease out bias in sentiment analysis systems, because it allows for testing the hypothesis that a system should equally rate the intensity of the emotion expressed by two sentences that differ only in the gender/nationality of the person mentioned (e.g., "Anja je jezna." vs. "Snježana je jezna.")

Monitor corpus of Slovene Trendi 2025-09

Author: Kosem Iztok
Čibej Jaka
Dobrovoljc Kaja
Erjavec Tomaž
Ljubešić Nikola
Ponikvar Primož
Šinkec Mihael
Krek Simon
Publication venue: Centre for Language Resources and Technologies, University of Ljubljana
Publication date: 03/10/2025
Field of study

The Trendi corpus is a monitor corpus of Slovenian. It contains news articles from 106 media websites, published by 58 publishers. Trendi 2025-09 covers the period from January 2019 to September 2025, complementing the Gigafida 2.0 reference corpus of written Slovene (http://hdl.handle.net/11356/1320). The contents of the Trendi corpus are obtained using the Jožef Stefan Institute Newsfeed service (http://newsfeed.ijs.si/). The texts have been annotated using the CLASSLA-Stanza pipeline (https://github.com/clarinsi/classla), including syntactic parsing according to the Universal Dependencies (https://universaldependencies.org/sl/) and Named Entities (https://nl.ijs.si/janes/wp-content/uploads/2017/09/SlovenianNER-eng-v1.1.pdf). An important addition are topics or thematical categories, which have been automatically assigned to each text. There are 13 categories altogether: Arts and culture, Crime and accidents, Economy, Environment, Health, Leisure, Politics and Law, Science and Technology, Society, Sports, Weather, Entertainment, and Education. The text classification uses the following models: Text classification model SloBERTa-Trendi-Topics 1.0 (http://hdl.handle.net/11356/1709), Text classification model fastText-Trendi-Topics 1.0 (http://hdl.handle.net/11356/1710), and the SloBERTa model (https://huggingface.co/cjvt/sloberta-trendi-topics). The corpus is currently not available as a downloadable dataset due to copyright restrictions but we hope to make at least some of it available in the near future. The corpus is accessible through CLARIN.SI concordancers. If you would like to use the dataset for research purposes, please contact Iztok Kosem ([email protected]). This version adds texts from September 2025

Monitor corpus of Slovene Trendi 2025-10

Author: Kosem Iztok
Čibej Jaka
Dobrovoljc Kaja
Erjavec Tomaž
Ljubešić Nikola
Ponikvar Primož
Šinkec Mihael
Krek Simon
Publication venue: Centre for Language Resources and Technologies, University of Ljubljana
Publication date: 03/11/2025
Field of study

The Trendi corpus is a monitor corpus of Slovenian. It contains news articles from 106 media websites, published by 58 publishers. Trendi 2025-10 covers the period from January 2019 to October 2025, complementing the Gigafida 2.0 reference corpus of written Slovene (http://hdl.handle.net/11356/1320). The contents of the Trendi corpus are obtained using the Jožef Stefan Institute Newsfeed service (http://newsfeed.ijs.si/). The texts have been annotated using the CLASSLA-Stanza pipeline (https://github.com/clarinsi/classla), including syntactic parsing according to the Universal Dependencies (https://universaldependencies.org/sl/) and Named Entities (https://nl.ijs.si/janes/wp-content/uploads/2017/09/SlovenianNER-eng-v1.1.pdf). An important addition are topics or thematical categories, which have been automatically assigned to each text. There are 13 categories altogether: Arts and culture, Crime and accidents, Economy, Environment, Health, Leisure, Politics and Law, Science and Technology, Society, Sports, Weather, Entertainment, and Education. The text classification uses the following models: Text classification model SloBERTa-Trendi-Topics 1.0 (http://hdl.handle.net/11356/1709), Text classification model fastText-Trendi-Topics 1.0 (http://hdl.handle.net/11356/1710), and the SloBERTa model (https://huggingface.co/cjvt/sloberta-trendi-topics). The corpus is currently not available as a downloadable dataset due to copyright restrictions but we hope to make at least some of it available in the near future. The corpus is accessible through CLARIN.SI concordancers. If you would like to use the dataset for research purposes, please contact Iztok Kosem ([email protected]). This version adds texts from October 2025

5

full texts

840

metadata records

Updated in last 30 days.

Common Language Resources and Technology Infrastructure - Slovenia

Access Repository Dashboard

Do you manage Open Research Online? Become a CORE Member to access insider analytics, issue reports and manage access to outputs from your repository in the CORE Repository Dashboard! 👇