Common Language Resources and Technology Infrastructure - Slovenia

Not a member yet

840 research outputs found

Sort by

Monitor corpus of Slovene Trendi 2025-03

Author: Kosem Iztok
Čibej Jaka
Dobrovoljc Kaja
Erjavec Tomaž
Ljubešić Nikola
Ponikvar Primož
Šinkec Mihael
Krek Simon
Publication venue: Centre for Language Resources and Technologies, University of Ljubljana
Publication date: 10/04/2025
Field of study

The Trendi corpus is a monitor corpus of Slovenian. It contains news articles from 106 media websites, published by 56 publishers. Trendi 2025-03 covers the period from January 2019 to March 2025, complementing the Gigafida 2.0 reference corpus of written Slovene (http://hdl.handle.net/11356/1320). The contents of the Trendi corpus are obtained using the Jožef Stefan Institute Newsfeed service (http://newsfeed.ijs.si/). The texts have been annotated using the CLASSLA-Stanza pipeline (https://github.com/clarinsi/classla), including syntactic parsing according to the Universal Dependencies (https://universaldependencies.org/sl/) and Named Entities (https://nl.ijs.si/janes/wp-content/uploads/2017/09/SlovenianNER-eng-v1.1.pdf). An important addition are topics or thematical categories, which have been automatically assigned to each text. There are 13 categories altogether: Arts and culture, Crime and accidents, Economy, Environment, Health, Leisure, Politics and Law, Science and Technology, Society, Sports, Weather, Entertainment, and Education. The text classification uses the following models: Text classification model SloBERTa-Trendi-Topics 1.0 (http://hdl.handle.net/11356/1709), Text classification model fastText-Trendi-Topics 1.0 (http://hdl.handle.net/11356/1710), and the SloBERTa model (https://huggingface.co/cjvt/sloberta-trendi-topics). The corpus is currently not available as a downloadable dataset due to copyright restrictions but we hope to make at least some of it available in the near future. The corpus is accessible through CLARIN.SI concordancers. If you would like to use the dataset for research purposes, please contact Iztok Kosem ([email protected]). This version adds texts from March 2025, and provides some improvements in the list of publishers by correcting source to publisher conversion from the previous months (esp. 2023-02)

Dataset for primary stress identification in Croatian and related languages and dialects

Author: Ljubešić Nikola
Rupnik Peter
Porupski Ivan
Robida Nejc
Potočnjak Mirna
Publication venue: Jožef Stefan Institute
Publication date: 30/05/2025
Field of study

The dataset contains recordings and offset annotations of a sample of the Croaitan parliamentary recordings from the corpus ParlaSpeech-HR. It contains training and testing data for primary stress identification from the speech signal on the level of a single word. Additional test datasets are available in three languages / dialects: Slovenian, Chakavian dialect of Croatian, and Serbian. The data is split in four sections based on their provenance: ParlaStress-HR.jsonl - Croatian train and test datasets, sampled from the ParlaSpeech-HR 2.0 (http://hdl.handle.net/11356/1914) ParlaStress-SR.jsonl - Serbian test dataset, sampled from the ParlaSpeech-RS (http://hdl.handle.net/11356/1834) MićiPrinc-CKM.jsonl - Chakavian test dataset, sampled from the Mići Princ dataset (http://hdl.handle.net/11356/1765) Artur-SL.jsonl - Slovenian test dataset, sampled from the Artur dataset (http://hdl.handle.net/11356/1776) All JSONL files have the following attributes: * id: string * audio_wav: string, path to the audio file * audio_start, audio_end: float, seconds of the start and end times in the original audio file, useful for calculating sample duration, as well as reference to original audio * multisyllabic_words: a list of dictionaries, each entry corresponding to one multisyllabic word with stress information, with keys: word: string, word in question time_s: float, start of word in seconds from the start of the recording, time_e: float, end of word in seconds from the start of the recording, syllable_count: int, number of syllables in the word, stress: a list with a single dictionary (for consistency with unstressed) describing the stressed vowel with keys: vowel: string, character of the word that is stressed time_s: float, vowel start in seconds from the start of the word time_e: float, vowel end in seconds from the start of the word char_idx: int, index of stressed character in the word unstress: same as stress, but for unstressed vowels * graphalign_intervals: a list of dictionaries describing time alignment of individual graphemes / phonemes, with keys: label: string, character that is being aligned time_s: float, character start in seconds from the start of the word time_e: float, character end in seconds from the start of the word In addition, ParlaStress-HR.jsonl also has the attribute "split_speaker" that assigns individual instances into "train" or "test" splits. These splits ensure that different speakers are found in the training and the testing section

Slovenian Dataset for Vision-Language Model Instruction-Tuning SLO-VLM-IT-Dataset 1.0

Author: Martinc Matej
Publication venue: Jožef Stefan Institute
Publication date: 18/09/2025
Field of study

This entry contains the SLO-VLM-IT-Dataset, a comprehensive dataset designed for instruction-tuning vision-language models in the Slovenian language. It is composed of five main .json files, which together provide a rich and diverse set of examples for training and fine-tuning models to understand and process both visual and textual information in Slovenian. 1. llava_v1_5_mix665k_translated_gemini_1_5_pro_all.json This file contains a machine-translated version of the popular Llava_v1_5_mix665k dataset. The translation from English to Slovenian was performed using the proprietary Gemini 1.5 Pro model. 2. wiki_14_march_2024_latest.json This file consists of conversational examples generated from Slovenian Wikipedia articles. The proprietary Gemini 1.5 Pro model was utilized for the data curation process, transforming the articles into an instruction-tuning format. 3. rtv.json This file consists of conversational examples generated on the basis of images from the news portal https://www.rtvslo.si. The proprietary Gemini 1.5 Pro model was utilized for the data generation. 4. siol.json This file consists of conversational examples generated on the basis of images from the news portal https://siol.net. The proprietary Gemini 1.5 Pro model was utilized for the data generation. 5. 24ur.json This file consists of conversational examples generated on the basis of images from the news portal https://www.24ur.com. The proprietary Gemini 1.5 Pro model was utilized for the data generation. The combined dataset includes a total of 1,128,228 examples, categorized as follows: 21,838 textvqa examples: Instructions for vision question answering based on specific Optical Character Recognition (OCR) tokens. 349,369 coco examples: A mix of instructions corresponding to 118,000 images from the COCO 2017 Object Detection Dataset. These include tasks such as generating long image descriptions, providing single-word answers, and answering multiple-choice questions. 81,309 vg examples: Instructions to either provide bounding box coordinates for a specified region in an image or describe a region defined by given coordinates. 66,227 gqa examples: Instructions requiring a one-word or one-phrase response to a question about the corresponding image. 78,976 ocr_vqa examples: Instructions focused on performing OCR to extract text from an image. 139,433 wiki examples: Instruction-tuning examples generated from Slovenian Wikipedia articles. The original Wikipedia articles were obtained from a Wikipedia database dump from March 14th 2025. 100,000 rtv examples: Instruction-tuning examples generated on the basis of images from the news portal https://www.rtvslo.si. Image scraping was completed on February 7th 2025. 100,000 siol examples: Instruction-tuning examples generated on the basis of images from the news portal https://siol.net. Image scraping was completed on March 22nd 2025. 100,000 24ur examples: Instruction-tuning examples generated on the basis of images from the news portal https://www.24ur.com. Image scraping was completed on February 7th 2025. Accessing the Corresponding Images News portal Images The images corresponding to the 'rtv', 'siol' and '24ur' examples need to be downloaded from the appropriate news portal. Each example in the json file contains an 'image' key with a URL of the corresponding image. Wiki Images The images corresponding to the 'wiki' examples are available for download at the following link: https://kt-cloud.ijs.si/index.php/s/nbLmWkaJEXHMMwe Llava_v1_5_mix665k Images To facilitate the download of images for the translated Llava_v1_5_mix665k dataset, we provide the necessary Python script get_llava_images.py and its dependency overwatch.py

Syntactic Tree Inventories from Slovenian UD Corpora (v2.15)

Author: Dobrovoljc Kaja
Publication venue: Jožef Stefan Institute
Publication date: 23/05/2025
Field of study

This dataset contains lists of delexicalized dependency trees and subtrees extracted from the Slovenian UD corpora SSJ (written) and SST (spoken), version 2.15 (http://hdl.handle.net/11234/1-5787), using the STARK tool (https://github.com/clarinsi/STARK). These lists represent a basic set of syntactic structures in Slovenian, useful for data-based investigations of syntactic patterns in Slovenian and their variation across the two modalities. Each structure is represented as a fixed-order labeled dependency tree or subtree with UPOS tags as nodes (e.g., ADJ <amod NOUN). Structures were extracted from three versions of each corpus: (1) The full version (2) A version excluding punctuation (i.e., branches labeled as punct) (3) A version excluding disfluencies (i.e., branches labeled as punct, reparandum, or discourse) The extracted structures are provided in tabular TSV format. Each row contains: * The delexicalized tree/subtree (e.g., ADJ <amod NOUN) * Its absolute and relative frequency in the target corpus (e.g., spoken SST) * An example (e.g., samostojna <amod država) * Frequency in the corresponding reference corpus (e.g., written SSJ) * Keyness measures for modality-based comparison (e.g., LL, Odds Ratio, %DIFF) The STARK configuration file used in the extraction process is included

The "Mobile languages" corpus MoJezik 1.0 (audio)

Author: Bitenc Maja
Publication venue: Faculty of Arts, University of Ljubljana
Publication date: 23/07/2025
Field of study

The "Mobile Languages" corpus documents in-depth, semi-structured sociolinguistic interviews with speakers from two Slovene regions and distinctive dialects: Idrija (Cerkno dialect, Rovte group) and Ribnica (Dolenjska dialect, Dolenjska group), who study or work in the Slovenian capital, Ljubljana, and thus navigate daily between dialectal and standard language use. Interview topics include narratives of personal (linguistic) history, reflections on past and present language practices, attitudes towards their own dialects and other Slovene varieties, experiences of dialect perception in the Ljubljana context and of standard-like speech in local environments, linguistic identity, stereotypes and prejudices, intergenerational language use (especially with children), and language behaviour in educational settings. The corpus includes: – Idrija group: 4 speakers (2 women, 2 men; 2 adults, 2 secondary-school students), recorded between 2009 and 2013; total interview length: 5 hours, 37 minutes, 9 seconds. – Ribnica group: 6 speakers (2 primary informants and 4 close contacts, including family members, friends, and colleagues), recorded between 2020 and 2022; total interview length: 4 hours, 37 minutes, 15 seconds. The interviews were conducted within the framework of broader sociolinguistic research, which also encompassed informants’ self-recordings of spontaneous speech in diverse everyday situations and a quantitative variationist analysis of five phonological variables (dialect-specific) across various communicative contexts. The interview data enable comparisons between speakers’ metalinguistic commentary and their actual language use as documented in the recordings. The findings of the Cerkno and Ribnica studies are comprehensively presented in two scientific publications: * Bitenc, Maja (2016): Z jezikom na poti med Idrijskim in Ljubljano [With Language on the Move Between Idrija and Ljubljana]. Ljubljana: Znanstvena založba Filozofske fakultete. https://www.ff.uni-lj.si/publikacije/z-jezikom-na-poti-med-idrijskim-ljubljano * Bitenc, Maja (2025): Govor v gibanju med Ribnico in Ljubljano [Speech in Motion Between Ribnica and Ljubljana]. Ljubljana: Znanstvena založba Filozofske fakultete. https://doi.org/10.4312/9789612976316 This entry contains only audio recordings, and only for speakers who have consented to the publication of their recordings. The transcriptions are available in a separate entry: The "Mobile Languages" corpus MoJezik 1.0 (transcription), http://hdl.handle.net/11356/2037

Monitor corpus of Slovene Trendi 2025-04

Author: Kosem Iztok
Čibej Jaka
Dobrovoljc Kaja
Erjavec Tomaž
Ljubešić Nikola
Ponikvar Primož
Šinkec Mihael
Krek Simon
Publication venue: Centre for Language Resources and Technologies, University of Ljubljana
Publication date: 06/05/2025
Field of study

The Trendi corpus is a monitor corpus of Slovenian. It contains news articles from 106 media websites, published by 56 publishers. Trendi 2025-04 covers the period from January 2019 to April 2025, complementing the Gigafida 2.0 reference corpus of written Slovene (http://hdl.handle.net/11356/1320). The contents of the Trendi corpus are obtained using the Jožef Stefan Institute Newsfeed service (http://newsfeed.ijs.si/). The texts have been annotated using the CLASSLA-Stanza pipeline (https://github.com/clarinsi/classla), including syntactic parsing according to the Universal Dependencies (https://universaldependencies.org/sl/) and Named Entities (https://nl.ijs.si/janes/wp-content/uploads/2017/09/SlovenianNER-eng-v1.1.pdf). An important addition are topics or thematical categories, which have been automatically assigned to each text. There are 13 categories altogether: Arts and culture, Crime and accidents, Economy, Environment, Health, Leisure, Politics and Law, Science and Technology, Society, Sports, Weather, Entertainment, and Education. The text classification uses the following models: Text classification model SloBERTa-Trendi-Topics 1.0 (http://hdl.handle.net/11356/1709), Text classification model fastText-Trendi-Topics 1.0 (http://hdl.handle.net/11356/1710), and the SloBERTa model (https://huggingface.co/cjvt/sloberta-trendi-topics). The corpus is currently not available as a downloadable dataset due to copyright restrictions but we hope to make at least some of it available in the near future. The corpus is accessible through CLARIN.SI concordancers. If you would like to use the dataset for research purposes, please contact Iztok Kosem ([email protected]). This version adds texts from April 2025

Comparable corpus of parliamentary debates ParlaMint-ES-CN 1.0

Author: Cano González Pedro
Carreras Riudavets Francisco Javier
Hernández Marrero Mar
Hernández Figueroa Zenón José
Publication venue: Instituto Universitario de Análisis y Aplicaciones Textuales
Publication date: 20/11/2025
Field of study

The ParlaMint-ES-CN corpus is the contribution of the Parliament of the Canary Islands (Parlamento de Canarias) to the ParlaMint collection of comparable parliamentary corpora (https://www.clarin.eu/parlamint). It contains transcriptions of parliamentary debates produced between 1991 and 2021, covering thirty years of legislative activity in the autonomous community of the Canary Islands. The corpus is encoded following the official ParlaMint TEI/XML guidelines and schema specifications. The transcriptions are organised by day and include detailed metadata on each parliamentary sitting, such as the legislative term, session and meeting. Speeches are marked with information about the speaker and their role in the debate. The corpus also includes transcriber-supplied comments, such as interruptions, lapses, or procedural notes. The dataset provides extensive metadata on speakers, including their full names and institutional roles within the Parliament of the Canary Islands. Additional metadata on parties, parliamentary groups and political affiliations are included where available. As in other ParlaMint corpora, the texts are divided into subcorpora according to the ParlaMint scheme. The corpus is distributed in two variants: - The canonical TEI/XML version, containing the full transcriptions and metadata. - The linguistically annotated version (.ana), which includes tokenization, lemmatisation, Universal Dependencies part-of-speech tags, morphological features, syntactic dependencies, and named entities. The ParlaMint-ES-CN corpus offers a comprehensive diachronic resource for the study of political discourse in the Canary Islands, enabling comparative research with other European and regional parliaments within the ParlaMint framework

Frequency lists of syntactic structures from the Šolar 3.0 corpus

Author: Munda Tina
Arhar Holdt Špela
Dobrovoljc Kaja
Rozman Tadeja
Stritar Kučuk Mojca
Krek Simon
Krapš Vodopivec Irena
Stabej Marko
Pori Eva
Goli Teja
Lavrič Polona
Laskowski Cyprian
Kocjančič Polonca
Klemenc Bojan
Krsnik Luka
Kosem Iztok
Publication venue: Faculty of Arts, University of Ljubljana
Publication date: 30/01/2025
Field of study

The frequency lists of syntactic structures from the developmental corpus Šolar 3.0 (http://hdl.handle.net/11356/1589), specifically from the original, uncorrected student texts ("solar-orig.conllu") were extracted with the STARK v3 tool (http://hdl.handle.net/11356/1958). The extracted data is available at two levels: at the phrase level (see folder "besednozvezne") and at the sentence level (see folder "medstavcne"). At the phrase level, the extracted syntactic structures have a headword belonging to one of the following parts of speech, as defined by the MULTEXT-East system for morphosyntactic annotation of Slovene texts: noun (samostalnik), verb (glagol), adjective (pridevnik), adverb (prislov), pronoun (zaimek), numeral (števnik), predlog (adposition), veznik (conjunction), particle (členek), abbreviation (okrajšava) (no results were returned for interjection (medmet) and residual (neuvrščeno)). These structures were extracted based on the MULTEXT-East morphosyntax v6 (https://wiki.cjvt.si/books/04-multext-east-morphosyntax) and the JOS-SYN dependency syntax (https://wiki.cjvt.si/books/06-jos-syn-syntax), where the latter serves as a syntactic complement to the former. At the sentence level, the extracted syntactic structures link two clauses. The included types of clausal syntactic relations according to Universal Dependencies (UD) are: parataxis (soredje), coordination (priredje), and subordination (podredje), which is further divided into 4 main types according to UD: clausal subject (osebkov odvisnik), clausal object (predmetni odvisnik), adverbial cluase modifier (prislovni odvisniki), and adnominal clause modifier (prilastkov odvisnik). These structures were extracted based on the UD part-of-speech and syntactic relations annotations (https://wiki.cjvt.si/books/07-universal-dependencies). The dataset can be used for syntactic analyses of school writing in Slovene in (Slovene) schools, also in combination with comparable data (http://hdl.handle.net/11356/2010) from the Slovene textbook corpus Učbeniki 1.0, which presents the expected or desired scope of reception. For each part of speech (phrase level) or clausal relation (sentence level), there are 4 files: - "solar-orig_*_default.tsv" - the original output, containing extracted unique syntactic structures of varying lengths, ranging from 2 to 10 tokens, arranged by frequency, followed by additional data on syntactic structures and corpus-linguistic statistics (Absolute frequency, Relative frequency, MI, MI3, Dice, logDice, t-score, simple-LL). - "solar-orig_*_all-examples.tsv" - the original output, containing all matched structures found in the input corpus (i.e. all occurances of the extracted structures in every sentence). - "solar-orig_*_default_tree-description.tsv" - an extension of the "solar-orig_*_default.tsv" file that includes a verbal description of syntactic structures (trees). - "solar-orig_*_all-examples_metadata_tree-description.tsv" - an extension of the "solar-orig_*_all-examples.tsv" file that includes school text metadata and a verbal description of syntactic structures (trees). (The asterisk (*) in file names serves as a placeholder for a part of speech or a clausal relation.) The data was prepared in the following manner: First, the corpus was linguistically annotated with the CLASSLA v2.1 pipeline (https://github.com/clarinsi/classla/) at the levels of UD part-of-speech and syntactic relations annotations to enable the extraction of sentence-level structures. Furthermore, the original corpus containing MULTEXT-East tags (MSD tags) was preprocessed to reduce the tag to its first letter (e.g., Somei → S), which denotes the part of speech (the remaining letters represent the token's morphosyntactic features). This preprocessing step enabled extraction at the part-of-speech level, disregarding token-specific features, yet still displaying the full MSD tags as nodes in the extracted structures. (Note that STARK was originally developed for extracting data from UD-parsed corpora and was not designed for use cases like this one.) Then, the data was extracted with the STARK v3.0 tool (http://hdl.handle.net/11356/1958), based on predefined parameters in the "config.ini" file, with phrase-level structures extracted based on the MULTEXT-East and JOS-SYN annotation systems, and sentence-level structures extracted based on the UD schema. The sentence-level data underwent a postprocessing phase to remove duplicates that occured due to the phased extraction of complex connectives and to recalculate corpus-linguistic statistics based on the deduplicated data. Another step was to enhance all output files with verbal descriptions of the extracted structures and to enrich all "solar-orig_*_all-examples.tsv" files with school text metadata by assigning metadata from "solar-meta.tsv" (see "Solar.CoNLL-U.zip" in http://hdl.handle.net/11356/1589) to each structure based on matching text IDs (both with Python). Lastly, the extended versions of the two original output files ("solar-orig_*_default_tree-description.tsv", "solar-orig_*_all-examples_metadata_tree-description.tsv") were converted into Excel spreadsheets. The package also includes a configuration file for each level: "config_solar_besednozvezne.ini" for phrase-level structures, and "config_solar_medstavcne.ini" for sentence-level structures. These files contain all the parameter values used for data extraction with STARK. For more details, see "00README.txt"

Frequency list of collocations from the Učbeniki 1.0 corpus

Author: Munda Tina
Arhar Holdt Špela
Kosem Iztok
Pori Eva
Krek Simon
Publication venue: Faculty of Arts, University of Ljubljana
Publication date: 31/01/2025
Field of study

The frequency list of collocations from the Slovene textbook corpus Učbeniki 1.0 was extracted with the CORDEX library (https://github.com/clarinsi/cordex/). The extraction is based on 82 predefined syntactic structures (cf. Krek et al., 2021) using the MULTEXT-East morphosyntactic (https://wiki.cjvt.si/books/04-multext-east-morphosyntax) and JOS-SYN dependency parsing (https://wiki.cjvt.si/books/06-jos-syn-syntax) annotations, where the latter serves as a syntactic complement to the former. The formal description of syntactic structures is included in the CORDEX library (see "structures_JOS.xml"). There are 2 output files: - "ucbeniki1.0_kolokacije.csv" contains the original output of collocations with absolute frequency 1 and above, corresponding to 82 predefined syntactic structures. The list is sorted by absolute frequency of collocations (Joint_representative_form) and includes frequency and POS information for each lemma of the collocation. The file also provides additional statistical measures (Delta_p12, Delta_p21, LogDice_core, LogDice_all) and shows the number of distinct forms in which the lemmas appear in the corpus for each collocation. - "ucbeniki1.0_kolokacije_collocation_sentence_mapper.csv" complements the file above by showing all occurrences of the extracted collocations in the corpus. Each row lists a collocation ID (matching the first file), identifies the sentence in which the collocation appears, and provides the exact tokens that form the collocation. The dataset can be used for analyses, especially in combination with comparable data (http://hdl.handle.net/11356/2011) from the develpmental corpus Šolar 3.0 (http://hdl.handle.net/11356/1589) to identify core student vocabulary. The data was prepared in the following manner: In the preprocessing phase, all individual Slovene school textbooks were merged into a single CoNLL-U file. Because the library then in use did not support Slovene MULTEXT-East morphosyntactic tags (MSD tags), these tags were converted into their English equivalents. Next, collocation data were extracted using the CORDEX library. Any collocations containing punctuation were excluded from the output. The lookup lexicon (https://www.clarin.si/repository/xmlui/handle/11356/1854) was used to improve collocation representations (applicable only when using the JOS system). In the postprocessing phase, the MSD tags in the output were translated back into Slovene MSD tags. For more details, see "00README.txt". --- KREK, Simon, GANTAR, Polona, KOSEM, Iztok, DOBROVOLJC, Kaja. Opis modela za pridobivanje in strukturiranje kolokacijskih podatkov iz korpusa. V: ARHAR HOLDT, Špela (ur.). Nova slovnica sodobne standardne slovenščine : viri in metode. 1. izd. Ljubljana: Znanstvena založba Filozofske fakultete, 2021. Str. 160-194, ilustr. Zbirka Sporazumevanje. https://ebooks.uni-lj.si/ZalozbaUL/catalog/view/325/477/732

The CLASSLA-Stanza model for morphosyntactic annotation of spoken Slovenian 2.2

Author: Terčon Luka
Dobrovoljc Kaja
Ljubešić Nikola
Publication venue: Jožef Stefan Institute
Publication date: 07/02/2025
Field of study

This model for morphosyntactic annotation of spoken Slovenian was built with the CLASSLA-Stanza tool (https://github.com/clarinsi/classla) by training on the SST treebank of spoken Slovenian (https://github.com/UniversalDependencies/UD_Slovenian-SST) combined with the SUK training corpus (http://hdl.handle.net/11356/1959) and using the CLARIN.SI-embed.sl word embeddings (http://hdl.handle.net/11356/1791) that were expanded with the MaCoCu-sl Slovene web corpus (http://hdl.handle.net/11356/1517). The model produces simultaneously UPOS, FEATS and XPOS (MULTEXT-East) labels. The estimated F1 of the XPOS annotations is ~96.76

5

full texts

840

metadata records

Updated in last 30 days.

Common Language Resources and Technology Infrastructure - Slovenia

Access Repository Dashboard

Do you manage Open Research Online? Become a CORE Member to access insider analytics, issue reports and manage access to outputs from your repository in the CORE Repository Dashboard! 👇