Common Language Resources and Technology Infrastructure - Slovenia

Not a member yet

840 research outputs found

Sort by

Comparable corpus of parliamentary debates ParlaMint-IL 1.0

Author: Goldin Gili
Howell Nick
Ordan Noam
Rabinovich Ella
Wintner Shuly
Publication venue: University of Haifa
Publication date: 07/06/2025
Field of study

The ParlaMint-IL corpus is the Israeli contribution to the ParlaMint collection of comparable parliamentary corpora (https://www.clarin.eu/parlamint), which contain transcriptions of parliamentary debates of European countries and autonomous regions. The Knesset Corpus follows the ParlaMint encoding guidelines and is fully aligned with version 4.1 of the ParlaMint corpora (cf. http://hdl.handle.net/11356/1912 and http://hdl.handle.net/11356/1911). The corpus comprises transcriptions of all plenary and committee protocols of the Israeli parliament (the Knesset), spanning from 1994 to 2024. It includes more than 12 million speeches and over 400 million words, making it the largest corpus in the ParlaMint collection. All transcriptions are provided in Hebrew, the primary language of Knesset proceedings. The transcriptions are divided by days with information on the term, session and meeting, and contain speeches marked by the speaker and their role (e.g. chair, regular speaker). The speeches also contain marked-up transcriber comments, such as gaps in the transcription. The corpus includes extensive metadata, most importantly on speakers (name, gender, year of birth, MP and minister status, party affiliation), and on their political parties and parliamentary groups (name, coalition/opposition status, and Wikipedia-sourced left-to-right political orientation). The transcriptions are also marked with the subcorpora they belong to, i.e. "reference" (until 2020-01-30), "covid" (from 2020-01-31), and "war" (from 2022-02-24). The corpus TEI/XML schemas are included in the distribution. The corpus is available in two variants, the "plain-text" version (ParlaMint-IL.tgz, corresponding to http://hdl.handle.net/11356/1912) and the linguistically annotated version (ParlaMint-IL.ana.tgz, corresponding to http://hdl.handle.net/11356/1911). The ParlaMint-IL.ana linguistic annotation includes tokenization; sentence segmentation; lemmatisation; Universal Dependencies part-of-speech, morphological features, and syntactic dependencies; and the 4-class CoNLL-2003 named entities. The corpus was annotated with morphological and syntactic annotations by Trankit (https://github.com/nlp-uoregon/trankit) based model, fine-tuned on Knesset data. Named Entity Recognition was performed using dicta-bert (https://huggingface.co/dicta-il/dictabert), a Hebrew NER model. The "plain-text" version (ParlaMint-IL.tgz) contains the canonical TEI/XML files; derived plain-text files; and derived TSV metadata files for the speeches. The linguistically annotated version (ParlaMint-IL.ana.tgz) contains the canonical TEI/XML files with linguistic annotations; derived CoNLL-U files along with TSV metadata of the speeches; and the derived vertical files (with their registry file), suitable for use with CQP-based concordancers, such as CWB, noSketch Engine or KonText. The ParlaMint-IL corpus is based on data and annotations described in: Goldin, Gili; Wintner, Shuly; and Rabinovich, Ella. The Knesset Corpus: An Annotated Corpus of Hebrew Parliamentary Proceedings. Language Resources and Evaluation (2025). https://doi.org/10.1007/s10579-025-09833-

Slovenian Day of Resistance X & news corpus

Author: Koražija Jure
Horvat Marjan
Babnik Jan
Škvorc Tadej
Robnik-Šikonja Marko
Darovec Darko
Oman Žiga
Publication venue: Institute IRRIS for Research, Development and Strategies of Society, Culture and Environment
Publication date: 04/08/2025
Field of study

The dataset contains social media posts from X and traditional media articles from online news sources related to the Slovenian commemorations of the Day of Resistance. We used two types of data: For the social media analysis, we collected X posts covering the period from April 2023 to April 2024. This dataset was gathered by Sciences Po under the SoMe4Dem project. The collection focused on commemorative discussions in Slovenian, comprising 753 posts. The dataset includes the full text of tweets, quotes, and retweets, as well as metadata such as language, timestamps, and external links. The user ID and screen names were removed to anonymize the data. The X dataset was compiled based on the following query terms: "Dan upora proti okupatorju" and "Dan upora". These keywords were selected to capture discussions related to the Day of Resistance and its broader commemorative context in Slovenia. The collection included special-character normalization, ensuring the retrieval of all relevant posts. To analyze traditional media, we collected relevant news articles using Media Cloud (https://www.mediacloud.org/), an open-source platform developed by the Berkman Klein Center for Internet & Society at Harvard University, which compiles and organizes online news content to facilitate research on attention, representation, influence, and language in global media ecosystems. The Slovenian database was queried using the following 14 case-sensitive keywords: »dan upora«, »dnevu upora«, »dan OF«, »dneva OF«, »proti okupatorju«, »državna proslava«, »državne proslave«, »državni proslavi«, »dan spomina«, »dnevu spomina«, »osvobodilna fronta«, »osvobodilne fronte«, »protiimperialistična fronta« and »protiimperialistične fronte«. Additional news material was collected through links found in the X dataset and manually retrieved from three Slovenian weekly publications: Delo, Demokracija, and Mladina. We included all relevant news articles published on this topic for three consecutive years, from 2022 to 2024. After collecting traditional media news articles from Media Cloud and X links, 144 irrelevant or duplicated articles were identified, thus reducing the media part of our dataset from 308 to 164 articles

Slovene learner corpus KOST 2.1

Author: Stritar Kučuk Mojca
Šter Helena
Pisek Staša
Petric Lasnik Ivana
Kete Matičič Jana
Pirih Svetina Nataša
Preglau Daniela
Arhar Holdt Špela
Krsnik Luka
Erjavec Tomaž
Pegan Jasmina
Huber Damjan
Publication venue: Centre for Language Resources and Technologies, University of Ljubljana
Publication date: 18/11/2025
Field of study

The corpus of Slovene as a foreign language KOST (Korpus slovenščine kot tujega jezika) contains 10,590 texts (almost 1.4 million words) written by adult speakers for whom Slovene is not their first language. This corpus offers insights into Slovene language as produced by those who are still learning it as a second or foreign language, and in particular into the most common errors that occur in this process. KOST therefore aims at all those working with Slovene as a second or foreign language. The texts were mainly written at lectorates and Slovene as a L2/FL courses. Most of the authors of these texts speak Serbian, Bosnian and Macedonian as their first language, but texts by speakers of other languages are also included. The authors are at different proficiency levels in Slovene, from beginners to advanced. For each contributor, information is available on gender, year of birth, country, first language and other languages they speak, employment status and education, and prior experience of learning Slovene. For each text, there is also information on the time and circumstances of creation (exam or homework), the programme in which it was produced, input type (digital or hand-written), language level and the grade. A part of the corpus has also texts available in their corrected version. The tokens of the original and corrected texts are linked (one group of links per paragraph) and the links categorised into 23 error types. The corpus is availabe in two formats: (1) TEI encoding of the complete corpus (texts, links), including contributor and text metadata in the TEI header, and (2) the corpus in the original and corrected variants as vertical and corresponding registry files, suitable for mounting on CQP-type concordancers. Note that the vertical format does not retain the connection between the original and corrected tokens

Slovene instruction-following dataset for large language models GaMS-Instruct-MED 2.0

Author: Tovornik Robert
Pavlović Anđela
Radnić Vuk
Plesnik Emil
Fabjan Borut
Publication venue: Faculty of Computer and Information Science, University of Ljubljana
Publication date: 25/08/2025
Field of study

GaMS-Instruct-MED is an instruction-following dataset designed to fine-tune Slovene large language models to follow instructions in the medical domain. It consists of units of prompts, instructions and responses from the field of medicine, particularly those pertaining to the use of pharmaceutical drugs and medications. The dataset was generated in several steps (for a more detailed description, please refer to 00README.txt). After consulting with experts from the medical field, a series of prompts was manually compiled containing questions interesting in the context of drug and medication use. For each medication in the PoVeJMo-VeMo-Med 1.0 dataset (http://hdl.handle.net/11356/1983), approximately 10-15 questions were automatically generated using prompt tuning. In version 2.0, the dataset was extended with several other similar datasets for English that were translated into Slovene: MedQuAD, MeQSum, Medication QA, and LiveQA (references are available in 00README.txt). All translations were made automatically using GPT-4.1. The manual validation was made in two phases. In the preparation-evaluation phase, the quality of machine translations were validated on a sample using different machine translation applications (DeepL, OpenAI) to determine the solution with optimal performance. In the second phase, a random sample of 20--40 examples from each translated subset were manually validated (a total of 240 examples). The manual validations were made by two experts from the field of medicine and an expert for dataset compilation. Unlike version 1.0, where the dataset consisted of prompt-response pairs, version 2.0 contains units consisting of three elements (instruction-input-output). The conversion was made using OpenAI GPT-4.1. All final instructions were manually validated by an expert for dataset compilation. Two experts from the field of medicine participated in the design of clinically relevant categories of instructions, the compilation of examples of prompt-response pairs, and the manual validation of test results of the conversion process. Please note that the current version of the dataset (containing 25,046 instruction-input-output units) does not guarantee full clinical accuracy and may contain errors as a consequence of LLM hallucinations

The CLASSLA-Stanza model for UD dependency parsing of standard Slovenian 2.2

Author: Terčon Luka
Dobrovoljc Kaja
Ljubešić Nikola
Publication venue: Jožef Stefan Institute
Publication date: 07/02/2025
Field of study

This model for UD dependency parsing of standard Slovenian was built with the CLASSLA-Stanza tool (https://github.com/clarinsi/classla) by training on the SUK training corpus (http://hdl.handle.net/11356/1747) and using the CLARIN.SI-embed.sl word embeddings (http://hdl.handle.net/11356/1204) expanded with the MaCoCu-sl Slovene web corpus (http://hdl.handle.net/11356/1517). The estimated LAS of the parser is ~90.42. The difference to the previous version of the model is that the model was trained using the improved SUK 1.1 version of the training corpus

Multilingual comparable corpora of parliamentary debates ParlaMint 5.0

Author: Erjavec Tomaž
Kopp Matyáš
Kuzman Pungeršek Taja
Ljubešić Nikola
Ogrodniczuk Maciej
Osenova Petya
Agirrezabal Manex
Agnoloni Tommaso
Aires José
Albini Monica
Alkorta Jon
Antiba-Cartazo Iván
Arrieta Ekain
Barcala Mario
Bardanca Daniel
Barkarson Starkaður
Bartolini Roberto
Battistoni Roberto
Bel Nuria
Bonet Ramos Maria del Mar
Calzada Pérez María
Cardoso Aida
Çöltekin Çağrı
Coole Matthew
Darģis Roberts
de Libano Ruben
Depoorter Griet
Diwersy Sascha
Dodé Réka
Fernandez Kike
Fernández Rei Elisa
Frontini Francesca
Garcia Marcos
García Díaz Noelia
García Louzao Pedro
Gavriilidou Maria
Gkoumas Dimitris
Grigorov Ilko
Grigorova Vladislava
Haltrup Hansen Dorte
Iruskieta Mikel
Jarlbrink Johan
Jelencsik-Mátyus Kinga
Jongejan Bart
Kahusk Neeme
Kirnbauer Martin
Kryvenko Anna
Ligeti-Nagy Noémi
Luxardo Giancarlo
Magariños Carmen
Magnusson Måns
Marchetti Carlo
Marx Maarten
Meden Katja
Mendes Amália
Mochtak Michal
Mölder Martin
Montemagni Simonetta
Navarretta Costanza
Nitoń Bartłomiej
Norén Fredrik Mohammadi
Nwadukwe Amanda
Ojsteršek Mihael
Pančur Andrej
Papavassiliou Vassilis
Pereira Rui
Pérez Lago María
Piperidis Stelios
Pirker Hannes
Pisani Marilina
Pol Henk van der
Prokopidis Prokopis
Quochi Valeria
Rayson Paul
Regueira Xosé Luís
Rii Andriana
Rudolf Michał
Ruisi Manuela
Rupnik Peter
Schopper Daniel
Simov Kiril
Sinikallio Laura
Skubic Jure
Tungland Lars Magne
Tuominen Jouni
van Heusden Ruben
Varga Zsófia
Vázquez Abuín Marta
Venturi Giulia
Vidal Miguéns Adrián
Vider Kadri
Vivel Couso Ainhoa
Vladu Adina Ioana
Wissik Tanja
Yrjänäinen Väinö
Zevallos Rodolfo
Fišer Darja
Publication venue: CLARIN ERIC
Publication date: 08/07/2025
Field of study

ParlaMint 5.0 is a set of comparable corpora containing transcriptions of parliamentary debates of 29 European countries and autonomous regions, mostly starting in 2015 and extending to mid-2022. The individual corpora comprise between 9 and 126 million words and the complete set contains over 1.2 billion words. The transcriptions are divided by days with information on the term, session and meeting, and contain speeches marked by the speaker and their role (e.g. chair, regular speaker) as well as by their automatically assigned CAP (Comparative Agendas Project) top level topic. The speeches also contain marked-up transcriber comments, such as gaps in the transcription, interruptions, applause, etc. The corpora have extensive metadata, most importantly on speakers (name, gender, MP and minister status, party affiliation), on their political parties and parliamentary groups (name, coalition/opposition status, Wikipedia-sourced left-to-right political orientation, and CHES variables, https://www.chesdata.eu/). Note that some corpora have further metadata, e.g. the year of birth of the speakers, links to their Wikipedia articles, their membership in various committees, etc. The transcriptions are also marked with the subcorpora they belong to ("reference", until 2020-01-30, "covid", from 2020-01-31, and "war", from 2022-02-24). An overview of the statistics of the corpora is available on GitHub in the folder Build/Metadata, in particular for the release 5.0 at https://github.com/clarin-eric/ParlaMint/tree/v5.0/Build/Metadata. The corpora are encoded according to the ParlaMint encoding guidelines (https://clarin-eric.github.io/ParlaMint/) and schemas (included in the distribution). This entry contains the ParlaMint TEI-encoded corpora and their derived plain text versions along with TSV metadata of the speeches. Also included is the 5.0 release of the sample data and scripts available at the GitHub repository of the ParlaMint project at https://github.com/clarin-eric/ParlaMint. Note that there also exists the linguistically marked-up version of the 5.0 ParlaMint corpus (http://hdl.handle.net/11356/2005) as well as a version machine translated to English (http://hdl.handle.net/11356/2006). Both are linked with CLARIN.SI concordancers for on-line analysis. As opposed to the previous version 4.1, this version adds information on the topic of each speech for all corpora, changes the IDs of the categories in corpus-specific taxonomies to prevent ID clashes and corrects some other minor errors

Monitor corpus of Slovene Trendi 2025-05

Author: Kosem Iztok
Čibej Jaka
Dobrovoljc Kaja
Erjavec Tomaž
Ljubešić Nikola
Ponikvar Primož
Šinkec Mihael
Krek Simon
Publication venue: Centre for Language Resources and Technologies, University of Ljubljana
Publication date: 05/06/2025
Field of study

The Trendi corpus is a monitor corpus of Slovenian. It contains news articles from 106 media websites, published by 57 publishers. Trendi 2025-05 covers the period from January 2019 to May 2025, complementing the Gigafida 2.0 reference corpus of written Slovene (http://hdl.handle.net/11356/1320). The contents of the Trendi corpus are obtained using the Jožef Stefan Institute Newsfeed service (http://newsfeed.ijs.si/). The texts have been annotated using the CLASSLA-Stanza pipeline (https://github.com/clarinsi/classla), including syntactic parsing according to the Universal Dependencies (https://universaldependencies.org/sl/) and Named Entities (https://nl.ijs.si/janes/wp-content/uploads/2017/09/SlovenianNER-eng-v1.1.pdf). An important addition are topics or thematical categories, which have been automatically assigned to each text. There are 13 categories altogether: Arts and culture, Crime and accidents, Economy, Environment, Health, Leisure, Politics and Law, Science and Technology, Society, Sports, Weather, Entertainment, and Education. The text classification uses the following models: Text classification model SloBERTa-Trendi-Topics 1.0 (http://hdl.handle.net/11356/1709), Text classification model fastText-Trendi-Topics 1.0 (http://hdl.handle.net/11356/1710), and the SloBERTa model (https://huggingface.co/cjvt/sloberta-trendi-topics). The corpus is currently not available as a downloadable dataset due to copyright restrictions but we hope to make at least some of it available in the near future. The corpus is accessible through CLARIN.SI concordancers. If you would like to use the dataset for research purposes, please contact Iztok Kosem ([email protected]). This version adds texts from May 2025

Corpus of Slovenian historical legal texts SI-IUS 1.0

Author: Škrubej Katja
Jemec Tomazin Mateja
Pančur Andrej
Erjavec Tomaž
Publication venue: Faculty of Law, University of Ljubljana
Publication date: 01/05/2025
Field of study

The SI-IUS collection of older law texts is meant to be used both as a digital library and as a language corpus. For the former, each text has been carefully annotated in TEI preserving e.g. different types of divisions and other structural encoding, page breaks, highlighted text, etc. For the latter, the structure has been simplified and the texts annotated on the levels of Universal Dependencies morphosyntax and lemmas with the CLASSLA annotation pipeline (https://github.com/clarinsi/classla). The collection consists of seven texts: - CPZ1906: Zbirka avstrijskih zakonov v slovenskem jeziku. IV. ZVEZEK. Civilnopravdni zakoni (1906). Društvo Pravnik. - ODZ1928: Občni državljanski zakonik z dne 1. junija 1811 (1928). Tiskovna zadruga v Ljubljani. - SlP1917: Slovenski Pravnik (1917). Društvo Pravnik. - SlP1920: Slovenski Pravnik (1920). Društvo Pravnik. - ZKP1890: Zbirka avstrijskih zakonov v slovenskem jeziku. II. zvezek. Kazenskopravdni red (1890). Društvo Pravnik. - UsV1910: Zbirka avstrijskih zakonov v slovenskem jeziku VII. zvezek. Državni osnovni zakoni (1910). Društvo Pravnik. - ZKP1929: Zakonik o sodnem kazenskem postopanju za kraljevino Srbov, Hrvatov in Slovencev. Zakon o izvrševanju kazni na prostosti. (1929). Tiskovna zadruga v Ljubljani

Monitor corpus of Slovene Trendi 2025-08

Author: Kosem Iztok
Čibej Jaka
Dobrovoljc Kaja
Erjavec Tomaž
Ljubešić Nikola
Ponikvar Primož
Šinkec Mihael
Krek Simon
Publication venue: Centre for Language Resources and Technologies, University of Ljubljana
Publication date: 03/09/2025
Field of study

The Trendi corpus is a monitor corpus of Slovenian. It contains news articles from 106 media websites, published by 58 publishers. Trendi 2025-08 covers the period from January 2019 to August 2025, complementing the Gigafida 2.0 reference corpus of written Slovene (http://hdl.handle.net/11356/1320). The contents of the Trendi corpus are obtained using the Jožef Stefan Institute Newsfeed service (http://newsfeed.ijs.si/). The texts have been annotated using the CLASSLA-Stanza pipeline (https://github.com/clarinsi/classla), including syntactic parsing according to the Universal Dependencies (https://universaldependencies.org/sl/) and Named Entities (https://nl.ijs.si/janes/wp-content/uploads/2017/09/SlovenianNER-eng-v1.1.pdf). An important addition are topics or thematical categories, which have been automatically assigned to each text. There are 13 categories altogether: Arts and culture, Crime and accidents, Economy, Environment, Health, Leisure, Politics and Law, Science and Technology, Society, Sports, Weather, Entertainment, and Education. The text classification uses the following models: Text classification model SloBERTa-Trendi-Topics 1.0 (http://hdl.handle.net/11356/1709), Text classification model fastText-Trendi-Topics 1.0 (http://hdl.handle.net/11356/1710), and the SloBERTa model (https://huggingface.co/cjvt/sloberta-trendi-topics). The corpus is currently not available as a downloadable dataset due to copyright restrictions but we hope to make at least some of it available in the near future. The corpus is accessible through CLARIN.SI concordancers. If you would like to use the dataset for research purposes, please contact Iztok Kosem ([email protected]). This version adds texts from August 2025

French and Slovene offensive language metaphor and metonymy annotated dataset FRENK-MRW 1.0

Author: Pahor de Maiti Tekavčič Kristina
Publication venue: CY Cergy Paris University
Publication date: 09/05/2025
Field of study

The Frenk-MRW dataset contains French and Slovene socially unacceptable Facebook comments that are manually annotated for metaphor and metonymy based on the observed incongruity between the basic and contextual meaning. The comments were posted between 2015 and 2017 under Facebook posts produced by major news media outlets on the topics of LGBTQIA+/homophobia and migration/islamophobia. This entry includes the dataset divided into four files in CSV format, two with French comments (metadata: meta_fr, metaphor/metonymy annotations: mrw_fr) and two with Slovene comments (metadata: meta_sl, metaphor/metonymy annotations: mrw_sl). Attached are also annotation guidelines and a README file explaining the file structure, both formatted as TXT files. The dataset uses a selection of Slovene socially unacceptable comments from FRENK 1.1 (http://hdl.handle.net/11356/1462) and French socially unacceptable comments from FRENK-fr 1.0 (http://hdl.handle.net/11356/1947). French data from FRENK-fr 1.0 was linguistically annotated with the FreeLing tagger (https://aclanthology.org/L12-1224/), while Slovene data from FRENK 1.1 was processed using CLASSLA tagger (http://hdl.handle.net/11356/1337). Manual annotations were performed in a WebAnno deployment (webanno.github.io/webanno) hosted at CLARIN.SI. FRENK-MRW represent a set of comments, 2,000 in total, that is based on a selection of news items (POST_CONTENT (NEWS) column) which were chosen according to two criteria: (1) for ease of annotation and interpretation, the entire thread of comments needed to be included (excluding acceptable comments from the annotation), and (2) the total amount of available comments linked to these news posts had to reach 2,000 comments equally distributed between the two languages (French, Slovene) and the two topics (migrants, LGBT). The French part of the dataset includes posts from Le Figaro and 20 minutes, with LGBT-related news coming only from the latter. In the Slovene part, the posts on both topics (migrants and LGBT) come from Nova24TV, Siol.net and 24ur. There are 2,000 comments in the dataset with 84,738 tokens. Not all comments contain metaphors. In the French part, 541 comments contain at least one metaphorically used token, while in the Slovene part of the dataset this number amounts to 571 comments. In total, there are 1,192 metaphorically used tokens in the French part of the dataset, and 1,270 in the Slovene part

5

full texts

840

metadata records

Updated in last 30 days.

Common Language Resources and Technology Infrastructure - Slovenia

Access Repository Dashboard

Do you manage Open Research Online? Become a CORE Member to access insider analytics, issue reports and manage access to outputs from your repository in the CORE Repository Dashboard! 👇