Common Language Resources and Technology Infrastructure - Slovenia
Not a member yet
840 research outputs found
Sort by
Corpus of conversational humor Krohot 1.0
The KROHOT corpus consists of 10 audio recordings of private, spontaneous conversations between two or three speakers, with a total duration of 232 minutes. Most recordings were made between May and September 2025. The conversations include recollections about past events that triggered spontaneous humorous reactions among participants (conversational humour).
Segments containing humour were manually annotated using a tagging scheme developed exclusively for this corpus. The scheme comprises five primary categories: vocabulary (lexical choice, including figurative use), relation (relationship between speakers), content (topical focus), attitude (speaker’s opinion toward the topic), and manner (purposefully humorous way of speaking). These categories are not mutually exclusive and can be combined.
The corpus allows for the analysis of linguistic and communicative phenomena, including markers of humour and strategies used to achieve humorous effects (teasing, mocking, irony, or metaphorical language) in informal private spoken conversations.
The corpus is available as WAV audio recordings, while the (aligned) transcriptions are given in the formats of the EXMARaLDA (https://exmaralda.org/en/) and Transcriber (https://trans.sourceforge.net/) tools, as well as in plain text
Word-sense disambiguation corpus SloDicWSD 1.0
SloDicWSD is a Slovene word-sense disambiguation (WSD) corpus generated from data contained in SSKJ (Slovar slovenskega knjižnega jezika, the largest dictionary of standard Slovene). The corpus is an automatically constructed WSD dataset based on the sense inventory from the SSKJ dictionary and consists of SSKJ dictionary use-case examples converted to complete sentences using GPT-3.5 Turbo (https://platform.openai.com/docs/models#gpt-3-5-turbo).
We limited the corpus to the top 758 lemmas present in the Slovene part of the Elexis-WSD dataset (http://hdl.handle.net/11356/1842). For each lemma, we extracted every usage example from the SSKJ dictionary and labeled it with the matching sense. As these usage examples are likely too short to be useful for the WSD task, we extended them using GPT-3.5. We automatically filtered sentences that contain one of the two errors:
1. The original dictionary lemma was not present in the full sentence. While we prompted GPT-3.5 to generate complete sentences by extending existing examples, GPT-3.5 sometimes omitted the original lemma.
2. The generated sentence was identical to one of the already generated sentences. Thesentences generated by GPT-3.5 are not guaranteed to be unique; therefore, we discarded duplicates
Dataset of annotated headword-synonym-distractor triplets SYNDIST
The dataset contains 51,023 headword-synonym-distractor triplets for 5,000 headwords. Distractor is defined as an incorrect answer/alternative to synonym, which can be similar to synonym in meaning and/or form. Headwords and their synonyms were obtained from the Thesaurus of Modern Slovene (http://hdl.handle.net/11356/1916), which is part of the Dictionary Database of Slovene (the database is available via API: https://wiki.cjvt.si/books/digital-dictionary-database-of-slovene). The criteria for selecting the headwords (nouns, adjectives, verbs, and adverbs) were that they had to be frequent and had to have several synonyms, preferably more than five.
The distractors were obtained with the Gemini-2.0-flash (https://docs.cloud.google.com/vertex-ai/generative-ai/docs/models/gemini/2-0-flash) model, using the following prompt:
"You are given headword and a synonym. Create a distractor — a word that looks similar to the synonym but has a different meaning.
The distractor must be the same part of speech as the synonym (e.g., if the synonyms are verbs in their base form, the distractor must also be a verb in its base form).
The distractor must not include sensitive vocabulary (e.g., words related to minorities, religion, sexual content, violence, etc.).
The distractor must be a frequent word in the Slovene language.
The distractor must look similar to the synonym but have a different meaning.
Write the distractor in the same line as the headword and synonym, following this format: živahen - vesel - resen. These are the headword and synonym: {word} - {synonym}
The distractor cannot be one of these words: {synonym_set}."
The manual evaluation of all the distractors (with the exception of the distractors that were identified as existing synonyms in the Thesaurus) was conducted by two lexicographers. Each of them evaluted their own part, with the second one also subsequently inspecting the evaluations of the first one. The estimate is that around 30-35% of data was evaluated by both lexicographers. Five decisions were used: good distractor, bad distractor, problematic (i.e. difficult to decide due to certain characteristic such as being too similar to synonym, word being too archaic or informal etc.), same as synonym, and synonym candidate (likely being a legitimate (new) synonym of the headword).
The dataset also includes the information on the frequency of synonyms and the distractors in the Gigafida 2.0 reference corpus of Slovene (http://hdl.handle.net/11356/1320). The frequency information is provided for single-word lemmas only (and not for multiword items, non-lemma single-word forms such as plural form of nouns or comparatives of adjectives). In addition, the information on similarity between the headwords and synonyms, and between the synonyms and distractors is provided. Similary is calculated using Gestalt pattern matching
Service for querying dependency treebanks Drevesnik 1.2
Drevesnik (https://orodja.cjvt.si/drevesnik/) is an online service for querying Slovenian corpora parsed with the Universal Dependencies annotation scheme. It features an easy-to-use query language on the one hand and user-friendly graph visualizations on the other.
It is based on the open-source dep_search tool (https://github.com/TurkuNLP/dep_search), which was localized and modified so as to also support querying by JOS morphosyntactic tags, random distribution of results, and filtering by sentence length.
The source code and the documentation for the search backend and the web user interface are publicly available on the CLARIN.SI GitHub repository https://github.com/clarinsi/drevesnik. In comparison to previous version (1.1), release 1.2 introduces a new front-end design and some improved interface features
Pilot corpus of student academic texts KOŠ 1.0
The Pilot corpus of student academic texts KOŠ 1.0 consists of authentic texts written by undergraduate students (approx. age 19–23 years) as part of their coursework at two faculties of the University of Ljubljana. The information on the study programme, field of study, year of study, academic year of submission, number of authors (single or multiple) and type of text is provided for each text. The corpus predominantly contains article reviews, essays, answers to questions, and seminar papers. It also includes reports, summaries and presentations of articles and lectures, lesson plans, and other academic materials.
Linguistic annotations were applied using the CLASSLA pipeline (https://github.com/clarinsi/classla/) across various levels, including tokenization, sentence segmentation, lemmatization, MULTEXT-East v6 MSD-tags, JOS-SYN dependency syntax, Universal Dependencies, and named entities (more about specific annotation layers: https://wiki.cjvt.si/shelves/linguistic-annotation-of-slovene-corpora). For better accessibility and wider usability, we provide versions with JOS-SYN as well as Universal Dependencies, and English as well as Slovene tags
Frequency lists of syntactic structures from the Učbeniki 1.0 corpus
The frequency lists of syntactic structures from the Slovene textbook corpus Učbeniki 1.0 were extracted with the STARK v3 tool (http://hdl.handle.net/11356/1958).
The extracted data is available at two levels: at the phrase level (see folder "besednozvezne") and at the sentence level (see folder "medstavcne").
At the phrase level, the extracted syntactic structures have a headword belonging to one of the following parts of speech, as defined by the MULTEXT-East system for morphosyntactic annotation of Slovene texts:
noun (samostalnik), verb (glagol), adjective (pridevnik), adverb (prislov), pronoun (zaimek),
numeral (števnik), predlog (adposition), veznik (conjunction), particle (členek), abbreviation (okrajšava) (no results were returned for interjection (medmet) and residual (neuvrščeno)).
These structures were extracted based on the MULTEXT-East morphosyntax v6 (https://wiki.cjvt.si/books/04-multext-east-morphosyntax) and the JOS-SYN dependency syntax (https://wiki.cjvt.si/books/06-jos-syn-syntax), where the latter serves as a syntactic complement to the former.
At the sentence level, the extracted syntactic structures link two clauses. The included types of clausal syntactic relations according to Universal Dependencies (UD) are:
parataxis (soredje), coordination (priredje), and subordination (podredje), which is further divided into 4 main types according to UD:
clausal subject (osebkov odvisnik), clausal object (predmetni odvisnik), adverbial cluase modifier (prislovni odvisniki), and adnominal clause modifier (prilastkov odvisnik).
These structures were extracted based on the UD part-of-speech and syntactic relations annotations (https://wiki.cjvt.si/books/07-universal-dependencies).
The dataset can be used for syntactic analyses in combination with comparable data (http://hdl.handle.net/11356/2009) from develpmental corpus Šolar 3.0 (http://hdl.handle.net/11356/1589), the present data representing the expected or desired scope of reception.
For each part of speech (phrase level) or clausal relation (sentence level), there are 4 files:
- "ucbeniki_*_default.tsv" - the original output, containing extracted unique syntactic structures of varying lengths, ranging from 2 to 10 tokens, arranged by frequency, followed by additional data on syntactic structures and corpus-linguistic statistics (Absolute frequency, Relative frequency, MI, MI3, Dice, logDice, t-score, simple-LL).
- "ucbeniki_*_all-examples.tsv" - the original output, containing all matched structures found in the input corpus (i.e. all occurances of the extracted structures in every sentence).
- "ucbeniki_*_default_tree-description.tsv" - an extension of the "ucbeniki_*_default.tsv" file that includes a verbal description of syntactic structures (trees).
- "ucbeniki_*_all-examples_tree-description.tsv" - an extension of the "ucbeniki_*_all-examples.tsv" file that includes a verbal description of syntactic structures (trees).
(The asterisk (*) in file names serves as a placeholder for a part of speech or a clausal relation.)
The data was prepared in the following manner:
The individual files of Slovene school textbooks were merged into a single CONLLU file. The corpus was already linguistically annotated with the CLASSLA pipeline (https://github.com/clarinsi/classla/) at the levels of the MULTEXT-East v6 morphosyntax, JOS-SYN dependency syntax, and UD part-of-speech and syntactic relations annotations.
Furthermore, the original corpus was preprocessed to reduce the MSD tag to its first letter (e.g., Somei → S), which denotes the part of speech (the remaining letters represent the token's morphosyntactic features). This preprocessing step enabled extraction at the part-of-speech level, disregarding token-specific features, yet still displaying the full MSD tags as nodes in the extracted structures. (Note that STARK was originally developed for extracting data from UD-parsed corpora and was not designed for use cases like this one.)
Then, the data was extracted with the STARK v3.0 tool (http://hdl.handle.net/11356/1958), based on predefined parameters in the "config.ini" file, with phrase-level structures extracted based on the MULTEXT-East and JOS-SYN annotation systems, and sentence-level structures extracted based on the UD schema.
The sentence-level data underwent a postprocessing phase to remove duplicates that occured due to the phased extraction of complex connectives and to recalculate corpus-linguistic statistics based on the deduplicated data.
Another step was to enhance all output files with verbal descriptions of the extracted structures.
Lastly, the extended versions of the two original output files ("ucbeniki_*_default_tree-description.tsv", "ucbeniki_*_all-examples_tree-description.tsv") were converted into Excel spreadsheets.
The package also includes a configuration file for each level: "config_ucbeniki_besednozvezne.ini" for phrase-level structures, and "config_ucbeniki_medstavcne.ini" for sentence-level structures. These files contain all the parameter values used for data extraction with STARK.
For more details, see "00README.txt"
Linguistically annotated multilingual comparable corpora of parliamentary debates ParlaMint.ana 5.0
ParlaMint 5.0 is a set of comparable corpora containing transcriptions of parliamentary debates of 29 European countries and autonomous regions, mostly starting in 2015 and extending to mid-2022. The individual corpora comprise between 9 and 126 million words and the complete set contains over 1.2 billion words.
The transcriptions are divided by days with information on the term, session and meeting, and contain speeches marked by the speaker and their role (e.g. chair, regular speaker) as well as by their automatically assigned CAP (Comparative Agendas Project) top level topic.
The speeches also contain marked-up transcriber comments, such as gaps in the transcription, interruptions, applause, etc. The corpora have extensive metadata, most importantly on speakers (name, gender, MP and minister status, party affiliation), on their political parties and parliamentary groups (name, coalition/opposition status, Wikipedia-sourced left-to-right political orientation, and CHES variables, https://www.chesdata.eu/). Note that some corpora have further metadata, e.g. the year of birth of the speakers, links to their Wikipedia articles, their membership in various committees, etc. The transcriptions are also marked with the subcorpora they belong to ("reference", until 2020-01-30, "covid", from 2020-01-31, and "war", from 2022-02-24).
An overview of the statistics of the corpora is available on GitHub in the folder Build/Metadata, in particular for the release 5.0 at https://github.com/clarin-eric/ParlaMint/tree/v5.0/Build/Metadata.
The corpora are encoded according to the ParlaMint encoding guidelines (https://clarin-eric.github.io/ParlaMint/) and schemas (included in the distribution).
The ParlaMint.ana linguistic annotation includes tokenization; sentence segmentation; lemmatisation; Universal Dependencies part-of-speech, morphological features, and syntactic dependencies; the 4-class CoNLL-2003 named entities; and per-sentence sentiment score and class. Some corpora also have further linguistic annotations, in particular PoS tagging according to a language-specific scheme, with their corpus TEI headers giving further details on the annotation vocabularies and tools used.
This entry contains the ParlaMint.ana TEI-encoded linguistically annotated corpora; the derived CoNLL-U files along with TSV metadata of the speeches and TSV with per-sentence sentiment score, 6- and 3-categories class; and the derived vertical files (with their registry file), suitable for use with CQP-based concordancers, such as CWB, noSketch Engine or KonText.
Also included is the 5.0 release of the sample data and scripts available at the GitHub repository of the ParlaMint project at https://github.com/clarin-eric/ParlaMint and the log files produced in the process of building the corpora for this release. The log files show e.g. known errors in the corpora, while more information about known problems is available in the open issues at the GitHub repository of the project.
This entry contains the linguistically marked-up version of the corpus, while the text version, i.e. without the linguistic annotation is also available at http://hdl.handle.net/11356/2004. Another related resource, namely the ParlaMint corpora machine translated to English ParlaMint-en.ana 5.0 can be found at http://hdl.handle.net/11356/2006.
As opposed to the previous version 4.1, this version adds information on the topic of each speech and the sentence-level sentiment for all corpora, adds some previously missing speeches to the TR corpus, changes the IDs of the categories in corpus-specific taxonomies to prevent ID clashes and corrects some other minor errors
Syntactic Tree Inventories from English GUM UD Corpus (v2.15)
This dataset contains lists of delexicalized dependency trees and subtrees extracted from the English UD GUM corpus, version 2.15 (http://hdl.handle.net/11234/1-5787), using the STARK tool (https://github.com/clarinsi/STARK). These lists represent a basic inventory of syntactic structures in English, supporting data-driven investigations into syntactic patterns and their variation across modalities.
The GUM corpus was divided into spoken and written subsets based on the original genre classifications. The spoken subset includes interviews, conversations, podcasts, vlogs, courtroom transcripts, and speeches, while the written subset includes news articles, academic texts, fiction, how-to guides, biographies, essays, letters, textbooks, and travel guides.
Each structure is represented as a fixed-order labeled dependency tree or subtree with UPOS tags as nodes (e.g., ADJ <amod NOUN). For each of the two subcorpora (spoken and written), structures were extracted in three versions
(1) The full version
(2) A version excluding punctuation (i.e., branches labeled as punct)
(3) A version excluding disfluencies (i.e., branches labeled as punct, reparandum, or discourse)
The extracted structures are provided in tabular TSV format. Each row contains:
* The delexicalized tree/subtree (e.g., ADJ <amod NOUN)
* Its absolute and relative frequency in the target corpus (e.g., GUM-spoken)
* An example (e.g., nice <amod example)
* Frequency in the corresponding reference corpus (e.g., GUM-written)
* Keyness measures for modality-based comparison (e.g., LL, Odds Ratio, %DIFF
The "Mobile languages" corpus MoJezik 1.0 (transcription)
The "Mobile Languages" corpus documents in-depth, semi-structured sociolinguistic interviews with speakers from two Slovene regions and distinctive dialects: Idrija (Cerkno dialect, Rovte dialect group) and Ribnica (Lower Carniola dialect, Lower Carniola dialect group), who study or work in the Slovenian capital, Ljubljana, and thus navigate daily between dialectal and standard language use. Interview topics include narratives of personal (linguistic) history, reflections on past and present language practices, attitudes towards their own dialects and other Slovene varieties, experiences of dialect perception in the Ljubljana context and of standard-like speech in local environments, linguistic identity, stereotypes and prejudices, intergenerational language use (especially with children), and language behaviour in educational settings.
The corpus includes:
– Idrija group: 5 speakers (3 women, 2 men; 3 adults, 2 secondary-school students), recorded between 2009 and 2013; 1,112 transcribed utterances, 31,506 transcribed words.
– Ribnica group: 11 speakers (3 primary informants and 8 close contacts, including family members, friends, and colleagues), recorded between 2020 and 2022; 2,889 transcribed utterances, 47,364 transcribed words.
The transcriptions are orthographic, with selected non-standard features preserved using special symbols to capture salient dialectal elements (e.g., the fricative [γ] and the bilabial glide [w] in the Cerkno variety). Speaker names have been anonymised. While transcription prioritised content and was performed by multiple transcribers, consistency in the phonetic rendering of dialectal features was not systematically verified. Users should be aware that detailed phonological analysis may require additional checking.
The interviews were conducted within the framework of broader sociolinguistic research, which also encompassed informants’ self-recordings of spontaneous speech in diverse everyday situations and a quantitative variationist analysis of five phonological variables (dialect-specific) across various communicative contexts. The interview data enable comparisons between speakers’ metalinguistic commentary and their actual language use as documented in the recordings.
The findings of the Cerkno and Ribnica studies are comprehensively presented in two scientific publications:
* Bitenc, Maja (2016): Z jezikom na poti med Idrijskim in Ljubljano [With Language on the Move Between Idrija and Ljubljana]. Ljubljana: Znanstvena založba Filozofske fakultete. https://www.ff.uni-lj.si/publikacije/z-jezikom-na-poti-med-idrijskim-ljubljano
* Bitenc, Maja (2025): Govor v gibanju med Ribnico in Ljubljano [Speech in Motion Between Ribnica and Ljubljana]. Ljubljana: Znanstvena založba Filozofske fakultete. https://doi.org/10.4312/9789612976316
The corpus speech files for speakers who have consented to the publication of their recordings are available as a separate entry: The "Mobile languages" corpus MoJezik 1.0 (audio), http://hdl.handle.net/11356/2042
Parallel sense-annotated corpus ELEXIS-WSD 1.3
ELEXIS-WSD is a parallel sense-annotated corpus in which content words (nouns, adjectives, verbs, and adverbs) have been assigned senses. Version 1.3 contains sentences for 10 languages: Bulgarian, Danish, English, Spanish, Estonian, Hungarian, Italian, Dutch, Portuguese, and Slovene.
The corpus was compiled by automatically extracting a set of sentences from WikiMatrix (Schwenk et al., 2019), a large open-access collection of parallel sentences derived from Wikipedia, using an automatic approach based on multilingual sentence embeddings. The sentences were manually validated according to specific formal, lexical and semantic criteria (e.g. by removing incorrect punctuation, morphological errors, notes in square brackets and etymological information typically provided in Wikipedia pages). To obtain a satisfactory semantic coverage, we filtered out sentences with less than 5 words and less than 2 polysemous words were filtered out. Subsequently, in order to obtain datasets in the other nine target languages, for each selected sentence in English, the corresponding WikiMatrix translation into each of the other languages was retrieved. If no translation was available, the English sentence was translated manually. The resulting corpus is comprised of 2,024 sentences for each language.
The sentences were tokenized, lemmatized, and tagged with UPOS tags using UDPipe v2.6 (https://lindat.mff.cuni.cz/services/udpipe/). Senses were annotated using LexTag (https://elexis.babelscape.com/): each content word (noun, verb, adjective, and adverb) was assigned a sense from among the available senses from the sense inventory selected for the language (see below) or BabelNet. Sense inventories were also updated with new senses during annotation. Dependency relations were added with UDPipe 2.15 in version 1.2.
List of sense inventories
BG: Dictionary of Bulgarian
DA: DanNet – The Danish WordNet
EN: Open English WordNet
ES: Spanish Wiktionary
ET: The EKI Combined Dictionary of Estonian
HU: The Explanatory Dictionary of the Hungarian Language
IT: PSC + Italian WordNet
NL: Open Dutch WordNet
PT: Portuguese Academy Dictionary (DACL)
SL: Digital Dictionary Database of Slovene
The corpus is available in the CoNLL-U tab-separated format. In order, the columns contain the token ID, its form, its lemma, its UPOS-tag, its XPOS-tag (if available), its morphological features (FEATS), the head of the dependency relation (HEAD), the type of dependency relation (DEPREL); the ninth column (DEPS) is empty; the final MISC column contains the following: the token's whitespace information (whether the token is followed by a whitespace or not; e.g. SpaceAfter=No), the ID of the sense assigned to the token, the index of the multiword expression (if the token is part of an annotated multiword expression), and the index and type of the named entity annotation (currently only available in elexis-wsd-sl and elexis-wsd-en).
Each language has a separate sense inventory containing all the senses (and their definitions) used for annotation in the corpus. Not all the senses from the sense inventory are necessarily included in the corpus annotations: for instance, all occurrences of the English noun "bank" in the corpus might be annotated with the sense of "financial institution", but the sense inventory also contains the sense "edge of a river" as well as all other possible senses to disambiguate between.
For more information, please refer to 00README.txt.
Updates in version 1.3:
- A handful of token ID issues were corrected in ELEXIS-WSD-sl. In addition, lemmas were corrected according to the version of ELEXIS-WSD-sl included in the SUK 1.1 Training Corpus of Slovene (http://hdl.handle.net/11356/1959).
- Named entity annotations and named entity core concept annotations were added to ELEXIS-WSD-en.
- For all languages, missing UPOS tags were added for non-content words