Common Language Resources and Technology Infrastructure - Slovenia
Not a member yet
840 research outputs found
Sort by
Slovene-Japanese Learner's Dictionary sloJa 1.1
The Slovenian-Japanese online dictionary for Slovenian speaking learners of Japanese was compiled by extracting and converting the Japanese-Slovenian dictionary jaSlo 3.1 (http://hdl.handle.net/11356/1050) into a preliminary Slovene-Japanese dictionary, automatically and then manually cleaning duplicates and inappropriate entries, labelling Slovene headwords with MULTEXT-East part-of-speech (https://nl.ijs.si/ME/V6/msd/html/msd-sl.html#msd.msds-sl) and difficulty levels according to the CEFR scale as available in the Core Vocabulary of Slovene (http://hdl.handle.net/11356/1697). The entries were manually edited via Lexonomy (https://www.lexonomy.eu/).
For this version the dictionary was augmented with entries for words from the Core Vocabulary of Slovene that had not been included in version 1.0, and manually edited in other existing entries.
Senses of polysemous words and corresponding translation equivalents were manually glossed with semantic hints, in part also with examples, extracted from the Japanese-Slovene parallel corpus jaSlo (https://nl.ijs.si/jaslo/index-en.html#parallel) and manually adapted for the learner's dictionary. Japanese translational equivalents from different registers were tagged according to their level of politeness and with notes on usage restrictions aimed at dictionary users who are learning Japanese as a foreign language.
The sloJa dictionary is available in TEI Lex0 encoding (https://dariah-eric.github.io/lexicalresources/pages/TEILex0/TEILex0.html) and in the XML encoding used by Lexonomy, which was derived from the TEI version
Corpus of spoken Slovenian ROG-Dialog 1.0
Corpus of spoken Slovenian ROG-Dialog consists of volunteered audio, recorded by students by asking their relatives or acquaintances to talk on record in their homes. The speakers were directed to use various styles of dialogue, including instructions, interviews, discussions, story telling, and chatting. Dialogue themes were freely chosen, most prevalent themes include travelling, health, childhood memories, work, technology, food, and entertainment.
Recordings and metadata were uploaded to the Govorjena Slovenščina web portal (https://govorjena-slovenscina.um.si/), manually segmented and transcribed in both colloquial and standardized orthographic transcriptions, and annotated with dialogue acts and sentiment.
The 25 speakers in this corpus cover all statistical regions of Slovenia with their ages ranging from 21 to 82 years. The corpus includes speakers from both rural and urban areas. Reflecting this geographic and social diversity, speech samples range from standard colloquial registers to local dialects, with some speakers employing distinct regional varieties.
ROG-Dialog is distributed as:
- EXMARaLDA format (.EXB files) for viewing with Partitur Editor (https://www.exmaralda.org/)
- .EXS files and Rog-Art.coma file for searching through the annotated corpus in the EXMARaLDA EXAKT concordancer (https://www.exmaralda.org/)
- .TRS files for viewing the transcriptions without annotations with Transcriber (https://trans.sourceforge.net/en/presentation.php)
- .TXT plain-text files
ROG-dialog data were compiled to complement the ROG-Artur subcorpus of the ROG 1.0 training corpus of spoken Slovenian (http://hdl.handle.net/11356/1992). However, the two corpora differ in their annotation levels, and harmonising these remains a task for future merging
Trankit model for linguistic processing of spoken Slovenian
This is a retrained Slovenian spoken language model for Trankit v1.1.1 library (https://pypi.org/project/trankit/). It is able to predict sentence segmentation, tokenization, lemmatization, language-specific morphological annotation (MULTEXT-East morphosyntactic tags), as well as universal part-of-speech tagging, feature prediction, and dependency parsing in accordance with the Universal Dependencies annotation scheme (https://universaldependencies.org/).
The model was trained using a combination of two datasets published by Universal Dependencies in release 2.12, the spoken SST treebank (https://github.com/UniversalDependencies/UD_Slovenian-SSJ/tree/r2.12) and the written SSJ treebank (https://github.com/UniversalDependencies/UD_Slovenian-SST/tree/r2.12). Its evaluation on the spoken SST test set yields an F1 score of 97.78 for lemmas, 97.19 for UPOS, 95.05 for XPOS and 81.26 for LAS, a significantly better performance in comparison to the counterpart model trained on written SSJ data only (http://hdl.handle.net/11356/1870).
To utilize this model, please follow the instructions provided in our github repository (https://github.com/clarinsi/trankit-train) or refer to the Trankit documentation (https://trankit.readthedocs.io/en/latest/training.html#loading). This ZIP file contains models for both xlm-roberta-large (which delivers better performance but requires more hardware resources) and xlm-roberta-base
Monitor corpus of Slovene Trendi 2024-11
The Trendi corpus is a monitor corpus of Slovenian. It contains news articles from 106 media websites, published by 76 publishers. Trendi 2024-11 covers the period from January 2019 to November 2024, complementing the Gigafida 2.0 reference corpus of written Slovene (http://hdl.handle.net/11356/1320).
The contents of the Trendi corpus are obtained using the Jožef Stefan Institute Newsfeed service (http://newsfeed.ijs.si/). The texts have been annotated using the CLASSLA-Stanza pipeline (https://github.com/clarinsi/classla), including syntactic parsing according to the Universal Dependencies (https://universaldependencies.org/sl/) and Named Entities (https://nl.ijs.si/janes/wp-content/uploads/2017/09/SlovenianNER-eng-v1.1.pdf).
An important addition are topics or thematical categories, which have been automatically assigned to each text. There are 13 categories altogether: Arts and culture, Crime and accidents, Economy, Environment, Health, Leisure, Politics and Law, Science and Technology, Society, Sports, Weather, Entertainment, and Education. The text classification uses the following models: Text classification model SloBERTa-Trendi-Topics 1.0 (http://hdl.handle.net/11356/1709), Text classification model fastText-Trendi-Topics 1.0 (http://hdl.handle.net/11356/1710), and the SloBERTa model (https://huggingface.co/cjvt/sloberta-trendi-topics).
The corpus is currently not available as a downloadable dataset due to copyright restrictions but we hope to make at least some of it available in the near future. The corpus is accessible through CLARIN.SI concordancers. If you would like to use the dataset for research purposes, please contact Iztok Kosem ([email protected]).
This version adds texts from November 2024
Parliamentary spoken corpus of Serbian ParlaSpeech-RS 1.0
The ParlaSpeech-RS dataset is built from the transcripts of parliamentary proceedings available in the Serbian part of the ParlaMint (ParlaMint-RS) corpus, and the parliamentary recordings available from the Serbian Parliament's YouTube channel. The corpus consists of audio segments that correspond to specific sentences in the transcripts. The transcript contains word-level alignments to the recordings, allowing for simple further segmentation of long sentences into shorter segments for ASR and other memory-sensitive applications. Each segment has a reference to the ParlaMint 4.0 corpus (http://hdl.handle.net/11356/1859) via utterance IDs and character offsets. All the speaker information from the ParlaMint corpus is available via the "speaker_info" key
Slovenian Emotion Dimension and Emotion Association Lexicon SloEmoLex 1.0
SloEmoLex is a lexicon of emotion, valence, arousal and dominance for 19,998 Slovenian entries.
It includes and extends the Slovenian part of the LiLaH lexicon (Ljubešić et al., 2020; http://hdl.handle.net/11356/1318), in which words are annotated with binary values for association to one of the 8 basic emotions (anger, anticipation, disgust, fear, joy, sadness, surprise, trust) and binary values for association with positive/negative sentiment.
SloEmoLex extends the LiLaH emotion lexicon with VAD scores from NRC VAD v1 (http://saifmohammad.com/WebPages/nrc-vad.html), and emotion intensity scores from NRC Emotion Intensity lexicon v1 (http://saifmohammad.com/WebPages/AffectIntensity.htm). Apart from the approx. 14,000 words present in Lilah, the lexicon includes 5,931 additional entries from the NRC VAD lexicon, some of which were translated with the use of sloWNet 3.1 (http://hdl.handle.net/11356/1026), and some entries (3,273) retained the machine translation provided in the Slovenian part of the NRC VAD lexicon.
If you use this work, please cite our paper:
Caporusso, Jaya, Hoogland, Damar, Brglez, Mojca, Kolosko, Boshko, Purver, Matthew, and Pollak, Senja, (2024). A Computational Analysis of the Dehumanisation of Migrants from Syria and Ukraine in Slovene News Media. THE 2024 JOINT INTERNATIONAL CONFERENCE ON COMPUTATIONAL LINGUISTICS, LANGUAGE RESOURCES AND EVALUATION (LREC-COLING 2024) 20-25 MAY, 2024, TORINO, ITALY
Bosnian web corpus CLASSLA-web.bs 1.0
The Bosnian web corpus CLASSLA-web.bs 1.0 is based on the MaCoCu-bs 1.0 web corpus crawl (http://hdl.handle.net/11356/1808), which was additionally cleaned and enriched with linguistic and genre information. The CLASSLA-web.bs corpus is a part of the South Slavic CLASSLA-web corpus collection, which is the first collection of comparable corpora that encompasses the entire South Slavic language group.
The MaCoCu-bs 1.0 crawl was built by crawling the ".ba" internet top-level domain in 2021 and 2022, as well as extending the crawl dynamically to other domains. During the development of CLASSLA-web corpora, the MaCoCu web crawls were cleaned by removing paragraphs that are not in the target language, and by removing very short texts (less than 75 words or consisting only of paragraphs shorter than 70 characters). The corpus was also linguistically annotated with the CLASSLA-Stanza pipeline (https://github.com/clarinsi/classla). The linguistic processing involved tokenization, morphosyntactic annotation, and lemmatization. Additionally, the corpus was automatically annotated with genres using the Transformer-based X-GENRE classifier (https://huggingface.co/classla/xlm-roberta-base-multilingual-text-genre-classifier). The following genre categories are used: News, Information/Explanation, Promotion, Opinion/Argumentation, Instruction, Legal, Prose/Lyrical, Forum, Other and Mix.
The corpus is available in vertical format, as used by Sketch Engine and CWB concordancers. Information is provided on the text-, paragraph-, sentence- and token-level. Each text is accompanied by the following metadata: text id, title, url, domain, top-level domain (tld, e.g., "com"), and predicted genre category. Each text is divided into paragraphs that are accompanied by the following metadata: paragraph id, the automatically identified language of the text in the paragraph, and paragraph quality. For quality, labels, such as "short" or "good" are assigned based on paragraph length, URL and stopword density via the jusText tool (https://corpus.tools/wiki/Justext). Paragraphs are further divided into sentences that have as metadata their sentence id. Inside sentences, tokens are provided in tabular format with their linguistic annotation. Details about the structural and positional attributes are also given in the accompanying registry file which was used to install the corpus on the CLARIN.SI concordancers.
Notice and take down: Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please: (1) Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted. (2) Clearly identify the copyrighted work claimed to be infringed. (3) Clearly identify the material that is claimed to be infringing and information reasonably sufficient in order to allow us to locate the material. (4) Please write to the contact person for this resource whose email is available in the full item record. We will comply with legitimate requests by removing the affected sources from the next release of the corpus.
A JSONL version of the corpus is available as part of the MaCoCu-Genre corpora collection at http://hdl.handle.net/11356/1969. The MaCoCu-Genre version comprises texts and metadata at the text level, including genre information, and is not linguistically annotated
Linguistically annotated multilingual comparable corpora of parliamentary debates in English ParlaMint-en.ana 4.1
ParlaMint-en.ana 4.1 is the English machine translation of the ParlaMint.ana 4.1 (http://hdl.handle.net/11356/1911) set of corpora of parliamentary debates across Europe. The translation is linguistically annotated similarly to the original language corpora (but without UD syntax), and with the addition of USAS semantic tags (https://ucrel.lancs.ac.uk/usas/). Because of the addition of semantic tags the UK corpus (ParlaMint-GB) is also included.
The translation to English was done with EasyNMT (https://github.com/UKPLab/EasyNMT) using OPUS-MT models (https://github.com/Helsinki-NLP/Opus-MT). Machine translation was done on the sentence level, and includes both speeches and transcriber notes, including headings. Note that corpus metadata is mostly available both in the source language and in English. The linguistic annotation of the speeches, i.e. tokenisation, tagging with UD PoS and morphological features, lemmatisation, and NER annotation was done with Stanza (https://stanfordnlp.github.io/stanza/) using the conll03 model (4 classes). The annotation of MWEs (phrases) and tokens with USAS tags was done with pyMusas (https://github.com/ucrel/pymusas).
Note that the English in the corpora contains typical NMT errors, including factual errors even when high fluency is achieved, and any use of this corpus should take the machine translation limitations into account.
The files associated with this entry include the machine translated and linguistically annotated corpora in several formats: the corpora in the canonical ParlaMint TEI XML encoding; the corpora in the derived vertical format (for use with CQP-based concordancers, such as CWB, noSketch Engine or KonText); and the corpora in the CoNLL-U format with TSV speech metadata. The CoNLL-U files include pyMusas USAS tags. Also included is the 4.1 release of the sample data and scripts available at the GitHub repository of the ParlaMint project at https://github.com/clarin-eric/ParlaMint and the log files produced in the process of building the corpora for this release. The log files show e.g. known errors in the corpora, while more information about known problems is available in the (open) issues at the GitHub repository of the project.
As opposed to the previous version 4.0, this version fixes a number of bugs and restructures the ParlaMint GitHub repository. The DK corpus has now speeches also marked with topics. The PT corpus has been extended to 2024-03 and the UA corpus to 2023-11, where UA also has improved language marking (uk vs. ru) on segments
Multilingual text genre classification model X-GENRE
The X-GENRE classifier is a text classification model that can be used for automatic genre identification. The model classifies texts to one of 9 genre labels: Information/Explanation, News, Instruction, Opinion/Argumentation, Forum, Prose/Lyrical, Legal, Promotion and Other (refer to the provided README file for the details on the labels). The model was shown to provide high classification performance on Albanian, Catalan, Croatian, Greek, English, Icelandic, Macedonian, Slovenian, Turkish and Ukrainian, and the zero-shot cross-lingual experiments indicate that it will likely provide comparable performance on all other languages that are supported by the XLM-RoBERTa model (see Appendix in the following paper for the list of covered languages: https://arxiv.org/abs/1911.02116).
The model is based on the base-sized XLM-RoBERTa model (https://huggingface.co/FacebookAI/xlm-roberta-base). It was fine-tuned on the training split of an English-Slovenian X-GENRE dataset (http://hdl.handle.net/11356/1960), comprising of around 1,800 instances of Slovenian and English texts. Fine-tuning was performed with the simpletransformers library (https://simpletransformers.ai/) and the following hyperparameters were used:
Train batch size: 8
Learning rate: 1e-5
Max. sequence length: 512
Number of epochs: 15
For the optimum performance, the genre classifier should be applied to documents of sufficient length (the rule of thumb is at least 75 words), the predictions of label "Other" should be disregarded, and only predictions, predicted with confidence higher than 0.8, should be used. With these post-processing steps, the model was shown to reach macro-F1 scores of 0.92 and 0.94 on English and Slovenian test sets respectively (cross-dataset scenario), macro-F1 scores between 0.88 and 0.95 on Croatian, Macedonian, Turkish and Ukrainian, and macro-F1 scores between 0.80 and 0.85 on Albanian, Catalan, Greek, and Icelandic (zero-shot cross-lingual scenario). Refer to the provided README file for instructions with code examples on how to use the model
The Trankit model for linguistic processing of spoken and written Slovenian 1.1
This is a retrained Slovenian model for the Trankit v1.1.1 library for multilingual natural language processing (https://pypi.org/project/trankit/), trained on the concatenation of the SSJ UD treebank of written Slovenian (featuring fiction, non-fiction, periodicals and Wikipedia texts) and the SST UD treebank of spoken Slovenian (featuring transcriptions of spontaneous speech in various settings).
It is able to predict sentence segmentation, tokenization, lemmatization, language-specific morphological annotation (MULTEXT-East morphosyntactic tags), as well as universal part-of-speech tagging, morphological features, and dependency parses in accordance with the Universal Dependencies annotation scheme (https://universaldependencies.org/).
In comparison to its counterpart models trained on SSJ (http://hdl.handle.net/11356/1963) or SST datasets only, this model yields a significantly better performance on spoken transcripts and an almost identical state-of-the-art performance on written texts. The model can therefore be recommended as the default, 'universal' Trankit model for processing Slovenian, regardless of the data type.
To utilize this model, please follow the instructions provided in our github repository (https://github.com/clarinsi/trankit-train) or refer to the Trankit documentation (https://trankit.readthedocs.io/en/latest/training.html#loading). This ZIP file contains models for both xlm-roberta-large (which delivers better performance but requires more hardware resources) and xlm-roberta-base.
In comparison to the previous version, this version was trained on a newer, slightly improved version of the SSJ UD treebank (UD v2.14, https://github.com/UniversalDependencies/UD_Slovenian-SSJ/tree/r2.14) and a substantially extended and improved version of the SST UD treebank (UD v2.15, https://github.com/UniversalDependencies/UD_Slovenian-SST/tree/dev), thus producing significantly better results for spoken data