Common Language Resources and Technology Infrastructure - Slovenia

Not a member yet

840 research outputs found

Sort by

Slovenian social assistance rights text data collection SSAR 1.0

Author: Podpečan Vid
Pollak Senja
Arhar Holdt Špela
Rot Špela
Mišič Luka
Štajnpihler Božič Tilen
Urankar Nejc
Strban Grega
Publication venue: Faculty of Law, University of Ljubljana
Publication date: 06/11/2024
Field of study

The Slovenian Social Assistance Rights Text Data Collection (SSAR 1.0) consists of 13 documents, including 8 legally binding texts and 5 non-legally binding texts. In total, the collection contains 6,936 sentences. The following sources were used for data collection: - Legally binding documents: https://pisrs.si/ - Non-legally binding documents: https://www.csd-slovenije.si/, https://www.gov.si/, https://e-uprava.gov.si/ Each document in the collection is provided in both raw text and CONLL-U format, generated using CLASSLA v2.1.1. The collection is organized as follows: Legally Binding Documents 1. Zakon o usklajevanju transferjev posameznikom in gospodinjstvom 2. Zakon o uveljavljanju pravic iz javnih sredstev 3. Sklep o usklajenih višinah transferjev 4. Pravilnik o načinu ugotavljanja premoženja in njegove vrednosti pri dodeljevanju pravic iz javnih sredstev 5. Pravilnik o standardih in normativih socialnovarstvenih storitev 6. Zakon o socialnem varstvu 7. Pravilnik o načinu upoštevanja dohodkov pri ugotavljanju upravičenosti do pravic iz javnih sredstev 8. Zakon o socialno varstvenih prejemkih Non-Legally Binding Documents 1. Delovna področja centrov za socialno delo (CSD Slovenije) 2. Denarna socialna pomoč (eUprava) 3. Denarna socialna pomoč (GOV.SI) 4. Vodnik po socialnih pravicah 2022 5. Pogosta vprašanja in odgovori (CSD Ljubljana) The primary goal of the data collection is to support research into the linguistic features and accessibility of Slovene legal language in the area of social rights, and the development of tools that enhance understanding and usability of Slovene legal language

Monitor corpus of Slovene Trendi 2024-08

Author: Kosem Iztok
Čibej Jaka
Dobrovoljc Kaja
Erjavec Tomaž
Ljubešić Nikola
Ponikvar Primož
Šinkec Mihael
Krek Simon
Publication venue: Centre for Language Resources and Technologies, University of Ljubljana
Publication date: 05/09/2024
Field of study

The Trendi corpus is a monitor corpus of Slovenian. It contains news articles from 107 media websites, published by 77 publishers. Trendi 2024-08 covers the period from January 2019 to August 2024, complementing the Gigafida 2.0 reference corpus of written Slovene (http://hdl.handle.net/11356/1320). The contents of the Trendi corpus are obtained using the Jožef Stefan Institute Newsfeed service (http://newsfeed.ijs.si/). The texts have been annotated using the CLASSLA-Stanza pipeline (https://github.com/clarinsi/classla), including syntactic parsing according to the Universal Dependencies (https://universaldependencies.org/sl/) and Named Entities (https://nl.ijs.si/janes/wp-content/uploads/2017/09/SlovenianNER-eng-v1.1.pdf). An important addition are topics or thematical categories, which have been automatically assigned to each text. There are 13 categories altogether: Arts and culture, Crime and accidents, Economy, Environment, Health, Leisure, Politics and Law, Science and Technology, Society, Sports, Weather, Entertainment, and Education. The text classification uses the following models: Text classification model SloBERTa-Trendi-Topics 1.0 (http://hdl.handle.net/11356/1709), Text classification model fastText-Trendi-Topics 1.0 (http://hdl.handle.net/11356/1710), and the SloBERTa model (https://huggingface.co/cjvt/sloberta-trendi-topics). The corpus is currently not available as a downloadable dataset due to copyright restrictions but we hope to make at least some of it available in the near future. The corpus is accessible through CLARIN.SI concordancers. If you would like to use the dataset for research purposes, please contact Iztok Kosem ([email protected]). This version adds texts from August 2024

Corpus of texts by Hijacint Repič in "Cvetje z vertov sv. Frančiška" CVET 1.0

Author: Košir Diana
Erjavec Tomaž
Publication venue: Science and Research Centre Koper
Publication date: 07/05/2024
Field of study

The CVET corpus contains 230 texts (around 175 thousand words) of varying length, published in the religious journal "Cvetje z vertov sv. Frančiška" between 1887 and 1916, when the magazine was edited by the linguist Fr. Stanislav Škrabec. The articles are signed with the initials P. H. R. (padre Hijacint Repič) and are original texts, translations or adaptations. The majority are devotional and religious articles and hagiography. The corpus is encoded in two variants: one contains the corpus encoded in TEI, while the other contains automatic linguistic annotations that include word modernization, lemmatisation, MULTEXT-East morphosyntactic annotations, and morphological and syntactic annotations according to the Universal Dependencies Formalism for Slovenian. In addition to the two TEI-encoded versions, the corpus is also available in derived formats. First is the corpus in plain text but in several variants (original, normalised, lemmas; either tokenised or not, in original case or lower case), and the second vertical format as used by CQP complatible condordancers, such as noSketchEngine

Slovenian web corpus CLASSLA-web.sl 1.0

Author: Ljubešić Nikola
Rupnik Peter
Kuzman Taja
Publication venue: Jožef Stefan Institute
Publication date: 22/03/2024
Field of study

The Slovenian web corpus CLASSLA-web.sl 1.0 is based on the Slovenian MaCoCu-sl 2.0 web corpus crawl (http://hdl.handle.net/11356/1795), which was additionally cleaned and enriched with linguistic and genre information. The CLASSLA-web.sl corpus is a part of the South Slavic CLASSLA-web corpus collection, which is the first collection of comparable corpora that encompasses the entire South Slavic language group. The MaCoCu-sl 2.0 crawl was built by crawling the ".si" internet top-level domain in 2021 and 2022, as well as extending the crawl dynamically to other domains. During the development of CLASSLA-web corpora, the MaCoCu web crawls were cleaned by removing paragraphs that are not in the target language, and by removing very short texts (less than 75 words or consisting only of paragraphs shorter than 70 characters). The corpus was also linguistically annotated with the CLASSLA-Stanza pipeline (https://github.com/clarinsi/classla). The linguistic processing involved tokenization, morphosyntactic annotation, and lemmatization. Additionally, the corpus was automatically annotated with genres using the Transformer-based X-GENRE classifier (https://huggingface.co/classla/xlm-roberta-base-multilingual-text-genre-classifier). Ten genre categories are used: News, Information/Explanation, Promotion, Opinion/Argumentation, Instruction, Legal, Prose/Lyrical, Forum, Other and Mix. The corpus is available in vertical format, as used by Sketch Engine and CWB concordancers. Information is provided on the text-, paragraph-, sentence- and token-level. Each text is accompanied by the following metadata: text id, title, url, domain, top-level domain (tld, e.g., "com"), and predicted genre category. Each text is divided into paragraphs that are accompanied by the following metadata: paragraph id, the automatically identified language of the text in the paragraph, and paragraph quality. For quality, labels, such as "short" or "good" are assigned based on paragraph length, URL and stopword density via the jusText tool (https://corpus.tools/wiki/Justext). Paragraphs are further divided into sentences that have as metadata their sentence id. Inside sentences, tokens are provided in tabular format with their linguistic annotation. Details about the structural and positional attributes are also given in the accompanying registry file which was used to install the corpus on the CLARIN.SI concordancers. Notice and take down: Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please: (1) Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted. (2) Clearly identify the copyrighted work claimed to be infringed. (3) Clearly identify the material that is claimed to be infringing and information reasonably sufficient in order to allow us to locate the material. (4) Please write to the contact person for this resource whose email is available in the full item record. We will comply with legitimate requests by removing the affected sources from the next release of the corpus. A JSONL version of the corpus is available as part of the MaCoCu-Genre corpora collection at http://hdl.handle.net/11356/1969. The MaCoCu-Genre version comprises texts and metadata at the text level, including genre information, and is not linguistically annotated

Slovenian Semantic Lexicon sloWNet-USAS 1.0

Author: Brglez Mojca
Pahor de Maiti Tekavčič Kristina
Publication venue: Institute of Contemporary History
Publication date: 10/03/2024
Field of study

This entry is an extension of the Slovenian semantic lexicon sloWNet 3.1 (http://hdl.handle.net/11356/1026) which is enriched with semantic tags following the USAS ontology. The USAS ontology (Piao et al., 2005) is part of the UCREL semantic analysis system and is used for general language semantic description (https://ucrel.lancs.ac.uk/usas/). It consists of 21 major semantic fields (e.g., PHYSICAL ATTRIBUTES [O4]) and more than 400 semantic subcategories (e.g., Temperature [O4.6], Temperature : Cold [O4.6-]) that group together words belonging to the same mental concepts. The semantic tags were translated into Slovene and then automatically mapped onto the sloWNet entries from the USAS semantic lexicon following the algorithmic steps described in the README file. This procedure assigned semantic tags to 41,135 unique entries. The semantic tags were also given concreteness scores, calculated according to the procedure described in the README file. The file USAS_sl_conc.tsv contains the complete USAS tagset, including the concreteness scores of semantic domains, and their Slovenian descriptions. The file sloWNet_USAS_1.0.tsv contains lexemes from sloWNet 3.1 paired with the semantic tag of their most literal, basic sense as identified by the algorithm, and all the semantic tag candidates from which the closest tag was sourced, in a tabular format. The resource was originally used to facilitate metaphor analysis, but can be helpful also for other tasks such as text classification and sentiment analysis

Montenegrin web corpus CLASSLA-web.cnr 1.0

Author: Ljubešić Nikola
Rupnik Peter
Kuzman Taja
Publication venue: Jožef Stefan Institute
Publication date: 26/03/2024
Field of study

The Montenegrin web corpus CLASSLA-web.cnr 1.0 is based on the MaCoCu-cnr 1.0 web corpus crawl (http://hdl.handle.net/11356/1809), which was additionally cleaned and enriched with linguistic and genre information. The CLASSLA-web.cnr corpus is a part of the South Slavic CLASSLA-web corpus collection, which is the first collection of comparable corpora that encompasses the entire South Slavic language group. The MaCoCu-cnr 1.0 crawl was built by crawling the ".me" internet top-level domain in 2021 and 2022, as well as extending the crawl dynamically to other domains. During the development of CLASSLA-web corpora, the MaCoCu web crawls were cleaned by removing paragraphs that are not in the target language, and by removing very short texts (less than 75 words or consisting only of paragraphs shorter than 70 characters). The corpus was also linguistically annotated with the CLASSLA-Stanza pipeline (https://github.com/clarinsi/classla). The linguistic processing involved tokenization, morphosyntactic annotation, and lemmatization. Additionally, the corpus was automatically annotated with genres using the Transformer-based X-GENRE classifier (https://huggingface.co/classla/xlm-roberta-base-multilingual-text-genre-classifier). The following genre categories are used: News, Information/Explanation, Promotion, Opinion/Argumentation, Instruction, Legal, Prose/Lyrical, Forum, Other and Mix. The corpus is available in vertical format, as used by Sketch Engine and CWB concordancers. Information is provided on the text-, paragraph-, sentence- and token-level. Each text is accompanied by the following metadata: text id, title, url, domain, top-level domain (tld, e.g., "com"), and predicted genre category. Each text is divided into paragraphs that are accompanied by the following metadata: paragraph id, the automatically identified language of the text in the paragraph, and paragraph quality. For quality, labels, such as "short" or "good" are assigned based on paragraph length, URL and stopword density via the jusText tool (https://corpus.tools/wiki/Justext). Paragraphs are further divided into sentences that have as metadata their sentence id. Inside sentences, tokens are provided in tabular format with their linguistic annotation. Details about the structural and positional attributes are also given in the accompanying registry file which was used to install the corpus on the CLARIN.SI concordancers. Notice and take down: Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please: (1) Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted. (2) Clearly identify the copyrighted work claimed to be infringed. (3) Clearly identify the material that is claimed to be infringing and information reasonably sufficient in order to allow us to locate the material. (4) Please write to the contact person for this resource whose email is available in the full item record. We will comply with legitimate requests by removing the affected sources from the next release of the corpus. A JSONL version of the corpus is available as part of the MaCoCu-Genre corpora collection at http://hdl.handle.net/11356/1969. The MaCoCu-Genre version comprises texts and metadata at the text level, including genre information, and is not linguistically annotated

Monitor corpus of Slovene Trendi 2024-03

Author: Kosem Iztok
Čibej Jaka
Dobrovoljc Kaja
Erjavec Tomaž
Ljubešić Nikola
Ponikvar Primož
Šinkec Mihael
Krek Simon
Publication venue: Centre for Language Resources and Technologies, University of Ljubljana
Publication date: 04/04/2024
Field of study

The Trendi corpus is a monitor corpus of Slovenian. It contains news articles from 106 media websites, published by 70 publishers. Trendi 2024-03 covers the period from January 2019 to March 2024, complementing the Gigafida 2.0 reference corpus of written Slovene (http://hdl.handle.net/11356/1320). The contents of the Trendi corpus are obtained using the Jožef Stefan Institute Newsfeed service (http://newsfeed.ijs.si/). The texts have been annotated using the CLASSLA-Stanza pipeline (https://github.com/clarinsi/classla), including syntactic parsing according to the Universal Dependencies (https://universaldependencies.org/sl/) and Named Entities (https://nl.ijs.si/janes/wp-content/uploads/2017/09/SlovenianNER-eng-v1.1.pdf). An important addition are topics or thematical categories, which have been automatically assigned to each text. There are 13 categories altogether: Arts and culture, Crime and accidents, Economy, Environment, Health, Leisure, Politics and Law, Science and Technology, Society, Sports, Weather, Entertainment, and Education. The text classification uses the following models: Text classification model SloBERTa-Trendi-Topics 1.0 (http://hdl.handle.net/11356/1709), Text classification model fastText-Trendi-Topics 1.0 (http://hdl.handle.net/11356/1710), and the SloBERTa model (https://huggingface.co/cjvt/sloberta-trendi-topics). The corpus is currently not available as a downloadable dataset due to copyright restrictions but we hope to make at least some of it available in the near future. The corpus is accessible through CLARIN.SI concordancers. If you would like to use the dataset for research purposes, please contact Iztok Kosem ([email protected]). This version adds texts from March 2024

Multilingual IPTC Media Topic dataset EMMediaTopic 1.0

Author: Kuzman Taja
Ljubešić Nikola
Publication venue: Jožef Stefan Institute
Publication date: 02/12/2024
Field of study

The multilingual IPTC Media Topic dataset EMMediaTopic 1.0 is a collection of news articles in Catalan, Croatian, Greek, and Slovenian, automatically annotated with the 17 top-level topic labels from the IPTC NewsCodes Media Topic hierarchical schema. The texts were annotated by the GPT-4o large language model, accessed via the OpenAI API (https://openai.com/index/hello-gpt-4o/). Evaluation against a manually-annotated test set showed that the model consistently achieves high performance, with an average macro-F1 score of 0.731 and a micro-F1 score of 0.722. Additionally, assessments of inter-annotator agreement on the test set revealed that the reliability of the GPT model used as a data annotator is comparable to that of human annotators. The EMMediaTopic dataset consists of 21,000 texts, divided into a training (20,000 instances) and a development set (1,000 instances), both of which have an identical distribution of labels. The dataset comprises news articles from the Catalan (ca), Croatian (hr), Greek (el), and Slovenian (sl) MaCoCu-Genre corpora (http://hdl.handle.net/11356/1969). For each language, a random sample of 5,250 texts classified under the "News" genre was extracted from the web corpus. Due to the limitations of the XLM-RoBERTa model fine-tuned on this dataset, the texts were truncated to the first 512 words. The dataset employs the following 17 top-level IPTC NewsCodes Media Topic (https://cv.iptc.org/newscodes/mediatopic) labels: 'arts, culture, entertainment and media', 'conflict, war and peace', 'crime, law and justice', 'disaster, accident and emergency incident', 'economy, business and finance', 'education', 'environment', 'health', 'human interest', 'labour', 'lifestyle and leisure', 'politics', 'religion', 'science and technology', 'society', 'sport', and 'weather'. The EMMediaTopic dataset is provided in JSONL format, where each text is accompanied by the following metadata: document_id (document ID from the MaCoCu-Genre corpus), lang (language code: ca, el, hr, or sl), GPT-IPTC-label (GPT-assigned IPTC topic label), and split (train or dev). This dataset was used for the development of the Multilingual IPTC news topic classifier (https://huggingface.co/classla/multilingual-IPTC-news-topic-classifier), a fine-tuned Transformer-based XLM-RoBERTa model that can be applied to any of the languages included in the XLM-RoBERTa pretraining dataset

Spoken corpus Berta

Author: Krajnc Ivič Mira
Verdonik Darinka
Antloga Špela
Brčić Petek Tanja
Voršič Ines
Dugonik Bogdan
Donaj Gregor
Publication venue: Faculty of Arts, University of Maribor
Publication date: 08/10/2024
Field of study

The Berta Spoken Corpus contains six hours of recorded speech across a variety of interactional settings. These settings include 57 different speech events, with some captured on video and others, such as telephone or private conversations, recorded as audio. The interactional settings featured in the collection include public speaking, public appearances, public lectures, advertisement, cooking shows, casual conversations, advice sessions and interviews. This corpus was developed as part of the Slovene in the Palm of Your Hand (Slovenščina na dlani) project, designed to provide teachers with an additional tool for working with texts in primary and secondary schools. All recordings are accompanied by manual transcriptions in two formats: - Pronunciation-based (literal) transcription: This format provides a phoneme string generated from the orthographic form using letter-to-sound rules. - Standardized (expanded) orthographic transcription: This format follows standard Slovene spelling to represent the spoken words, with additional rules and word lists applied for non-standard vocabulary. The entry includes audio files (WAV 44.1 kHz, PCM, 16-bit), video files where available (MP4), and transcription files in TRS format (original Transcriber 1.5.1) as well as text files

Parliamentary spoken corpus of Croatian ParlaSpeech-HR 2.0

Author: Ljubešić Nikola
Koržinek Danijel
Rupnik Peter
Publication venue: Jožef Stefan Institute
Publication date: 25/01/2024
Field of study

The ParlaSpeech-HR dataset is built from the transcripts of parliamentary proceedings available in the Croatian part of the ParlaMint corpus, and the parliamentary recordings available from the Croatian Parliament's YouTube channel. The corpus consists of audio segments that correspond to specific sentences in the transcripts. The transcript contains word-level alignments to the recordings, allowing for simple further segmentation of long sentences into shorter segments for ASR and other memory-sensitive applications. Each segment has a reference to the ParlaMint 4.0 corpus (http://hdl.handle.net/11356/1859) via utterance IDs and character offsets. All the speaker information from the ParlaMint corpus is available via the "speaker_info" key. The main differences to the version 1.0 of the dataset are: - larger size (ParlaMint 4.0 is used here, while previously ParlaMint 2.1 was used) - improved matching pipeline - segments based on linguistically sound sentences from the ParlaMint transcripts, while previously segments surrounded with silence were use

5

full texts

840

metadata records

Updated in last 30 days.

Common Language Resources and Technology Infrastructure - Slovenia

Access Repository Dashboard

Do you manage Open Research Online? Become a CORE Member to access insider analytics, issue reports and manage access to outputs from your repository in the CORE Repository Dashboard! 👇