Common Language Resources and Technology Infrastructure - Slovenia

Not a member yet

840 research outputs found

Sort by

The Trankit model for linguistic process of standard written Slovenian 1.1

Author: Krsnik Luka
Dobrovoljc Kaja
Terčon Luka
Publication venue: Centre for Language Resources and Technologies, University of Ljubljana
Publication date: 29/08/2024
Field of study

This is a retrained Slovenian model for the Trankit v1.1.1 library for multilingual natural language processing (https://pypi.org/project/trankit/), trained on the reference SSJ UD treebank featuring fiction, non-fiction, periodical and Wikipedia texts in standard modern Slovenian. It is able to predict sentence segmentation, tokenization, lemmatization, language-specific morphological annotation (MULTEXT-East morphosyntactic tags), as well as universal part-of-speech tagging, morphological features, and dependency parses in accordance with the Universal Dependencies annotation scheme (https://universaldependencies.org/). The model was trained using a dataset published by Universal Dependencies in release 2.14 (https://github.com/UniversalDependencies/UD_Slovenian-SSJ/tree/r2.14). To utilize this model, please follow the instructions provided in our github repository (https://github.com/clarinsi/trankit-train) or refer to the Trankit documentation (https://trankit.readthedocs.io/en/latest/training.html#loading). This ZIP file contains models for both xlm-roberta-large (which delivers better performance but requires more hardware resources) and xlm-roberta-base. This version was trained on a newer, slightly improved version of the SSJ UD treebank (UD v2.14) than the previous version of the model and produces similar results

Genre-enriched web corpora MaCoCu-Genre

Author: Kuzman Taja
Ljubešić Nikola
Publication venue: Jožef Stefan Institute
Publication date: 07/10/2024
Field of study

The genre-enriched MaCoCu-Genre corpus collection comprises web corpora that have been automatically annotated with genre labels. The corpora can be very useful for genre-based creation of subcorpora that can be used for linguistic analyses or various end tasks in the field of natural language processing. The MaCoCu-Genre corpora comprise 67 million texts and 28.5 billion words in 13 European languages: Albanian, Bosnian, Bulgarian, Catalan, Croatian, Greek, Icelandic, Macedonian, Montenegrin, Serbian, Slovenian, Turkish, and Ukrainian (see the README file for sizes of individual corpora). The MaCoCu-Genre corpora are based on the MaCoCu web corpora for Albanian (http://hdl.handle.net/11356/1804), Catalan (http://hdl.handle.net/11356/1837), Greek (http://hdl.handle.net/11356/1839), Icelandic (http://hdl.handle.net/11356/1805), Turkish (http://hdl.handle.net/11356/1802) and Ukrainian (http://hdl.handle.net/11356/1838), and the CLASSLA-web corpora for Bosnian (http://hdl.handle.net/11356/1927), Bulgarian (http://hdl.handle.net/11356/1928), Croatian (http://hdl.handle.net/11356/1929), Macedonian (http://hdl.handle.net/11356/1932), Montenegrin (http://hdl.handle.net/11356/1930), Serbian (http://hdl.handle.net/11356/1931), and Slovenian (http://hdl.handle.net/11356/1882). The CLASSLA-web corpora are a cleaned-up subset of MaCoCu web corpora. During the creation of the MaCoCu-Genre corpora, the CLASSLA-web post-processing has now been applied to the other MaCoCu corpora as well: removal of paragraphs in a non-target language and removal of short texts (less than 75 words). The X-GENRE classifier (http://hdl.handle.net/11356/1961) was used for automatic annotation with genre labels. The model classifies texts to one of 9 genre labels: Information/Explanation, News, Instruction, Opinion/Argumentation, Forum, Prose/Lyrical, Legal, Promotion, and Other. Texts classified with a prediction confidence below 0.8 were assigned the label Mix (refer to the provided README file for the details on the labels). The classifier is based on the multilingual XLM-RoBERTa Transformer-based model (https://huggingface.co/FacebookAI/xlm-roberta-base), and was shown to provide high classification performance when evaluated on 9 languages included in the MaCoCu-Genre corpora (macro-F1 scores between 0.80 and 0.95). High prediction accuracy is also expected for the remaining four languages (Bosnian, Bulgarian, Montenegrin, and Serbian), as they are closely related to Croatian and Macedonian, for which the model has demonstrated strong performance. The MaCoCu-Genre corpora are available in the JSONL format, where each text is accompanied by the following metadata: id (document id from the original web corpus), title, url, domain, tld (top-level domain, e.g., "com"), and genre. Notice and take down: Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please: (1) Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted. (2) Clearly identify the copyrighted work claimed to be infringed. (3) Clearly identify the material that is claimed to be infringing and information reasonably sufficient in order to allow us to locate the material. (4) Please write to the contact person for this resource whose email is available in the full item record. We will comply with legitimate requests by removing the affected sources from the next release of the corpus. This action has received funding from the European Union's Connecting Europe Facility 2014-2020 - CEF Telecom, under Grant Agreement No. INEA/CEF/ICT/A2020/2278341. This communication reflects only the author’s view. The Agency is not responsible for any use that may be made of the information it contains

Slovenian manuscript sermons by Ignacij Holzapfel 1.0

Author: Holzapfel Ignacij
Kunavar Marko
Ogrin Matija
Publication venue: ZRC SAZU
Publication date: 25/11/2024
Field of study

This corpus consists of editions of three volumes of sermons written by Ignatius Holzapfel (1799-1866) when he was active as parish priest in Črnomelj and Ribnica. The bulk of Holzapfel's manuscript legacy remained unpublished. Holzapfel carefully prepared his Sunday and feast day homilies as written drafts in numbered volumes. From his legacy we have diplomatically transcribed volumes 88, 90 and 91. In total, more than 220 pages of manuscript have been transcribed. We have published them in three electronic editions. The main text is written entirely in Slovenian. The titles of the sermons are in Latin and partly in German. The passages in these foreign languages are tagged accordingly via @xml:lang. The text of cca 200 pages of the Holzapfel's manuscript was recognised using the Trankribus web service (https://www.transkribus.org/) and then manually corrected and annotated. Each volume is stored in a separate XML document, marked-up according to the Text Encoding Initiative (TEI) Guidelines

Overview of inflectional paradigms in Slovenian

Author: Štarkl Ema
Mišmaš Petra
Simonović Marko
Publication venue: University of Graz
Publication date: 18/04/2024
Field of study

The purpose of the overview is to provide a comprehensive overview of the inflectional features associated with specific endings. Each ending has a dedicated row in the table and is exemplified by a word in the relevant form, while additional columns detail the features that the ending expresses. The features are represented by individual columns, with a value of 1 indicating the presence of the feature and 0 indicating its absence

Monitor corpus of Slovene Trendi 2024-04

Author: Kosem Iztok
Čibej Jaka
Dobrovoljc Kaja
Erjavec Tomaž
Ljubešić Nikola
Ponikvar Primož
Šinkec Mihael
Krek Simon
Publication venue: Centre for Language Resources and Technologies, University of Ljubljana
Publication date: 07/05/2024
Field of study

The Trendi corpus is a monitor corpus of Slovenian. It contains news articles from 106 media websites, published by 73 publishers. Trendi 2024-04 covers the period from January 2019 to April 2024, complementing the Gigafida 2.0 reference corpus of written Slovene (http://hdl.handle.net/11356/1320). The contents of the Trendi corpus are obtained using the Jožef Stefan Institute Newsfeed service (http://newsfeed.ijs.si/). The texts have been annotated using the CLASSLA-Stanza pipeline (https://github.com/clarinsi/classla), including syntactic parsing according to the Universal Dependencies (https://universaldependencies.org/sl/) and Named Entities (https://nl.ijs.si/janes/wp-content/uploads/2017/09/SlovenianNER-eng-v1.1.pdf). An important addition are topics or thematical categories, which have been automatically assigned to each text. There are 13 categories altogether: Arts and culture, Crime and accidents, Economy, Environment, Health, Leisure, Politics and Law, Science and Technology, Society, Sports, Weather, Entertainment, and Education. The text classification uses the following models: Text classification model SloBERTa-Trendi-Topics 1.0 (http://hdl.handle.net/11356/1709), Text classification model fastText-Trendi-Topics 1.0 (http://hdl.handle.net/11356/1710), and the SloBERTa model (https://huggingface.co/cjvt/sloberta-trendi-topics). The corpus is currently not available as a downloadable dataset due to copyright restrictions but we hope to make at least some of it available in the near future. The corpus is accessible through CLARIN.SI concordancers. If you would like to use the dataset for research purposes, please contact Iztok Kosem ([email protected]). This version adds texts from April 2024

Monitor corpus of Slovene Trendi 2024-02

Author: Kosem Iztok
Čibej Jaka
Dobrovoljc Kaja
Erjavec Tomaž
Ljubešić Nikola
Ponikvar Primož
Šinkec Mihael
Krek Simon
Publication venue: Centre for Language Resources and Technologies, University of Ljubljana
Publication date: 06/03/2024
Field of study

The Trendi corpus is a monitor corpus of Slovenian. It contains news articles from 106 media websites, published by 70 publishers. Trendi 2024-02 covers the period from January 2019 to February 2024, complementing the Gigafida 2.0 reference corpus of written Slovene (http://hdl.handle.net/11356/1320). The contents of the Trendi corpus are obtained using the Jožef Stefan Institute Newsfeed service (http://newsfeed.ijs.si/). The texts have been annotated using the CLASSLA-Stanza pipeline (https://github.com/clarinsi/classla), including syntactic parsing according to the Universal Dependencies (https://universaldependencies.org/sl/) and Named Entities (https://nl.ijs.si/janes/wp-content/uploads/2017/09/SlovenianNER-eng-v1.1.pdf). An important addition are topics or thematical categories, which have been automatically assigned to each text. There are 13 categories altogether: Arts and culture, Crime and accidents, Economy, Environment, Health, Leisure, Politics and Law, Science and Technology, Society, Sports, Weather, Entertainment, and Education. The text classification uses the following models: Text classification model SloBERTa-Trendi-Topics 1.0 (http://hdl.handle.net/11356/1709), Text classification model fastText-Trendi-Topics 1.0 (http://hdl.handle.net/11356/1710), and the SloBERTa model (https://huggingface.co/cjvt/sloberta-trendi-topics). The corpus is currently not available as a downloadable dataset due to copyright restrictions but we hope to make at least some of it available in the near future. The corpus is accessible through CLARIN.SI concordancers. If you would like to use the dataset for research purposes, please contact Iztok Kosem ([email protected]). This version adds texts from February 2024

Monitor corpus of Slovene Trendi 2024-05

Author: Kosem Iztok
Čibej Jaka
Dobrovoljc Kaja
Erjavec Tomaž
Ljubešić Nikola
Ponikvar Primož
Šinkec Mihael
Krek Simon
Publication venue: Centre for Language Resources and Technologies, University of Ljubljana
Publication date: 07/05/2024
Field of study

The Trendi corpus is a monitor corpus of Slovenian. It contains news articles from 106 media websites, published by 73 publishers. Trendi 2024-05 covers the period from January 2019 to May 2024, complementing the Gigafida 2.0 reference corpus of written Slovene (http://hdl.handle.net/11356/1320). The contents of the Trendi corpus are obtained using the Jožef Stefan Institute Newsfeed service (http://newsfeed.ijs.si/). The texts have been annotated using the CLASSLA-Stanza pipeline (https://github.com/clarinsi/classla), including syntactic parsing according to the Universal Dependencies (https://universaldependencies.org/sl/) and Named Entities (https://nl.ijs.si/janes/wp-content/uploads/2017/09/SlovenianNER-eng-v1.1.pdf). An important addition are topics or thematical categories, which have been automatically assigned to each text. There are 13 categories altogether: Arts and culture, Crime and accidents, Economy, Environment, Health, Leisure, Politics and Law, Science and Technology, Society, Sports, Weather, Entertainment, and Education. The text classification uses the following models: Text classification model SloBERTa-Trendi-Topics 1.0 (http://hdl.handle.net/11356/1709), Text classification model fastText-Trendi-Topics 1.0 (http://hdl.handle.net/11356/1710), and the SloBERTa model (https://huggingface.co/cjvt/sloberta-trendi-topics). The corpus is currently not available as a downloadable dataset due to copyright restrictions but we hope to make at least some of it available in the near future. The corpus is accessible through CLARIN.SI concordancers. If you would like to use the dataset for research purposes, please contact Iztok Kosem ([email protected]). This version adds texts from May 2024

Dataset of Slovene medical texts PoVeJMo-VeMo-Med 1.0

Author: Malenšek Miha
Bajec Marko
Publication venue: VITASIS, d.o.o.
Publication date: 25/09/2024
Field of study

PoVeJMo-VeMo-Med is a dataset containing Slovene medical texts. The bulk of it is comprised of instructions of use for different prescribed drugs. The texts were extracted from the Slovene Central Drug Database (Centralna baza zdravil; http://www.cbz.si/), with a minority of documents from the National Institute of Public Health (Nacionalni inštitut za javno zdravje; https://nijz.si/). The documents were converted from PDF-files to text format. The dataset can be used to fine-tune large language models for the medical domain. Version 1.0 contains two subversions of the corpus: the original (with 17,701 texts) and the deduplicated version (with 5,841 texts), in which duplicate texts have been removed. Please note that this dataset was also the basis for the automatic generation of the Slovene instruction-following dataset for large language models GaMS-Instruct-MED 1.0 (http://hdl.handle.net/11356/1982). For more information on how the two are related, please consult the entry for GaMS-Instruct-MED 1.0

Documents on Magdalena Gornik in mid-19th century manuscripts 1.0

Author: Gornik Magdalena
Vernik Tobija
Plaper Janez
Janež Jurij
Žagar Jožef
Lenarčič Barbara
Ogrin Matija
Publication venue: ZRC SAZU
Publication date: 25/11/2024
Field of study

The document contains a diplomatic transcription of over 285 pages of manuscript documents about the Slovenian mystic Magdalena Gornik (1835-1896) from the village of Gora near Sodražica. The vast majority of the documents were written or transcribed by the Franciscan Tobija Vernik (1801-1886) contemporaneously with the events of 1849-1860. The full text is published as an electronic edition together with digital facsimiles of the manuscript. The transcription has been consistently checked against the manuscript. The annotations of personal names, place names and dates have not been checked in full consistency. The XML file is marked-up according to the Text Encoding Initiative (TEI) Guidelines

Dataset of Slovene word formation trees ArboSloleks 1.0

Author: Čibej Jaka
Publication venue: Faculty of Computer and Information Science, University of Ljubljana
Publication date: 30/11/2024
Field of study

ArboSloleks is a dataset containing Slovene word formation trees that have been automatically constructed from word relations (http://hdl.handle.net/11356/1986) extracted from Sloleks 2.0 (http://hdl.handle.net/11356/1230). Each word formation tree begins with a root lexeme from Sloleks (e.g. abolicionizem); morphologically related lexemes are then listed in pairs (original lexeme, related lexeme) along with the levels of word formation (e.g. abolicionizem – abolicionist (Level 1); abolicionist – abolicionistka (Level 2)). Version 1.0 includes 14.918 word formation trees constructed from 66.360 lexeme pairs. It is available in an ad-hoc .txt format – for information on the structure and how to parse the data, please consult 00README.txt

5

full texts

840

metadata records

Updated in last 30 days.

Common Language Resources and Technology Infrastructure - Slovenia

Access Repository Dashboard

Do you manage Open Research Online? Become a CORE Member to access insider analytics, issue reports and manage access to outputs from your repository in the CORE Repository Dashboard! 👇