Common Language Resources and Technology Infrastructure - Slovenia
Not a member yet
    840 research outputs found

    The Trankit model for linguistic process of standard written Slovenian 1.1

    No full text
    This is a retrained Slovenian model for the Trankit v1.1.1 library for multilingual natural language processing (https://pypi.org/project/trankit/), trained on the reference SSJ UD treebank featuring fiction, non-fiction, periodical and Wikipedia texts in standard modern Slovenian. It is able to predict sentence segmentation, tokenization, lemmatization, language-specific morphological annotation (MULTEXT-East morphosyntactic tags), as well as universal part-of-speech tagging, morphological features, and dependency parses in accordance with the Universal Dependencies annotation scheme (https://universaldependencies.org/). The model was trained using a dataset published by Universal Dependencies in release 2.14 (https://github.com/UniversalDependencies/UD_Slovenian-SSJ/tree/r2.14). To utilize this model, please follow the instructions provided in our github repository (https://github.com/clarinsi/trankit-train) or refer to the Trankit documentation (https://trankit.readthedocs.io/en/latest/training.html#loading). This ZIP file contains models for both xlm-roberta-large (which delivers better performance but requires more hardware resources) and xlm-roberta-base. This version was trained on a newer, slightly improved version of the SSJ UD treebank (UD v2.14) than the previous version of the model and produces similar results

    Genre-enriched web corpora MaCoCu-Genre

    No full text
    The genre-enriched MaCoCu-Genre corpus collection comprises web corpora that have been automatically annotated with genre labels. The corpora can be very useful for genre-based creation of subcorpora that can be used for linguistic analyses or various end tasks in the field of natural language processing. The MaCoCu-Genre corpora comprise 67 million texts and 28.5 billion words in 13 European languages: Albanian, Bosnian, Bulgarian, Catalan, Croatian, Greek, Icelandic, Macedonian, Montenegrin, Serbian, Slovenian, Turkish, and Ukrainian (see the README file for sizes of individual corpora). The MaCoCu-Genre corpora are based on the MaCoCu web corpora for Albanian (http://hdl.handle.net/11356/1804), Catalan (http://hdl.handle.net/11356/1837), Greek (http://hdl.handle.net/11356/1839), Icelandic (http://hdl.handle.net/11356/1805), Turkish (http://hdl.handle.net/11356/1802) and Ukrainian (http://hdl.handle.net/11356/1838), and the CLASSLA-web corpora for Bosnian (http://hdl.handle.net/11356/1927), Bulgarian (http://hdl.handle.net/11356/1928), Croatian (http://hdl.handle.net/11356/1929), Macedonian (http://hdl.handle.net/11356/1932), Montenegrin (http://hdl.handle.net/11356/1930), Serbian (http://hdl.handle.net/11356/1931), and Slovenian (http://hdl.handle.net/11356/1882). The CLASSLA-web corpora are a cleaned-up subset of MaCoCu web corpora. During the creation of the MaCoCu-Genre corpora, the CLASSLA-web post-processing has now been applied to the other MaCoCu corpora as well: removal of paragraphs in a non-target language and removal of short texts (less than 75 words). The X-GENRE classifier (http://hdl.handle.net/11356/1961) was used for automatic annotation with genre labels. The model classifies texts to one of 9 genre labels: Information/Explanation, News, Instruction, Opinion/Argumentation, Forum, Prose/Lyrical, Legal, Promotion, and Other. Texts classified with a prediction confidence below 0.8 were assigned the label Mix (refer to the provided README file for the details on the labels). The classifier is based on the multilingual XLM-RoBERTa Transformer-based model (https://huggingface.co/FacebookAI/xlm-roberta-base), and was shown to provide high classification performance when evaluated on 9 languages included in the MaCoCu-Genre corpora (macro-F1 scores between 0.80 and 0.95). High prediction accuracy is also expected for the remaining four languages (Bosnian, Bulgarian, Montenegrin, and Serbian), as they are closely related to Croatian and Macedonian, for which the model has demonstrated strong performance. The MaCoCu-Genre corpora are available in the JSONL format, where each text is accompanied by the following metadata: id (document id from the original web corpus), title, url, domain, tld (top-level domain, e.g., "com"), and genre. Notice and take down: Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please: (1) Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted. (2) Clearly identify the copyrighted work claimed to be infringed. (3) Clearly identify the material that is claimed to be infringing and information reasonably sufficient in order to allow us to locate the material. (4) Please write to the contact person for this resource whose email is available in the full item record. We will comply with legitimate requests by removing the affected sources from the next release of the corpus. This action has received funding from the European Union's Connecting Europe Facility 2014-2020 - CEF Telecom, under Grant Agreement No. INEA/CEF/ICT/A2020/2278341. This communication reflects only the author’s view. The Agency is not responsible for any use that may be made of the information it contains

    Slovenian manuscript sermons by Ignacij Holzapfel 1.0

    No full text
    This corpus consists of editions of three volumes of sermons written by Ignatius Holzapfel (1799-1866) when he was active as parish priest in Črnomelj and Ribnica. The bulk of Holzapfel's manuscript legacy remained unpublished. Holzapfel carefully prepared his Sunday and feast day homilies as written drafts in numbered volumes. From his legacy we have diplomatically transcribed volumes 88, 90 and 91. In total, more than 220 pages of manuscript have been transcribed. We have published them in three electronic editions. The main text is written entirely in Slovenian. The titles of the sermons are in Latin and partly in German. The passages in these foreign languages are tagged accordingly via @xml:lang. The text of cca 200 pages of the Holzapfel's manuscript was recognised using the Trankribus web service (https://www.transkribus.org/) and then manually corrected and annotated. Each volume is stored in a separate XML document, marked-up according to the Text Encoding Initiative (TEI) Guidelines

    Overview of inflectional paradigms in Slovenian

    Full text link
    The purpose of the overview is to provide a comprehensive overview of the inflectional features associated with specific endings. Each ending has a dedicated row in the table and is exemplified by a word in the relevant form, while additional columns detail the features that the ending expresses. The features are represented by individual columns, with a value of 1 indicating the presence of the feature and 0 indicating its absence

    Monitor corpus of Slovene Trendi 2024-04

    No full text
    The Trendi corpus is a monitor corpus of Slovenian. It contains news articles from 106 media websites, published by 73 publishers. Trendi 2024-04 covers the period from January 2019 to April 2024, complementing the Gigafida 2.0 reference corpus of written Slovene (http://hdl.handle.net/11356/1320). The contents of the Trendi corpus are obtained using the Jožef Stefan Institute Newsfeed service (http://newsfeed.ijs.si/). The texts have been annotated using the CLASSLA-Stanza pipeline (https://github.com/clarinsi/classla), including syntactic parsing according to the Universal Dependencies (https://universaldependencies.org/sl/) and Named Entities (https://nl.ijs.si/janes/wp-content/uploads/2017/09/SlovenianNER-eng-v1.1.pdf). An important addition are topics or thematical categories, which have been automatically assigned to each text. There are 13 categories altogether: Arts and culture, Crime and accidents, Economy, Environment, Health, Leisure, Politics and Law, Science and Technology, Society, Sports, Weather, Entertainment, and Education. The text classification uses the following models: Text classification model SloBERTa-Trendi-Topics 1.0 (http://hdl.handle.net/11356/1709), Text classification model fastText-Trendi-Topics 1.0 (http://hdl.handle.net/11356/1710), and the SloBERTa model (https://huggingface.co/cjvt/sloberta-trendi-topics). The corpus is currently not available as a downloadable dataset due to copyright restrictions but we hope to make at least some of it available in the near future. The corpus is accessible through CLARIN.SI concordancers. If you would like to use the dataset for research purposes, please contact Iztok Kosem ([email protected]). This version adds texts from April 2024

    Monitor corpus of Slovene Trendi 2024-02

    No full text
    The Trendi corpus is a monitor corpus of Slovenian. It contains news articles from 106 media websites, published by 70 publishers. Trendi 2024-02 covers the period from January 2019 to February 2024, complementing the Gigafida 2.0 reference corpus of written Slovene (http://hdl.handle.net/11356/1320). The contents of the Trendi corpus are obtained using the Jožef Stefan Institute Newsfeed service (http://newsfeed.ijs.si/). The texts have been annotated using the CLASSLA-Stanza pipeline (https://github.com/clarinsi/classla), including syntactic parsing according to the Universal Dependencies (https://universaldependencies.org/sl/) and Named Entities (https://nl.ijs.si/janes/wp-content/uploads/2017/09/SlovenianNER-eng-v1.1.pdf). An important addition are topics or thematical categories, which have been automatically assigned to each text. There are 13 categories altogether: Arts and culture, Crime and accidents, Economy, Environment, Health, Leisure, Politics and Law, Science and Technology, Society, Sports, Weather, Entertainment, and Education. The text classification uses the following models: Text classification model SloBERTa-Trendi-Topics 1.0 (http://hdl.handle.net/11356/1709), Text classification model fastText-Trendi-Topics 1.0 (http://hdl.handle.net/11356/1710), and the SloBERTa model (https://huggingface.co/cjvt/sloberta-trendi-topics). The corpus is currently not available as a downloadable dataset due to copyright restrictions but we hope to make at least some of it available in the near future. The corpus is accessible through CLARIN.SI concordancers. If you would like to use the dataset for research purposes, please contact Iztok Kosem ([email protected]). This version adds texts from February 2024

    Monitor corpus of Slovene Trendi 2024-05

    No full text
    The Trendi corpus is a monitor corpus of Slovenian. It contains news articles from 106 media websites, published by 73 publishers. Trendi 2024-05 covers the period from January 2019 to May 2024, complementing the Gigafida 2.0 reference corpus of written Slovene (http://hdl.handle.net/11356/1320). The contents of the Trendi corpus are obtained using the Jožef Stefan Institute Newsfeed service (http://newsfeed.ijs.si/). The texts have been annotated using the CLASSLA-Stanza pipeline (https://github.com/clarinsi/classla), including syntactic parsing according to the Universal Dependencies (https://universaldependencies.org/sl/) and Named Entities (https://nl.ijs.si/janes/wp-content/uploads/2017/09/SlovenianNER-eng-v1.1.pdf). An important addition are topics or thematical categories, which have been automatically assigned to each text. There are 13 categories altogether: Arts and culture, Crime and accidents, Economy, Environment, Health, Leisure, Politics and Law, Science and Technology, Society, Sports, Weather, Entertainment, and Education. The text classification uses the following models: Text classification model SloBERTa-Trendi-Topics 1.0 (http://hdl.handle.net/11356/1709), Text classification model fastText-Trendi-Topics 1.0 (http://hdl.handle.net/11356/1710), and the SloBERTa model (https://huggingface.co/cjvt/sloberta-trendi-topics). The corpus is currently not available as a downloadable dataset due to copyright restrictions but we hope to make at least some of it available in the near future. The corpus is accessible through CLARIN.SI concordancers. If you would like to use the dataset for research purposes, please contact Iztok Kosem ([email protected]). This version adds texts from May 2024

    Dataset of Slovene medical texts PoVeJMo-VeMo-Med 1.0

    No full text
    PoVeJMo-VeMo-Med is a dataset containing Slovene medical texts. The bulk of it is comprised of instructions of use for different prescribed drugs. The texts were extracted from the Slovene Central Drug Database (Centralna baza zdravil; http://www.cbz.si/), with a minority of documents from the National Institute of Public Health (Nacionalni inštitut za javno zdravje; https://nijz.si/). The documents were converted from PDF-files to text format. The dataset can be used to fine-tune large language models for the medical domain. Version 1.0 contains two subversions of the corpus: the original (with 17,701 texts) and the deduplicated version (with 5,841 texts), in which duplicate texts have been removed. Please note that this dataset was also the basis for the automatic generation of the Slovene instruction-following dataset for large language models GaMS-Instruct-MED 1.0 (http://hdl.handle.net/11356/1982). For more information on how the two are related, please consult the entry for GaMS-Instruct-MED 1.0

    Documents on Magdalena Gornik in mid-19th century manuscripts 1.0

    No full text
    The document contains a diplomatic transcription of over 285 pages of manuscript documents about the Slovenian mystic Magdalena Gornik (1835-1896) from the village of Gora near Sodražica. The vast majority of the documents were written or transcribed by the Franciscan Tobija Vernik (1801-1886) contemporaneously with the events of 1849-1860. The full text is published as an electronic edition together with digital facsimiles of the manuscript. The transcription has been consistently checked against the manuscript. The annotations of personal names, place names and dates have not been checked in full consistency. The XML file is marked-up according to the Text Encoding Initiative (TEI) Guidelines

    Dataset of Slovene word formation trees ArboSloleks 1.0

    No full text
    ArboSloleks is a dataset containing Slovene word formation trees that have been automatically constructed from word relations (http://hdl.handle.net/11356/1986) extracted from Sloleks 2.0 (http://hdl.handle.net/11356/1230). Each word formation tree begins with a root lexeme from Sloleks (e.g. abolicionizem); morphologically related lexemes are then listed in pairs (original lexeme, related lexeme) along with the levels of word formation (e.g. abolicionizem – abolicionist (Level 1); abolicionist – abolicionistka (Level 2)). Version 1.0 includes 14.918 word formation trees constructed from 66.360 lexeme pairs. It is available in an ad-hoc .txt format – for information on the structure and how to parse the data, please consult 00README.txt

    5

    full texts

    840

    metadata records
    Updated in last 30 days.
    Common Language Resources and Technology Infrastructure - Slovenia
    Access Repository Dashboard
    Do you manage Open Research Online? Become a CORE Member to access insider analytics, issue reports and manage access to outputs from your repository in the CORE Repository Dashboard! 👇