Common Language Resources and Technology Infrastructure - Slovenia

Not a member yet

840 research outputs found

Sort by

Slovene instruction-following dataset for large language models GaMS-Instruct-MED 1.0

Author: Tovornik Robert
Pavlović Anđela
Plesnik Emil
Fabjan Borut
Publication venue: Faculty of Computer and Information Science, University of Ljubljana
Publication date: 25/09/2024
Field of study

GaMS-Instruct-MED is an instruction-following dataset designed to fine-tune Slovene large language models to follow instructions in the medical domain. It consists of pairs of prompts and responses from the field of medicine, particularly those pertaining to the use of pharmaceutical drugs and medications. The dataset was generated in several steps. After consulting with experts from the medical field, a series of prompts was manually compiled containing questions interesting in the context of drug and medication use. For each medication in the PoVeJMo-VeMo-Med 1.0 dataset (http://hdl.handle.net/11356/1983), approximately 10-15 questions were automatically generated using prompt tuning. The questions followed the context of the instructions of use for the medication in question. Inadequate questions were manually excluded, while the responses were generated entirely automatically using a specialized RAG system. Please note that the current version of the dataset (containing 18,897 prompt-response pairs) does not guarantee clinical accuracy and may contain errors as a consequence of LLM hallucinations

List of word relations from the Sloleks 2.0 lexicon 1.1

Author: Čibej Jaka
Arhar Holdt Špela
Krek Simon
Publication venue: Jožef Stefan Institute
Publication date: 07/11/2024
Field of study

This entry consists of a TSV file containing a list of 66,347 Slovene word pairs from the Sloleks Morphological Lexicon of Slovene (v2.0; http://hdl.handle.net/11356/1230) that have been automatically identified as morphologically related according to a number of manually designed morphological relation rules (e.g. "dež" -> "deževen", "pisati" -> "pisatelj", "prijatelj" -> "prijateljica"). Each line in the list contains the following columns: - original lemma (e.g. "pisati"), - related lemma (e.g. "pisatelj"), - original lemma, automatically deconstructed into individual word parts (e.g. "pis_ati"), - related lemma, automatically deconstructed into individual word parts (e.g. "pis_at_elj"), - MTE-6 lexical features of the original lemma (e.g. "G"),* - MTE-6 lexical features of the related lemma (e.g. "Som"),* - ID of the original lemma from Sloleks 2.0, - ID of the related lemma from Sloleks 2.0, - the overlapping or central part (common to both the original and the related lemmas; e.g. "pis") - the ID of the morphological relation rule used to identify the relation (e.g. "G.Som.5.2.1"), - the morphological relation rule (e.g. "[G]_ati -> [G]_at_elj"). * MTE-6 refers to MULTEXT-East Version 6 morphosyntactic specifications for Slovenian, available at http://nl.ijs.si/ME/V6/ Each rule constitutes a pattern to form a morphological relation. For instance, "[G]_ati -> [G]_at_elj" indicates that a verb (G) ending with the word part "ati" is related to the lemma formed by replacing "_ati" with "_at_elj". Note that the list contains no proper nouns and no relations for 38 morphological rules that have been included in the hierarchy of rules (listed in the accompanying file nssss_sloleks_word_relation_rules.tsv), but need to take into account additional rules that have not yet been implemented in the current version of the extraction process (such as irregular conversions in overlapping word parts: "gri_sti" - "griz_enj_e", "sneg" - "snež_ak"). Version 1.1 also contains manual evaluation scores for approximately 5,000 pairs which were sampled in a stratified manner (by rules). The pairs were reviewed by a linguist and assigned one of three scores (0 - inadequate; 1 - acceptable; 2 - adequate)

Trankit model for SST 2.15 1.1

Author: Krsnik Luka
Dobrovoljc Kaja
Terčon Luka
Publication venue: Centre for Language Resources and Technologies, University of Ljubljana
Publication date: 06/12/2024
Field of study

This is a retrained Slovenian model for the Trankit v1.1.1 library for multilingual natural language processing (https://pypi.org/project/trankit/), trained on the SST treebank of spoken Slovenian (UD v2.15, https://github.com/UniversalDependencies/UD_Slovenian-SST/tree/r2.15) featuring transcriptions of spontaneous speech in various everyday settings. It is able to predict sentence segmentation, tokenization, lemmatization, language-specific morphological annotation (MULTEXT-East morphosyntactic tags), as well as universal part-of-speech tagging, morphological feature prediction, and dependency parses in accordance with the Universal Dependencies annotation scheme (https://universaldependencies.org/). Please note this model has been published for archiving purposes only. For production use, we recommend using the state-of-the art Trankit model available here: http://hdl.handle.net/11356/1965 (v1.2 or newest). The latter was trained on both spoken (SST) and written (SSJ) data, and demonstrates a significantly higher performance to the model featured in this submission. In comparison with version 1.0, this model was trained on a new train-dev-test split of the SST treebank introduced in release UD v2.15

Monitor corpus of Slovene Trendi 2023-12

Author: Kosem Iztok
Čibej Jaka
Dobrovoljc Kaja
Erjavec Tomaž
Ljubešić Nikola
Ponikvar Primož
Šinkec Mihael
Krek Simon
Publication venue: Centre for Language Resources and Technologies, University of Ljubljana
Publication date: 10/01/2024
Field of study

The Trendi corpus is a monitor corpus of Slovenian. It contains news articles from 106 media websites, published by 70 publishers. Trendi 2023-12 covers the period from January 2019 to December 2023, complementing the Gigafida 2.0 reference corpus of written Slovene (http://hdl.handle.net/11356/1320). The contents of the Trendi corpus are obtained using the Jožef Stefan Institute Newsfeed service (http://newsfeed.ijs.si/). The texts have been annotated using the CLASSLA-Stanza pipeline (https://github.com/clarinsi/classla), including syntactic parsing according to the Universal Dependencies (https://universaldependencies.org/sl/) and Named Entities (https://nl.ijs.si/janes/wp-content/uploads/2017/09/SlovenianNER-eng-v1.1.pdf). An important addition are topics or thematical categories, which have been automatically assigned to each text. There are 13 categories altogether: Arts and culture, Crime and accidents, Economy, Environment, Health, Leisure, Politics and Law, Science and Technology, Society, Sports, Weather, Entertainment, and Education. The text classification uses the following models: Text classification model SloBERTa-Trendi-Topics 1.0 (http://hdl.handle.net/11356/1709), Text classification model fastText-Trendi-Topics 1.0 (http://hdl.handle.net/11356/1710), and the SloBERTa model (https://huggingface.co/cjvt/sloberta-trendi-topics). The corpus is currently not available as a downloadable dataset due to copyright restrictions but we hope to make at least some of it available in the near future. The corpus is accessible through CLARIN.SI concordancers. This version adds texts from November to December 2023

English translation of the Slovene Natural Language Inference Dataset SI-NLI-en 1.0

Author: Klemen Matej
Žagar Aleš
Čibej Jaka
Robnik-Šikonja Marko
Publication venue: Faculty of Computer and Information Science, University of Ljubljana
Publication date: 19/03/2024
Field of study

SI-NLI-en is an English translation of the SI-NLI Slovene Natural Language Inference Dataset (http://hdl.handle.net/11356/1707). The English version was compiled by first using machine translation (DeepL) to translate all the premises and hypotheses from SI-NLI into English. The machine translations were then manually checked and corrected by a group of 7 students of translation at the University of Ljubljana. Each translator was given both the Slovene premise and all its hypotheses as well as the translations of both the premise and the hypotheses, so the translations were not checked in isolation, but as units to ensure maximum semantic coherence. Just like SI-NLI, SI-NLI-en contains 5,937 sentence pairs (premise and hypothesis) that are manually labeled with the labels "entailment", "contradiction", and "neutral". The dataset is split into train, validation, and test sets, with sizes of 4,392, 547, and 998. The dataset is released in a tabular TSV format. The 00README.txt file contains a description of the attributes. Only the hypothesis and premise are provided in the test set (with no annotations) since SI-NLI-en is integrated into the Slovene evaluation framework SloBENCH (https://slobench.cjvt.si/). If you use the dataset to train your models, please consider submitting the test set predictions to SloBENCH to get the evaluation score and see how it compares to others

Database of the Western South Slavic Verb HyperVerb 2.0 -- WeSoSlav

Author: Arsenijević Boban
Gomboc Čeh Katarina
Marušič Franc Lanko
Milosavljević Stefan
Mišmaš Petra
Simić Jelena
Simonović Marko
Žaucer Rok
Publication venue: University of Nova Gorica
Publication date: 10/12/2024
Field of study

The Western South Slavic verbal database (WeSoSlaV) contains 3000 most frequent Slovenian and 5300 most frequent BCMS verbs which are all coded for a number of properties spanning from their phonology, morphology to their semantic and syntactic properties. The database is a table where each verb is given a row of its own. The coded properties are organized in columns. This database contains updated annotations from Marušič et al. (2022), Milosavljević et al. (2023), and Arsenijević et al. (2024) plus an extra coded property -- “Imperfective aspect”. The description of the database in the PDF file (WeSoSlaV-description.pdf) is a chapter draft from Arsenijević et al. (in preparation). - Arsenijević, Boban, Franc Lanko Marušič, Stefan Milosavljević, Petra Mišmaš, Marko Simonović & Rok Žaucer. in preparation. Hyperspacing the Verb: The interplay between prosody, morphology, syntax and semantics in the Western South Slavic verbal domain. Ms. University of Graz, University of Nova Gorica. - Arsenijević, Boban, Franc Lanko Marušič, Stefan Milosavljević, Petra Mišmaš, Marko Simonović & Rok Žaucer. 2024. Database of the Western South Slavic References Verb HyperVerb (WeSoSlaV) – Deverbal Nominalizations. dataset. Zenodo. DOI:10.5281/zenodo.14230589. - Marušič, Franc Lanko, Rok Žaucer, Petra Mišmaš, Boban Arsenijević, Marko Simonović, Stefan Milosavljević, Katarina Gomboc Čeh & Jelena Simić. 2022. Database of the western south slavic verb HyperVerb 1.0. Slovenian language resource repository CLARIN.SI. http://hdl.handle.net/11356/1683. - Milosavljević, Stefan, Petra Mišmaš, Marko Simonović, Boban Arsenijević, Katarina Gomboc Čeh, Franc Lanko Marušič, Jelena Simić & Rok Žaucer. 2023. Database of the western south slavic verb HyperVerb – derivation. Slovenian language resource repository CLARIN.SI. http://hdl.handle.net/11356/1855

Offensive language dataset of French comments FRENK-fr 1.0

Author: Pahor de Maiti Tekavčič Kristina
Ljubešić Nikola
Fišer Darja
Publication venue: Institute of Contemporary History
Publication date: 27/05/2024
Field of study

The FRENK-fr dataset contains French socially unacceptable and acceptable comments posted in response to news articles that cover the topics of LGBT and migrants, and which were posted on Facebook by prominent French media outlets (20 minutes, Le Figaro and Le Monde). The original thread order of comments based on the time of publishing is preserved in the dataset. These comments were manually annotated for the type and target of socially unacceptable comments. The creation process, including data collection, filtering, annotation schema and annotation procedure, was adopted from the FRENK 1.1 dataset (http://hdl.handle.net/11356/1462), which makes FRENK-fr fully comparable to the datasets of Croatian, English and Slovenian comments included in the FRENK 1.1. Apart from manual annotation of the type and target of socially unacceptable discourse, the comments are accompanied with metadata, namely the topic of the news item (LGBT or migrants) that triggered the comment, the news item itself and the media outlet authoring it, an anonymised user ID, and information about the reply level in the thread. The dataset consists of 10,239 Facebook comments posted under 66 news items. It includes 3,071 comments that were labelled as socially unacceptable, and 7,168 that were labelled as socially acceptable

Comprehensive Slovenian-Hungarian Dictionary 2.0

Author: Kosem Iztok
Bálint Čeh Júlia
Ponikvar Primož
Zaranšek Petra
Kamenšek Urška
Koša Peter
Gróf Annamária
Böröcz Nándor
Harmat Császár Jolanda
Szíjártó Imre
Šantak Borut
Gantar Polona
Krek Simon
Roblek Rebeka
Zgaga Karolina
Logar Urban
Pori Eva
Arhar Holdt Špela
Gorjanc Vojko
Šešet Jure
Potoczky Klára
Laskowski Cyprian
Bombek Miha
Dragar Luka
Publication venue: Centre for Language Resources and Technologies, University of Ljubljana
Publication date: 04/04/2024
Field of study

The Comprehensive Slovenian-Hungarian dictionary is a general bilingual dictionary that is being compiled at the Centre for Language Resources and Technologies of the University of Ljubljana (CJVT UL). Version 2.0 contains 15,362 headwords, 61,190 translations, 28,748 collocations and other word combinations, and 7,741 examples. The file also contains links between synonymous entries or entry senses, and links between single-word headwords and compounds/phrases. The Comprehensive Slovenian-Hungarian dictionary is a growing dictionary, which means that new headwords will be added in regular intervals. The Comprehensive Slovenian-Hungarian dictionary is based on a concept (Kosem et al. 2018) that was prepared in the targeted research project KOMASS (the Concept of Hungarian-Slovenian dictionary: from a language resource to its user), funded by the Slovenian Research Agency and the Ministry of Education, Science and Sport of the Republic of Slovenia. The dictionary concept follows the state-of-the-art international lexicographic practice, e.g. bilingual dictionaries compiled at established international publishers and institutes. In the second version, nearly 5,000 entries have been added, and some corrections to the old ones were also made. Moreover, additional metadata has been included, e.g. lemma and tags for headwords and collocations, and statistical and syntactic structure information on collocations. The contact person for dictionary-related questions is Iztok Kosem ([email protected])

Albanian Spoken Corpus in Kosovo 1.0

Author: Wasserscheidt Philipp
Rugova Bardh
Baftiu Adelajda
Publication venue: University of Prishtina "Hasan Prishtina"
Publication date: 08/07/2024
Field of study

This is the third version of a spoken corpus of Albanian in Kosovo. The data of the corpus is based on short life stories of 212 informants out of sample of 1800 speakers balanced across all regions of Kosovo and the categories of gender, age and education. In addition, metadata such as place of birth, place of residence, L1, L2, Age group and occupation were collected. The audio data was recorded in 2019 by students from the University of Prishtina. The speech files can be made available on request from one of the authors and will be made publicly available after the finalisation of the transcription in the next version. The transcription was carried out partly at Humboldt-Universität zu Berlin and partly at the University of Prishtina. The transcription is diplomatic (using the standard alphabet but transcribing relevant phonological realisation). It partly follows typical rendering of Gheg dialectal words and uses the HIAT system. The data was annotated using Timofey Arkhangelsky's Uniparser-albanian-grammar (https://bitbucket.org/timarkh/uniparser-albanian-grammar), keeping only non-ambiguous values. A list of tags used in the parser can be found here: http://albanian.web-corpora.net. The data are in CoNLL-U format. This version of the corpus contains the data of 212 speakers aged between 11 and 80, mainly from the regions of Ferizaj, Gjilan, Kaçanik, Mitrovicë, Podujevë, Rahovec and Shtërpcë. As opposed to the previous version, this corpus corrects several errors in the metadata

Dependency tree extraction tool STARK 3.0

Author: Krsnik Luka
Dobrovoljc Kaja
Robnik-Šikonja Marko
Publication venue: CLARIN.SI
Publication date: 26/07/2024
Field of study

STARK is a highly customizable tool designed for extracting different types of syntactic structures (trees) from parsed corpora (treebanks), aimed at corpus-driven linguistic investigations of syntactic and lexical phenomena of various kinds. It takes a treebank in the CONLL-U format as input and returns a list of all relevant dependency trees with frequency information and other useful statistics, such as the strength of association between the nodes of a tree, or its significance in comparison to another treebank. For installation, execution and the description of various user-defined parameter settings, see the official project page at: https://github.com/clarinsi/STARK. An online demo version of the tool is available at: https://orodja.cjvt.si/stark/. In comparison to v2, this version introduces several new features and improvements, such as the ability to extract very long trees, ignore irrelevant relations, process multi-root treebanks, or handle special operators when querying

5

full texts

840

metadata records

Updated in last 30 days.

Common Language Resources and Technology Infrastructure - Slovenia

Access Repository Dashboard

Do you manage Open Research Online? Become a CORE Member to access insider analytics, issue reports and manage access to outputs from your repository in the CORE Repository Dashboard! 👇