Common Language Resources and Technology Infrastructure - Slovenia

Not a member yet

840 research outputs found

Sort by

The Sarajevo Corpus of SMS Messages in Bosnian 1.0

Author: Wasserscheidt Philipp
Bulić Halid
Durmišević Elma
Hodžić-Čavkić Azra
Bajraktarević Enisa
Ahmetspahić-Peljto Azra
Šabić Belmin
Publication venue: University of Sarajevo – Faculty of Philosophy
Publication date: 17/04/2024
Field of study

This corpus is specialized, static (i.e., no future growth is planned), diachronic and covers the period from 2002 to 2022. The SMS messages included in this corpus were obtained from voluntary donors (informants). Both senders and recipients of the messages included in the corpus are Bosnian speakers, exhibiting diversity in terms of age, education and occupation, place of origin and countries of long-term residence. The Sarajevo Corpus of SMS Messages in Bosnian was originally published by University of Sarajevo – Faculty of Philosophy as an electronic book. The second phase of the work involved compiling the SMS messages into a corpus and linguistic annotation, which was done using the CLASSLA package (https://github.com/clarinsi/classla), version 2.1, with language = Serbian and type = nonstandard for tokenization, lemmatization and morpho-syntactic tagging (both MULTEXT-East and Universal Dependencies)

Monitor corpus of Slovene Trendi 2024-01

Author: Kosem Iztok
Čibej Jaka
Dobrovoljc Kaja
Erjavec Tomaž
Ljubešić Nikola
Ponikvar Primož
Šinkec Mihael
Krek Simon
Publication venue: Centre for Language Resources and Technologies, University of Ljubljana
Publication date: 06/02/2024
Field of study

The Trendi corpus is a monitor corpus of Slovenian. It contains news articles from 106 media websites, published by 70 publishers. Trendi 2024-01 covers the period from January 2019 to January 2024, complementing the Gigafida 2.0 reference corpus of written Slovene (http://hdl.handle.net/11356/1320). The contents of the Trendi corpus are obtained using the Jožef Stefan Institute Newsfeed service (http://newsfeed.ijs.si/). The texts have been annotated using the CLASSLA-Stanza pipeline (https://github.com/clarinsi/classla), including syntactic parsing according to the Universal Dependencies (https://universaldependencies.org/sl/) and Named Entities (https://nl.ijs.si/janes/wp-content/uploads/2017/09/SlovenianNER-eng-v1.1.pdf). An important addition are topics or thematical categories, which have been automatically assigned to each text. There are 13 categories altogether: Arts and culture, Crime and accidents, Economy, Environment, Health, Leisure, Politics and Law, Science and Technology, Society, Sports, Weather, Entertainment, and Education. The text classification uses the following models: Text classification model SloBERTa-Trendi-Topics 1.0 (http://hdl.handle.net/11356/1709), Text classification model fastText-Trendi-Topics 1.0 (http://hdl.handle.net/11356/1710), and the SloBERTa model (https://huggingface.co/cjvt/sloberta-trendi-topics). The corpus is currently not available as a downloadable dataset due to copyright restrictions but we hope to make at least some of it available in the near future. The corpus is accessible through CLARIN.SI concordancers. This version adds texts from January 2024

Bulgarian web corpus CLASSLA-web.bg 1.0

Author: Ljubešić Nikola
Rupnik Peter
Kuzman Taja
Publication venue: Jožef Stefan Institute
Publication date: 26/03/2024
Field of study

The Bulgarian web corpus CLASSLA-web.bg 1.0 is based on the MaCoCu-bg 2.0 web corpus crawl (http://hdl.handle.net/11356/1800), which was additionally cleaned and enriched with linguistic and genre information. The CLASSLA-web.bg corpus is a part of the South Slavic CLASSLA-web corpus collection, which is the first collection of comparable corpora that encompasses the entire South Slavic language group. The MaCoCu-bg 2.0 crawl was built by crawling the ".bg" and ".бг" internet top-level domains in 2021, as well as extending the crawl dynamically to other domains. During the development of CLASSLA-web corpora, the MaCoCu web crawls were cleaned by removing paragraphs that are not in the target language, and by removing very short texts (less than 75 words or consisting only of paragraphs shorter than 70 characters). The corpus was also linguistically annotated with the CLASSLA-Stanza pipeline (https://github.com/clarinsi/classla). The linguistic processing involved tokenization, morphosyntactic annotation, and lemmatization. Additionally, the corpus was automatically annotated with genres using the Transformer-based X-GENRE classifier (https://huggingface.co/classla/xlm-roberta-base-multilingual-text-genre-classifier). The following genre categories are used: News, Information/Explanation, Promotion, Opinion/Argumentation, Instruction, Legal, Prose/Lyrical, Forum, Other and Mix. The corpus is available in vertical format, as used by Sketch Engine and CWB concordancers. Information is provided on the text-, paragraph-, sentence- and token-level. Each text is accompanied by the following metadata: text id, title, url, domain, top-level domain (tld, e.g., "com"), and predicted genre category. Each text is divided into paragraphs that are accompanied by the following metadata: paragraph id, the automatically identified language of the text in the paragraph, and paragraph quality. For quality, labels, such as "short" or "good" are assigned based on paragraph length, URL and stopword density via the jusText tool (https://corpus.tools/wiki/Justext). Paragraphs are further divided into sentences that have as metadata their sentence id. Inside sentences, tokens are provided in tabular format with their linguistic annotation. Details about the structural and positional attributes are also given in the accompanying registry file which was used to install the corpus on the CLARIN.SI concordancers. Notice and take down: Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please: (1) Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted. (2) Clearly identify the copyrighted work claimed to be infringed. (3) Clearly identify the material that is claimed to be infringing and information reasonably sufficient in order to allow us to locate the material. (4) Please write to the contact person for this resource whose email is available in the full item record. We will comply with legitimate requests by removing the affected sources from the next release of the corpus. A JSONL version of the corpus is available as part of the MaCoCu-Genre corpora collection at http://hdl.handle.net/11356/1969. The MaCoCu-Genre version comprises texts and metadata at the text level, including genre information, and is not linguistically annotated

Heritage Bosnian, Croatian, and Serbian spoken by Second Generation Speakers in Germany He-BCS-Ge

Author: Romić Daniel
Publication venue: University of Regensburg; University of Zurich
Publication date: 10/11/2024
Field of study

The corpus documents the spoken language skills of second-generation Bosnian/Croatian/Serbian (BCS) speakers in Germany. It includes 15 covertly recorded interviews conducted between 2010 and 2013. The goal was to capture language use in as close to a 'natural' setting as possible, focusing on authentic communication within the 'BCS'-speaking community in Germany. All test subjects have the following characteristics: 1. Born in Germany or immigrated to Germany before the age of three; 2. Both parents monolingual 'BCS'-speakers at the time of immigration; 3. No stay in Bosnia, Croatia or Serbia of longer than three months in the last five years.; 4. (almost) simultaneous acquisition of 'BCS' and German; 5. Formal instruction in German; 6. 'BCS' (Standard Štokavian) speaker. Description of the interview setting: 1. The search for respondents was designed to be individual rather than public - avoiding self- selection. 2. The choice of place and time was left to the interviewees in order to place them in an informal and familiar environment, simulating a natural situation. 3. The subjects were only informed about the actual purpose of the interview after the recording – thereby avoiding a focus on language. 4. The topics oscillated between binary yes/no answers and small talk sequences. The questions were supplemented by the presentation of four photos. 5. When meeting the subject, the interviewer immediately switched to 'BCS' and care was taken not to use code-switching stimuli throughout the interview

Trankit model for SST 2.15

Author: Krsnik Luka
Dobrovoljc Kaja
Terčon Luka
Publication venue: Centre for Language Resources and Technologies, University of Ljubljana
Publication date: 04/09/2024
Field of study

This is a retrained Slovenian model for the Trankit v1.1.1 library for multilingual natural language processing (https://pypi.org/project/trankit/), trained on the SST treebank of spoken Slovenian (UD v2.15, https://github.com/UniversalDependencies/UD_Slovenian-SST/tree/dev) featuring transcriptions of spontaneous speech in various everyday settings. It is able to predict sentence segmentation, tokenization, lemmatization, language-specific morphological annotation (MULTEXT-East morphosyntactic tags), as well as universal part-of-speech tagging, morphological feature prediction, and dependency parses in accordance with the Universal Dependencies annotation scheme (https://universaldependencies.org/). Please note this model has been published for archiving purposes only. For production use, we recommend using the state-of-the art Trankit model available here: http://hdl.handle.net/11356/1965. The latter was trained on both spoken (SST) and written (SSJ) data, and demonstrates a significantly higher performance to the model featured in this submission

Knowledge-Enhanced Winograd Schema Challenge KE-WSC 1.0

Author: Žagar Aleš
Dobrovoljc Kaja
Munda Tina
Brglez Mojca
Košmrlj Lea
Podolski Tamara
Šardi Matic
Robnik-Šikonja Marko
Publication venue: Faculty of Computer and Information Science, University of Ljubljana
Publication date: 15/11/2024
Field of study

Knowledge-Enhanced Winograd Schema Challenge KE-WSC is an upgraded version of the original WSC dataset. It includes the following extensions: - Annotation of semantically or syntactically solvable examples: Some samples from the original dataset can be solved without deeper semantic processing due to the morphologically richness of Slovene. For example, the sentence: “Riba je pojedla črva. Bila je lačna.” requires only the knowledge of gender and does not require any deep semantical processing to infer that the fish was hungry and not the worm. To have a representative set of syntactical samples, we decided to create 197 new examples by modifying the existing ones. - Two-Level Knowledge ontology: We developed a hierarchical scheme to categorize knowledge required to successfully solve a problem. In our analysis, we detected 9 high-level knowledge categories (social knowledge, psychological knowledge, etc.) and 37 lower-level more nuanced knowledge (physical laws/the laws of nature, social roles, causal relationships, etc.). - Semi-Automatic Explanation Generation: Textual explanations were generated using GPT-4, followed by verification and correction by human annotators to ensure accuracy and clarity. For instance, a textual explanation for the sentence “Pokal ne gre v rjav kovček, ker je prevelik.” is “Če je nekaj preveliko, se ne prilega v manjši prostor.”. - Translation to English: The finalized explanations were translated into English using a trained translator, enabling broader applicability. - SPO Triplet Generation: Subject-Predicate-Object triplets were extracted using GPT-4 to highlight key semantic relationships within each example. The dataset can be used to study knowledge explanation in models and enables knowledge-enhanced machine learning. It can be used to train a classification or generative models. It comprises 601 training samples, 200 validation samples, and 200 test samples, and is released in a tabular TSV format. The README.txt file contains a description of the attributes. The test set labels are private, as the dataset is integrated into the SloBENCH evaluation framework (https://slobench.cjvt.si/). If you use the dataset to train your models, please consider submitting the test set predictions to SloBENCH to get the evaluation score and see how it compares to others. References: Levesque, H., Davis, E., & Morgenstern, L. (2012, May). The winograd schema challenge. In Thirteenth international conference on the principles of knowledge representation and reasoning

Parliamentary spoken corpus of Polish ParlaSpeech-PL 1.0

Author: Koržinek Danijel
Ljubešić Nikola
Publication venue: Jožef Stefan Institute
Publication date: 01/02/2024
Field of study

The ParlaSpeech-PL dataset is built from the transcripts of parliamentary proceedings available in the Polish part of the ParlaMint corpus, and the parliamentary recordings available from the Polish Parliament's YouTube channel. The corpus consists of audio segments that correspond to specific sentences in the transcripts. The transcript contains word-level alignments to the recordings, allowing for simple further segmentation of long sentences into shorter segments for ASR and other memory-sensitive applications. Each segment has a reference to the ParlaMint 4.0 corpus (http://hdl.handle.net/11356/1859) via utterance IDs and character offsets. All the speaker information from the ParlaMint corpus is available via the "speaker_info" key

"Choice of plausible alternatives" datasets in South Slavic dialects DIALECT-COPA

Author: Ljubešić Nikola
Kuzman Taja
Rupnik Peter
Milosavljević Stefan
Galant Nada
Benčina Sonja
Čibej Jaka
Publication venue: Jožef Stefan Institute
Publication date: 26/04/2024
Field of study

The DIALECT-COPA datasets comprise Choice of Plausible Alternatives (COPA) datasets for three South Slavic dialects: (1) COPA-SL-CER for the Cerkno dialect of Slovenian, spoken in the Slovenian Littoral region, specifically from the town of Idrija; (2) COPA-HR-CKM for the Chakavian dialect of Croatian from northern Adriatic, specifically from the town of Žminj; (3) COPA-SR-TOR for the Torlak dialect from southeastern Serbia, specifically from the town of Lebane. The datasets were translated from the English COPA dataset (https://people.ict.usc.edu/~gordon/copa.html) by native dialect speakers, following the XCOPA dataset translation methodology (https://arxiv.org/abs/2005.00333). A novelty in the DIALECT-COPA translation approach is that both English and the corresponding standard South Slavic language were at disposal to the translator during the translation process. Each instance consists of a premise (My body cast a shadow over the grass), a question (What is the cause? / What happened as a result?), and two choices (The sun was rising; The grass was cut), with a label encoding which of the choices is more plausible given the annotator or translator (The sun was rising). The datasets follow the same format as the Croatian COPA-HR dataset (http://hdl.handle.net/11356/1404), the Macedonian COPA-MK dataset (http://hdl.handle.net/11356/1687) and the Serbian COPA-SR dataset (http://hdl.handle.net/11356/1708). Each dataset is split into training (400 instances) and validation (100 instances) JSONL files. The test split (500 instances), which is usually a part of the COPA datasets, has been withheld and can be shared upon request. The reason for this is to prevent its inclusion of the test instances in the training data of future large language models, which would invalidate the benchmark measurements. The DIALECT-COPA datasets are published as part of the DIALECT-COPA shared task at the VarDial 2024 workshop where they were used as gold data for evaluation of the performance of large language models on South Slavic dialects (https://sites.google.com/view/vardial-2024/shared-tasks/dialect-copa)

Macedonian web corpus CLASSLA-web.mk 1.0

Author: Ljubešić Nikola
Rupnik Peter
Kuzman Taja
Publication venue: Jožef Stefan Institute
Publication date: 25/03/2024
Field of study

The Macedonian web corpus CLASSLA-web.mk 1.0 is based on the MaCoCu-mk 2.0 web corpus crawl (http://hdl.handle.net/11356/1801), which was additionally cleaned and enriched with linguistic and genre information. The CLASSLA-web.mk corpus is a part of the South Slavic CLASSLA-web corpus collection, which is the first collection of comparable corpora that encompasses the entire South Slavic language group. The MaCoCu-mk 2.0 crawl was built by crawling the ".mk" and ".мкд" internet top-level domains in 2021, as well as extending the crawl dynamically to other domains. During the development of CLASSLA-web corpora, the MaCoCu web crawls were cleaned by removing paragraphs that are not in the target language, and by removing very short texts (less than 75 words or consisting only of paragraphs shorter than 70 characters). The corpus was also linguistically annotated with the CLASSLA-Stanza pipeline (https://github.com/clarinsi/classla). The linguistic processing involved tokenization, morphosyntactic annotation, and lemmatization. Additionally, the corpus was automatically annotated with genres using the Transformer-based X-GENRE classifier (https://huggingface.co/classla/xlm-roberta-base-multilingual-text-genre-classifier). The following genre categories are used: News, Information/Explanation, Promotion, Opinion/Argumentation, Instruction, Legal, Prose/Lyrical, Forum, Other and Mix. The corpus is available in vertical format, as used by Sketch Engine and CWB concordancers. Information is provided on the text-, paragraph-, sentence- and token-level. Each text is accompanied by the following metadata: text id, title, url, domain, top-level domain (tld, e.g., "com"), and predicted genre category. Each text is divided into paragraphs that are accompanied by the following metadata: paragraph id, the automatically identified language of the text in the paragraph, and paragraph quality. For quality, labels, such as "short" or "good" are assigned based on paragraph length, URL and stopword density via the jusText tool (https://corpus.tools/wiki/Justext). Paragraphs are further divided into sentences that have as metadata their sentence id. Inside sentences, tokens are provided in tabular format with their linguistic annotation. Details about the structural and positional attributes are also given in the accompanying registry file which was used to install the corpus on the CLARIN.SI concordancers. Notice and take down: Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please: (1) Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted. (2) Clearly identify the copyrighted work claimed to be infringed. (3) Clearly identify the material that is claimed to be infringing and information reasonably sufficient in order to allow us to locate the material. (4) Please write to the contact person for this resource whose email is available in the full item record. We will comply with legitimate requests by removing the affected sources from the next release of the corpus. A JSONL version of the corpus is available as part of the MaCoCu-Genre corpora collection at http://hdl.handle.net/11356/1969. The MaCoCu-Genre version comprises texts and metadata at the text level, including genre information, and is not linguistically annotated

Slovenian parliamentary corpus (1990-2022) siParl 4.0

Author: Pančur Andrej
Meden Katja
Erjavec Tomaž
Ojsteršek Mihael
Šorn Mojca
Blaj Hribar Neja
Publication venue: Institute of Contemporary History
Publication date: 05/06/2024
Field of study

The siParl 4.0 corpus contains minutes of the Assembly of the Republic of Slovenia for 11th legislative period 1990-1992, minutes of the National Assembly of the Republic of Slovenia from the 1st to the 8th legislative period 1992-2022, minutes of the working bodies of the National Assembly of the Republic of Slovenia from the 2nd to the 8th legislative period 1996-2022, and minutes of the Council of the President of the National Assembly from the 2nd to the 8th legislative period 1996-2022. The corpus comprises of over 13 thousand sessions, one million speeches and 230 million words. The corpus is encoded according to the Parla-CLARIN schema (https://github.com/clarin-eric/parla-clarin). Each mandate is in one directory, and each session in one file. As opposed to the previous version 3.0, this version adds new data (minutes of the National Assembly of the Republic of Slovenia of the 8th legislative period) and corrects many errors. This item comprises the following datasets: 1. source DARAH-SI Parla-CLARIN encoded corpus in TEI format; 2. linguistically annotated Parla-CLARIN encoded corpus: tokenisation, MSD tagging, lemmatisation, Universal Dependencies features and syntactic parses, named entities; 3. automatically derived corpus in plain text with metadata on speeches; 4. automatically derived linguisticaly annotated corpus in CoNLL-U (Universal Dependencies) format with metadata on speeches; 5. automatically derived linguisticaly annotated corpus in vertical format used by CWB and Sketch Engine concordancers, together with registry file as used on the CLARIN.SI concordancers

5

full texts

840

metadata records

Updated in last 30 days.

Common Language Resources and Technology Infrastructure - Slovenia

Access Repository Dashboard

Do you manage Open Research Online? Become a CORE Member to access insider analytics, issue reports and manage access to outputs from your repository in the CORE Repository Dashboard! 👇