Common Language Resources and Technology Infrastructure - Slovenia

Not a member yet

840 research outputs found

Sort by

The "Mići Princ" text and speech dataset of Chakavian micro-dialects

Author: Ljubešić Nikola
Rupnik Peter
Perinčić Tea
Publication venue: Jožef Stefan Institute
Publication date: 05/03/2024
Field of study

The Mići Princ "text and speech" dialectal dataset is a word-aligned version of the translation of The Little Prince into various Chakavian micro-dialects, released by the Udruga Calculus and the Peek&Poke museum (http://skupnikatalog.nsk.hr/Record/nsk.NSK01001103632), both in form of a printed book and an audio book. The printed book is a translation of Antoine de Saint-Exupéry's "Le Petit Prince". The translation was performed by Tea Perinčić and the following additional translators (almost every character in the book uses a different micro-dialect): Davina Ivašić, Annamaria Grus, Maria Luisa Ivašić, Marin Miš, Josip Fafanđel, Glorija Fabijanić Jelović, Vlasta Juretić, Anica Pritchard, Tea Rosić, Dino Marković, Ilinka Babić, Jadranka Ajvaz, Vlado Simičić Vava, Irena Grdinić, and Ivana Marinčić. The audio book has been read by Zoran Prodanović Prlja, Davina Ivašić, Josip Fafanđel, Melita Nilović, Glorija Fabijanić Jelović, Albert Sirotich, Tea Rosić, Tea Perinčić, Dino Marković, Iva Močibob, Dražen Turina Šajeta, Vlado Simčić Vava, Ilinka Babić, Melita and Svetozar Nilović, and Ivana Marinčić. The master encoding of this "text and speech" dataset is available in form of json files (MP_13.json for the thirteenth chapter of the book), where the text, the turn-level alignment, and the word-level alignment to the audio are available. This master encoding is available from the MP.json.tgz archive for the text and alignment part, with the audio part of the master encoding located in the MP.wav.tgz archive. Besides this master encoding, an encoding focused on applications in automatic speech recognition (ASR) testing and adaptation, is available as well. Chapters 13 and 15 have been selected as testing data, and the text and audio reference files MP_13.asr.json and MP_15.asr.json contain segments split by speaker turns. The remainder of the dataset has been prepared in segments of length up to 20 seconds, ideal for training / fine-tuning current ASR systems. The text and audio reference data are available in the MP.asr.json.tgz archive, while the audio data are available in form of MP3 files in the MP.mp3.tgz archive. The dataset also includes an encoding for the Exmaralda speech editor (https://exmaralda.org), one file per chapter (MP_13.exb for the thirteenth chapter), available from the MP.exb.tgz archive. The wav files from the MP.wav.tgz archive are required if speech data are to be available inside Exmaralda. Speaker information is available in the speakers.json file, each speaker having a textual and wikidata reference to the location of the micro-dialect, as well as the name of the translator in the printed book and the reader in the audio book. An application of the dataset on fine-tuning the current (March 2024) SotA automatic speech recognition model for standard Croatian, whisper-v3-large (https://huggingface.co/classla/whisper-large-v3-mici-princ), shows for word error rate to drop from 35.43% to 16.83%, and the character error rate to drop from 11.54% to 3.95% (in-dataset test data, two seen speakers / micro-dialects, two unseen)

Corpus of daily jokes from the 24ur.com portal Šale24 1.0

Author: Dobranić Filip
Publication venue: Institute of Contemporary History
Publication date: 03/10/2024
Field of study

This is a corpus of 1915 "jokes of the day" ("šala dneva") published by the Slovenian news portal 24ur.com. The jokes were scraped from their archive on September 18th, 2024. The initial list is lightly curated: shorter texts found in the original collection were removed from the corpus since they appear to be illustration captions without the accompanying illustrations. Readers of the news portal vote on the jokes themselves with thumbs up and thumbs down buttons. The voting results are included as metadata with each joke. Several jokes have been published more than once. Each joke (distinguished based on exact text matches) is identified by a hash of its text and presents a list of voting results for every instance of its publication. The normalised_text field contains text with punctuation corrections. For now, this is limited to replacing '' (two consecutive apostrophes U+0027) with " (a single straight/dumb/vertical quotation mark U+0022). The former (two apostrophes) is consistently used in place of the latter in the original corpus. Based on the name ("Šala dneva" i.e. "Joke of the day") and observed frequency of posting during September 2024 we assume each entry corresponds to a day starting from the day of data collection counting backwards. Each voting event for has an associated estimated publication date calculated with the above algorithm. The jokes are linguistically annotated with CLASSLA-Stanza (https://github.com/clarinsi/classla), using the models for standard Slovenian. The JSONL file contains entries representing individual jokes containing: - a hash of the original joke text used for duplicate identification (key: hash) - original scraped text (key: original_text) - normalised text (key: normalised_text) - linguistically annotated normalised text in CoNLL-U format (key: processed_text) - a list of vote objects containing joke vote metadata (key: votes) - votes for (key: votes.for) - votes against (key: votes.against) - estimated dates of joke publication and voting (key: estimated_date) The corpus contains 16658 sentences, 129063 tokens, and 662 recognised named entities

Parliamentary spoken corpus of Czech ParlaSpeech-CZ 1.0

Author: Kopp Matyáš
Ljubešić Nikola
Publication venue: Jožef Stefan Institute
Publication date: 24/07/2024
Field of study

The ParlaSpeech-CZ dataset is built from the transcripts of parliamentary proceedings available in the Czech part of the ParlaMint corpus, and the parliamentary recordings available from the AudioPSP dataset (http://hdl.handle.net/11234/1-5404). The corpus consists of audio segments that correspond to specific sentences in the transcripts. The transcript contains word-level alignments to the recordings, allowing for simple further segmentation of long sentences into shorter segments for ASR and other memory-sensitive applications. Each segment has a reference to the ParlaMint 4.0 corpus (http://hdl.handle.net/11356/1859) via utterance IDs and character offsets. All the speaker information from the ParlaMint corpus is available via the "speaker_info" key. Different to other ParlaSpeech datasets, each instance in this dataset has an additional "sentence_id" key referring to the ParlaMint sentence ID, and an additional "id" key in the description of each word referring to the ParlaMint word ID. Namely, in this dataset original ParlaMint sentence and word segmentation was kept due to a different, centralised processing approach. Additionally, the "audio_source" key is also available, pointing at the original audio recording from the AudioPSP dataset

Multilingual comparable corpora of parliamentary debates ParlaMint 4.1

Author: Erjavec Tomaž
Kopp Matyáš
Ogrodniczuk Maciej
Osenova Petya
Agirrezabal Manex
Agnoloni Tommaso
Aires José
Albini Monica
Alkorta Jon
Antiba-Cartazo Iván
Arrieta Ekain
Barcala Mario
Bardanca Daniel
Barkarson Starkaður
Bartolini Roberto
Battistoni Roberto
Bel Nuria
Bonet Ramos Maria del Mar
Calzada Pérez María
Cardoso Aida
Çöltekin Çağrı
Coole Matthew
Darģis Roberts
de Libano Ruben
Depoorter Griet
Diwersy Sascha
Dodé Réka
Fernandez Kike
Fernández Rei Elisa
Frontini Francesca
Garcia Marcos
García Díaz Noelia
García Louzao Pedro
Gavriilidou Maria
Gkoumas Dimitris
Grigorov Ilko
Grigorova Vladislava
Haltrup Hansen Dorte
Iruskieta Mikel
Jarlbrink Johan
Jelencsik-Mátyus Kinga
Jongejan Bart
Kahusk Neeme
Kirnbauer Martin
Kryvenko Anna
Ligeti-Nagy Noémi
Ljubešić Nikola
Luxardo Giancarlo
Magariños Carmen
Magnusson Måns
Marchetti Carlo
Marx Maarten
Meden Katja
Mendes Amália
Mochtak Michal
Mölder Martin
Montemagni Simonetta
Navarretta Costanza
Nitoń Bartłomiej
Norén Fredrik Mohammadi
Nwadukwe Amanda
Ojsteršek Mihael
Pančur Andrej
Papavassiliou Vassilis
Pereira Rui
Pérez Lago María
Piperidis Stelios
Pirker Hannes
Pisani Marilina
Pol Henk van der
Prokopidis Prokopis
Quochi Valeria
Rayson Paul
Regueira Xosé Luís
Rii Andriana
Rudolf Michał
Ruisi Manuela
Rupnik Peter
Schopper Daniel
Simov Kiril
Sinikallio Laura
Skubic Jure
Tungland Lars Magne
Tuominen Jouni
van Heusden Ruben
Varga Zsófia
Vázquez Abuín Marta
Venturi Giulia
Vidal Miguéns Adrián
Vider Kadri
Vivel Couso Ainhoa
Vladu Adina Ioana
Wissik Tanja
Yrjänäinen Väinö
Zevallos Rodolfo
Fišer Darja
Publication venue: CLARIN ERIC
Publication date: 03/06/2024
Field of study

ParlaMint 4.1 is a set of comparable corpora containing transcriptions of parliamentary debates of 29 European countries and autonomous regions, mostly starting in 2015 and extending to mid-2022. The individual corpora comprise between 9 and 126 million words and the complete set contains over 1.2 billion words. The transcriptions are divided by days with information on the term, session and meeting, and contain speeches marked by the speaker and their role (e.g. chair, regular speaker). The speeches also contain marked-up transcriber comments, such as gaps in the transcription, interruptions, applause, etc. The corpora have extensive metadata, most importantly on speakers (name, gender, MP and minister status, party affiliation), on their political parties and parliamentary groups (name, coalition/opposition status, Wikipedia-sourced left-to-right political orientation, and CHES variables, https://www.chesdata.eu/). Note that some corpora have further metadata, e.g. the year of birth of the speakers, links to their Wikipedia articles, their membership in various committees, etc. The transcriptions are also marked with the subcorpora they belong to ("reference", until 2020-01-30, "covid", from 2020-01-31, and "war", from 2022-02-24). An overview of the statistics of the corpora is avaialable on GitHub in the folder Build/Metadata, in particular for the release 4.1 at https://github.com/clarin-eric/ParlaMint/tree/v4.1/Build/Metadata. The corpora are encoded according to the ParlaMint encoding guidelines (https://clarin-eric.github.io/ParlaMint/) and schemas (included in the distribution). This entry contains the ParlaMint TEI-encoded corpora and their derived plain text versions along with TSV metadata of the speeches. Also included is the 4.1 release of the sample data and scripts available at the GitHub repository of the ParlaMint project at https://github.com/clarin-eric/ParlaMint. Note that there also exists the linguistically marked-up version of the 4.1 ParlaMint corpus (http://hdl.handle.net/11356/1911) as well as a version machine translated to English (http://hdl.handle.net/11356/1910). Both are linked with CLARIN.SI concordancers for on-line analysis. As opposed to the previous version 4.0, this version fixes a number of bugs and restructures the ParlaMint GitHub repository. The DK corpus has now speeches also marked with topics. The PT corpus has been extended to 2024-03 and the UA corpus to 2023-11, where UA also has improved language marking (uk vs. ru) on segments

Monitor corpus of Slovene Trendi 2024-09

Author: Kosem Iztok
Čibej Jaka
Dobrovoljc Kaja
Erjavec Tomaž
Ljubešić Nikola
Ponikvar Primož
Šinkec Mihael
Krek Simon
Publication venue: Centre for Language Resources and Technologies, University of Ljubljana
Publication date: 04/10/2024
Field of study

The Trendi corpus is a monitor corpus of Slovenian. It contains news articles from 106 media websites, published by 76 publishers. Trendi 2024-08 covers the period from January 2019 to September 2024, complementing the Gigafida 2.0 reference corpus of written Slovene (http://hdl.handle.net/11356/1320). The contents of the Trendi corpus are obtained using the Jožef Stefan Institute Newsfeed service (http://newsfeed.ijs.si/). The texts have been annotated using the CLASSLA-Stanza pipeline (https://github.com/clarinsi/classla), including syntactic parsing according to the Universal Dependencies (https://universaldependencies.org/sl/) and Named Entities (https://nl.ijs.si/janes/wp-content/uploads/2017/09/SlovenianNER-eng-v1.1.pdf). An important addition are topics or thematical categories, which have been automatically assigned to each text. There are 13 categories altogether: Arts and culture, Crime and accidents, Economy, Environment, Health, Leisure, Politics and Law, Science and Technology, Society, Sports, Weather, Entertainment, and Education. The text classification uses the following models: Text classification model SloBERTa-Trendi-Topics 1.0 (http://hdl.handle.net/11356/1709), Text classification model fastText-Trendi-Topics 1.0 (http://hdl.handle.net/11356/1710), and the SloBERTa model (https://huggingface.co/cjvt/sloberta-trendi-topics). The corpus is currently not available as a downloadable dataset due to copyright restrictions but we hope to make at least some of it available in the near future. The corpus is accessible through CLARIN.SI concordancers. If you would like to use the dataset for research purposes, please contact Iztok Kosem ([email protected]). This version adds texts from September 2024

News sentiment analysis datasets for Serbian, Bosnian, Macedonian, Albanian and Estonian SADEmma 1.0

Author: Ivačič Nikola
Pelicon Andraž
Koloski Boshko
Pollak Senja
Purver Matthew
Publication venue: Jožef Stefan Institute
Publication date: 13/11/2024
Field of study

We provide annotated datasets on a three-point sentiment scale (positive, neutral and negative) for Serbian, Bosnian, Macedonian, Albanian, and Estonian. For all languages except Estonian, we include pairs of source URL (where corresponding text can be found) and sentiment label. For Estonian, we randomly sampled 100 articles from "Ekspress news article archive (in Estonian and Russian) 1.0" (http://hdl.handle.net/11356/1408). The data is organized in Tab-Separated Values (TSV) format. For Serbian, Bosnian, Macedonian, and Albanian, the dataset contains two columns: sourceURL and sentiment. For Estonian, the dataset consists of three columns: text ID (from the CLARIN.SI reference above), body text, and sentiment label

CorefUD conversion of Slovene corpus for aspect-based sentiment analysis SentiCoref

Author: Klemen Matej
Žitnik Slavko
Publication venue: Faculty of Computer and Information Science, University of Ljubljana
Publication date: 17/11/2024
Field of study

This corpus is the CorefUD conversion of the SentiCoref corpus for coreference resolution in Slovene contained within the SUK 1.1 collection of corpora (http://hdl.handle.net/11356/1959). SentiCoref contains 756 documents annotated with coreference information. Coreference in Universal Dependencies (CorefUD) is an initiative to collect coreference corpora in various languages and harmonize them to the same scheme and data format (CoNLL-U). The coreference information is stored in the MISC column. More concretely, the start and end of each coreference mention is marked with the "Entity=" attribute. For example, "Entity=(e0" marks the start of the entity e0 at the current token while "Entity=e0) marks the end of the entity e0 at the current token. For full details on the format, please see http://hdl.handle.net/11234/1-5478. To ensure compliance with the CoNLL-U format, the corpus was automatically annotated with trankit v1.1.2 to obtain universal part of speech tags (UPOS) and dependencies (head, dependency relation), while the remainder of annotations (lemmas, XPOS - MULTEXT-East V6, features) were copied from the SUK 1.1 resource. To enable implementation into the SloBENCH evaluation framework (https://slobench.cjvt.si/), we release the labeled SentiCoref corpus (training set) and an unlabeled test set. To prevent accidental data leaks, the test set labels are not publicly released, and are only indirectly accesible via the SloBENCH evaluation framework. In comparison to the original SentiCoref corpus, this contains the same texts and coreference information in a different (more universal) format. Additionally it contains 81 unlabeled private test set texts

Corpus extraction tool LIST 1.3

Author: Krsnik Luka
Arhar Holdt Špela
Čibej Jaka
Dobrovoljc Kaja
Ključevšek Aleksander
Krek Simon
Robnik-Šikonja Marko
Publication venue: Jožef Stefan Institute
Publication date: 28/08/2024
Field of study

The LIST corpus extraction tool is a Java program for extracting lists from text corpora on the levels of characters, word parts, words, and word sets. It supports VERT and TEI P5 XML formats and outputs .CSV files that can be imported into Microsoft Excel or similar statistical processing software. Version 1.3 adds support for the KOST 2.0 Slovene Learner Corpus (http://hdl.handle.net/11356/1887) in XML format. It also allows program execution using the command line (see 00README.txt for details), and uses a later version of Java (tested using JDK 21). In addition, Windows users no longer need to have Java installed on their computers to run the program

Word association norms for Slovenian SWOW-SL 1.0

Author: Brglez Mojca
Vintar Špela
De Deyne Simon
Publication venue: Faculty of Arts, University of Ljubljana
Publication date: 05/11/2024
Field of study

The word association norms for Slovenian SWOW-SL 1.0 contain words and their associations collected in the scope of the project "Mali svet besed", a Slovenian replication of the experiment "Small World of Words" (De Deyne et al. 2019, https://doi.org/10.3758/s13428-018-1115-7). The SWOW project (https://smallworldofwords.org/en/project) is a large-scale scientific study that aims to build a mental dictionary or lexicon by collecting free word associations to linguistic cues (words) from human online participants. SWOW-SL 1.0 contains free word associations for 1,000 different cues in Slovenian collected up to November 5, 2024. It includes all 19,898 responses collected online from more than 1,100 native Slovenian speakers, each providing up to 3 associations per given cue. The word association norms - the associative frequency and associative strength - comprise more than 37,000 unique cue-association pairs. The file SWOW-SL1.0_responses.tsv contains all collected responses, which are provided both in their original form as well as in two normalized forms (word-lemmatized, normalized). SWOW-SL1.0_participants.tsv contains participant metadata collected in the experiment, such as age and education.The file SWOW-SL1.0_statistics_normalized.tsv provides the aggregated word association norms, i.e. frequency statistics of all cue-association pairs on normalized responses, while SWOW-SL1.0_statistics_raw.tsv is based on raw, unprocessed responses. Additional information about the data and processing is provided in README.txt

Monitor corpus of Slovene Trendi 2024-10

Author: Kosem Iztok
Čibej Jaka
Dobrovoljc Kaja
Erjavec Tomaž
Ljubešić Nikola
Ponikvar Primož
Šinkec Mihael
Krek Simon
Publication venue: Centre for Language Resources and Technologies, University of Ljubljana
Publication date: 06/11/2024
Field of study

The Trendi corpus is a monitor corpus of Slovenian. It contains news articles from 106 media websites, published by 76 publishers. Trendi 2024-10 covers the period from January 2019 to Oktober 2024, complementing the Gigafida 2.0 reference corpus of written Slovene (http://hdl.handle.net/11356/1320). The contents of the Trendi corpus are obtained using the Jožef Stefan Institute Newsfeed service (http://newsfeed.ijs.si/). The texts have been annotated using the CLASSLA-Stanza pipeline (https://github.com/clarinsi/classla), including syntactic parsing according to the Universal Dependencies (https://universaldependencies.org/sl/) and Named Entities (https://nl.ijs.si/janes/wp-content/uploads/2017/09/SlovenianNER-eng-v1.1.pdf). An important addition are topics or thematical categories, which have been automatically assigned to each text. There are 13 categories altogether: Arts and culture, Crime and accidents, Economy, Environment, Health, Leisure, Politics and Law, Science and Technology, Society, Sports, Weather, Entertainment, and Education. The text classification uses the following models: Text classification model SloBERTa-Trendi-Topics 1.0 (http://hdl.handle.net/11356/1709), Text classification model fastText-Trendi-Topics 1.0 (http://hdl.handle.net/11356/1710), and the SloBERTa model (https://huggingface.co/cjvt/sloberta-trendi-topics). The corpus is currently not available as a downloadable dataset due to copyright restrictions but we hope to make at least some of it available in the near future. The corpus is accessible through CLARIN.SI concordancers. If you would like to use the dataset for research purposes, please contact Iztok Kosem ([email protected]). This version adds texts from October 2024

5

full texts

840

metadata records

Updated in last 30 days.

Common Language Resources and Technology Infrastructure - Slovenia

Access Repository Dashboard

Do you manage Open Research Online? Become a CORE Member to access insider analytics, issue reports and manage access to outputs from your repository in the CORE Repository Dashboard! 👇