Common Language Resources and Technology Infrastructure - Slovenia
Not a member yet
840 research outputs found
Sort by
The news articles reporting on the 2021 Tokyo Olympics data set OG2021 (research)
The OG2021 corpus contains multilingual news articles that are reporting on the events happening during the 2021 Tokyo Olympics. The data set was created to evaluate the clustering algorithm. The articles were initially acquired via the EventRegistry service, clustered using an online news clustering algorithm, and finally manually inspected and annotated by a single evaluator using translation services to understand the meaning of the articles' content.
The corpus consists of a single file called og2021.csv, which contains the data of 10.940 news articles grouped into 1.350 clusters. Each article has the following attributes:
- id: The ID of the news article.
- title: The title of the article.
- body: The body of the article.
- lang: The language in which the article is written. Can be one of nine values.
- source: The news publisher's name.
- published_at: The date and time when the article was published. The published dates range between 2021-07-01 and 2021-08-14.
- URL: The URL location of the news article.
- cluster_id: The ID of the cluster the article is a member of
Serbian web corpus CLASSLA-web.sr 1.0
The Serbian web corpus CLASSLA-web.sr 1.0 is based on the MaCoCu-sr 1.0 web corpus crawl (http://hdl.handle.net/11356/1807), which was additionally cleaned and enriched with linguistic and genre information. The CLASSLA-web.sr corpus is a part of the South Slavic CLASSLA-web corpus collection, which is the first collection of comparable corpora that encompasses the entire South Slavic language group.
The MaCoCu-sr 1.0 crawl was built by crawling the ".rs" and ".срб" internet top-level domains in 2021 and 2022, as well as extending the crawl dynamically to other domains. During the development of CLASSLA-web corpora, the MaCoCu web crawls were cleaned by removing paragraphs that are not in the target language, and by removing very short texts (less than 75 words or consisting only of paragraphs shorter than 70 characters). The corpus was also linguistically annotated with the CLASSLA-Stanza pipeline (https://github.com/clarinsi/classla). The linguistic processing involved tokenization, morphosyntactic annotation, and lemmatization. Additionally, the corpus was automatically annotated with genres using the Transformer-based X-GENRE classifier (https://huggingface.co/classla/xlm-roberta-base-multilingual-text-genre-classifier). The following genre categories are used: News, Information/Explanation, Promotion, Opinion/Argumentation, Instruction, Legal, Prose/Lyrical, Forum, Other and Mix.
The corpus is available in vertical format, as used by Sketch Engine and CWB concordancers. Information is provided on the text-, paragraph-, sentence- and token-level. Each text is accompanied by the following metadata: text id, title, url, domain, top-level domain (tld, e.g., "com"), and predicted genre category. Each text is divided into paragraphs that are accompanied by the following metadata: paragraph id, the automatically identified language of the text in the paragraph, and paragraph quality. For quality, labels, such as "short" or "good" are assigned based on paragraph length, URL and stopword density via the jusText tool (https://corpus.tools/wiki/Justext). Paragraphs are further divided into sentences that have as metadata their sentence id. Inside sentences, tokens are provided in tabular format with their linguistic annotation. Details about the structural and positional attributes are also given in the accompanying registry file which was used to install the corpus on the CLARIN.SI concordancers.
Notice and take down: Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please: (1) Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted. (2) Clearly identify the copyrighted work claimed to be infringed. (3) Clearly identify the material that is claimed to be infringing and information reasonably sufficient in order to allow us to locate the material. (4) Please write to the contact person for this resource whose email is available in the full item record. We will comply with legitimate requests by removing the affected sources from the next release of the corpus.
A JSONL version of the corpus is available as part of the MaCoCu-Genre corpora collection at http://hdl.handle.net/11356/1969. The MaCoCu-Genre version comprises texts and metadata at the text level, including genre information, and is not linguistically annotated
Post-OCR correction training dataset sPeriodika-postOCR
The post-OCR correction dataset consists of paragraphs of text, at least 100 characters in length, extracted from documents randomly sampled from the sPeriodika dataset (http://hdl.handle.net/11356/1881) of Slovenian historical periodicals. From each document five paragraphs were randomly sampled. If the paragraph was longer than 500 characters, it was trimmed to that length. The correction was performed by one human annotator having access to the scan of the original document. Out of the original collection of 450 paragraphs, 41 were discarded due to non-running text or very bad quality of the OCR.
The metadata in the CSV dataset are the following:
- URN of the document
- link to the original PDF in dLib
- name of the periodical
- publisher of the periodical
- publication date
- original text
- corrected text
- line offset (zero-indexed)
- character length of the paragraph (trimmed to max. 500 characters
Albanian Spoken Corpus in Kosovo 0.2
This is the second version of a spoken corpus of Albanian in Kosovo.
The data of the corpus is based on short life stories of 212 informants out of sample of 1800 speakers balanced across all regions of Kosovo and the categories of gender, age and education. In addition, metadata such as place of birth, place of residence, L1, L2, Age group and occupation were collected.
The audio data was recorded in 2019 by students from the University of Prishtina. The speech files can be made available on request from one of the authors. The speech files will be made publicly available after the finalisation of the transcription in the next version of the publication.
The transcription was carried out partly at Humboldt-Universität zu Berlin and partly at the University of Prishtina. The transcription is diplomatic (using the standard alphabet but transcribing relevant phonological realisation). It partly follows typical rendering of Gheg dialectal words and uses the HIAT system.
The data was annotated using Timofey Arkhangelsky's Uniparser-albanian-grammar (https://bitbucket.org/timarkh/uniparser-albanian-grammar), keeping only non-ambiguous values. A list of tags used in the parser can be found here: http://albanian.web-corpora.net. The data are in CoNLL-U format.
This version of the corpus contains the data of 212 speakers aged between 11 and 80, mainly from the regions of Ferizaj, Gjilan, Kaçanik, Mitrovicë, Podujevë, Rahovec and Shtërpcë
The news articles reporting on the 2021 Tokyo Olympics data set OG2021 (public)
The OG2021 corpus contains multilingual news articles that are reporting on the events happening during the 2021 Tokyo Olympics. The data set was created to evaluate the clustering algorithm. The articles were initially acquired via the EventRegistry service, clustered using an online news clustering algorithm, and finally manually inspected and annotated by a single evaluator using translation services to understand the meaning of the articles' content.
The corpus consists of a single file called og2021.csv, which contains the data of 10.940 news articles grouped into 1.350 clusters. Each article has the following attributes:
- id: The ID of the news article.
- title: The title of the article.
- lang: The language in which the article is written. Can be one of nine values.
- source: The news publisher's name.
- published_at: The date and time when the article was published. The published dates range between 2021-07-01 and 2021-08-14.
- URL: The URL location of the news article.
- cluster_id: The ID of the cluster the article is a member of.
The dataset is also published with the body attribute but under a more restrictive licence. It can be found at http://hdl.handle.net/11356/1921
Service for querying dependency treebanks Drevesnik 1.1
Drevesnik (https://orodja.cjvt.si/drevesnik/) is an online service for querying Slovenian corpora parsed with the Universal Dependencies annotation scheme. It features an easy-to-use query language on the one hand and user-friendly graph visualizations on the other. It is based on the open-source dep_search tool (https://github.com/TurkuNLP/dep_search), which was localized and modified so as to also support querying by JOS morphosyntactic tags, random distribution of results, and filtering by sentence length.
The source code and the documentation for the search backend and the web user interface are publicly available on the CLARIN.SI GitHub repository https://github.com/clarinsi/drevesnik. This submission corresponds to release 1.1: https://github.com/clarinsi/drevesnik/releases/tag/1.1, which brings improved architecture, documentation and branding in comparison to release 1.0
Training corpus of spoken Slovenian ROG 1.0
Training corpus of spoken Slovenian ROG 1.0 is the main resource for Slovenian language to train and evaluate technologies aimed at processing speech or speech transcripts, such as part-of-speech taggers, parsers, prosodic unit segmenters, disfluency identifiers, dialogue act classifiers etc. It is also suitable for performing speech-related research. It consists of two parts:
1. ROG-SST, which includes selected Gos 2.1 (http://hdl.handle.net/11356/1863) transcriptions with:
- manually assigned lemmas and morphosyntactic tags according to the MULTEXT-East annotation scheme (https://nl.ijs.si/ME/V6/msd/html/msd-sl.html),
- manual annotations according to the Universal Dependencies annotation scheme (i.e. part-of-speech categories, morphological features and syntactic dependencies)
In total, ROG-SST spans 76341 words and 6108 sentences. ROG-SST is distributed as CONLL-U format (2014-2024) (.conllu files). Project website: https://spot.ff.uni-lj.si/en/.
2. ROG-Art, which includes:
- all the annotation layers from the ROG-SST
- prosodic units annotations
- disfluencies annotation
- dialogue acts annotation
ROG-Art is distributed as:
- EXMARaLDA format (.EXB files) for viewing with Partitur Editor (https://www.exmaralda.org/)
- .EXS files and Rog-Art.coma file for searching through the annotated corpus in the EXMARaLDA EXAKT concordancer (https://www.exmaralda.org/)
- .TRS files for viewing the transcriptions without annotations with Transcriber (https://trans.sourceforge.net/en/presentation.php)
- .TextGrid files with additional prosodic annotations for viewing with Praat (TeG folder, www.praat.org)
ROG-Art consists of 39001 words in 1969 sentences. WAV files are only available for the ROG-Art part. They must be copied to the WAV folder of the ROG-Art folder structure to enable automatic opening of WAV files in EXMARaLDA or Transcriber tools. WAV recording are single channel, sampled with 44100 Hz, with 16 bit precision
Dataset of Annotated Slovene Words with Pre-Consonant L ILS 1.0
ILS is a dataset containing Slovene word forms containing a single lC bigram, i.e. an "l" grapheme preceding a consonant grapheme (a bigram of "l"+C(onsonant) = lC bigram). This combination is one of the less predictable pronunciation ambiguities in Slovene, as the "l" grapheme is sometimes pronounced as /l/ (e.g. "alge") and sometimes as /u̯/ (e.g. "polža"). In some cases, both variants are acceptable (e.g. "morilka"), but there is disagreement within the linguistic community on which pronunciations are acceptable in standard Slovene.
The word forms containing an lC bigram were extracted from the manually validated lexemes of Sloleks 3.0 (http://hdl.handle.net/11356/1745). Approximately 6,600 lexemes were exported along with their inflected forms. The inflected forms were then annotated by 5 linguists in PyBossa (https://docs.pybossa.com/). Each set of forms within a lexeme were annotated by two linguists in terms of the standard Slovene pronunciation of the lC bigram (L, U, or both). The dataset enables additional linguistic analyses of the pronunciation of L in pre-consonant position in Slovene words and can be used as a starting point to identify the most problematic points of disagreement in pronunciation which can be included in future studies.
Version 1.0 includes 173.419 annotated word forms with 2 annotations each. Forms containing multiple lC bigrams were excluded in this version as they only account for approximately 5 % of all lC bigram forms; these will be included in future versions. For a more detailed description of the file structure, please see 00README.txt
Lists of Slovene accentuated units SNES 1.0
SNES (Stalno naglašene enote iz Sloleksa; Constantly accentuated units from Sloleks) is a dataset containing Slovene final accentuated word parts (i.e., the ending part of an accentuated word from its last grapheme with an accentuation diacritic to the end of the word; for instance, -álnik for "računálnik", -úlja for "hodúlja") that have been automatically extracted from the accentuated forms of the approximately 100,800 manually validated lexemes of Sloleks 3.0 (http://hdl.handle.net/11356/1745). The extracted parts were then manually categorized to compile a manually validated machine-readable list of final accentuated word parts that are always or almost always accentuated in Slovene (e.g. -álnik, -ílnik). Only accentuated word parts that are accentuated in at least 80% of examples were included in the manual list. The list can be used as a resource in post-processing to correct some of the errors in the output of Slovene accentuation models.
Version 1.0 includes 24,188 automatically extracted final accentuated word parts, 1,013 of which have been manually validated, categorized, and included in a separate manual list of Slovene final word parts that are always or very frequently accentuated. For more details on the structure of the files, please consult 00README.txt
Linguistically annotated multilingual comparable corpora of parliamentary debates ParlaMint.ana 4.1
ParlaMint 4.1 is a set of comparable corpora containing transcriptions of parliamentary debates of 29 European countries and autonomous regions, mostly starting in 2015 and extending to mid-2022. The individual corpora comprise between 9 and 126 million words and the complete set contains over 1.2 billion words.
The transcriptions are divided by days with information on the term, session and meeting, and contain speeches marked by the speaker and their role (e.g. chair, regular speaker). The speeches also contain marked-up transcriber comments, such as gaps in the transcription, interruptions, applause, etc. The corpora have extensive metadata, most importantly on speakers (name, gender, MP and minister status, party affiliation), on their political parties and parliamentary groups (name, coalition/opposition status, Wikipedia-sourced left-to-right political orientation, and CHES variables, https://www.chesdata.eu/). Note that some corpora have further metadata, e.g. the year of birth of the speakers, links to their Wikipedia articles, their membership in various committees, etc. The transcriptions are also marked with the subcorpora they belong to ("reference", until 2020-01-30, "covid", from 2020-01-31, and
"war", from 2022-02-24). An overview of the statistics of the corpora is avaialable on GitHub in the folder Build/Metadata, in particular for the release 4.1 at https://github.com/clarin-eric/ParlaMint/tree/v4.1/Build/Metadata.
The corpora are encoded according to the ParlaMint encoding guidelines (https://clarin-eric.github.io/ParlaMint/) and schemas (included in the distribution).
The ParlaMint.ana linguistic annotation includes tokenization; sentence segmentation; lemmatisation; Universal Dependencies part-of-speech, morphological features, and syntactic dependencies; and the 4-class CoNLL-2003 named entities. Some corpora also have further linguistic annotations, in particular PoS tagging according a language-specific scheme, with their corpus TEI headers giving further details on the annotation vocabularies and tools used.
This entry contains the ParlaMint.ana TEI-encoded linguistically annotated corpora; the derived CoNLL-U files along with TSV metadata of the speeches; and the derived vertical files (with their registry file), suitable for use with CQP-based concordancers, such as CWB, noSketch Engine or KonText. Also included is the 4.1 release of the sample data and scripts available at the GitHub repository of the ParlaMint project at https://github.com/clarin-eric/ParlaMint and the log files produced in the process of building the corpora for this release. The log files show e.g. known errors in the corpora, while more information about known problems is available in the open issues at the GitHub repository of the project.
This entry contains the linguistically marked-up version of the corpus, while the text version, i.e. without the linguistic annotation is also available at http://hdl.handle.net/11356/1912. Another related resource, namely the ParlaMint corpora machine translated to English ParlaMint-en.ana 4.1 can be found at http://hdl.handle.net/11356/1910.
As opposed to the previous version 4.0, this version fixes a number of bugs and restructures the ParlaMint GitHub repository. The DK corpus has been linguistically re-annotated to remove bugs, while its speeches are now also marked with topics. The PT corpus has been extended to 2024-03 and the UA corpus to 2023-11, which also has improved language marking (uk vs. ru) on segments