1,721,024 research outputs found
Core vocabulary for Slovenian as L2 1.0
The Core vocabulary for Slovenian as L2 is based on an analysis of the vocabulary appearing in the KUUS corpus (http://hdl.handle.net/11356/1696), which includes textbooks for Slovenian as a second and foreign language. By exporting lemmas, comparing them with the Reference list of Slovene frequent common words (Pollak et al. 2020, http://hdl.handle.net/11356/1346) and manual review, a list of 5273 words was compiled. The lemmas were classified into the first three CEFR levels. The list includes 350 words with the assigned label A1-core, 864 words with the label A1-larger, 1451 words with the label A2 and 2608 words at level B1. The file is in a tab separated format, containing lemma, part-of-speech (following the MULTEXT-East tagset for Slovenian), the information if the lemma appears in the Reference List of Slovene Frequent Common Words or not, and the relative average frequency.
The word lists are presented in more detail in: KLEMEN, Matej, ARHAR HOLDT, Špela, POLLAK, Senja, KOSEM, Iztok, HUBER, Damjan, LUTAR, Mateja, 2022: Korpus učbenikov za učenje slovenščine kot drugega in tujega jezika. Nataša Pirih Svetina, Ina Ferbežar (eds.): Na stičišču svetov: slovenščina kot drugi in tuji jezik. Obdobja 41. Ljubljana: Založba Univerze v Ljubljani. 165–174. DOI: https://doi.org/10.4312/Obdobja.41.2784-7152
Corpus of textbooks for learning Slovenian as L2 KUUS 1.0
The KUUS corpus comprises 17 textbooks for Slovenian as a second and foreign language published between 2002 and 2022 at the Centre for Slovene as a Second and Foreign Language (Faculty of Arts, University of Ljubljana). These textbooks were widely used in the teaching of Slovenian as a second and foreign language to children, adolescents and adults in Slovenia and abroad at the time of the creation of the corpus. The KUUS consists of 520,796 words. It was linguistically annotated with the CLASSLA v1.1.1 pipeline (https://github.com/clarinsi/classla/) at the levels of tokenization, sentence segmentation, lemmatization, MULTEXT-East v6 MSD-tags (https://nl.ijs.si/ME/V6/msd/html/msd-sl.html), JOS dependency syntax (https://nl.ijs.si/jos/bib/jos-skladnja-navodila.pdf), and named entities (https://nl.ijs.si/janes/wp-content/uploads/2017/09/SlovenianNER-eng-v1.1.pdf). The metadata for each of the textbooks includes the information about the title, subtitle, authors, year of publication, publisher, CEFR level, target audience, and the estimated number of lessons for the textbook.
The corpus is presented in more detail in: KLEMEN, Matej, ARHAR HOLDT, Špela, POLLAK, Senja, KOSEM, Iztok, HUBER, Damjan, LUTAR, Mateja, 2022: Korpus učbenikov za učenje slovenščine kot drugega in tujega jezika. Nataša Pirih Svetina, Ina Ferbežar (eds.): Na stičišču svetov: slovenščina kot drugi in tuji jezik. Obdobja 41. Ljubljana: Založba Univerze v Ljubljani. 165–174. DOI: https://doi.org/10.4312/Obdobja.41.2784-715
Slovenian Emotion Dimension and Emotion Association Lexicon SloEmoLex 1.0
SloEmoLex is a lexicon of emotion, valence, arousal and dominance for 19,998 Slovenian entries.
It includes and extends the Slovenian part of the LiLaH lexicon (Ljubešić et al., 2020; http://hdl.handle.net/11356/1318), in which words are annotated with binary values for association to one of the 8 basic emotions (anger, anticipation, disgust, fear, joy, sadness, surprise, trust) and binary values for association with positive/negative sentiment.
SloEmoLex extends the LiLaH emotion lexicon with VAD scores from NRC VAD v1 (http://saifmohammad.com/WebPages/nrc-vad.html), and emotion intensity scores from NRC Emotion Intensity lexicon v1 (http://saifmohammad.com/WebPages/AffectIntensity.htm). Apart from the approx. 14,000 words present in Lilah, the lexicon includes 5,931 additional entries from the NRC VAD lexicon, some of which were translated with the use of sloWNet 3.1 (http://hdl.handle.net/11356/1026), and some entries (3,273) retained the machine translation provided in the Slovenian part of the NRC VAD lexicon.
If you use this work, please cite our paper:
Caporusso, Jaya, Hoogland, Damar, Brglez, Mojca, Kolosko, Boshko, Purver, Matthew, and Pollak, Senja, (2024). A Computational Analysis of the Dehumanisation of Migrants from Syria and Ukraine in Slovene News Media. THE 2024 JOINT INTERNATIONAL CONFERENCE ON COMPUTATIONAL LINGUISTICS, LANGUAGE RESOURCES AND EVALUATION (LREC-COLING 2024) 20-25 MAY, 2024, TORINO, ITALY
Slovenian Definition Extraction training dataset DF_NDF_wiki_slo 1.0
The Slovenian definition extraction training dataset DF_NDF_wiki_slo contains 38613 sentences extracted from the Slovenian Wikipedia. The first sentence of a term's description on Wikipedia is considered a definition, and all other sentences are considered non-definitions.
The corpus consists of the following files each containing one definition / non-definition sentence per line:
1. Definitions: df_ndf_wiki_slo_Y.txt with 3251 definition sentences.
2. Non-definitions: df_ndf_wiki_slo_N.txt with 14678 non-definition sentences which do not contain the term at the beginning of the sentence.
3. Non-definitions: df_ndf_wiki_slo_N1.txt with 20684 non-definition sentences which may also contain the term at the beginning of the sentence.
The dataset is described in more detail in Fišer et al. 2010. If you use this resource, please cite:
Fišer, D., Pollak, S., Vintar, Š. (2010). Learning to Mine Definitions from Slovene Structured and Unstructured Knowledge-Rich Resources. Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10). https://aclanthology.org/L10-1089/
Reference to training Transformer-based definition extraction models using this dataset:
Tran, T.H.H., Podpečan, V., Jemec Tomazin, M., Pollak, Senja (2023). Definition Extraction for Slovene: Patterns, Transformer Classifiers and ChatGPT. Proceedings of the ELEX 2023: Electronic lexicography in the 21st century. Invisible lexicography: everywhere lexical data is used without users realizing they make use of a “dictionary”.
Related resources:
Jemec Tomazin, M. et al. (2023). Slovenian Definition Extraction evaluation datasets RSDO-def 1.0, Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/184
A deep learning approach to self-expansion of abbreviations based on morphology and context distance
Abbreviations and acronyms are shortened forms of words or phrases that are commonly used in technical writing. In this study we focus specifically on abbreviations and introduce a corpus-based method for their expansion. The method divides the processing into three key stages: abbreviation identification, full form candidate extraction, and abbreviation disambiguation. First, potential abbreviations are identified by combining pattern matching and named entity recognition. Both acronyms and abbreviations exhibit similar orthographic properties, thus additional processing is required to distinguish between them. To this end, we implement a character-based recurrent neural network (RNN) that analyses the morphology of a given token in order to classify it as an acronym or an abbreviation. A siamese RNN that learns the morphological process of word abbreviation is then used to select a set of full form candidates. Having considerably constrained the search space, we take advantage of the Word Mover’s Distance (WMD) to assess semantic compatibility between an abbreviation and each full form candidate based on their contextual similarity. This step does not require any corpusbased training, thus making the approach highly adaptable to different domains. Unlike the vast majority of existing approaches, our method does not rely on external lexical resources for disambiguation, but with a macro F-measure of 96.27% is comparable to the state-of-the art
List of single-word male and female occupations in Slovenian
The list of single-word occupations in Slovene is based on the Slovene Standard Classification of Occupations (https://www.uradni-list.si/glasilo-uradni-list-rs/vsebina?urlid=199728&stevilka=1641).
The list includes 234 occupation pairs. For each occupation, it contains its masculine word form (e.g. fotograf), its possible synonym, its feminine equivalent (e.g. fotografka) and the corresponding synonym of the feminine form (e.g. fotografinja). The cases where no synonyms were added for a specific occupation are denoted with the label 0 (note that only synonyms with the same root are considered).
Several conditions for inclusion or exclusion of an occupation to the list were applied:
- Our list contains only single word occupation pairs, while the majority of the occupations in the aforementioned classification are multi-word expressions.
- An occupation has to exist both in female and male grammatical gender (gender-neutral words such as pismonoša [en. postman] are not included in the list).
- At least one of the variants of an occupation (masculine or feminine) occurs at least 500 times in the Corpus of Written Standard Slovene Gigafida 2.0.
- The occupations that are also proper names in Slovene, e.g. kovač [en. blacksmith], were filtered out if in the Slovene Morphological Lexicon Sloleks 2.0 (Dobrovoljc et al., 2019) the proper name form exists.
- Occupations that could be easily associated with a context unrelated to occupations (e.g. čarovnik/čarovnica [en. wizard/witch]) or where a male or female variant is a homograph of a common noun (e.g. detektivka [en. detective] also denotes a detective novel) were excluded from the final set of occupations.
When a more established version of an occupation exists, we manually add a synonym with the same root (e.g. in the case of fotografka, an arguably more established fotografinja was added [en. photographer]).
If the standard classification does not include the female (e.g. dramatik [en. playwright]) or the male version (e.g. prostitutka [en. prostitute]) of an occupation, the missing version is manually added if it exists and appears in Gigafida corpus (e.g. there are no established words for female and male versions of postrešček [en. porter] and hostesa [en. hostess]).
The list of occupations can be used for different natural language processing tasks including evaluation of word embeddings models through analogies, which can point to bias in language use.
If you use the dataset, please cite the following paper: SUPEJ, Anka, ULČAR, Matej, ROBNIK ŠIKONJA, Marko, POLLAK, Senja (2020). Primerjava slovenskih besednih vektorskih vložitev z vidika spola na analogijah poklicev. Zbornik konference Jezikovne tehnologije in digitalna humanistika / Proc. of the Conference on Language Technologies and Digital Humanities, p. 93-100
SimLex-999 Slovenian translation SimLex-999-sl 1.0
The resource contains English SimLex-999 (Hill et al. 2015) and their Slovene translations. In the translation process, the word pairs were first translated by two translators independently, and next, for the examples where the translations differed, the final translations were chosen in a consensus meeting.
The translators had also access to Croatian Simlex-999 translations (Mrkšić et al. 2017) and received translation guidelines (see next sheet) inspired by guidelines of Multi-SimLex (Vulić et al. 2020). The resources was used for building the CoSimLex resource (Armendariz et al. 2020).
The list contains English original pair of words (Word1 and Word2), their part-of-speech, followed by Slovene translations (Trans1 and Trans2). The last column Comment relates to special cases:
- "multiword_translation" -> translators were asked to opt for single-word equivalents, in some cases the only appropriate translation was a multi-word expression (for example, "birthday" -> "rojstni dan").
- "no_translation" -> pairs without a proper translation, i.e. translation pair contains two identical words. Although the translators were asked to find two different translations for the words, in a few examples that was not possible. For example, for the English pair "taxi" and "cab", only "taksi" was considered a good Slovene equivalent.
- "duplicated_translation" -> in cases where a pair of words is repeated for two different English original pairs, both occurrences are marked as duplicate translations.
- "duplicated_original" -> in one case, the original word pair was a duplicate, which is also marked.
Cite: If you use the dataset, please cite the Clarin handle and the following paper: Armendariz, Carlos Santos, Purver, Matthew, Ulčar, Matej, Pollak, Senja, Ljubešić, Nikola, Granroth-Wilding, Mark, and Vaik, Kristiina (2020). CoSimLex: A Resource for Evaluating Graded Word Similarity in Context. In Proceedings of the 12th Language Resources and Evaluation Conference, p. 5878--5886. https://www.aclweb.org/anthology/2020.lrec-1.720/
References:
Armendariz, Carlos Santos, Purver, Matthew, Ulčar, Matej, Pollak, Senja, Ljubešić, Nikola, Granroth-Wilding, Mark, and Vaik, Kristiina (2020). CoSimLex: A Resource for Evaluating Graded Word Similarity in Context. In Proceedings of the 12th Language Resources and Evaluation Conference, p. 5878--5886. https://www.aclweb.org/anthology/2020.lrec-1.720/
Hill, F., Reichart, R., and Korhonen, A. (2015). Simlex-999: Evaluating semantic models with (genuine) similarity estimation. Computational Linguistics, 41(4):665–695. https://www.aclweb.org/anthology/J15-4004/
Mrkšić, Nikola, Ivan Vulić, Diarmuid Ó Séaghdha, Ira Leviant, Roi Reichart, Milica Gašić, Anna Korhonen, and Steve Young. (2017). Semantic specialisation of distributional word vector spaces using monolingual and cross-lingual constraints. Transactions of the ACL, 5:309–324. https://www.mitpressjournals.org/doi/abs/10.1162/tacl_a_00063
Vulić, Ivan, Baker, Simon, Ponti, Edoardo Maria, Petti, Ulla, Leviant, Ira, Wing, Kelly, Majewska, Olga, Bar, Eden, Malone, Matt, Poibeau, Thierry, Reichart, Roi and Anna Korhonen (2020). Multi-SimLex: A Large-Scale Evaluation of Multilingual and Cross-Lingual Lexical Semantic Similarity. Computational Linguistics. https://doi.org/10.1162/coli_a_0039
Going Beyond Counting First Authors in Author Co-citation Analysis
The present study examines one of the fundamental aspects of author co-citation analysis (ACA) - the way co-citation
counts are defined. Co-citation counting provides the data on which all subsequent statistical analyses and mappings
are based, and we compare ACA results based on two different types of co-citation counting - the traditional type that
only counts the first one among a cited work's authors on the one hand and a non-traditional type that takes into
account the first 5 authors of a cited work on the other hand. Results indicate that the picture produced through this non-traditional author co-citation counting contains more coherent author groups and is therefore considerably clearer. However, this picture represents fewer specialties in the research field being studied than that produced through the traditional first-author co-citation counting when the same number of top-ranked authors is selected and analyzed. Reasons for these effects are discussed
Linguistic and orthographical classic Portuguese variants. Challenges for NLP
In recent times, it was made a great investment in transfer from physical ancient Portuguese texts to digital support. This support transfer allows not
only the access to the texts, bringing them to the public in general, but also the possibility of texts to be readable and processed by machines. NLP tools are
addressed, mainly, to contemporary Portuguese and the application of NLP to
classic texts has several difficulties. The elaboration of big lexical corpora of
forms previous to modern Portuguese is an opportunity for multidisciplinary
field of studies allowing the enlargement of linguistic studies and also the possibility of obtaining, by NLP, validated corpora, collections and ontologies, that can be input in NLP tools for ancient Portuguese texts. In this work we will present, briefly, the problem of lexical variation of forms in processing classic Portuguese texts, the challenges that emerge from them and future perspectives of work
- …
