CLARIN-PL

KPWr chunks 2021

Author: Oleksy Marcin
Wieczorek Jan
Walentynowicz Wiktor
Domogała Aleksandra
Wajda Anna
Dominiak Daria
Wróż Anita
Kwiatkowska Agnieszka
Pochanke Anna
Gałkowska Marzena
Publication venue: Wrocław University of Science and Technology
Publication date: 23/04/2021
Field of study

357 documents from KPWr corpus annotated manually at syntactic level (chunks). Please cite as: Oleksy, M., Walentynowicz, W., & Wieczorek, J. (2021). New approach to the chunk recoginition in Polish. Procedia Computer Science, 192, 1001-1010

PoLitBert_v32k_linear_125k - Polish RoBERTa model

Author: Sopyła Krzysztof
Sawaniewski Łukasz
Publication venue: Ermlab
Publication date: 01/01/2021
Field of study

Polish RoBERTa model trained on Polish Wikipedia, Polish literature and Oscar

PoLitBert_v32k_linear_50k - Polish RoBERTa model

Author: Sopyła Krzysztof
Sawaniewski Łukasz
Publication venue: Ermlab
Publication date: 01/01/2021
Field of study

Polish RoBERTa model trained on Polish Wikipedia, Polish literature and Oscar

Speech tools plugin for Annotation Pro

Author: Klessa Katarzyna
Korzinek Danijel
Publication venue: Adam Mickiewicz University
Publication date: 14/10/2021
Field of study

This resource describes the Annotation Pro plugin containing various tools for automatic processing of speech data. The initial tool provides only a speech aligner, but more are planned in the future

AspectEmo 1.0: Multi-Domain Corpus of Consumer Reviews for Aspect-Based Sentiment Analysis

Author: Kocoń Jan
Radom Jarema
Kaczmarz-Wawryk Ewa
Wabnic Kamil
Zajączkowska Ada
Zaśko-Zielińska Monika
Publication venue: Wrocław University of Science and Technology
Publication date: 01/10/2021
Field of study

AspectEmo 1.0 Corpus is an extended version of a publicly available PolEmo 2.0 corpus of Polish customer reviews, that was used in many projects on the use of different methods in sentiment analysis. The AspectEmo corpus consists of four subcorpora, each containing online customer reviews from the following domains: school, medicine, hotels, and products. All documents are annotated at the aspect level with 6 sentiment categories: strong negative (minus_m), weak negative (minus_s), neutral (zero), weak positive (plus_s), strong positive (plus_m)

Lexicalisation of Polish and English word combinations: two samples manually annotated (with collocation strength corpus statistics)

Author: Dziob Agnieszka
Grabowski Łukasz
Kanclerz Kamil
Kompa Karolina
Maziarz Marek
Piasecki Maciej
Piotrowski Tadeusz
Rudnicka Ewa
Publication venue: Wrocław University of Science and Technology
Publication date: 21/12/2021
Field of study

We analysed over 350 Polish and English word combinations (multi-word expressions, MWEs). Half of the sample was drawn from traditional dictionaries, while the other half was created by hand to represent free word combinations (i.e., MWEs not found in dictionaries, the information is given in the column "Status"). Syntactically these were noun phrases (NPs), either adjectives and nouns (A+N), or nouns and nouns (N+N), called 'bigrams'. We operationalised semantic compositionality by testing two custom-designed criteria, i.e., Intuition and Paraphrase, as well as by using statistical methods (selected measures of collocational strength, i.e. log-likelihood, PMI and Jaccard) for checking word order fixedness and word combination specificity. We also checked how long (in letters) the syntactic nucleus / its complement is (the measure highly correlated with word frequency, which is known as Zipf’s law (columns "AWL" and "HWL"). In the last column ("LCA") we give classification results obtained from Latent Class Analysis

Testing agreement between lexicographers: A case of homonymy and polysemy

Author: Maziarz Marek
Bond Francis
Rudnicka Ewa
Publication venue: Global Wordnet Association
Publication date: 01/01/2021
Field of study

In this paper we compare Oxford Lexico and Merriam Webster dictionaries with Princeton WordNet with respect to the description of semantic (dis)similarity between polysemous and homonymous senses that could be inferred from them. WordNet lacks any explicit description of polysemy or homonymy, but as a network of linked senses it may be used to compute semantic distances between word senses. To compare WordNet with the dictionaries, we transformed sample entry microstructures of the latter into graphs and crosslinked them with the equivalent senses of the former. We found that dictionaries are in high agreement with each other, if one considers polysemy and homonymy altogether, and in moderate concordance, if one focuses merely on polysemy descriptions. Measuring the shortest path lengths on WordNet gave results comparable to those on the dictionaries in predicting semantic dissimilarity between polysemous senses, but was less felicitous while recognising homonymy

A (Non)-Perfect Match: Mapping plWordNet onto Princeton WordNet

Author: Rudnicka Ewa
Witkowski Wojciech
Piasecki Maciej
Publication venue: Global Wordnet Association
Publication date: 01/01/2021
Field of study

The paper reports on the methodology and final results of a large-scale synset mapping between plWordNet and Princeton WordNet. Dedicated manual and semi-automatic mapping procedures as well as interlingual relation types for nouns, verbs, adjectives and adverbs are described. The statistics of all types of interlingual relations are also provided

Big data language model with part of speech tags stemmed in RAW format

Author: Wołk Krzysztof
Publication venue: Polish-Japanese Academy of Information Technology
Publication date: 31/03/2021
Field of study

Big data language model with part of speech tags stemmed in RAW forma

Metaphors and arguments annotations in Polish political – pre-election debates from 2019 (TVP 2019-10-01 and TVN 2019-10-08)

Author: Juszczyk Konrad
Konat Barbara
Fabiszak Małgorzata
Publication venue: Adam Mickiewicz University
Publication date: 21/07/2021
Field of study

The data published here are a supplementary for a paper to be published in Metaphor and Social Words (under revision). Two debates organised and published by TVP and TVN were transcribed and annotated with Metaphor Identification Method. We have used eMargin software (a collaborative textual annotation tool, (Kehoe and Gee 2013) and a slightly modified version of MIP (Pragglejaz 2007). Each lexical unit in the transcript was labelled as a metaphor related word (MRW) if its “contextual meaning was related to the more basic meaning by some form of similarity” (Steen 2007). The meanings were established with the Wielki Słownik Języka Polskiego (Great Dictionary of Polish, ed. (Żmigrodzki 2019). In addition to MRW, lexemes which create a metaphorical expression together with MRW were tagged as metaphor expression word (MEW). At least two words are needed to identify the actual metaphorical expression, since MRW cannot appear without MEW. Grammatical construction of the metaphor (Sullivan 2009) is asymmetrical: one word is conceptually autonomous and the other is conceptually dependent on the first. Within construction grammar terms (Langacker 2008), metaphor related word is elaborated with/by metaphorical expression word, because the basic meaning of MRW is elaborated and extended to more figurative meaning only if it is used jointly with MEW. Moreover, the meaning of the MEW is rather basic, concrete, as it remains unchanged in connection with the MRW. This can be clearly seen in the expression often used in our data: “Służba zdrowia jest w zapaści” (“Health service suffers from a collapse.”) where the word “zapaść” (“collapse”) is an example of MRW and words “służba zdrowia” (“health service”) are labeled as MEW. The English translation of this expression needs a different verb, instead of “jest w zapaści” (“is in collapse”) the English unmarked collocation is “suffers from a collapse”, therefore words “suffers from a collapse” are labeled as MRW. The “collapse” could be caused by heart failure, such as cardiac arrest or any other life-threatening medical condition and “health service” is portrayed as if it could literally suffer from such a condition – a collapse. The data are in csv tables exported from xml files downloaded from eMargin site. Prior to annotation transcripts were divided to 40 parts, each for one annotator. MRW words are marked as MLN, MEW are marked as MLP and functional words within metaphorical expression are marked MLI, other words are marked just noana, which means no annotation needed

40

full texts

504

metadata records

Updated in last 30 days.

CLARIN-PL

Access Repository Dashboard

Do you manage Open Research Online? Become a CORE Member to access insider analytics, issue reports and manage access to outputs from your repository in the CORE Repository Dashboard! 👇