504 research outputs found

    KPWr chunks 2021

    No full text
    357 documents from KPWr corpus annotated manually at syntactic level (chunks). Please cite as: Oleksy, M., Walentynowicz, W., & Wieczorek, J. (2021). New approach to the chunk recoginition in Polish. Procedia Computer Science, 192, 1001-1010

    PoLitBert_v32k_linear_125k - Polish RoBERTa model

    No full text
    Polish RoBERTa model trained on Polish Wikipedia, Polish literature and Oscar

    PoLitBert_v32k_linear_50k - Polish RoBERTa model

    No full text
    Polish RoBERTa model trained on Polish Wikipedia, Polish literature and Oscar

    Speech tools plugin for Annotation Pro

    No full text
    This resource describes the Annotation Pro plugin containing various tools for automatic processing of speech data. The initial tool provides only a speech aligner, but more are planned in the future

    AspectEmo 1.0: Multi-Domain Corpus of Consumer Reviews for Aspect-Based Sentiment Analysis

    No full text
    AspectEmo 1.0 Corpus is an extended version of a publicly available PolEmo 2.0 corpus of Polish customer reviews, that was used in many projects on the use of different methods in sentiment analysis. The AspectEmo corpus consists of four subcorpora, each containing online customer reviews from the following domains: school, medicine, hotels, and products. All documents are annotated at the aspect level with 6 sentiment categories: strong negative (minus_m), weak negative (minus_s), neutral (zero), weak positive (plus_s), strong positive (plus_m)

    Lexicalisation of Polish and English word combinations: two samples manually annotated (with collocation strength corpus statistics)

    No full text
    We analysed over 350 Polish and English word combinations (multi-word expressions, MWEs). Half of the sample was drawn from traditional dictionaries, while the other half was created by hand to represent free word combinations (i.e., MWEs not found in dictionaries, the information is given in the column "Status"). Syntactically these were noun phrases (NPs), either adjectives and nouns (A+N), or nouns and nouns (N+N), called 'bigrams'. We operationalised semantic compositionality by testing two custom-designed criteria, i.e., Intuition and Paraphrase, as well as by using statistical methods (selected measures of collocational strength, i.e. log-likelihood, PMI and Jaccard) for checking word order fixedness and word combination specificity. We also checked how long (in letters) the syntactic nucleus / its complement is (the measure highly correlated with word frequency, which is known as Zipf’s law (columns "AWL" and "HWL"). In the last column ("LCA") we give classification results obtained from Latent Class Analysis

    Testing agreement between lexicographers: A case of homonymy and polysemy

    Full text link
    In this paper we compare Oxford Lexico and Merriam Webster dictionaries with Princeton WordNet with respect to the description of semantic (dis)similarity between polysemous and homonymous senses that could be inferred from them. WordNet lacks any explicit description of polysemy or homonymy, but as a network of linked senses it may be used to compute semantic distances between word senses. To compare WordNet with the dictionaries, we transformed sample entry microstructures of the latter into graphs and crosslinked them with the equivalent senses of the former. We found that dictionaries are in high agreement with each other, if one considers polysemy and homonymy altogether, and in moderate concordance, if one focuses merely on polysemy descriptions. Measuring the shortest path lengths on WordNet gave results comparable to those on the dictionaries in predicting semantic dissimilarity between polysemous senses, but was less felicitous while recognising homonymy

    A (Non)-Perfect Match: Mapping plWordNet onto Princeton WordNet

    Full text link
    The paper reports on the methodology and final results of a large-scale synset mapping between plWordNet and Princeton WordNet. Dedicated manual and semi-automatic mapping procedures as well as interlingual relation types for nouns, verbs, adjectives and adverbs are described. The statistics of all types of interlingual relations are also provided

    Big data language model with part of speech tags stemmed in RAW format

    No full text
    Big data language model with part of speech tags stemmed in RAW forma

    Metaphors and arguments annotations in Polish political – pre-election debates from 2019 (TVP 2019-10-01 and TVN 2019-10-08)

    No full text
    The data published here are a supplementary for a paper to be published in Metaphor and Social Words (under revision). Two debates organised and published by TVP and TVN were transcribed and annotated with Metaphor Identification Method. We have used eMargin software (a collaborative textual annotation tool, (Kehoe and Gee 2013) and a slightly modified version of MIP (Pragglejaz 2007). Each lexical unit in the transcript was labelled as a metaphor related word (MRW) if its “contextual meaning was related to the more basic meaning by some form of similarity” (Steen 2007). The meanings were established with the Wielki Słownik Języka Polskiego (Great Dictionary of Polish, ed. (Żmigrodzki 2019). In addition to MRW, lexemes which create a metaphorical expression together with MRW were tagged as metaphor expression word (MEW). At least two words are needed to identify the actual metaphorical expression, since MRW cannot appear without MEW. Grammatical construction of the metaphor (Sullivan 2009) is asymmetrical: one word is conceptually autonomous and the other is conceptually dependent on the first. Within construction grammar terms (Langacker 2008), metaphor related word is elaborated with/by metaphorical expression word, because the basic meaning of MRW is elaborated and extended to more figurative meaning only if it is used jointly with MEW. Moreover, the meaning of the MEW is rather basic, concrete, as it remains unchanged in connection with the MRW. This can be clearly seen in the expression often used in our data: “Służba zdrowia jest w zapaści” (“Health service suffers from a collapse.”) where the word “zapaść” (“collapse”) is an example of MRW and words “służba zdrowia” (“health service”) are labeled as MEW. The English translation of this expression needs a different verb, instead of “jest w zapaści” (“is in collapse”) the English unmarked collocation is “suffers from a collapse”, therefore words “suffers from a collapse” are labeled as MRW. The “collapse” could be caused by heart failure, such as cardiac arrest or any other life-threatening medical condition and “health service” is portrayed as if it could literally suffer from such a condition – a collapse. The data are in csv tables exported from xml files downloaded from eMargin site. Prior to annotation transcripts were divided to 40 parts, each for one annotator. MRW words are marked as MLN, MEW are marked as MLP and functional words within metaphorical expression are marked MLI, other words are marked just noana, which means no annotation needed

    40

    full texts

    504

    metadata records
    Updated in last 30 days.
    CLARIN-PL
    Access Repository Dashboard
    Do you manage Open Research Online? Become a CORE Member to access insider analytics, issue reports and manage access to outputs from your repository in the CORE Repository Dashboard! 👇