CLARIN-PL
Not a member yet
504 research outputs found
Sort by
Corpus of Russian Local Press of the Millennium Period (1996-2006)
Corpus of Russian Local Press of the Millennium Period (1996-2006): selected archives (borders - from 1995/1996-2006) of two hundred and eighty (280) local newspapers from eighty-six (86) subjects of the Russian Federation (2005-2006): Oblasts (provinces), Republics, Krais (territories), Autonomous Okrugs (with a substantial ethnic minority), Federal cities, Autonomous Oblasts
MultiCo
The MultiCo multimodal corpus is one of the outcomes of the project "Digital Research Infrastructure for the Humanities and Arts Studies DARIAH-PL." This project was funded by POIR 4.2 of the European Regional Development Fund from 2021 to 2023 and was carried out by a consortium of academic institutions across Poland with Adam Mickiewicz University, Poznan as a member of the consortium. The MultiCo multimodal corpus was developed at the Faculty of Modern Languages of Adam Mickiewicz University in Poznań.
The motivation behind creating the corpus stems from contemporary research on interpersonal communication. The studies confirm that in order to understand and model the multifaceted process of communication, it's essential to study and describe not only speech but also other components of communication, such as gestures, facial expressions, and body posture. The MultiCo corpus was designed to support and facilitate this type of research approach. The corpus contains over 15 hours of recordings and consists of three sections:
- Monologs representing persuasion in parliamentary speeches and motivational talks (TEDex),
- Dialogs based on task-oriented activities recorded in a lab setting,
- Multilogs illustrating discussions with multiple participants, exemplified by conversations on current sports events (TVP Sport 4-4-2).
The monolog and multilog sections are based on materials available in public media or archives, while the dialog section includes task-oriented dialogs originally designed and recorded specifically for this resource
Word Sense Disambiguation Based on Iterative Activation Spreading with Contextual Embeddings for Sense Matching
Many knowledge-based solutions were proposed to solve Word Sense disambiguation (WSD) problem with limited annotated resources. Such WSD algorithms are able to cover very large sense repositories, but still being outperformed by supervised ones on benchmark data. In this paper, we start with analysis identifying key properties and issues in application of spreading activation algorithms in knowledge-based WSD, e.g. influence of the network local structures, interaction with context information and sense frequency. Taking our observations as a point of departure, we introduce a novel solution with new context-to-sense matching using BERT embeddings, iterative parallel spreading activation function and selective sense alignment using contextual BERT embeddings. The proposed solution obtains performance beyond the state-of-the-art for the contemporary knowledge-based WSD approaches for both English and Polish data
Polish WSD Datasets
Data and code for the paper published at ICCS 2022: "A Unified Sense Inventory for Word Sense Disambiguation in Polish". The code is available at https://gitlab.clarin-pl.eu/team-semantics/wsd-researc
DiaBiz
DiaBiz corpus is a dialog corpus comprising recordings and annotated transcriptions of phone-based customer-agent interactions in several key business domains
StudEmo - corpus of consumer reviews annotated with emotions
Humans' emotional perception is subjective by nature, in which each individual could express different emotions regarding the same textual content. Existing datasets for emotion analysis commonly depend on a single ground truth per data sample, derived from majority voting or averaging the opinions of all annotators. We introduce a new non-aggregated dataset, namely StudEmo, that contains 5,182 customer reviews, each annotated by 25 people with intensities of eight emotions from Plutchik's model, extended with valence and arousal. We also propose three personalized models that use not only textual content but also the individual human perspective, providing the model with different approaches to learning human representations. The experiments were carried out as a multitask classification on two datasets: our StudEmo dataset and GoEmotions dataset, which contains 28 emotional categories. The proposed personalized methods significantly improve prediction results, especially for emotions that have low inter-annotator agreement
War with striped beetle in main Polish communist party newspaper "Trybuna Ludu" 1950-1965
Articles from main Polish communist party newspaper "Trybuna Ludu" concerning battle with potato beetle allegedly drop down by US Government to Poland and other socialists countrie