CLARIN-PL

Not a member yet

504 research outputs found

Sort by

Addenda to the inventory of female names in Słowosieć: The case of biskupka ‘bishop-FEM’ and other female names in religious life and institutions

Author: Alberski Bartłomiej
Jankowski Hubert
Publication venue: Poradnik Językowy
Publication date: 01/01/2025
Field of study

Due to the dynamic social discussion and the observed increase in the use of feminatives, we deemed it appropriate to modify the current way of describing these units in the plWordNet. We present the newly introduced masculinity relation describing male forms derived from female ones, such as alternatywek ‘alternates’, where the base form is the female variant alternatywka. We also describe the method of densifying relations, which is driven by the need to improve the quality of NLP texts, where feminatives appear more frequently and for which plWordNet is utilized. Using the religious semantic field as an example, we propose a description of parallel hyponym–hypernym trees connected by femininity and masculinity relations at the unit level. We also provide a concise corpus-based analysis using the unit biskupka ‘bishop-FEM’ as an example, justifying the necessity of constructing analogous relaƟonal structures, to the extent possible, to the biskup ‘bishop’ unit

The lexicographic description of feminine forms in plWordNet: the current state and future perspectives

Author: Alberski Bartłomiej
Jakowski Hubert
Publication venue: Prace Językozawcze
Publication date: 01/01/2025
Field of study

The aim of the study is to present a method of describing feminine forms (nouns referring to humans with female gender) in plWordNet and to indicate possible directions of its development. One of them could be increasing the quality of natural language processing (NLP), for which the resource is used. plWordNet as a relational dictionary is based on clustering lexical units into synonymous lexical units (synsets) connected by a network of morphosemantic relations. One of the relations at the unit level is the femininity relation, which is responsible for connecting feminine derivatives with their derivational base. The study presents various types of feminine forms along with their derivational mechanisms and possible interpretations. The authors describe the current methodological solutions in plWordNet. They also propose possible further developments, including expanding the list of relations, such as introducing a masculinity relation describing an increasingly common type of derivation: przedszkolanka => przedszkolanek ‘female kindergarten teacher => male kindergarten teacher,’ niania => nianiek ‘female nanny => male nanny.’ Currently, malarka ‘female painter’ is presented as being derived from the noun malarz ‘male painter,’ but another interpretation may suggest malować ‘to paint’ as the base for malarka. This approach makes it possible to introduce a change in the way derivational relations are described, from the current systematic recognition of feminine forms as derivatives from their masculine counterparts, to analogous systems where the base may be a verb that forms two mutational derivatives: feminine and masculine. Furthermore, it also enables to offer a description of neutral forms, which are currently becoming more widespread, e.g. malować ‘to paint’ => malarx ‘non-binary painter.

Corpus of Nineteenth-Century French Texts on Palingenesis

Author: Sukiennicka Marta
Publication venue: Marta Sukiennicka
Publication date: 10/04/2025
Field of study

A tagged corpus of nineteenth-century French texts related to the concept of palingenesis, constructed from the digital collections of the French National Library "Gallica". The corpus is annotated for discourse type (scientific, philosophical, literary), discipline (medicine, astronomy, politics, literary criticism, etc.), textual genre (treatise, essay, study, novel, poem, etc.), function (scientific concept, religious concept, metaphor, etc.), and tone of the concept's usage (serious, ironic, polemical, etc.). Each entry includes a concise interpretive summary of the term’s meaning within the given work, as well as a selection of representative or significant excerpts

MultiCo-Hub: a corpus of multimodal enrichments with motion-trajectory annotation

Author: Klessa Katarzyna
Karpiński Maciej
Jarmołowicz-Nowikow Ewa
Sawicka-Stępińska Brygida
Klessa Wojciech
Publication venue: Adam Mickiewicz University, Poznan
Publication date: 01/01/2025
Field of study

MultiCo-Hub is a multimodal dataset including 11 zipped subsets (henceforth: sessions) of time-aligned audio, video and motion-capture–derived BVH data, together with multi-layered Annotation Pro files (ANTx) extended with automatically extracted motion-trajectory layers. The dataset includes a dedicated training session demonstrating body movement (TESM_001). The video and audio files included in remaining 10 sessions are derived from the MultiCo corpus (http://hdl.handle.net/11321/942). The original MultiCo sessions were enriched by means of: - full audio, video and BVH streams synchronization to enhance precise multimodal analysis; - motion-capture (BVH) data normalization, conversion, and integration directly into the annotation files as layers describing trajectories of selected body parts (positions, speeds, gesture-space coordinates). Furthermore, for each session, the corpus provides a composite multi-view video file showing all four camera angles simultaneously. This makes the dataset easier to inspect and substantially more accessible for users working on standard-performance computers. MultiCo-Hub offers a compact, ready-to-use resource for research and education in the areas of speech–gesture coordination, gesture space, temporal properties of movement, communicative alignment of interlocutors, and multimodal interaction. Export to common formats (TextGrid, EAF, CSV, etc.) is supported via Annotation Pro, facilitating downstream statistical analysis, visualization and interoperability. The MultiCo-Hub set also served as input for developing a set of R and C# applications and scripts that support the analysis and visualization of gesture space, temporal movement properties, and communicative alignment in dialogue

DiPSS - longitudinal corpus of drift in Polish students of Spanish

Author: Sawicka-Stępińska Brygida
Sypiańska Jolanta
Publication venue: Adam Mickiewicz University, Poznań
Publication date: 30/11/2025
Field of study

The DiPSS corpus (part 1) is a longitudinal speech resource documenting the phonetic productions of L1 Polish students learning L2 English and L3 Spanish. It includes recordings from first year Spanish philology students across five testing points over two academic years, capturing word-initial stops (lenis and fortis), vowels (e, o, u, a), rhotics ({rr}) and approximants ([β, ð, ɣ]). The corpus integrates rich metadata including L2/L3 proficiency, language aptitude (LLAMA, Meara & Rogers, 2019), and age of onset for foreign languages, allowing for longitudinal and cross-linguistic analyses. DiPSS is designed as an open-access resource suitable for research in L1 drift, cross-linguistic influence, speech production and multilingual acquisition. Its detailed annotation, metadata and longitudinal structure result in a valuable tool for both linguistic research and computational modeling. The task consisted in reading words presented on auto-advancing slides in Polish, Spanish, and English. Instructions for the entire task were delivered in Polish. Prior to the Spanish and English sets of target words, participants received a written instruction along with a brief audio prompt in the respective language to establish the appropriate language mode. Audio was captured using the AKG C4000 microphone connected to a computer via a Focusrite Scarlett 2i2 audio interface and recorded using Audacity software, version 3.4.2. Data were collected from 28 speakers across testing times 1–4, and 22 speakers across testing times 1–5. The testing times correspond to: T1: October, year 1, during the opening week of the program, T2: November, year 1, after approximately five full weeks of instruction, T3: February, year 1, at the end of the first semester, T4: June, year 1, at the end of the first academic year, T5: June-September, year 2, at the end of the second year of studies. Metadata corresponding to the speakers include the following information: A: Sociodemographic data: speaker ID, gender, age B: Language background: self-reported L1, L2 and L3, level of Spanish: (A - absolute beginners, B - false beginners, C - advanced learners) C and D: L2 and L3 profile (self-reported proficiency, age of onset of formal education, age of exposure to naturalistic speech, stay in Spanish/English speaking countries for longer than a month, weekly exposure to naturalistic speech) E: Proficiency and language aptitude test results. The DiPSS corpus consists of five packages (T1-T5) of recordings with forced-aligned three-tier annotation in TextGrid, performed using WebMAUS Basic (Kisler, T. et al. 2017). Each package corresponds to one testing time and contains three sets of data: Polish, Spanish, and English. Packages T1-T4 each include 28 recordings per language, with corresponding TextGrid files. Package T5 includes 22 recordings per language, also with their corresponding TextGrid files. In total, the corpus comprises 402 pairs of WAV and TextGrid files from 28 speakers. The total recording time is approximately 20 hours, and the complete corpus size is 2.5 GB. The recordings in the released DiPSS corpus part 1 cover data collected in mid-2020s. The labels of the recordings adhere to a structured format: SPEAKER ID_TESTING TIME_LANGUAGE, wherein: SPEAKER ID corresponds to a unique speaker ID consisting of 6 characters, TESTING TIME corresponds to one of the five recording sessions (T1, T2, T3, T4, T5), LANGUAGE corresponds to the language in which the task was recorded (PL – Polish, ES – Spanish, EN – English). The data were processed using the server infrastructure developed within "Digital Research Infrastructure for the Arts and the Humanities" (POIR.04.02.00-00-D006/20)

plWordNet 5.0 – challenges of a life-long wordnet development process

Author: Rudnicka Ewa
Alberski Bartłomiej
Piasecki Maciej
Publication venue: GWC
Publication date: 01/01/2025
Field of study

The construction of plWordNet began in 2005 and has been continued since then. In this paper we present the latest 5.0 version and describe the challenges connected with a life-long wordnet development process. These involve changes in the procedures and lexicographers' teams, the necessity to extend the lexical description, and the need to link to external resources (Princeton WordNet, sense-tagged corpora, and valence dictionary). We describe different strategies and diagnostics implemented to improve the quality of the resource

HANOI corpus and tool for analysis of note-taking of conference interpreters

Author: Jelec Anna
Publication venue: Adam Mickiewicz University, Poznań
Publication date: 24/11/2025
Field of study

HANOI is a resource for understanding the process of consecutive interpreting through the analysis of the note-taking process. Each data package is a record of an interpretation performed by a professional interpreter and includes: a recording of the interpretation, the interpreter's notes, a recording of the note-taking process, transcripts of the original speech and the translated text and a reference to the source speech recording from the Multico corpus

Register of multi-word expressions deleted from plWordNet after verification

Author: Maziarz Marek
Rudnicka Ewa
Dziob Agnieszka
Wieczorek Justyna
Publication venue: Wrocław University of Science and Technology
Publication date: 29/08/2024
Field of study

A dataset of multi-word expressions deleted from plWordNet after manual verification of their lexicality status

DN XXI 213 (trial corpus)

Author: Jaworska Julia
Jaworska Julia
Jaworska Julia
Publication venue: SWPS University
Publication date: 23/06/2024
Field of study

This is a trial corpus

The LnNor Corpus: A spoken multilingual corpus of non-native and native Norwegian, English and Polish (Part 1)

Author: Magdalena Wrembel
Hwaszcz Krzysztof
Agnieszka Pludra
Skałba Anna
Weckwerth Jarosław
Walczak Angelika
Sypiańska Jolanta
Żychliński Sylwiusz
Malarski Kamil
Kędzierska Hanna
Kaźmierski Kamil
Gruszecka Justyna
Dziubalska-Kolaczyk Katarzyna
Czarnecki-Verner Tristan
Cal Zuzanna
Balas Anna
Publication venue: Adam Mickiewicz University
Publication date: 31/01/2024
Field of study

The LnNor corpus was created as part of the data collection in two projects: CLIMAD (Cross- linguistic influence in multilingualism across domains: phonology and syntax) and ADIM (Across-domain Investigations in Multilingualism: Modeling L3 Acquisition in Diverse Settings), led by Prof. Magdalena Wrembel at Adam Mickiewicz University in Poznań, Poland and by Prof. Marit Westergaard at the Arctic University of Norway, from December 2021 to April 2024 with funding from the National Science Centre (NCN) in Poland and Norway Grants. The CLIMAD and ADIM projects explored cross-linguistic influence (CLI) in the acquisition, processing, and use of a third language (L3/Ln) across various language domains and focused on different settings and stages of acquisition from a multilingual perspective. A range of sophisticated methodologies, such as perception and production tests, grammaticality judgement tasks and online brain imaging techniques like EEG, were leveraged to unravel the intricacies of multilingual processing. By capturing real-time insights into the interplay of cross-linguistic influences, the projects not only provided valuable contributions to the understanding of L3/Ln acquisition but also advanced theoretical frameworks in this field. Corpus data collection covered a broad range of speech elicitation tasks. The recordings consist of word, sentence and text reading, picture story description, video story retelling, spontaneous speech and socio-phonetic interviews in Polish, English and Norwegian. The corpus contains metadata based on the Language History Questionnaire (Li et al. 2020) such as age, gender, native languages, proficiency level, length of language exposure, age of onset. Data was collected from different groups of speakers: • L1 Polish learners of Norwegian as L3/Ln, attending Scandinavian studies at Poznań College of Modern Languages and the University of Szczecin (instructed learners); • L1 Polish learners of Norwegian as L3/Ln, living in Norway (naturalistic learners) • L1 English natives as controls • L1 Norwegian natives as controls • speakers of L2/L3/Ln English and L2/L3/Ln Norwegian with various L1 backgrounds Six types of speech tasks were recorded in Norwegian, English and Polish: • word reading • sentence reading • text reading (“The North Wind and the Sun”) • picture description • picture story telling • video story telling Metadata corresponding to the recordings include the following information: • speaker ID, age, gender, education, current residence, speaker status • (instructed/naturalistic/native), native language, additional languages spoken • recording ID • language: PL (Polish), EN (English), NO (Norwegian) • status: L1, L2, L3/Ln • speech task: WR (word reading), SR1/2/... (sentence reading), TR1/2/... (text reading), PD (picture description), ST (story telling), VT (video story telling) • recording date, recording place, iteration, recording environment, recording device, type of microphone, noise level, etc. The labels of the recordings adhere to a structured format: PROJECT_SPEAKER ID_LANGUAGE STATUS_TASK, wherein: • PROJECT corresponds to the project within which the data were collected (A for ADIM, C for CLIMAD) • SPEAKER ID corresponds to a unique speaker ID consisting of 8 characters • LANGUAGE STATUS represents the language in which the task was recorded and its status for the speaker (e.g., L1PL, L2EN, L3NO) • TASK corresponds to the type of speech task recorded (e.g., TR, SR, WR, etc.) The LnNor corpus has been created to represent multilingual speech with a focus on L3/Ln Norwegian learners as well as native controls of Norwegian, English and Polish. The corpus is designed to study linguistic variation in learners acquiring Norwegian as a foreign language in instructed and naturalistic settings. Additionally, a subcorpus of native speech patterns is provided to serve as a benchmark, against which the learners' productions could be compared. Furthermore, parts of the corpus contain word alignment with orthographic transcriptions of speech to facilitate subsequent analyses across various linguistic domains. All speech samples were recorded with the use of Shure SM-35 unidirectional cardioid head-worn condenser microphones, using portable Marantz PMD620 solid state recorders with signal digitized at 48 kHz, 16-bit. This set-up was selected to minimize ambient noise and provide clear and focused recordings. The LnNOR corpus part 1 consists of 1073 annotated files from 78 speakers. The speakers included 53 L1 Polish, 16 L1 Norwegian and 9 L1 speakers of other European languages. The total recording time is approximately 35 hours and the full size is 18 GB. The recordings in the released LnNor corpus part 1 cover data collected between 2021-2022

40

full texts

504

metadata records

Updated in last 30 days.

CLARIN-PL

Access Repository Dashboard

Do you manage Open Research Online? Become a CORE Member to access insider analytics, issue reports and manage access to outputs from your repository in the CORE Repository Dashboard! 👇