CLARIN-PL
Not a member yet
504 research outputs found
Sort by
The LnNor Corpus: A spoken multilingual corpus of non-native and native Norwegian, English and Polish (Part 2)
The LnNor corpus was created as part of the data collection in two projects: CLIMAD
(Crosslinguistic influence in multilingualism across domains: phonology and syntax) and
ADIM (Across-domain Investigations in Multilingualism: Modeling L3 Acquisition in Diverse
Settings), led by Prof. Magdalena Wrembel at Adam Mickiewicz University in Poznań, Poland
and by Prof. Marit Westergaard at the Arctic University of Norway, from December 2021 to
April 2024 with funding from the National Science Centre (NCN) in Poland and Norway
Grants.
The CLIMAD and ADIM projects explored cross-linguistic influence (CLI) in the
acquisition, processing, and use of a third language (L3/Ln) across various language domains
and focused on different settings and stages of acquisition from a multilingual perspective. A
range of sophisticated methodologies, such as perception and production tests, grammaticality
judgement tasks and online brain imaging techniques like EEG, were leveraged to unravel the
intricacies of multilingual processing. By capturing real-time insights into the interplay of
cross-linguistic influences, the projects not only provided valuable contributions to the
understanding of L3/Ln acquisition but also advanced theoretical frameworks in this field.
Corpus data collection covered a broad range of speech elicitation tasks. The recordings
consist of word, sentence and text reading, picture story description, video story retelling,
spontaneous speech and socio-phonetic interviews in Polish, English and Norwegian. The
corpus contains metadata based on the Language History Questionnaire (Li et al. 2020) such as
age, gender, native languages, proficiency level, length of language exposure, age of onset.
Data was collected from different groups of speakers:
• L1 Polish learners of Norwegian as L3/Ln, attending Scandinavian studies at Poznań College
of Modern Languages and the University of Szczecin (instructed learners);
• L1 Polish learners of Norwegian as L3/Ln, living in Norway (naturalistic learners)
• L1 English natives as controls
• L1 Norwegian natives as controls
Six types of speech tasks were recorded in Norwegian, English and Polish:
• word reading
• sentence reading
• text reading (“The North Wind and the Sun”)
• story telling (spontaneous)
• picture description
• picture story telling
• video story telling
• translation from Polish/English to Norwegian
Metadata corresponding to the recordings include the following information:
• speaker ID, age, gender, education, current residence, speaker status
(instructed/naturalistic/native), native language, additional languages spoken
• recording ID
• language: PL (Polish), EN (English), NO (Norwegian)
• status: L1, L2, L3/Ln
• speech task: WR (word reading), SR1/2/... (sentence reading), TR1/2/... (text reading), PD
(picture description), ST (story telling), VT (video story telling)
• recording date, recording place, iteration, recording environment, recording device, type of
microphone, noise level, etc.
The labels of the recordings adhere to a structured format: PROJECT_SPEAKER
ID_LANGUAGE STATUS_TASK, wherein:
• PROJECT corresponds to the project within which the data were collected (A for ADIM, C
for CLIMAD)
• SPEAKER ID corresponds to a unique speaker ID consisting of 8 characters
• LANGUAGE STATUS represents the language in which the task was recorded and its status
for the speaker (e.g., L1PL, L2EN, L3NO)
• TASK corresponds to the type of speech task recorded (e.g., TR, SR, WR, etc.)
The LnNor corpus has been created to represent multilingual speech with a focus on L3/Ln
Norwegian learners as well as native controls of Norwegian, English and Polish. The corpus is
designed to study linguistic variation in learners acquiring Norwegian as a foreign language in
instructed and naturalistic settings. Additionally, a subcorpus of native speech patterns is
provided to serve as a benchmark, against which the learners' productions could be compared.
Furthermore, part 2 of the corpus contains word alignment with orthographic transcriptions of
speech to facilitate subsequent analyses across various linguistic domains.
All speech samples were recorded with the use of Shure SM-35 unidirectional cardioid
head-worn condenser microphones, using portable Marantz PMD620 solid state recorders with
signal digitized at 48 kHz, 16-bit. This set-up was selected to minimize ambient noise and
provide clear and focused recordings.
The LnNOR corpus part 2 consists of 1671 annotated files from 164 speakers. The
speakers included 113 L1 Polish, 33 L1 Norwegian and 18 L1 speakers of English. The total
recording time is approximately 59 hours and the full size is 26 GB. The recordings in the
released LnNor corpus part 2 cover data collected between 2023-2024
Polish multi-word lexical unit recognition
A dataset of Polish multi-word expressions manually annotated with respect to their lexicality status. We show annotators' decisions with respect to two criteria: terminology (that is whether a given word combination can be classified as 'term', and 'paraphrase' (that is whether a given word combination can be can be easily paraphrased). In the last column, we present lexicographers' decision with respect to their lexicality status: "tak" - 'yes' means a given word combination is a multi-word lexical unit, "nie" - 'no' means it is not
Korpus przemówień przedwyborczych Baracka Obamy
Korpus tekstowy przemówień Baracka Obamy z lat 2006-2015
WordNet-based Data Augmentation for Hybrid WSD Models
Recent advances in Word Sense Disambiguation suggest neural language models can be successfully improved by incorporating knowledge base structure. Such class of models are called hybrid solutions. We propose a method of improving hybrid WSD models by harnessing data augmentation techniques and bilingual training. The data augmentation consist of structure augmentation using interlingual connections between wordnets and text data augmentation based on multilingual glosses and usage examples. We utilise language-agnostic neural model trained both with SemCor and Princeton WordNet gloss and example corpora, as well as with Polish WordNet glosses and usage examples. This augmentation technique proves to make well-known hybrid WSD architecture to be competitive, when compared to current State-of-the-Art models, even more complex
Wordnet-oriented Recognition of Derivational Relations
Derivational relations are an important element in defining meanings, as they help to explore word-formation schemes and predict senses of derivates (derived words). In this work, we analyse different methods of representing derivational forms obtained from WordNet – from quantitative vectors to contextual learned embedding methods – and compare ways of classifying the derivational relations occurring between them. Our research focuses on the explainability of the obtained representations and results. The data source for our research is plWordNet, which is the wordnet of the Polish language and includes a rich set of derivation examples
Lexicalised and Non-lexicalized Multi-word Expressions inWordNet: a Cross-encoder Approach
Focusing on recognition of multi-word expressions (MWEs), we address the problem of recording MWEs in WordNet. In fact, not all MWEs recorded in that lexical database could with no doubt be considered as lexicalised (e.g. elements of wordnet taxonomy, quantifier phrases, certain collocations). In this paper, we use a cross-encoder approach to improve our earlier method of distinguishing between lexicalised and non-lexicalised MWEs found in WordNet using custom-designed rulebased and statistical approaches. We achieve F1-measure for the class of lexicalised word combinations close to 80%, easily beating two baselines (random and a majority class one). Language model also proves to be better than a feature-based logistic regression model
Wordnet for Definition Augmentation with Encoder-Decoder Architecture
Data augmentation is a difficult task in Natural Language Processing. Simple methods that can be relatively easily applied in other domains like insertion, deletion or substitution, mostly result in changing the sentence meaning significantly and obtaining an incorrect example. Wordnets are potentially a perfect source of rich and high quality data that when integrated with the powerful capacity of generative models can help to solve this complex task. In this work, we use plWordNet, which is a wordnet of the Polish language, to explore the capability of encoder-decoder architectures in data augmentation of sense glosses. We discuss the limitations of generative methods and perform qualitative review of generated data samples