Eurac Research CLARIN Centre Repository
Not a member yet
43 research outputs found
Sort by
MERLIN Written Learner Corpus for Czech, German, Italian 1.2
The MERLIN corpus is a written learner corpus for Czech, German, and Italian that has been designed to illustrate the Common European Framework of Reference for Languages (CEFR) with authentic learner data. The corpus contains learner texts produced in standardized language certifications covering CEFR levels A1-C1. The MERLIN annotation scheme includes a wide range of language characteristics that provide researchers with concrete examples of learner performance and progress across multiple proficiency levels
„One school, many languages“: A Teacher Questionnaire for Research on Plurilingual Education
This questionnaire was used in the spring of 2021 by the research team of the project „One school, many languages / A lezione con più lingue / Sprachenvielfalt macht Schule” (SMS 2.0) of Eurac Research in the context of a cross-sectional, explorative study on plurilingual education. Plurilingual education is understood here as any approach in which two or more languages are strategically used for teaching and learning with the aim of encouraging students to gain increased awareness of and appreciation for linguistic diversity, and to leverage the resources of their repertoires to enhance their overall learning (Guarda 2023).
The study in which this questionnaire was developed involved a selected sample of teachers of every subjects and working at all levels of education – from primary to upper secondary school – in the Italian province of South Tyrol, a historically multilingual area where three official languages (German, Italian and Ladin) now coexist with the new forms of multilingualism brought by recent migration flows. The aim of the study was to explore whether and how plurilingual education was implemented by the questionnaire respondents, as well as to identify their formative needs with regard to its implementation.
Based on this, the main research questions that informed the study are as follows:
• RQ1. With which frequency, if any, did the questionnaire respondents implement plurilingual didactic activities (PDAs) before the outbreak of the Covid-19 pandemic?
• RQ2. In case the respondents did implement PDAs, what kind of activities did they conduct in their classes?
• RQ3. In case the respondents did implement PDAs, which languages and/or varieties did they involve?
• RQ4. In case the respondents did not implement any PDAs in their classes, what reasons did they provide?
• RQ5. What were the respondents’ formative needs with regard to plurilingual education and its implementation?
Since the study took place in a time when schools in Italy were still dealing with the Covid-19 pandemic, it was hypothesised that the frequency or conditions of PDA implementation had been affected by the emergency situation. This aspect was taken into account while designing the questionnaire, and this is also why the research questions reported above make reference to experiences in times before the outbreak of the pandemic.
The questionnaire includes 55 items distributed across four sections: these aimed at collecting information about the school the respondents were working in, their experience – if any - with plurilingual education, their formative needs and their biodata.
The questionnaire can be adapted to inform the design and administration of future questionnaires aimed at a deeper understanding of plurilingual education through the experiences and perspectives of schoolteachers, both in South Tyrol and in other increasingly multilingual contexts. Users acknowledge and agree that the survey is provided “as is,” without warranty of any kind, and that users assume all risks and liabilities arising from or relating to its and recipient subsidiaries’ use of and reliance upon the survey. Eurac Research makes no representations or warranties of any kind whatsoever, express or implied, at law or in equity, in connection with or with respect to the survey, including any representations or warranties in regard to quality, performance, or noninfringement.
If interested, researchers can read more about the questionnaire, as well as about the findings of the study in which the questionnaire was administered, in the following publications:
Guarda, M., Colombo, S. & Flarer, H. (2022). Plurilinguismo: uno studio esplorativo sulla didattica plurilingue. Bolzano: Eurac Research. https://sms-project.eurac.edu/report-didattica-plurilingue/?lang=it ISBN: 978-88-98857-77-7
Guarda, M., Colombo, S. & Flarer, H. (2022). Mehrsprachigkeit: Eine explorative Studie zur Mehrsprachigkeitsdidaktik. https://sms-project.eurac.edu/bericht-mehrsprachigkeitsdidaktik/?lang=de ISBN: 978-88-98857-76-0.
Guarda, M. (2023). Plurilingual education through the teachers‘ eyes: insights from South Tyrol. In: Fusco, F., Marcato, C. & Oniga, R. (eds.) Proceedings of the Third International Colloquium on Plurilingualism, 252-269. Udine: FORUM
HELLO CAMPANIA! Ukraina Collection
The Ukrainian collection contains data for 26 speakers of first generation (G1), 19 females and 6 males.
The collection contains three folders for each group: the sociolinguistic interview and a language portrait
LegISTyr test set
LegISTyr is a machine translation test set for evaluating the quality of legal terminology translation from Italian to South Tyrolean German, a minor standard variety of German. It covers specific legal subdomains or legal translation issues: 1) standardised terminology, 2) occupational health and safety, 3) subsidised housing, 4) family law, 5) criminal and criminal procedure law, 6) homonyms, 7) abbreviated forms, 8) gender-inclusive writing strategies. Each subset contains at least 250 examples, i.e. five examples for each term or twenty examples for each inclusive writing strategy. The total number of examples is 2067.
The example sentences in the test set showcase single-word and multi-word terms from the Italian legal system, together with their correct, standardised or non-standardised South Tyrolean German target hypothesis. It also lists other (less) acceptable variants used in South Tyrol and, where available, equivalent terms from other German-speaking legal systems (mainly Austria, Germany, Switzerland). The legal subdomain is specified for each example in every subset, except for the last subset on gender-inclusive writing. This subset contains examples for different strategies used in Italian but no target hypotheses, as there may be several acceptable ones.
LegISTyr can be used, for example, to assess the success of terminology enforcement strategies when machine translating legal and administrative texts from Italian into German as well as the influence of major varieties of legal German on translations into a minor standard variety
Core Metadata Schema for Learner Corpora (LC-meta) v2
This document contains a list of metadata fields that can be used to describe learner corpus data. The core metadata scheme is structured around 8 metadata types: - Administrative metadata; - Corpus design metadata; - Learner; - Text (language sample); - Situational and task characteristics; - Annotation; - Annotator; - Transcriber
German Summary Corpus (GerSumCo) v1.0.0
The GerSumCo (German Summary Corpus) is a learner corpus comprising syntheses written by L2 German writers (CEFR B2/C1) and writers of L1 German. The corpus has been created with the objective of conducting a comparative analysis of the academic writing of L1 German and L2 German students.
The two subcorpora (L1 and L2) contain a total of 286 texts (178 L1 and 108 L2), written by 286 students at 14 universities and language schools in Germany (Bamberg, Bochum, Dresden, Hamburg, Hildesheim, Kiel, Leipzig, Magdeburg, Osnabrück, Potsdam, Trier, Wuppertal), Poland (Gdansk) and China (Hangzhou). The texts were collected between 2022 and 2024 as part of a PhD research project about a contrastive interlanguage analysis using GerSumCo and Beldeko to identify L1-dependent features in cohesion in L2/L1 German.
The metadata files (Meta_GerSumCo_L1 & Meta_GerSumCo_L2) contain the following information:
- Up to three L1s of the writers
- Up to three L2s of the writers
- Collection date
- Topic
- Whether the text was written as homework or in class
- Group of students the texts belonged to
The file names contain the following information:
- Whether the text is part of the L1 or L2 subcorpus
- Topic
The summaries, on average, consist of 230 words. The texts were either produced in class on computers or as homework, within a 60-minute time frame. Students were permitted to use online dictionaries, but no AI-based auxiliary means. They were required to summarise two texts on one of four topics related to language variation in German: Kiezdeutsch, Mundartdebatte in der Schweiz, Viadrinisch and Varianten-Wörterbuch des Deutschen.
This version contains the TXT files of the texts and the CSV files containing the manual annotations of the texts with token ID, sentence ID, source text form, target form, automatic annotated lemma, POS (STTS) and simple UPOS part-of-speech tag
KONTATTO v1.0
Kontatto is a corpus of transcribed and annotated spoken data collected by Silvia Dal Negro at the Free University of Bozen/Bolzano. It consists of almost 150,000 orthographic words divided into 55 recordings involving 97 different speakers for a total of 18 hours of speech. The corpus is multilingual and contains a variety of spontaneously occurring code-mixing patterns. However, language distribution is not even: 80.4% of the corpus is made of Tyrolean words, 11.5% of Italian, 2.6% of the words were classified as Trentino, another 0.8% involved other languages (e.g. Ladin, English, etc.) and, finally, 4.7% of the words are not confidently attributable to any language in particular (e.g. proper names, widespread loanwords, some interjections, etc.).
This repository contains the Kontatto-MT corpus subset. The data was collected using a collaborative Map Task, during which two speakers and an interviewer interacted to navigate a physical map in order to reach a given destination. This subcorpus documents a variety of languages and dialects in the dolomite region, including (some) Tyrolean and Trentino dialects, Italian, Cimbrian, Ladin, usually combined in the same dialogue. At present it consists of 35,453 tokens, 73% classified as local German dialect.
Kontatto was created within the scope of two projects financed by the Autonomous Province of Bozen-Bolzano between 2011-2014, “Italiano-tedesco: aree storiche di contatto in Sudtirolo e Trentino”, and 2016-2019, “Germanico-Romanzo: discorsi e strutture in contatto nell’area dolomitica”. Over the years, many research assistants and students have contributed to the annotation of the data: Katrin Tartarotti, Mara Leonardi, Marta Ghilardi, Nicole Giaier, Adriana Rasa, Lucia Rossaro, Luigi Parisi and Jay Hevelone. The CLARIN deposit was prepared by Greta Franzini and Luca Ducceschi of Eurac Research
KoKo German L1 Learner Corpus 4
The KoKo Corpus is an error-annotated learner corpus of L1 German speakers. It has been created with the aim to investigate and describe the writing skills of German-speaking secondary-school pupils at the end of their school career by analysing authentic texts produced in classrooms.
The corpus consists of 1503 argumentative essays which contain manually performed transcription annotations and linguistic error annotations. Transcription annotations reflect surface features of the text, such as the graphical arrangement and self-corrections. All texts are error annotated on the orthographic level (including punctuation errors) and a selection contains error annotations on the grammatical level (i.e. ANNIS sub-corpus KoKo_4_gram, n=597) and on the lexical level (i.e. ANNIS sub-corpus KoKo_4_lex, n=980).
The corpus building process was guided by two goals:
1. describe writing skills at the transition from secondary school to university,
2. determine external factors that may influence the distribution of writing skills, such as the region, sociolinguistic (gender, age), socio-economic, and language-related biographical factors (L1, preferred variety of German, reading and writing habits, etc.).
The pupils were selected from three different German-speaking areas:
- North Tyrol (Austria), South Tyrol (Italy), and Thuringia (Germany).
Classes were sampled randomly, using the size of the cities in which the schools were located (small vs. medium vs. big) and the type of school (providing general education vs. education specific to a particular profession) as strata for the sampling. Since data were collected during regular courses, the typical formation of secondary-school classes in the three regions is represented in the whole corpus. Most of the participants are German native speakers (n=1319, 82.7%).
Person-related metadata provides information about:
- writer’s L1
- writer’s gender
- type of school the essay comes from
- location of the school the essay comes from
- grade attended at data collection
In addition, the corpus is automatically annotated, including tokenisation, sentence splitting, POS-tagging and lemmatization
HELLO CAMPANIA! Philippines Collection
The Philippines collection contains data for 66 speakers: 32 first generation (G1), 28 second generation (G2), 6 homeland (G0). The collection contains three folders for each group: the sociolinguistic interview (G1 and G2 did it in Italian, G0 did it in Tagalog), a language portrait, and two linguistic tasks (the Frog story, a description of 21 video clips, all done in Tagalog)
Core Metadata Schema for Learner Corpora (LC-meta) v1
The Core Metadata Schema for Learner Corpora is an extensive revision of Granger & Paquot's (2017) Core Metadata [Schema] for Learner Corpora Draft 1.0 in the field of learner corpus research. The original proposal was presented in the form of a draft at the CLARIN workshop on Interoperability of Second Language Resources and Tools (University of Gothenburg, Sweden, 6-7 December 2017, https://sweclarin.se/swe/workshop-interoperability-l2-resources-and-tools).
This document contains version 1 of the Core Metadata Schema for Learner Corpora as shared with the community in 2023-2024 to collect feedback