Eurac Research CLARIN Centre Repository

Not a member yet

43 research outputs found

Sort by

HELLO CAMPANIA! Ghana Collection

Author: Di Salvo Margherita
Cataldo Violetta
Marta Maffia
Asienda Hannaora Marlene
Publication venue: Di Salvo Margherita
Publication date: 03/12/2024
Field of study

The HELLO CAMPANIA! Ghana collection contains 12 sociolinguistic interviews collected with 4 first generation migrants and 8 second generation migrants living in Naples. It also contains 9 language portraits

ITACA Corpus - Coherence in Italian Argumentative Essays v1.0

Author: Bienati Arianna
Frey Jennifer-Carmen
Zanasi Lorenzo
Stemle Egon
Brasolin Paolo
Vettori Chiara
Publication venue: Eurac Research
Publication date: 28/02/2024
Field of study

The ITACA Corpus is a corpus of argumentative essays written in Italian by upper secondary school students from South Tyrol. It has been created with the aim to investigate and describe the students’ textual competences with a special focus on text coherence. The ITACA corpus consists of 635 texts collected during the school year 2021/2022 in schools with Italian as a language of instruction. The whole corpus has been automatically tokenized, lemmatized, and annotated for part-of-speech and dependency relations. A subset of 388 texts additionally contains annotations regarding textual features, such as punctuation, connectives, agreement, anaphora, argumentative structure, off-topics and contradictions. The corpus furthermore provides metadata regarding student’s age, gender, language background, reading and writing habits, their performance in a standardized language test as well as holistic and analytic coherence evaluations for each text

HELLO CAMPANIA! Sri Lanka Collection

Author: Di Salvo Margherita
Cataldo Violetta
Maffia Marta
Noschese Maria Paola
Publication venue: University Federico II, Naples
Publication date: 27/11/2024
Field of study

The corpus consists of 48 audio files for a total of 20:38 of recordings (public) and their relative transcriptions in ELAN (upon request). This collection includes 15 language portraits. The collection is organized in four bundles: - 1G_audio: contains all the audio files collected with 1st generation migrants (30 files) - 1G_portrait: contains the language portraits collected 1st generation migrants (13 files) - 2G_audio: contains all the audio files collected with 2nd generation migrants (18 files) - 2G_portrait: contains the language portraits collected 2nd generation migrants (2 files

HELLO CAMPANIA! Bangladesh Collection

Author: Di Salvo Margherita
Cataldo Violetta
Noschese Maria Paola
Publication venue: Margherita Di Salvo
Publication date: 28/11/2024
Field of study

The collection contains 11 interviews with 1st Bangladeshi generation migrants in Naples. It also contains langauge portraits of the migrants

VinKo (Varieties in Contact) Corpus v1.2

Author: Rabanus Stefan
Kruijt Anne
Tagliani Marta
Tomaselli Alessandra
Padovan Andrea
Alber Birgit
Cordin Patrizia
Zamparelli Roberto
Vogt Barbara Maria
Publication venue: University of Verona
Publication date: 01/01/2023
Field of study

VINKO is a spoken corpus based on crowd-sourced audio recordings that has been designed to provide relevant linguistic information about the minority languages and dialects spoken in the area between Innsbruck and the Po Valley. The corpus contains audio recordings from local languages and varieties spoken in the regions Trentino-Alto Adige/Südtirol, Veneto, and Friuli-Venezia Giulia, with particular focus on the so-called 'language contact' between Germanic (Cimbrian, Mòcheno, Tyrolean, Saurano, and Sappadino) and Romance (Ladin, Trentino and Veneto dialects). The data collection took place from June 2017 to May 2023

MT@BZ translation corpus v1.0

Author: De Camillis Flavia
Chiocchetti Elena
Stemle Egon W.
Publication venue: Institute for Applied Linguistics, Eurac Research
Publication date: 13/06/2023
Field of study

The MT@BZ is a translation corpus that consists of 52 decrees published by the Autonomous Province of Bolzano (South Tyrol) aligned with their machine translated versions. More precisely, it consists of 26 decrees in German and the same 26 in Italian in their official versions, respectively machine translated by the project team into Italian and into German. 10 of them are COVID-19 related decress, while 16 are miscellaneous. Overall, they consist of around 130,000 words. Their machine translation was carried out with a customized version of ModernMT. Later, the corpus was uploaded first into the annotation platform Webanno, then transferred to Inception. Four annotators annotated the translation errors made by the machine according to an ad hoc error taxonomy for quality assessment. Finally, the annotations were curated to create a gold standard corpus

Kolipsi-1 Corpus v1.1

Author: Glaznieks Aivars
Frey Jennifer-Carmen
Abel Andrea
Vettori Chiara
Nicolas Lionel
Publication venue: Institute for Applied Linguistics, Eurac Research
Publication date: 15/02/2023
Field of study

The Kolipsi-1 L2 is a written learner corpus of German and Italian L2 speakers originating from South Tyrol (Italy). It has been developed as a by-product of the KOLIPSI project “South-Tyrolean pupils and the second language: a linguistic and socio-psychological investigation”. In addition, data from L1 pupils were collected exclusively for the creation of a native speaker reference corpus. The data collection took place in autumn 2007 and is based on two standardized tests for written productions. The two tasks consisted of (1) writing an e-mail to a friend retelling a given event at the supermarket based on a picture story (narrative text genre) and (2) in writing a letter to a friend discussing holiday plans (argumentative text genre). For both tasks a time limit of 30 minutes was fixed and no additional reference material was allowed. CEFR levesl have been assigned to all L2 learner texts, providing a holistic score as well as evaluations of coherence, lexis, grammar and sociolinguistic appropriateness. Person-related metadata provides information about: - the writer's language background, including L1(s), the L1(s) of mother and father, and a self-declared language group affiliation - the writer's age, gender and socio-economic status - the writer's district of residence and whether he lives in an urban or rural environment - the language, location and type of school the writer attended - whether the writer passed the local bilinguality exam or not - an anonymous identifier for the writer's school class and L2 teacher to account for class effects All texts have been transcribed manually adding transcription annotations that reflect surface features of the text, such as the graphical arrangement, and include error annotation on the orthographic level. In addition to that, all texts were automatically annotated, adding tokenisation, sentence splitting, POS-tagging and lemmatization using an orthographically corrected target version of the corpus. Kolipsi-1 L2 belongs to the Kolipsi Corpus Family, a series of related learner corpora collected in South Tyrolean upper secondary schools. The corpora of the Kolipsi Corpus Family contain Italian and German learner texts that were collected in the course of the KOLIPSI project in 2007/2008 (Kolipsi-1) and a follow-up study in 2014/2015 (Kolipsi-2). The aim of both corpus studies was to analyse the second language competences of South-Tyrolean pupils from upper secondary schools (between 16-18 years old), and to contextualize the results of such investigation by commenting on crucial sociolinguistic and psychosocial aspects that influence it. The results of the follow-up study should be compared to the results of the original KOLIPSI project

Beldeko Summary Corpus v1.1.0

Author: Strobl Carola
Wedig Helena
Publication venue: University of Antwerp
Publication date: 01/03/2023
Field of study

Beldeko Summary Corpus v1.1.0 The Beldeko (Belgisches Deutschkorpus) Summary Corpus is a learner corpus that consists of summaries written by advanced L2 German learners (CEF level B2-C1) with L1 Dutch. It has been created with the aim of investigating the academic writing skills in L2 German of third-year students of two bachelor programmes in Applied Linguistics and Linguistics and Literature, respectively. The corpus consists of 301 summaries (70774 tokens) written by 115 students of three intact classes (convenience sampling). The texts were collected at Ghent University (in 2013 and in 2014) and University College of Ghent (in 2013) as pre- and posttests of an intervention study on collaborative writing carried out by Carola Strobl in the context of her PhD research (Strobl, C. (2015). Affordances of online technologies for academic writing instruction in a foreign language. Ghent University, unpublished doctoral dissertation). 82 students produced three summaries each (pretest, posttest immediately after the three-weeks-intervention, delayed posttest six weeks after the intervention; missing data are indicated as n.a. in the metadata file) and 33 students produced two summaries each (pretest and posttest, missing data are indicated as n.a. in the metadata file). The metadata file (Beldeko_Summary_1.1.0_metadata.xlsx) provides information about: • Institution of data collection (HG= University College of Ghent, UG= Ghent University) • Year of data collection (2013, 2014) • Participants´ gender (f, m) • Number of texts written and number of tokens in each text (T1, T2, T3) The individual file names of the corpus reveal institution, year, unique ID of participant (per institution per year), text number, in the given order. The summaries contain between 37-330 words each, with a mean of 230 words (the targeted word count was between 220-250 words). Outliers regarding text length were unfinished texts produced by students who struggled with the time restriction of 60 minutes. The texts were written in class, on computers. Students were allowed to use online auxiliary means such as dictionaries. The task consisted in summarizing two texts (fragments of newspaper articles or interviews or websites) about a topic related to language variation in German each time (Kiezdeutsch, Mundartdebatte in der Schweiz, Viadrinisch, Varianten-Wörterbuch des Deutschen; see also word files provided in metadata). More specifically, the topics were distributed as follows: Kiezdeutsch: HG_2013_T1, UG_2013_T1, UG_2014_T1 Mundartdebatte in der Schweiz: HG_2013_T2, UG_2013_T2, UG_2014_T2 Viadrinisch: HG_2013_T3, Varianten-Wörterbuch des Deutschen: UG_2014_T3 The new version of the corpus (Beldeko 1.1.0) contains the manual annotations of the texts with token id, sentence id, source text form, target form, POS (STTS) and simple UPOS part-of-speech tag

VinKo (Varieties in Contact) Corpus v1.1

Author: Rabanus Stefan
Kruijt Anne
Tagliani Marta
Tomaselli Alessandra
Padovan Andrea
Alber Birgit
Cordin Patrizia
Zamparelli Roberto
Vogt Barbara Maria
Publication venue: University of Verona
Publication date: 01/01/2022
Field of study

Code preference in OLL of accommodation in Palma

Author: Bruyèl-Olmedo Antonio
Publication venue: Escuela Universitaria de Turismo 'Felipe Moreno' (appointed to Universitat de les Illes Balears)
Publication date: 12/01/2022
Field of study

The file consists of a database in .SAV format (SPSS) of language choice and preference as reflected in the websites of accommodation establishments in the city of Palma de Mallorca (Spain). The database comprises identifying data of all 245 establishments as well as multilingualism information on code choice and preference. The main variables considered are: Post code, Accommodation type, Ownership, Name, Rating, presence of Catalan, L1, L2, L3, L4, L5, L6, Ln, type of multiwriting and Type of Multilingualism. Code preference includes positions from L1 through L6

1

full texts

43

metadata records

Updated in last 30 days.

Eurac Research CLARIN Centre Repository

Access Repository Dashboard

Do you manage Open Research Online? Become a CORE Member to access insider analytics, issue reports and manage access to outputs from your repository in the CORE Repository Dashboard! 👇