Eurac Research CLARIN Centre Repository
Not a member yet
    43 research outputs found

    HELLO CAMPANIA! Ghana Collection

    No full text
    The HELLO CAMPANIA! Ghana collection contains 12 sociolinguistic interviews collected with 4 first generation migrants and 8 second generation migrants living in Naples. It also contains 9 language portraits

    ITACA Corpus - Coherence in Italian Argumentative Essays v1.0

    No full text
    The ITACA Corpus is a corpus of argumentative essays written in Italian by upper secondary school students from South Tyrol. It has been created with the aim to investigate and describe the students’ textual competences with a special focus on text coherence. The ITACA corpus consists of 635 texts collected during the school year 2021/2022 in schools with Italian as a language of instruction. The whole corpus has been automatically tokenized, lemmatized, and annotated for part-of-speech and dependency relations. A subset of 388 texts additionally contains annotations regarding textual features, such as punctuation, connectives, agreement, anaphora, argumentative structure, off-topics and contradictions. The corpus furthermore provides metadata regarding student’s age, gender, language background, reading and writing habits, their performance in a standardized language test as well as holistic and analytic coherence evaluations for each text

    HELLO CAMPANIA! Sri Lanka Collection

    No full text
    The corpus consists of 48 audio files for a total of 20:38 of recordings (public) and their relative transcriptions in ELAN (upon request). This collection includes 15 language portraits. The collection is organized in four bundles: - 1G_audio: contains all the audio files collected with 1st generation migrants (30 files) - 1G_portrait: contains the language portraits collected 1st generation migrants (13 files) - 2G_audio: contains all the audio files collected with 2nd generation migrants (18 files) - 2G_portrait: contains the language portraits collected 2nd generation migrants (2 files

    HELLO CAMPANIA! Bangladesh Collection

    No full text
    The collection contains 11 interviews with 1st Bangladeshi generation migrants in Naples. It also contains langauge portraits of the migrants

    VinKo (Varieties in Contact) Corpus v1.2

    No full text
    VINKO is a spoken corpus based on crowd-sourced audio recordings that has been designed to provide relevant linguistic information about the minority languages and dialects spoken in the area between Innsbruck and the Po Valley. The corpus contains audio recordings from local languages and varieties spoken in the regions Trentino-Alto Adige/Südtirol, Veneto, and Friuli-Venezia Giulia, with particular focus on the so-called 'language contact' between Germanic (Cimbrian, Mòcheno, Tyrolean, Saurano, and Sappadino) and Romance (Ladin, Trentino and Veneto dialects). The data collection took place from June 2017 to May 2023

    MT@BZ translation corpus v1.0

    No full text
    The MT@BZ is a translation corpus that consists of 52 decrees published by the Autonomous Province of Bolzano (South Tyrol) aligned with their machine translated versions. More precisely, it consists of 26 decrees in German and the same 26 in Italian in their official versions, respectively machine translated by the project team into Italian and into German. 10 of them are COVID-19 related decress, while 16 are miscellaneous. Overall, they consist of around 130,000 words. Their machine translation was carried out with a customized version of ModernMT. Later, the corpus was uploaded first into the annotation platform Webanno, then transferred to Inception. Four annotators annotated the translation errors made by the machine according to an ad hoc error taxonomy for quality assessment. Finally, the annotations were curated to create a gold standard corpus

    Kolipsi-1 Corpus v1.1

    No full text
    The Kolipsi-1 L2 is a written learner corpus of German and Italian L2 speakers originating from South Tyrol (Italy). It has been developed as a by-product of the KOLIPSI project “South-Tyrolean pupils and the second language: a linguistic and socio-psychological investigation”. In addition, data from L1 pupils were collected exclusively for the creation of a native speaker reference corpus. The data collection took place in autumn 2007 and is based on two standardized tests for written productions. The two tasks consisted of (1) writing an e-mail to a friend retelling a given event at the supermarket based on a picture story (narrative text genre) and (2) in writing a letter to a friend discussing holiday plans (argumentative text genre). For both tasks a time limit of 30 minutes was fixed and no additional reference material was allowed. CEFR levesl have been assigned to all L2 learner texts, providing a holistic score as well as evaluations of coherence, lexis, grammar and sociolinguistic appropriateness. Person-related metadata provides information about: - the writer's language background, including L1(s), the L1(s) of mother and father, and a self-declared language group affiliation - the writer's age, gender and socio-economic status - the writer's district of residence and whether he lives in an urban or rural environment - the language, location and type of school the writer attended - whether the writer passed the local bilinguality exam or not - an anonymous identifier for the writer's school class and L2 teacher to account for class effects All texts have been transcribed manually adding transcription annotations that reflect surface features of the text, such as the graphical arrangement, and include error annotation on the orthographic level. In addition to that, all texts were automatically annotated, adding tokenisation, sentence splitting, POS-tagging and lemmatization using an orthographically corrected target version of the corpus. Kolipsi-1 L2 belongs to the Kolipsi Corpus Family, a series of related learner corpora collected in South Tyrolean upper secondary schools. The corpora of the Kolipsi Corpus Family contain Italian and German learner texts that were collected in the course of the KOLIPSI project in 2007/2008 (Kolipsi-1) and a follow-up study in 2014/2015 (Kolipsi-2). The aim of both corpus studies was to analyse the second language competences of South-Tyrolean pupils from upper secondary schools (between 16-18 years old), and to contextualize the results of such investigation by commenting on crucial sociolinguistic and psychosocial aspects that influence it. The results of the follow-up study should be compared to the results of the original KOLIPSI project

    Beldeko Summary Corpus v1.1.0

    No full text
    Beldeko Summary Corpus v1.1.0 The Beldeko (Belgisches Deutschkorpus) Summary Corpus is a learner corpus that consists of summaries written by advanced L2 German learners (CEF level B2-C1) with L1 Dutch. It has been created with the aim of investigating the academic writing skills in L2 German of third-year students of two bachelor programmes in Applied Linguistics and Linguistics and Literature, respectively. The corpus consists of 301 summaries (70774 tokens) written by 115 students of three intact classes (convenience sampling). The texts were collected at Ghent University (in 2013 and in 2014) and University College of Ghent (in 2013) as pre- and posttests of an intervention study on collaborative writing carried out by Carola Strobl in the context of her PhD research (Strobl, C. (2015). Affordances of online technologies for academic writing instruction in a foreign language. Ghent University, unpublished doctoral dissertation). 82 students produced three summaries each (pretest, posttest immediately after the three-weeks-intervention, delayed posttest six weeks after the intervention; missing data are indicated as n.a. in the metadata file) and 33 students produced two summaries each (pretest and posttest, missing data are indicated as n.a. in the metadata file). The metadata file (Beldeko_Summary_1.1.0_metadata.xlsx) provides information about: • Institution of data collection (HG= University College of Ghent, UG= Ghent University) • Year of data collection (2013, 2014) • Participants´ gender (f, m) • Number of texts written and number of tokens in each text (T1, T2, T3) The individual file names of the corpus reveal institution, year, unique ID of participant (per institution per year), text number, in the given order. The summaries contain between 37-330 words each, with a mean of 230 words (the targeted word count was between 220-250 words). Outliers regarding text length were unfinished texts produced by students who struggled with the time restriction of 60 minutes. The texts were written in class, on computers. Students were allowed to use online auxiliary means such as dictionaries. The task consisted in summarizing two texts (fragments of newspaper articles or interviews or websites) about a topic related to language variation in German each time (Kiezdeutsch, Mundartdebatte in der Schweiz, Viadrinisch, Varianten-Wörterbuch des Deutschen; see also word files provided in metadata). More specifically, the topics were distributed as follows: Kiezdeutsch: HG_2013_T1, UG_2013_T1, UG_2014_T1 Mundartdebatte in der Schweiz: HG_2013_T2, UG_2013_T2, UG_2014_T2 Viadrinisch: HG_2013_T3, Varianten-Wörterbuch des Deutschen: UG_2014_T3 The new version of the corpus (Beldeko 1.1.0) contains the manual annotations of the texts with token id, sentence id, source text form, target form, POS (STTS) and simple UPOS part-of-speech tag

    VinKo (Varieties in Contact) Corpus v1.1

    No full text
    VINKO is a spoken corpus based on crowd-sourced audio recordings that has been designed to provide relevant linguistic information about the minority languages and dialects spoken in the area between Innsbruck and the Po Valley. The corpus contains audio recordings from local languages and varieties spoken in the regions Trentino-Alto Adige/Südtirol, Veneto, and Friuli-Venezia Giulia, with particular focus on the so-called 'language contact' between Germanic (Cimbrian, Mòcheno, Tyrolean, Saurano, and Sappadino) and Romance (Ladin, Trentino and Veneto dialects). The data collection took place from June 2017 to December 2021

    Code preference in OLL of accommodation in Palma

    No full text
    The file consists of a database in .SAV format (SPSS) of language choice and preference as reflected in the websites of accommodation establishments in the city of Palma de Mallorca (Spain). The database comprises identifying data of all 245 establishments as well as multilingualism information on code choice and preference. The main variables considered are: Post code, Accommodation type, Ownership, Name, Rating, presence of Catalan, L1, L2, L3, L4, L5, L6, Ln, type of multiwriting and Type of Multilingualism. Code preference includes positions from L1 through L6

    1

    full texts

    43

    metadata records
    Updated in last 30 days.
    Eurac Research CLARIN Centre Repository
    Access Repository Dashboard
    Do you manage Open Research Online? Become a CORE Member to access insider analytics, issue reports and manage access to outputs from your repository in the CORE Repository Dashboard! 👇