Eurac Research CLARIN Centre Repository

Not a member yet

43 research outputs found

Sort by

ACTER (Annotated Corpora for Term Extraction Research) v1.4

Author: Rigouts Terryn Ayla
Publication venue: LT3 Language and Translation Technology Team
Publication date: 15/07/2020
Field of study

The ACTER (Annotated Corpora for Term Extraction Research) is an annotated dataset for term extraction. Terms and Named Entities have been manually annotated in specialised comparable corpora covering 3 languages (English, French, and Dutch), and 4 domains (corruption, dressage, heart failure, and wind energy)

ACTER (Annotated Corpora for Term Extraction Research) v1.3

Author: Rigouts Terryn Ayla
Publication venue: LT3 Language and Translation Technology Team
Publication date: 17/12/2019
Field of study

DIDI - The DiDi Corpus of South Tyrolean CMC 1.0.0

Author: Frey Jennifer-Carmen
Glaznieks Aivars
Stemle Egon W.
Publication venue: Institute for Applied Linguistics, Eurac Research
Publication date: 07/03/2019
Field of study

The DiDi corpus has an overall size of around 600.000 Tokens gathered from 136 South Tyrolean Facebook users who participated in the DiDi project. It consists of 11.102 Facebook wall posts, 6.507 wall comments and 22.218 private messages. All messages were written by the participants throughout the year 2013. Please read the fulldescription of the corpus for further details. Please consider also the description of the method of data collection and the full description of the DiDi project and its research questions. As every participant could offer either his/her private messages, his/her texts on the wall or both, the corpus comprises wall posts and wall comments from 130 profiles and private messages of 56 profiles; 50 participants granted access to both types of data. Free access to the corpus is given to the wall posts and comments. Due to privacy issues the access to the private messages is restricted. Access to the private messages can be given for scientific research only, after signing a non-disclosure agreement. In case you are interested in the data for scientific reasons, please contact the research team. All texts were anonymised in order to guarantee that the participants' identity cannnot be infered from the texts. The anonymisation included person names, group names, geographical names and adjectival references, institution names, hyperlinks, mail addresses, phone numbers, numbers of bank accounts, servers, postal codes and other private information. Please, read the anonymisation document for the anonymisation keys. The corpus offers a vast range of research opportunities for linguists that are interested in CMC in general, and more specific in multilingual language use, the use of regional varieties, code switching, code shifting and code mixing phenomena, etc. Access to the DiDi corpus: https://commul.eurac.edu/annis/did

MERLIN Written Learner Corpus for Czech, German, Italian 1.1

Author: Wisniewski Katrin
Abel Andrea
Vodičková Kateřina
Plassmann Sybille
Meurers Detmar
Woldt Claudia
Schöne Karin
Blaschitz Verena
Lyding Verena
Nicolas Lionel
Vettori Chiara
Pečený Pavel
Hana Jirka
Čurdová Veronika
Štindlová Barbora
Klein Gudrun
Lauppe Louise
Boyd Adriane
Bykh Serhiy
Krivanek Julia
Publication venue: Institute for Applied Linguistics, Eurac Research
Publication date: 24/08/2018
Field of study

The MERLIN corpus is a written learner corpus for Czech, German, and Italian that has been designed to illustrate the Common European Framework of Reference for Languages (CEFR) with authentic learner data. The corpus contains learner texts produced in standardized language certifications covering CEFR levels A1-C1. The MERLIN annotation scheme includes a wide range of language characteristics that provide researchers with concrete examples of learner performance and progress across multiple proficiency levels

Core Metadata [Schema] for Learner Corpora Draft 1.0

Author: Granger Sylviane
Paquot Magali
Publication venue: Institute for Applied Linguistics, Eurac Research
Publication date: 15/12/2017
Field of study

First proposal towards a "Core Metadata [Schema] for Learner Corpora", presented at the "CLARIN workshop on Interoperability of Second Language Resources and Tools", Gothenburg, Sweden, 06-08/12/2017 . It was circulated as part of the invited talk "Towards standardization of metadata for L2 corpora" that took stock of a range of metadata sets and made suggestions for minimal and maximal design principles, but it was never published (or part of a publication)

MERLIN Written Learner Corpus for Czech, German, Italian 1.0

Author: Wisniewski Katrin
Abel Andrea
Vodičková Kateřina
Plassmann Sybille
Meurers Detmar
Woldt Claudia
Schöne Karin
Blaschitz Verena
Lyding Verena
Nicolas Lionel
Vettori Chiara
Pečený Pavel
Hana Jirka
Čurdová Veronika
Štindlová Barbora
Klein Gudrun
Lauppe Louise
Boyd Adriane
Bykh Serhiy
Krivanek Julia
Publication venue: Institute for Applied Linguistics, Eurac Research
Publication date: 2014
Field of study

KoKo German L1 Learner Corpus v3

Author: Abel Andrea
Glaznieks Aivars
Culy Chris
Nicolas Lionel
Stemle Egon W.
Publication venue: Institute for Applied Linguistics, Eurac Research
Publication date: 2014
Field of study

The KoKo Corpus is an error-annotated learner corpus of L1 German speakers. It has been created with the aim to investigate and describe the writing skills of German-speaking secondary-school pupils at the end of their school career by analysing authentic texts produced in classrooms. The corpus consists of 1503 argumentative essays which contain manually performed transcription annotations and linguistic error annotations. Transcription annotations reflect surface features of the text, such as the graphical arrangement and self-corrections. Error annotations relate to the orthographic level (including punctuation errors), and a selection of the texts (n=597) also contain error annotations on the grammatical level. The corpus building process was guided by two goals: 1. describe writing skills at the transition from secondary school to university, 2. determine external factors that may influence the distribution of writing skills, such as the region, sociolinguistic (gender, age), socio-economic, and language-related biographical factors (L1, preferred variety of German, reading and writing habits, etc.). The pupils were selected from three different German-speaking areas: - North Tyrol (Austria), South Tyrol (Italy), and Thuringia (Germany). Classes were sampled randomly, using the size of the cities in which the schools were located (small vs. medium vs. big) and the type of school (providing general education vs. education specific to a particular profession) as strata for the sampling. Since data were collected during regular courses, the typical formation of secondary-school classes in the three regions is represented in the whole corpus. Most of the participants are German native speakers (n=1319, 82.7%). Person-related metadata provides information about: - writer's L1 - writer's gender - type of school the essay comes from - location of the school the essay comes from - grade attended at data collection In addition, the corpus is automatically annotated, including tokenisation, sentence splitting, POS-tagging and lemmatization

PAISÀ Corpus of Italian Web Text

Author: Lyding Verena
Stemle Egon
Borghetti Claudia
Brunello Marco
Castagnoli Sara
Dell’Orletta Felice
Dittmann Henrik
Lenci Alessandro
Pirrelli Vito
Publication venue: Institute for Applied Linguistics, Eurac Research
Publication date: 2013
Field of study

oai:clarin.eurac.edu:20.500.12124/3The Paisà corpus is a large collection of Italian web texts, licensed under Creative Commons (Attribution-ShareAlike and Attribution-Noncommercial-ShareAlike). It has been created in the context of the project PAISÀ. All documents were selected in two different ways. A part of the corpus was constructed using a method inspired by the WaCky project. We created 50,000 word pairs by randomly combining terms from an Italian basic vocabulary list, and used the pairs as queries to the Yahoo! search engine in order to retrieve candidate pages. We limited hits to pages in Italian with a Creative Commons license of type: CC-Attribution, CC-Attribution-Sharealike, CC-Attribution-Sharealike-Non-commercial, and CC-Attribution-Non-commercial. Pages that were wrongly tagged as CC-licensed were eliminated using a black-list that was populated by manual inspection of earlier versions of the corpus. The retrieved pages were automatically cleaned using the KrdWrd system. The remaining pages in the PAISÀ corpus come from the Italian versions of various Wikimedia Foundation projects, namely: Wikipedia, Wikinews, Wikisource, Wikibooks, Wikiversity, Wikivoyage. The official Wikimedia Foundation dumps were used, extracting text with Wikipedia Extractor. Once all materials were downloaded, the collection was filtered discarding empty documents or documents containing less than 150 words. The corpus contains approximately 380,000 documents coming from about 1,000 different websites, for a total of about 250 million words. Approximately 260,000 documents are from Wikipedia, approx. 5,600 from other Wikimedia Foundation projects. About 9,300 documents come from Indymedia, and we estimate that about 65,000 documents come from blog services

KoKo German L1 Learner Corpus v2

Author: Abel Andrea
Glaznieks Aivars
Culy Chris
Publication venue: Institute for Applied Linguistics, Eurac Research
Publication date: 2012
Field of study

The KoKo Corpus is an error-annotated learner corpus of L1 German speakers. It has been created with the aim to investigate and describe the writing skills of German-speaking secondary-school pupils at the end of their school career by analysing authentic texts produced in classrooms. The corpus consists of 1503 argumentative essays which contain manually performed transcription annotations and linguistic error annotations. Error annotation relates to the orthographic level only. Transcription annotations reflect surface features of the text, such as the graphical arrangement and self-corrections. The corpus building process was guided by two goals: 1. describe writing skills at the transition from secondary school to university, 2. determine external factors that may influence the distribution of writing skills, such as the region, sociolinguistic (gender, age), socio-economic, and language-related biographical factors (L1, preferred variety of German, reading and writing habits, etc.). The pupils were selected from three different German-speaking areas: - North Tyrol (Austria), South Tyrol (Italy), and Thuringia (Germany). Classes were sampled randomly, using the size of the cities in which the schools were located (small vs. medium vs. big) and the type of school (providing general education vs. education specific to a particular profession) as strata for the sampling. Since data were collected during regular courses, the typical formation of secondary-school classes in the three regions is represented in the whole corpus. Most of the participants are German native speakers (n=1319, 82.7%). Person-related metadata provides information about: - writer's L1 - writer's gender - type of school the essay comes from - location of the school the essay comes from - grade attended at data collection In addition, the corpus is automatically annotated, including tokenisation, sentence splitting, POS-tagging and lemmatization

KoKo German L1 Learner Corpus v1

Author: Abel Andrea
Glaznieks Aivars
Culy Chris
Publication venue: Institute for Applied Linguistics, Eurac Research
Publication date: 2012
Field of study

The KoKo Corpus is an error-annotated learner corpus of L1 German speakers. It has been created with the aim to investigate and describe the writing skills of German-speaking secondary-school pupils at the end of their school career by analysing authentic texts produced in classrooms. The corpus building process was guided by two goals: 1. describe writing skills at the transition from secondary school to university, 2. determine external factors that may influence the distribution of writing skills, such as the region, sociolinguistic (gender, age), socio-economic, and language-related biographical factors (L1, preferred variety of German, reading and writing habits, etc.). The pupils were selected from three different German-speaking areas: - North Tyrol (Austria), South Tyrol (Italy), and Thuringia (Germany). Classes were sampled randomly, using the size of the cities in which the schools were located (small vs. medium vs. big) and the type of school (providing general education vs. education specific to a particular profession) as strata for the sampling. Since data were collected during regular courses, the typical formation of secondary-school classes in the three regions is represented in the whole corpus. Most of the participants are German native speakers (n=1319, 82.7%). Person-related metadata provides information about: - writer's L1 - writer's gender - type of school the essay comes from - location of the school the essay comes from - grade attended at data collectio

1

full texts

43

metadata records

Updated in last 30 days.

Eurac Research CLARIN Centre Repository

Access Repository Dashboard

Do you manage Open Research Online? Become a CORE Member to access insider analytics, issue reports and manage access to outputs from your repository in the CORE Repository Dashboard! 👇