Eurac Research CLARIN Centre Repository
Not a member yet
43 research outputs found
Sort by
ACTER (Annotated Corpora for Term Extraction Research) v1.4
The ACTER (Annotated Corpora for Term Extraction Research) is an annotated dataset for term extraction. Terms and Named Entities have been manually annotated in specialised comparable corpora covering 3 languages (English, French, and Dutch), and 4 domains (corruption, dressage, heart failure, and wind energy)
ACTER (Annotated Corpora for Term Extraction Research) v1.3
The ACTER (Annotated Corpora for Term Extraction Research) is an annotated dataset for term extraction. Terms and Named Entities have been manually annotated in specialised comparable corpora covering 3 languages (English, French, and Dutch), and 4 domains (corruption, dressage, heart failure, and wind energy)
DIDI - The DiDi Corpus of South Tyrolean CMC 1.0.0
The DiDi corpus has an overall size of around 600.000 Tokens gathered from 136 South Tyrolean Facebook users who participated in the DiDi project. It consists of 11.102 Facebook wall posts, 6.507 wall comments and 22.218 private messages. All messages were written by the participants throughout the year 2013. Please read the fulldescription of the corpus for further details. Please consider also the description of the method of data collection and the full description of the DiDi project and its research questions.
As every participant could offer either his/her private messages, his/her texts on the wall or both, the corpus comprises wall posts and wall comments from 130 profiles and private messages of 56 profiles; 50 participants granted access to both types of data. Free access to the corpus is given to the wall posts and comments. Due to privacy issues the access to the private messages is restricted. Access to the private messages can be given for scientific research only, after signing a non-disclosure agreement. In case you are interested in the data for scientific reasons, please contact the research team.
All texts were anonymised in order to guarantee that the participants' identity cannnot be infered from the texts. The anonymisation included person names, group names, geographical names and adjectival references, institution names, hyperlinks, mail addresses, phone numbers, numbers of bank accounts, servers, postal codes and other private information. Please, read the anonymisation document for the anonymisation keys.
The corpus offers a vast range of research opportunities for linguists that are interested in CMC in general, and more specific in multilingual language use, the use of regional varieties, code switching, code shifting and code mixing phenomena, etc.
Access to the DiDi corpus: https://commul.eurac.edu/annis/did
MERLIN Written Learner Corpus for Czech, German, Italian 1.1
The MERLIN corpus is a written learner corpus for Czech, German, and Italian that has been designed to illustrate the Common European Framework of Reference for Languages (CEFR) with authentic learner data. The corpus contains learner texts produced in standardized language certifications covering CEFR levels A1-C1. The MERLIN annotation scheme includes a wide range of language characteristics that provide researchers with concrete examples of learner performance and progress across multiple proficiency levels
Core Metadata [Schema] for Learner Corpora Draft 1.0
First proposal towards a "Core Metadata [Schema] for Learner Corpora", presented at the "CLARIN workshop on Interoperability of Second Language Resources and Tools", Gothenburg, Sweden, 06-08/12/2017 . It was circulated as part of the invited talk "Towards standardization of metadata for L2 corpora" that took stock of a range of metadata sets and made suggestions for minimal and maximal design principles, but it was never published (or part of a publication)
MERLIN Written Learner Corpus for Czech, German, Italian 1.0
The MERLIN corpus is a written learner corpus for Czech, German, and Italian that has been designed to illustrate the Common European Framework of Reference for Languages (CEFR) with authentic learner data. The corpus contains learner texts produced in standardized language certifications covering CEFR levels A1-C1. The MERLIN annotation scheme includes a wide range of language characteristics that provide researchers with concrete examples of learner performance and progress across multiple proficiency levels
KoKo German L1 Learner Corpus v3
The KoKo Corpus is an error-annotated learner corpus of L1 German speakers. It has been created with the aim to investigate and describe the writing skills of German-speaking secondary-school pupils at the end of their school career by analysing authentic texts produced in classrooms.
The corpus consists of 1503 argumentative essays which contain manually performed transcription annotations and linguistic error annotations. Transcription annotations reflect surface features of the text, such as the graphical arrangement and self-corrections. Error annotations relate to the orthographic level (including punctuation errors), and a selection of the texts (n=597) also contain error annotations on the grammatical level.
The corpus building process was guided by two goals:
1. describe writing skills at the transition from secondary school to university,
2. determine external factors that may influence the distribution of writing skills, such as the region, sociolinguistic (gender, age), socio-economic, and language-related biographical factors (L1, preferred variety of German, reading and writing habits, etc.).
The pupils were selected from three different German-speaking areas:
- North Tyrol (Austria), South Tyrol (Italy), and Thuringia (Germany).
Classes were sampled randomly, using the size of the cities in which the schools were located (small vs. medium vs. big) and the type of school (providing general education vs. education specific to a particular profession) as strata for the sampling. Since data were collected during regular courses, the typical formation of secondary-school classes in the three regions is represented in the whole corpus. Most of the participants are German native speakers (n=1319, 82.7%).
Person-related metadata provides information about:
- writer's L1
- writer's gender
- type of school the essay comes from
- location of the school the essay comes from
- grade attended at data collection
In addition, the corpus is automatically annotated, including tokenisation, sentence splitting, POS-tagging and lemmatization
PAISÀ Corpus of Italian Web Text
oai:clarin.eurac.edu:20.500.12124/3The Paisà corpus is a large collection of Italian web texts, licensed under Creative Commons (Attribution-ShareAlike and Attribution-Noncommercial-ShareAlike). It has been created in the context of the project PAISÀ.
All documents were selected in two different ways. A part of the corpus was constructed using a method inspired by the WaCky project. We created 50,000 word pairs by randomly combining terms from an Italian basic vocabulary list, and used the pairs as queries to the Yahoo! search engine in order to retrieve candidate pages. We limited hits to pages in Italian with a Creative Commons license of type: CC-Attribution, CC-Attribution-Sharealike, CC-Attribution-Sharealike-Non-commercial, and CC-Attribution-Non-commercial. Pages that were wrongly tagged as CC-licensed were eliminated using a black-list that was populated by manual inspection of earlier versions of the corpus. The retrieved pages were automatically cleaned using the KrdWrd system.
The remaining pages in the PAISÀ corpus come from the Italian versions of various Wikimedia Foundation projects, namely: Wikipedia, Wikinews, Wikisource, Wikibooks, Wikiversity, Wikivoyage. The official Wikimedia Foundation dumps were used, extracting text with Wikipedia Extractor.
Once all materials were downloaded, the collection was filtered discarding empty documents or documents containing less than 150 words.
The corpus contains approximately 380,000 documents coming from about 1,000 different websites, for a total of about 250 million words. Approximately 260,000 documents are from Wikipedia, approx. 5,600 from other Wikimedia Foundation projects. About 9,300 documents come from Indymedia, and we estimate that about 65,000 documents come from blog services
KoKo German L1 Learner Corpus v2
The KoKo Corpus is an error-annotated learner corpus of L1 German speakers. It has been created with the aim to investigate and describe the writing skills of German-speaking secondary-school pupils at the end of their school career by analysing authentic texts produced in classrooms.
The corpus consists of 1503 argumentative essays which contain manually performed transcription annotations and linguistic error annotations. Error annotation relates to the orthographic level only. Transcription annotations reflect surface features of the text, such as the graphical arrangement and self-corrections.
The corpus building process was guided by two goals:
1. describe writing skills at the transition from secondary school to university,
2. determine external factors that may influence the distribution of writing skills, such as the region, sociolinguistic (gender, age), socio-economic, and language-related biographical factors (L1, preferred variety of German, reading and writing habits, etc.).
The pupils were selected from three different German-speaking areas:
- North Tyrol (Austria), South Tyrol (Italy), and Thuringia (Germany).
Classes were sampled randomly, using the size of the cities in which the schools were located (small vs. medium vs. big) and the type of school (providing general education vs. education specific to a particular profession) as strata for the sampling. Since data were collected during regular courses, the typical formation of secondary-school classes in the three regions is represented in the whole corpus. Most of the participants are German native speakers (n=1319, 82.7%).
Person-related metadata provides information about:
- writer's L1
- writer's gender
- type of school the essay comes from
- location of the school the essay comes from
- grade attended at data collection
In addition, the corpus is automatically annotated, including tokenisation, sentence splitting, POS-tagging and lemmatization
KoKo German L1 Learner Corpus v1
The KoKo Corpus is an error-annotated learner corpus of L1 German speakers. It
has been created with the aim to investigate and describe the writing skills of
German-speaking secondary-school pupils at the end of their school career by
analysing authentic texts produced in classrooms.
The corpus building process was guided by two goals:
1. describe writing skills at the transition from secondary school to
university,
2. determine external factors that may influence the distribution of writing
skills, such as the region, sociolinguistic (gender, age), socio-economic, and
language-related biographical factors (L1, preferred variety of German, reading
and writing habits, etc.).
The pupils were selected from three different German-speaking areas:
- North Tyrol (Austria), South Tyrol (Italy), and Thuringia (Germany).
Classes were sampled randomly, using the size of the cities in which the
schools were located (small vs. medium vs. big) and the type of school
(providing general education vs. education specific to a particular profession)
as strata for the sampling. Since data were collected during regular courses,
the typical formation of secondary-school classes in the three regions is
represented in the whole corpus. Most of the participants are German native
speakers (n=1319, 82.7%).
Person-related metadata provides information about:
- writer's L1
- writer's gender
- type of school the essay comes from
- location of the school the essay comes from
- grade attended at data collectio