Institute for Computational Linguistics “A. Zampolli”
ILC4CLARIN: Linguistic Data and NLP ToolNot a member yet
955 research outputs found
Sort by
CompL-it
CompL-it is a computational lexicon for Italian derived from LexicO (https://dspace-clarin-it.ilc.cnr.it/repository/xmlui/handle/20.500.11752/ILC-977), with the integration of following resources:
- M-GLF (https://dspace-clarin-it.ilc.cnr.it/repository/xmlui/handle/20.500.11752/ILC-1002), a list of lemmatized forms generated by the morphological analyzer MAGIC (Battista and Pirrelli, 1999, Pirrelli and Battista 2000);
- a set of treebanks for Italian (contained in https://lindat.cz/repository/xmlui/handle/11234/1-4611):
- ISDT;
- VIT;
- ParTUT;
- ParlaMint-it.
The resource contains a morphological layer (including lemmas, inflected forms, and morphological features) and a semantic layer (including senses and relations between them). Entries are encoded according to the OntoLex-Lemon model and made available as a semantic repository
PAROLE reference corpus
The PAROLE project (Preparatory Action for Linguistic Resources Organization for Language Engineering) has produced a set of harmonized corpora and lexicons for a large number of European languages. Each corpus, made up of 20 million words, was built up as reference corpus for Human Language Technology applications, to
provide full information about a large variety of text types in the language considered, to represent the use of contemporary language and to become the first nucleus of an electronic text library. The texts have been stored using a common format following the standards recommended in the CES (Corpus Encoding Standard), according to
flexibility and multifunctionality criteria. The texts belong to a wide range of media and genres, selected in proportions aimed at reflecting their prominence within the society, classified according to medium, genre, topic and time of production.
For more info see also
Goggi, Sara, Lisa Biagini, Remo Bindi, and Sergio Rossi. 1997. ‘Italian Corpus Documentation - LE-PAROLE WP2.11’, October. https://zenodo.org/records/8167985.
Marinelli, Rita, Lisa Biagini, Remo Bindi, Sara Goggi, Monica Monachini, Paola Orsolini, Eugenio Picchi, Sergio Rossi, Nicoletta Calzolari, and A. Zampolli. 1996. ‘The Italian “Parole” Corpus : An Overview’. Linguistica Computazionale Computational Linguistics in Pisa-Special Issue I (XVI/XVII, 1996/1997): 401–21.
https://doi.org/10.1400/18167.
https://www.ilc.cnr.it/wp-content/uploads/2022/05/Z224.pdf
The corpus is annotated at textual level, with some Named Entities annotation.
A portion of this corpus was annotated morpho-syntactic information and is available here:
Sara Goggi, Sara Goggi remo Bindi, Lisa Biagini e Sergio Rossi, 1997, Corpus Parole (3 milions words), ILC-CNR for CLARIN-IT repository hosted at Institute for Computational Linguistics "A. Zampolli", National Research Council, in Pisa, http://hdl.handle.net/20.500.11752/ILC-1001
KIParla - KIPasti transcripts
The KIPasti corpus is part of the larger KIParla collection (www.kiparla.it), which can be freely queried through the NoSketch Engine interface.
The ParlaBO corpus was compiled within the framework of “DiverSIta – Diversity in spoken Italian” project, funded by the Italian Ministry of University and Research (MUR) (PRIN 2022 PNRR Call).
It consists of over 40 hours of spoken data collected in thirteen different Italian regions (Abruzzo, Basilicata, Calabria, Campania, Emilia-Romagna, Lazio, Lombardy, Marche, Apulia, Sardinia, Tuscany, Umbria, Veneto) during mealtime conversations, generally within family settings. The interactions, recorded between 2020 and 2024, involved 145 speakers with different origins, ages, education levels, and occupations. Italian is predominantly used in all interactions, but in most of them (78%), various passages in dialect are also present. The transcriptions have been anonymized. Overall, the module is made up of 63 conversations.
This repository contains:
- metadata for both speakers (occupation, gender, age, origin, L1, educational achievement) and conversations (collection point, year, languages used), in the metadata subfolder
- descriptions of the set of transcription conventions used for this module
- for each conversation you will find: .eaf file in eaf/ folder (time-aligned Jefferson-style transcriptions); .txt file in linear-jefferson/ folder (linearized Jefferson-style transcription); .txt file in linear-orthographic/ folder (linearized transcription retaining only orthographic words); .tsv file in tsv/ folder (tokenised version of the transcription).
More information can be found in the README.md file.
Due to GDPR restrictions, pseudo-anonymized audio files (MP3) are available under a restricted-access license. To request access, please contact the corpus coordinators through the KIParla website and follow the provided procedure.
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License
ItAnt Faliscan Corpus
ItAnt Faliscan Corpus è il corpus digitale di nuove edizioni critiche di una selezione di iscrizioni in lingua Neo-falisca realizzato nell'ambito del progetto PRIN 2017 'Lingue e culture dell'Italia antica. Linguistica storica e modelli digitali'. Le iscrizioni sono rappresentate in XML utilizzando lo schema di codifica TEI/EpiDoc, arricchite con metadati condivisi e standardizzati, permettendo così una descrizione accurata di ciascuna iscrizione sia come oggetto linguistico che materiale. Il corpus comprende anche una riproduzione facsimile delle iscrizioni. Alla stesura delle schede hanno partecipato anche: Mariarosaria Zinzi, Greta Mozzat
Sewer Network Ontology
The developed Sewer Network ontology (SewerNet) describes the structure of the wastewater and stormwater networks, their elements, and their qualities. It also incorporates events involved in the network management process. The ontology is based on the French géostandard for drinking water supply and sanitation networks (RAEPA) v1.2 and the INSPIRE European directive. SewerNet is aligned with the DOLCE-lite foundational ontology and imports a few axioms from the Time ontology for interoperability providing a well-established semantic basis for modeling sewer networks
PMLAN
This corpus consists of written texts produced by 51 adolescents (14-18 years of age): 17 girls with a clinical diagnosis of Anorexia Nervosa, and 34 normal-weighted peers, matched for gender, age, educational level, and geographical origin. All participants were asked to produce three short texts (10-15 lines long): in the first task, the prompt was “How would you describe yourself? (Please, talk about your physical and personality traits, your hobbies, etc.)” (personal task). In the second, the prompt was “How do you usually spend time with your friends?” (neutral task). For the third task, participants were asked to describe a complex picture. The elicited responses were manually digitized by linguists, and the resulting corpus was subjected to PoS Tagging and Dependency Parsing (CoNLL format). Approval was granted by the Bioethics Committee of Azienda Ospedaliero-Universitaria di Bologna, Policlinico Sant’Orsola-Malpighi, Italy (prot. 683/2019/Oss/AOUBo). At the time of submission on CLARIN, this is an ongoing project. Due to the Italian privacy policy, raw data of the corpus (i.e., transcriptions and clinical information of the participants) is not available. Processed data (i.e., tables of lexical/syntactic values, with the name of the speakers masked through an alphanumeric acronym to ensure anonymity) are available from the contact person upon reasonable request
ItaASD: Italian speech corpus Autism Spectrum Disorder
This is a corpus of semi-spontaneous speech produced by 34 children between 6 and 13 years of age, residents in the Campania region of Italy. Half of the participating children were diagnosed with high-functioning Autism Spectrum Disorder, and the other half were neurotypical children matched for age, gender, and geographical origin. All participants were administered three tasks: a complex image description task, a story-telling task, and a story-retelling task. This resulted in 4 hours and 19 minutes of recorded speech, which were then transcribed and annotated using ELAN. This research project was approved by the Bioethics Committee of the Alma Mater Studiorum - University of Bologna (no. 0173455/2022). Due to the Italian privacy policy, raw data of the corpus (i.e., speech recordings, transcriptions, and clinical information of the participants) is not available. Processed data (i.e., tables of acoustic/rhythmic/lexical/syntactic values, with the name of the speakers masked through an alphanumeric acronym to ensure anonymity) are available from the contact person upon reasonable request
Pan-Latin Geothermal Energy Lexicon
The Pan-Latin Geothermal Energy Lexicon (Lessico panlatino dell’energia geotermica), developed within the Realiter network, contains the basic terms related to geothermal energy in seven Romance languages (Italian, Catalan, Spanish, French, Galician, Portuguese, Romanian) and in English
ItAntDSL
The bundle contains:
1. ANTLR Lexer and Parser for a Domain-Specific Language named ItAntDSL, compliant with the EpiDoc conceptual model, to describe inscriptions in the languages of ancient Italy (in particular Venetic and Faliscan);
2. Visitor to convert ItAntDSL in XML-ItAnt
The development of XSL(T) stylesheets to convert XML-ItAnt to XML-TEI/EpiDoc is in progres
Pan-Latin Lexicon of Collars and Sleeves in Fashion and Costume
The Pan-Latin Lexicon of Collars and Sleeves in Fashion and Costume, developed within the Pan-Latin Terminology Network (REALITER), aims at collecting the main terms designating collars and sleeves in fashion and costume. It proposes a semiotic reference for a common referent, in order to try to establish terminological equivalences in this very technical and specialised field, characterised by several cultural traditions. The Lexicon intends to give a multilingual (Italian, Catalan, Spanish, French, Portuguese, English) terminological description in this field, in order to provide a useful reference for those interested in this sector, those who study, translate, write and work on fashion and costume. In the case of the Spanish language, the equivalents in the Spanish of Spain, Argentina and Mexico are provided. For the Portuguese language, the Brazilian Portuguese equivalents are also given