Institute for Computational Linguistics “A. Zampolli”

ILC4CLARIN: Linguistic Data and NLP Tool

CompL-it

Author: Sciolette Flavia
Giovannetti Emiliano
Marchi Simone
Bellandi Andrea
Publication venue: Istituto di Linguistica Computazionale “A. Zampolli” - Consiglio Nazionale delle Ricerche (ILC-CNR)
Publication date: 01/01/2024
Field of study

CompL-it is a computational lexicon for Italian derived from LexicO (https://dspace-clarin-it.ilc.cnr.it/repository/xmlui/handle/20.500.11752/ILC-977), with the integration of following resources: - M-GLF (https://dspace-clarin-it.ilc.cnr.it/repository/xmlui/handle/20.500.11752/ILC-1002), a list of lemmatized forms generated by the morphological analyzer MAGIC (Battista and Pirrelli, 1999, Pirrelli and Battista 2000); - a set of treebanks for Italian (contained in https://lindat.cz/repository/xmlui/handle/11234/1-4611): - ISDT; - VIT; - ParTUT; - ParlaMint-it. The resource contains a morphological layer (including lemmas, inflected forms, and morphological features) and a semantic layer (including senses and relations between them). Entries are encoded according to the OntoLex-Lemon model and made available as a semantic repository

PAROLE reference corpus

Author: Marinelli Rita
Biagini Lisa
Bindi Remo
Goggi Sara
Monachini Monica
Orsolini Paola
Picchi Eugenio
Rossi Sergio
Calzolari Nicoletta
Zampolli Antonio
Publication venue: Istituto di Linguistica Computazionale “A. Zampolli” - Consiglio Nazionale delle Ricerche (ILC-CNR)
Publication date: 20/03/2024
Field of study

The PAROLE project (Preparatory Action for Linguistic Resources Organization for Language Engineering) has produced a set of harmonized corpora and lexicons for a large number of European languages. Each corpus, made up of 20 million words, was built up as reference corpus for Human Language Technology applications, to provide full information about a large variety of text types in the language considered, to represent the use of contemporary language and to become the first nucleus of an electronic text library. The texts have been stored using a common format following the standards recommended in the CES (Corpus Encoding Standard), according to flexibility and multifunctionality criteria. The texts belong to a wide range of media and genres, selected in proportions aimed at reflecting their prominence within the society, classified according to medium, genre, topic and time of production. For more info see also Goggi, Sara, Lisa Biagini, Remo Bindi, and Sergio Rossi. 1997. ‘Italian Corpus Documentation - LE-PAROLE WP2.11’, October. https://zenodo.org/records/8167985. Marinelli, Rita, Lisa Biagini, Remo Bindi, Sara Goggi, Monica Monachini, Paola Orsolini, Eugenio Picchi, Sergio Rossi, Nicoletta Calzolari, and A. Zampolli. 1996. ‘The Italian “Parole” Corpus : An Overview’. Linguistica Computazionale Computational Linguistics in Pisa-Special Issue I (XVI/XVII, 1996/1997): 401–21. https://doi.org/10.1400/18167. https://www.ilc.cnr.it/wp-content/uploads/2022/05/Z224.pdf The corpus is annotated at textual level, with some Named Entities annotation. A portion of this corpus was annotated morpho-syntactic information and is available here: Sara Goggi, Sara Goggi remo Bindi, Lisa Biagini e Sergio Rossi, 1997, Corpus Parole (3 milions words), ILC-CNR for CLARIN-IT repository hosted at Institute for Computational Linguistics "A. Zampolli", National Research Council, in Pisa, http://hdl.handle.net/20.500.11752/ILC-1001

KIParla - KIPasti transcripts

Author: Mauri Caterina
Ballarè Silvia
Zucchini Eleonora
Publication venue: Alma Mater Studiorum – Università di Bologna
Publication date: 30/04/2024
Field of study

The KIPasti corpus is part of the larger KIParla collection (www.kiparla.it), which can be freely queried through the NoSketch Engine interface. The ParlaBO corpus was compiled within the framework of “DiverSIta – Diversity in spoken Italian” project, funded by the Italian Ministry of University and Research (MUR) (PRIN 2022 PNRR Call). It consists of over 40 hours of spoken data collected in thirteen different Italian regions (Abruzzo, Basilicata, Calabria, Campania, Emilia-Romagna, Lazio, Lombardy, Marche, Apulia, Sardinia, Tuscany, Umbria, Veneto) during mealtime conversations, generally within family settings. The interactions, recorded between 2020 and 2024, involved 145 speakers with different origins, ages, education levels, and occupations. Italian is predominantly used in all interactions, but in most of them (78%), various passages in dialect are also present. The transcriptions have been anonymized. Overall, the module is made up of 63 conversations. This repository contains: - metadata for both speakers (occupation, gender, age, origin, L1, educational achievement) and conversations (collection point, year, languages used), in the metadata subfolder - descriptions of the set of transcription conventions used for this module - for each conversation you will find: .eaf file in eaf/ folder (time-aligned Jefferson-style transcriptions); .txt file in linear-jefferson/ folder (linearized Jefferson-style transcription); .txt file in linear-orthographic/ folder (linearized transcription retaining only orthographic words); .tsv file in tsv/ folder (tokenised version of the transcription). More information can be found in the README.md file. Due to GDPR restrictions, pseudo-anonymized audio files (MP3) are available under a restricted-access license. To request access, please contact the corpus coordinators through the KIParla website and follow the provided procedure. This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License

ItAnt Faliscan Corpus

Author: Rigobianco Luca
Publication venue: Istituto di Linguistica Computazionale “A. Zampolli” - Consiglio Nazionale delle Ricerche (ILC-CNR)
Publication date: 01/01/2024
Field of study

ItAnt Faliscan Corpus è il corpus digitale di nuove edizioni critiche di una selezione di iscrizioni in lingua Neo-falisca realizzato nell'ambito del progetto PRIN 2017 'Lingue e culture dell'Italia antica. Linguistica storica e modelli digitali'. Le iscrizioni sono rappresentate in XML utilizzando lo schema di codifica TEI/EpiDoc, arricchite con metadati condivisi e standardizzati, permettendo così una descrizione accurata di ciascuna iscrizione sia come oggetto linguistico che materiale. Il corpus comprende anche una riproduzione facsimile delle iscrizioni. Alla stesura delle schede hanno partecipato anche: Mariarosaria Zinzi, Greta Mozzat

Sewer Network Ontology

Author: HAYDAR Batoul
Chahinian Nanée
Keet Maria
Publication venue: Institut de Recherche pour le Développement
Publication date: 01/12/2024
Field of study

The developed Sewer Network ontology (SewerNet) describes the structure of the wastewater and stormwater networks, their elements, and their qualities. It also incorporates events involved in the network management process. The ontology is based on the French géostandard for drinking water supply and sanitation networks (RAEPA) v1.2 and the INSPIRE European directive. SewerNet is aligned with the DOLCE-lite foundational ontology and imports a few axioms from the Time ontology for interoperability providing a well-established semantic basis for modeling sewer networks

PMLAN

Author: Gagliardi Gloria
Minori Giulia
Cuteri Vittoria
Melloni Francesca Maria
Tamburini Fabio
Malaspina Elisabetta
Gualandi Paola
Rossi Francesca
Moscano Milena
Francia Valentina
Parmeggiani Antonia
Publication venue: IRCCS Istituto delle Scienze Neurologiche di Bologna, Centro Regionale per i Disturbi della Nutrizione e dell'Alimentazione in età evolutiva, Child Neurology and Psychiatry Unit, Bologna, Italy
Publication date: 01/01/2023
Field of study

This corpus consists of written texts produced by 51 adolescents (14-18 years of age): 17 girls with a clinical diagnosis of Anorexia Nervosa, and 34 normal-weighted peers, matched for gender, age, educational level, and geographical origin. All participants were asked to produce three short texts (10-15 lines long): in the first task, the prompt was “How would you describe yourself? (Please, talk about your physical and personality traits, your hobbies, etc.)” (personal task). In the second, the prompt was “How do you usually spend time with your friends?” (neutral task). For the third task, participants were asked to describe a complex picture. The elicited responses were manually digitized by linguists, and the resulting corpus was subjected to PoS Tagging and Dependency Parsing (CoNLL format). Approval was granted by the Bioethics Committee of Azienda Ospedaliero-Universitaria di Bologna, Policlinico Sant’Orsola-Malpighi, Italy (prot. 683/2019/Oss/AOUBo). At the time of submission on CLARIN, this is an ongoing project. Due to the Italian privacy policy, raw data of the corpus (i.e., transcriptions and clinical information of the participants) is not available. Processed data (i.e., tables of lexical/syntactic values, with the name of the speakers masked through an alphanumeric acronym to ensure anonymity) are available from the contact person upon reasonable request

ItaASD: Italian speech corpus Autism Spectrum Disorder

Author: Imparato Syria Cira
Izzo Maria
Liguori Olga
Orsino Ernesto
Sensale Donata
Santarpia Tina
Perfetto Giulia
Gison Giovanna
Gagliardi Gloria
Publication venue: Centro Medico Riabilitativo – Pompei
Publication date: 01/01/2023
Field of study

This is a corpus of semi-spontaneous speech produced by 34 children between 6 and 13 years of age, residents in the Campania region of Italy. Half of the participating children were diagnosed with high-functioning Autism Spectrum Disorder, and the other half were neurotypical children matched for age, gender, and geographical origin. All participants were administered three tasks: a complex image description task, a story-telling task, and a story-retelling task. This resulted in 4 hours and 19 minutes of recorded speech, which were then transcribed and annotated using ELAN. This research project was approved by the Bioethics Committee of the Alma Mater Studiorum - University of Bologna (no. 0173455/2022). Due to the Italian privacy policy, raw data of the corpus (i.e., speech recordings, transcriptions, and clinical information of the participants) is not available. Processed data (i.e., tables of acoustic/rhythmic/lexical/syntactic values, with the name of the speakers masked through an alphanumeric acronym to ensure anonymity) are available from the contact person upon reasonable request

Pan-Latin Geothermal Energy Lexicon

Author: Zanola Maria Teresa
Publication venue: Educatt
Publication date: 18/05/2023
Field of study

The Pan-Latin Geothermal Energy Lexicon (Lessico panlatino dell’energia geotermica), developed within the Realiter network, contains the basic terms related to geothermal energy in seven Romance languages (Italian, Catalan, Spanish, French, Galician, Portuguese, Romanian) and in English

ItAntDSL

Author: Boschetti Federico
Rigobianco Luca
Publication venue: Università Ca’ Foscari Venezia
Publication date: 31/08/2023
Field of study

The bundle contains: 1. ANTLR Lexer and Parser for a Domain-Specific Language named ItAntDSL, compliant with the EpiDoc conceptual model, to describe inscriptions in the languages of ancient Italy (in particular Venetic and Faliscan); 2. Visitor to convert ItAntDSL in XML-ItAnt The development of XSL(T) stylesheets to convert XML-ItAnt to XML-TEI/EpiDoc is in progres

Pan-Latin Lexicon of Collars and Sleeves in Fashion and Costume

Author: Zanola Maria Teresa
Dankova Klara
Grimaldi Claudio
Serpente Anna
Publication venue: Educatt
Publication date: 17/03/2023
Field of study

The Pan-Latin Lexicon of Collars and Sleeves in Fashion and Costume, developed within the Pan-Latin Terminology Network (REALITER), aims at collecting the main terms designating collars and sleeves in fashion and costume. It proposes a semiotic reference for a common referent, in order to try to establish terminological equivalences in this very technical and specialised field, characterised by several cultural traditions. The Lexicon intends to give a multilingual (Italian, Catalan, Spanish, French, Portuguese, English) terminological description in this field, in order to provide a useful reference for those interested in this sector, those who study, translate, write and work on fashion and costume. In the case of the Spanish language, the equivalents in the Spanish of Spain, Argentina and Mexico are provided. For the Portuguese language, the Brazilian Portuguese equivalents are also given

1

full texts

955

metadata records

Updated in last 30 days.

ILC4CLARIN: Linguistic Data and NLP Tool

Access Repository Dashboard

Do you manage Open Research Online? Become a CORE Member to access insider analytics, issue reports and manage access to outputs from your repository in the CORE Repository Dashboard! 👇