Institute for Computational Linguistics “A. Zampolli”
ILC4CLARIN: Linguistic Data and NLP ToolNot a member yet
955 research outputs found
Sort by
ItAnt Oscan Corpus
ItAnt Oscan Corpus è il corpus digitale di nuove edizioni critiche di una selezione di iscrizioni in lingua osca realizzato nell'ambito del progetto PRIN 2017 'Lingue e culture dell'Italia antica. Linguistica storica e modelli digitali'. Le iscrizioni sono rappresentate in XML utilizzando lo schema di codifica TEI/EpiDoc, arricchite con metadati condivisi e standardizzati, permettendo così una descrizione accurata di ciascuna iscrizione sia come oggetto linguistico che materiale. Il corpus comprende anche una riproduzione facsimile delle iscrizioni. Alla stesura delle schede hanno partecipatao: Mariarosaria Zinzi, Marco Ammazzino, Francesco Benassai e Alessia De Maria
Athloi: annotation of themes and motifs related to Iliad 23 and Odyssey 8
Annotation of themes and motifs related to Iliad 23 and Odyssey 8 through a Domain-Specific Language. The original annotations, the annotations converted in XML (with a proprietary scheme) and the CFG grammar are provided.
300 annotations have been encoded.
Further information can be asked to the Help Desk of The DiPText-KC: https://diptext-kc.clarin-it.it/helpdesk
ItAnt Venetic Corpus
ItAnt Venetic Corpus è il corpus digitale di nuove edizioni critiche di una selezione di iscrizioni in lingua venetica realizzato nell'ambito del progetto PRIN 2017 'Lingue e culture dell'Italia antica. Linguistica storica e modelli digitali'. Le iscrizioni sono rappresentate in XML utilizzando lo schema di codifica TEI/EpiDoc, arricchite con metadati condivisi e standardizzati, permettendo così una descrizione accurata di ciascuna iscrizione sia come oggetto linguistico che materiale. Il corpus comprende anche una riproduzione facsimile delle iscrizioni. Alla stesura delle schede hanno partecipato: Mariarosaria Zinzi, Greta Mozzat
Codice Pelavicino
This corpus contains all the data used for digital edition of the Codice Pelavicino consisting in 536 XML files.
The Codice Pelavicino (Pelavicino Code) is a parchment codex preserved in the Capitular Archive of Sarzana (Italy), composed of 446 pages organized in 37 fascicles. The content is structured into 7 distinct texts: the index of the liber iurium, the archive inventory, three isolated documents of 1275, 1297 and 1497, the liber magister with the copy of 23 documents and the liber iurium with a copy of 498 documents, all from 10th to the end of the 13th century.
The digital edition is visible at these two links
http://pelavicino.labcd.unipi.it introduction, description and various materials
http://pelavicino.labcd.unipi.it/evt full image based digital edition supported by EVT 1
All the data related to the Codice Pelavicino are made available under the FAIR principle for conservation, possible reuse and further studies
CASH - Corpus management Annotation and SearcH
CASH (Corpus, Annotation, and SearcH server) is a back-end software for managing text collections, annotations, and associated metadata. The system was developed to handle richly annotated document collections, including both primary texts and extensive metadata related to their historical and contextual information. Its native use case is to deal with a corpus of EpiDoc XML digital critical editions of archaic inscriptions, but it can ingest also CoNLL-x and plain text. CASH is designed to be modular and extensible in multiple ways, including document ingestion,
annotation and metadata semantics, data export, and multi-level queries. The back-end services expose
APIs documented via Swagger.
CASH was developed in the context of the PRIN 2017 project "Languages and Cultures of Ancient Italy.
Historical Linguistics and Digital Models". ILC supervisor: Valeria Quochi
ItAnt Cisalpine Celtic Corpus
ItAnt Cisalpine Celtic Corpus è il corpus digitale di nuove edizioni critiche di una selezione di iscrizioni in lingua celtica d'Italia realizzato nell'ambito del progetto PRIN 2017 'Lingue e culture dell'Italia antica. Linguistica storica e modelli digitali'. Le iscrizioni sono rappresentate in XML utilizzando lo schema di codifica TEI/EpiDoc, arricchite con metadati condivisi e standardizzati, permettendo così una descrizione accurata di ciascuna iscrizione sia come oggetto linguistico che materiale. Alla stesura delle schede hanno partecipato: Patrizia Solinas, Mariarosaria Zinzi, Luca Rigobianco
ELIta (Emotion Lexicon for Italian)
ELIta (Emotion Lexicon for Italian) include parole ed emoji, per un totale di 6905 voci in italiano. Ogni voce è stata valutata manualmente da annotatori umani per determinare il suo livello di associazione con otto emozioni di base (gioia, tristezza, rabbia, disgusto, paura, fiducia, sorpresa e attesa). Inoltre, il lessico è annotato secondo le dimensioni psicolinguistiche della valenza (positiva o negativa), dell'attivazione (calmo o eccitato) e della dominanza (sottomesso o dominante). Il lessico è disponibile anche in una versione non aggregata, con metadati demografici.
ELIta (Emotion Lexicon for Italian) is a resource that includes words and emojis, totaling 6905 entries in italian. Each entry has been manually evaluated by human annotators to determine its level of association with eight basic emotions as defined by Plutchik's wheel of emotions (joy, sadness, anger, disgust, fear, trust, surprise, and anticipation). Additionally, the lexicon is annotated according to the psycholinguistic dimensions of valence (positive or negative), arousal (calm or excited), and dominance (out of control or in control). The lexicon is available also in a non-aggregated version, with demographic metadata
Enriched Data from Codice Pelavicino
Lists of the named entities of the digital edition of the Codice Pelavicino in XML;
There are also the same lists, where each named entity (Person, Place, Family, People, Institution) is enriched with related information as date, original document and geographic coordinates. These data are available both in JSON and CSV
DigItAnt Search
DigItAnt-search is the GUI web application od the DigItAnt platform, designed to explore, visualise and navigate the different sources of information created or linked within the national ItAnt project (https://www.prin-italia-antica.unifi.it/). DigItAnt is an innovative platform designed to support historical linguistic and epigraphic studies, and researchers in the creation, management and consultation of digital linguistic resources for the fragmentary ancient languages.
DigItAnt-search allows to explore interactively various sources of information in a unified and easily accessible environment.
The development of DigItAnt was funded by the Ministry of University and Research under the program Research Projects of Relevant National Interest (PRIN) 2017.
This front-end application was developed by Michele Mallia under the supervision of Valeria Quochi and thanks to continuous discussion and exchange with the team, composed of: Andrea Bellandi, Alessandro Tommasi, Cesare Zavattari, Silvia Piccini, Michela Bandini, Chiara Fazzone
KIParla - ParlaBO transcripts
The ParlaBO corpus is part of the larger KIParla collection, which can be freely queried through the NoSketch Engine interface.
The ParlaBO corpus was compiled within the framework of “DiverSIta – Diversity in spoken Italian” project, funded by the Italian Ministry of University and Research (MUR) (PRIN 2022 PNRR Call). It was also supported by the Project PNRR PE5: CHANGES – Cultural Heritage Active Innovation for Next-Gen Sustainable Society (Spoke 3: Digital libraries, archives and philology. WP5: Languages and their legacies in oral digital archives: synchronic interdisciplinary perspectives on multilingualism, language minorities, dialects and cultural contact in Italy).
It consists of over 65 hours of spoken data collected in Bologna and its province through semi-structured interviews. The interviews, conducted between 2021 and 2024, involved more than 150 speakers with different origins, ages, education levels, and occupations and covered a variety of topics (study, work, leisure activities, retirement, memories of the past, life in the city, traditions, local customs, etc.). The transcriptions have been anonymized.
Overall, the module is made up of 86 conversations and includes 155 speakers.
This repository contains:
• metadata for both speakers (occupation, gender, age, origin, L1, educational achievement) and conversations (collection point, year, languages used), in the metadata subfolder
• descriptions of the set of transcription conventions used for this module
• for each conversation you will find: .eaf file in eaf/ folder (time-aligned Jefferson-style transcriptions); .txt file in linear-jefferson/ folder (linearized Jefferson-style transcription); .txt file in linear-orthographic/ folder (linearized transcription retaining only orthographic words); .tsv file in tsv/ folder (tokenised version of the transcription).
More information can be found in the README.md file.
Due to GDPR restrictions, pseudo-anonymized audio files (MP3) are available under a restricted-access license. To request access, please contact the corpus coordinators through the KIParla website and follow the provided procedure.
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License