Institute for Computational Linguistics “A. Zampolli”
ILC4CLARIN: Linguistic Data and NLP ToolNot a member yet
955 research outputs found
Sort by
Digital edition of opera libretti
The project provides the digital edition of the libretti staged for the election of the Council of the Elders in the Republic of Lucca. The celebration, known as funzione delle Tasche, was repeated every three years from 1636 to 1797. The present edition collects the works from 1636 to 1705 in order to analyze changes and recurring motifs throughout the 17th century in a republican context
TrAVaSI_VoDIM Corpus
The TrAVaSI_VoDIM Corpus is a sample of the corpus built for the Vocabolario Dinamico Dell’Italiano Moderno (VoDIM, Marazzini and Maconi, 2018), gathering Italian texts from 1861 to the present day, after the Unification of Italy. TrAVaSI_VoDIM is balanced and representative of different prose domains (art, gastronomy, law, newspapers, literature, popular fiction, science), for a total of about 21.000 tokens. TrAVaSI_VoDIM is morpho-syntactically annotated and lemmatized. The annotation, conforming to the Universal Dependencies standard (UD, De Marneffe et al. 2021), has been carried out semi-automatically. First, TrAVaSI_VoDIM was automatically annotated with the Stanza “combined” model for Italian. Automatic annotation was then manually revised. The resulting corpus has also been used to retrain Stanza to deal with historical varieties of the Italian language: achieved results are encouraging
Archilochus of Paros: Elegiac Fragments - XML Archive
Archilochus of Paros: Elegiac Poems – XML Archive
Sources are the outcome of seminars on the digitization of ancient Greek fragments developed at the University of Parma in February 2021 (16th, 17th, 19th, 24th and 26th), December 2021 (13th, 14th, 17th, 20th) and June 2022 (from 13th to 17th). Seminars provided practical skills in XML/TEI coding of ancient literary texts (text and critical apparatus), as well as an overview of the state of the art about scientific and digital critical editions. Students test the potential of digital philology working on the Elegies of Archilochus of Paro (VII century B.C.). Now we have the full corpus of Archilochus’ Elegies.
The work was carried out under the direction of Anika Nicolosi (University of Parma) with the teamwork of Beatrice Nava (University of Bologna) and Filippo Boni (University of Parma). Seminar and resources produced are part of the project DEA -–Digital Edition of Archilochus
TrAVaSI_GDLI-quotation corpus
The TrAVaSI_GDLI-quotation corpus (TrAVaSI_GDLI-QC) is a first nucleus of a diachronic corpus for Italian collecting a sample of the quotations of a historical dictionary, namely the "Grande Dizionario della Lingua Italiana" (GDLI) by Salvatore Battaglia, which includes a huge collection of quotations covering the entire history of the Italian language, ranging from the Middle Ages to the present day. Different criteria guided the composition of the corpus. Among the most cited authors, those who guaranteed to cover the widest chronological span were selected. Representativeness of different text typologies (e.g. chronicle, literary prose, poetry, treatises) was also taken into account. The resulting TrAVaSI_GDLI-QC consists of two balanced sub-corpora, with quotations from works written between 14th and 20th century: one collecting 1500 prose quotes from 15 authors (100 each) for a total of about 35.000 tokens, and the other gathering 500 poetry quotes from 10 authors (50 each) for a total of about 10.000 tokens. TrAVaSI_GDLI-QC is morpho-syntactically annotated and lemmatized. The annotation, conforming to the Universal Dependencies standard (UD, De Marneffe et al. 2021), has been carried out semi-automatically. First, both sub-corpora were automatically annotated with the Stanza “combined” model for Italian. Automatic annotation was then manually revised. The resulting corpus has also been used to retrain Stanza to deal with historical varieties of the Italian language: achieved results are encouraging
It-Sr-NER: CLARIN compatible NER and geoparsing web services for parallel texts: case study Italian and Serbian
It-Sr-NER-corp is the Italian/Serbian bilingual corpus with 10,000 aligned sentences compiled in the scope of the It-Sr-project from samples of several Italian novels translated to Serbian and vice versa, with the aim of the development of the CLARIN compatible NER web service for parallel text with the case study on Italian and Serbian. The set of 10,000 natural language segments is split into 4 files: 1*1000+3*3000. The corpus comprises of: 1) text versions, Italian and Serbian, with one segment per line 2) TMX (Translation Memory eXchange) bilingual aligned segments; 3) monolingual text and TMX files with automatically annotated named entities for six NER classes: demonyms (DEMO), works of art (WORK), person names (PERS), places (LOC), events (EVENT) and organizations (ORG). It-Sr-NER annotation uses a powerful Convolutional Neural Network architecture within the spaCy tool, for Italien WikiNER (Joel Nothman, Nicky Ringland, Will Radford, Tara Murphy, James R Curran) and for Serbian SrpCNNER (Cvetana Krstev, Ranka Stanković, Milica Ikonić Nešić, Branislava Šandrih Todorović)
Computational Historical Semantics
Computational Historical Semantics is a cooperative project involving the universities of Bielefeld, Frankfurt, Regensburg and Tübingen, coordinated at Goethe-University Frankfurt by an interdisciplinary team led by Bernhard Jussen and Alexander Mehler, and funded by the German Federal Ministry for Education and Research. The text database of the project gathers more than 4000 texts spanning from II to XV Century A.D. The section of the database linked to LiLa comprises 5 texts for a total approximately 1 million words
DemCorpus-Basilicata: Dementia Corpus
This corpus consists of semi-spontaneous speech data produced by elderly residents of the Basilicata region in Italy. In total, 40 individuals participated: the patient group consists of 20 participants with a diagnosis of dementia (9 cases of Alzheimer’s disease, 2 patients with mixed dementia, 5 patients with not-further-specified dementia, 3 patients with vascular dementia, and 1 patient with frontotemporal dementia), and the control group consists of 20 healthy individuals matched for age, gender, and geographical origin. Three linguistic tasks were administered to all participants: two narrative tasks (the first one was about an excursion or a trip, and the second was about Christmas festivities), and an image description task. This resulted in 8 hours and 50 minutes of recorded semi-spontaneous speech, which was then transcribed, segmented, and annotated using ELAN. This research project was approved by the Bioethics Committee of the Alma Mater Studiorum - University of Bologna (no. 0072032/2022). Due to the Italian privacy policy, raw data of the corpus (i.e., speech recordings, transcriptions, and clinical information of the participants) is not available. Processed data (i.e., tables of acoustic/rhythmic/lexical/syntactic values, with the name of the speakers masked through an alphanumeric acronym to ensure anonymity) are available from the contact person upon reasonable request
Fragmenta Vaticana - Fragmenta quae dicuntur Vaticana / digital edition published by BIA ― Bibliotheca Iuris Antiqui
Fragmenta Vaticana - Fragmenta quae dicuntur Vaticana
Testo digitalizzato e codificato nell’ambito del progetto BIA dall’edizione a stampa Baviera, Johannes (ed.), Fontes iuris Romani anteiustiniani. Pars Altera: Auctores, apud S.A.G. Barbera, Florentiae, 1940.
I materiali paratestuali della fonte cartacea sono stati soppressi nella versione digitale.
HomePage del progetto: https://bia.igsg.cnr.it/
Documentazione: https://bia.igsg.cnr.it/pdf/BIANET_manual_en.pd
Institutiones Iustiniani - Institutiones Iustiniani / digital edition published by BIA ― Bibliotheca Iuris Antiqui
Institutiones Iustiniani - Institutiones Iustiniani
Testo digitalizzato e codificato nell’ambito del progetto BIA dall’edizione a stampa Mommsen, Theodor, Corpus Iuris civilis. Editio stereotypa XII. Volumen primum: Institutiones, apud Weidmannos, Berolini, 1911.
I materiali paratestuali della fonte cartacea sono stati soppressi nella versione digitale.
HomePage del progetto: https://bia.igsg.cnr.it/
Documentazione: https://bia.igsg.cnr.it/pdf/BIANET_manual_en.pd
Italian Sense Inventory
The present Sense Inventory is an Italian language resource automatically derived from two Italian computational lexicons: ItalWordNet (https://dspace-clarin-it.ilc.cnr.it/repository/xmlui/handle/20.500.11752/ILC-62) and PAROLE-SIMPLE-CLIPS (https://dspace-clarin-it.ilc.cnr.it/repository/xmlui/handle/20.500.11752/ILC-88). It was built in collaboration with the CNR Institute of Computational Linguistics as an experiment related to the ELEXIS project (https://elex.is/), with the aim to produce a synthetic and structured inventory of senses to be used for the sense annotation of the ELEXIS WSD test corpus. This Sense Inventory is thus based upon the selection of lemmas occurring in the ELEXIS test corpus and on the merged sense information derived from the two existing lexicons.
The Python program developed for the automatic construction of the Sense Inventory takes as input the ELEXIS dataset, extracts the lemmas from its sentences and searches for all related senses in the above mentioned resources. It also makes use of a sense mapping database of the cited lexicons, 'iwnmapdb', available upon request from CNR-ILC. The extrapolated and checked data are then arranged in a formal structure in which for each lemma - PoS pair the following details are given:
- Not mapped senses extracted from PAROLE-SIMPLE-CLIPS (PSC),
- Mapped senses extracted from the mapping database 'iwnmapdb',
- Not mapped senses extracted from ItalWordNet (IWN).
All fields with no value are filled with None.
The tab separated format thus has the following structure:
LEMMA POS CONCATENATED DEFINITION PSC-IWN USEMID PSC DEFINITION PSC EXAMPLE PSC SEMANTIC TYPE PSC SYNSETID IWN SENSEID IWN DEFINITION IWN
The total number of lemmas (with a ADV/ADJ/NOUN/VERB part of speech) inserted in the Sense Inventory amounts to 3860. There are 12,944 senses and mappings reported in the Sense Inventory, out of a total of 15,672 senses extracted from PAROLE-SIMPLE-CLIPS and ItalWordNet; 3461 mappings were extracted from the mapping database IWNMAPDB and then included in the Sense Inventory as relevant senses