Institute for Computational Linguistics “A. Zampolli”

ILC4CLARIN: Linguistic Data and NLP Tool

Not a member yet

955 research outputs found

Sort by

RAC - Recovery from Ana/Anorexia Corpus

Author: Donati Melissa
Vernillo Paola
Polidori Ludovica
Gagliardi Gloria
Publication venue: Alma Mater Studiorum – Università di Bologna
Publication date: 01/01/2023
Field of study

RAC - Recovery from Ana/Anorexia Corpus is a collection of Italian ED-recovery community content downloaded from TikTok. It consists of 1000 videos from 27 TikTok channels (26 females and 1 male). Given the wide variety of features and formatting styles that characterize TikTok videos, we organized the data into 4 categories: 1) "Speech-only" videos, in which the user was talking in the absence of background music and/or written text. 2) "Playback" videos, in which the user sings over a song that is played in the background. 3) "Text-only" videos, in which there is neither background music nor the users themselves speaking, but only written text. 4) "Mixed" videos, in which the above-mentioned features are present in various combinations. "Speech-only" and "playback" videos were transcribed automatically using the Google Web Speech API. Afterward, transcriptions were manually checked. "Text-only" and "mixed" videos underwent manual transcription

Pan-Latin Photovoltaic Systems Lexicon

Author: Zanola Maria Teresa
Publication venue: Educatt
Publication date: 18/05/2023
Field of study

The Pan-Latin Photovoltaic Systems Lexicon (Lessico panlatino dei sistemi fotovoltaici), developed within the Realiter network, contains the basic terms related to photovoltaic systems in seven Romance languages (Italian, Catalan, Spanish, French, Galician, Portuguese, Romanian) and in English

EpiLexO

Author: Mallia Michele
Bellandi Andrea
Tommasi Alessandro
Zavattari Cesare
Bandini Michela
Quochi Valeria
Publication venue: CLARIN-IT
Publication date: 01/01/2023
Field of study

EpiLexO is a user friendly web application for the creation and editing of an integrated system of language resources for ancient fragmentary languages centered on the lexicon, in compliance with current digital humanities and Linked Open Data principles. EpiLexo allows for the editing of lexica with all relevant cross-references: for their linking to their testimonies, as well as to bibliographic information and other (external) resources and common vocabularies. This front-end application rests on a Service-Oriented Architecture with two main back-end components, the LexO-server (\handle) and the CASH-server (1github), which manage lexica and textual documents respectively via Rest-ful APIs web-services, plus additional services for the management of other aspects such as access and authentication, XML rendering, etc. All code is available on https://github.com/DigItAnt/ The application has been developed in the context of a project on the languages of fragmentary attestation of ancient Italy, but can be applied to other similar contexts

Pan-Latin Smart City Lexicon

Author: Grimaldi Claudio
Romagnoli Elisa
Publication venue: Educatt
Publication date: 14/05/2023
Field of study

The Pan-Latin Smart City Lexicon (Lessico panlatino della Smart City), developed within the Realiter network, contains the basic terms related to the Smart City concept in seven Romance languages (Italian, Catalan, Spanish, French, Galician, Portuguese, Romanian) and in English

Diccionario de Arquitectura_ES-Dictionnaire d'Architecture_FR

Author: BARTOLOME-DIAZ ZAIDA
Publication venue: Universidad de Las Palmas de Gran Canaria
Publication date: 27/04/2022
Field of study

The Diccionario de Arquitectura_ES – Dictionnaire d’Architecture_FR is a bilingual Spanish–French lexical resource focused on contemporary architecture. It has been designed as a structured database that combines linguistic and conceptual information, aiming to support both academic research and professional practice in architecture and related fields. The dictionary is based on two parallel corpora of Spanish and French texts from the last 20 years (academic publications, professional reports, specialized magazines, and digital resources). Key lexical units were extracted automatically using NLP tools, then enriched through semantic relations (hyperonyms, synonyms, collocations) and bilingual alignment. Entries follow the TEI-LMF encoding model, ensuring interoperability and compliance with current standards for digital lexicography. Each entry includes: Linguistic information: lemma, part of speech, gender, and grammatical features. Conceptual information: definitions from specialized dictionaries and online resources. Usage examples: extracted from the corpora to illustrate real usage. Cross-references: links to translations, equivalent concepts (e.g. DBpedia, WordNet), and related entries. Multimodal enrichment: images and figures when relevant

It-Sr-NER

Author: Perišić Olja
Stanković Ranka
Iković Nešić Milica
Škorić Mihailo
Vitas Duško
Krstev Cvetana
Publication venue: Università degli studi di Torino
Publication date: 20/09/2022
Field of study

It-Sr-NER tool is a CLARIN compatible NER web service for parallel texts with case study on Italian and Serbian; it can be used for recognizing and classifying named entities in bilingual natural language texts. Input parallel texts should be TMX (Translation Memory eXchange) files, e.g. Sr-It. It-Sr-NER can recognize six NER classes: demonyms (DEMO), works of art (WORK), person names (PERS), places (LOC), events (EVENT) and organisations (ORG). The service can also be used for monolingual text NER annotation for available spaCy NER models. It-Sr-NER uses a CNN architecture within the spaCy tool and Named Entity linking with Wikidata using spaCyOpenTapioca. For further details: API usage is described in: http://ners.jerteh.rs/4api A Postman example is available at https://github.com/rankastankovic/It-Sr-NER/blob/main/static/Postman_call_ners-mono.PN

Augustine's Confessions

Author: Passarotti Marco
Mambrini Francesco
Iurescia Federica
Cecchini Flavio Massimiliano
Moretti Giovanni
Testori Marinella
Pedonese Giulia
Publication venue: CIRCSE Research Centre, Università Cattolica del Sacro Cuore
Publication date: 2022
Field of study

The digital text of the 13 books of the "Confessiones" by Augustinus is taken from The Latin Library (http://www.thelatinlibrary.com/august.html). The original text was lemmatized and PoS tagged with the UDPipe tool (using the PROIEL trained model). The output of UDPipe was then checked manually at the CIRCSE Research Centre of the Università Cattolica del Sacro Cuore, Milan, Italy. The linking of the text to the Lemma Bank of the LiLa Knowledge Base was performed at CIRCSE, too

LexicO

Author: Sciolette Flavia
Giovannetti Emiliano
Marchi Simone
Publication venue: Istituto di Linguistica Computazionale “A. Zampolli” - Consiglio Nazionale delle Ricerche (ILC-CNR)
Publication date: 14/07/2022
Field of study

LexicO is a resource deriving from Parole-Simple-Clips (http://hdl.handle.net/20.500.11752/ILC-88). This resource contains all four levels of linguistic information represented in PSC (phonology, morphology, syntax, and semantics) which have been automatically analysed to find redundant, erroneous and missing data. The process of updating that conducted to the current version of LexicO starting from PSC included: i) the removal of all sure redundant entries (i.e. duplicates) belonging to all four linguistic levels; ii) the creation of tables dedicated to candidate redundants, detected by considering specific similarities amongst entries; iii) the correction of missing semantic and syntax-semantics interface relations amongst the entries of lexicon

French ELTEC NER Open Dataset

Author: Brando Carmen
Frontini Francesca
Galleron Ioana
Publication venue: Université Sorbonne Nouvelle, laboratoire Lattice - UMR 8094
Publication date: 19/10/2022
Field of study

This dataset is derived from the annotation of named entities in a collection of 100 French novels from the "long" 19th century. The collection was assembled in the framework of the COST Action 16204 "Distant reading", and can be found at the following address: [https://distantreading.github.io/ELTeC/fra/index.html]. From these 100 novels, samples of varying size were extracted and annotated with Stanza-NER. The result was loaded onto Tagtog, for manual verification and re-annotation. We used 8 categories of named entities: e_1 PERS: names of persons e_2 LOC: place names e_3 ORG: names of institutions, organisations e_4 OTHER e_5 WORK: works of art (only if they can be identified with certainty, e.g. "Mona Lisa" and not "a painting by Leonardo da Vinci") e_6 DEMO: (names of distinct peoples or social groups: do not annotate "the weavers", but annotate "the Jacobins") e_7 ROLE: occupation, social position, family role of the person e_8 EVENT: designation of historical events, which sometimes, but not necessarily, implies a date (e.g. "the revolution of 18..", "the battle of Jarnac") The data are loaded in the export formats provided by Tagtog: -- json for annotations -- html for text (without annotations) For more information on the steps of data elaboration, annotation choices and quality control, see the data paper mentioned above. The NER annotation of the entire ELTeC corpus is described in: Francesca Frontini, Carmen Brando, Joanna Byszuk, Ioana Galleron, Diana Santos, and Ranka Stanković. "Named Entity Recognition for Distant Reading in ELTeC". CLARIN Annual Conference 2020, (5-7 October). Virtual Edition. Madrid, Spain: CLARIN, 2020. pp. 37-41, ISSN 2773-2177. https://office.clarin.eu/v/CE-2020-1738-CLARIN2020_ConferenceProceedings.pdf -------- Ce jeu de données est issu de l’annotation des entités nommées dans une collection de 100 romans français du “long” XIXe siècle. La collection a été rassemblée dans le cadre de l’action COST 16204 “Distant reading”, et peut être trouvée à l’adresse suivante: [https://distantreading.github.io/ELTeC/fra/index.html]. À partir de ces 100 romans, des échantillons de taille variable ont été extraits, puis annotés avec Stanza-NER. Le résultat a été chargé sur Tagtog, pour vérification manuelle et ré-annotation. Nous avons utilisé 8 catégories d’entités nommées: e_1 PERS: noms de personnes e_2 LOC: noms de lieu e_3 ORG: noms d’institutions, organisations e_4 OTHER e_5 WORK: œuvres d’art (seulement si elle peut être identifiée avec certitude, ex. “Mona Lisa” et non pas “un tableau de Leonard de Vinci”) 3_6 DEMO: (noms de peuples ou groupes sociaux distincts: on n’annote pas “les tisserands”, mais on annote “les Jacobins”) e_7 ROLE: indications sur le métier, la position sociale, le rôle familial de la personne e_8 EVENT: désignation d’événements historiques, ce qui suppose parfois, mais pas obligatoirement, une date (ex. “la révolution de 18..”, “la bataille de Jarnac”) Les données sont chargées dans les formats d’export fournis par Tagtog: -- json pour les annotations -- html pour les textes (sans les annotations) Pour plus d’informations sur les étapes d’élaboration des données, les choix d’annotation et le contrôle de la qualité, voir le data paper cité plus haut. L'annotation des entités nommées du corpus ELTeC complet est décrite dans: Francesca Frontini, Carmen Brando, Joanna Byszuk, Ioana Galleron, Diana Santos, and Ranka Stanković. "Named Entity Recognition for Distant Reading in ELTeC". CLARIN Annual Conference 2020, (5-7 October). Virtual Edition. Madrid, Spain: CLARIN, 2020. pp. 37-41, ISSN 2773-2177. https://office.clarin.eu/v/CE-2020-1738-CLARIN2020_ConferenceProceedings.pd

Pan-Latin Textile Fibres Vocabulary

Author: Dankova Klara
Zanola Maria Teresa
Calvi Silvia
Publication venue: Educatt
Publication date: 21/01/2022
Field of study

The Pan-Latin Textile Fibres Vocabulary (Lessico panlatino delle fibre tessili), developed within the Realiter network, contains the basic terms designating textile fibres in seven Romance languages (Italian, Catalan, Spanish, French, Galician, Portuguese, Romanian) and in English

1

full texts

955

metadata records

Updated in last 30 days.

ILC4CLARIN: Linguistic Data and NLP Tool

Access Repository Dashboard

Do you manage Open Research Online? Become a CORE Member to access insider analytics, issue reports and manage access to outputs from your repository in the CORE Repository Dashboard! 👇