Search CORE

1,720,981 research outputs found

Gestione ed Analisi di Big Data: Sfide e Opportunità nell'Integrazione e nell'Estrazione di Conoscenza dai Dati

Author: PAGANELLI MATTEO
Publication venue
Publication date: 2021
Field of study

Nell'era dei Big Data, l'adeguata gestione e consumo dei dati rappresenta una delle attività più sfidanti, a causa di una serie di criticità che si è soliti categorizzare in 5 concetti chiave: volume, velocità, varietà, veridicità e variabilità. In risposta a queste esigenze, negli ultimi anni numerosi algoritmi e tecnologie sono stati proposti, tuttavia rimangono molti problemi aperti e nuove sfide sono emerse. Tra queste, solo per citarne alcune, ci sono la necessità di disporre di dati annotati per l'addestramento di tecniche di machine learning, di interpretare la logica dei sistemi utilizzati, di ridurre l'impatto della loro gestione in produzione (ovvero il cosiddetto debito tecnico o technical debt) e di fornire degli strumenti a supporto dell'interazione uomo-macchina. In questa tesi si approfondiscono in particolare le criticità che affliggono gli ambiti dell'integrazione dati e della moderna gestione (in termini di riadattamento rispetto i nuovi requisiti) dei DBMS relazionali. Il principale problema che affligge l'integrazione di dati riguarda la sua valutazione in contesti reali, la quale richiede tipicamente il costoso coinvolgimento, sia a livello economico che di tempo, di esperti del dominio. In quest'ottica l'impiego di strumenti per il supporto e l'automazione di questa operazione critica, nonché la sua risoluzione in maniera non supervisionata, risulterebbero molto utili. In questo ambito, il mio contributo può essere riassunto nei seguenti punti: 1) la realizzazione di tecniche per la valutazione non supervisionata di processi di integrazione di dati e 2) lo sviluppo di approcci automatici per la configurazione di modelli di matching basati su regole. Per quanto riguarda i DBMS relazionali, essi si sono dimostrati di essere, nell'arco degli ultimi decenni, il cavallo di battaglia di molte aziende, per merito della loro semplicità di governance, sicurezza, verificabilità e dell'elevate performance. Oggigiorno, tuttavia si assiste ad un parziale ripensamento del loro utilizzo rispetto alla progettazione originale. Si tratta per esempio di impiegarli nella risoluzione di compiti più avanzati, quali classificazione, regressione e clustering, tipici dell'ambito del machine learning. L'instaurazione di un rapporto simbiotico tra questi due ambiti di ricerca potrebbe rivelarsi essenziale al fine di risolvere alcune delle criticità sopra elencate. In questo ambito, il mio principale contributo è stato quello di verificare la possibilità di eseguire, durante la messa in produzione di un sistema, predizioni di modelli di machine learning direttamente all'interno del database.In the Big Data era, the adequate management and consumption of data represents one of the most challenging activities, due to a series of critical issues that are usually categorized into 5 key concepts: volume, velocity, variety, veridicity and variability. In response to these needs, a large number of algorithms and technologies have been proposed in recent years, however many open problems remain and new challenges have emerged. Among these, just to name a few, there is the need to have annotated data for the training of machine learning techniques, to interpret the logic of the systems used, to reduce the impact of their management in production (i.e. the so-called technical debt) and to provide tools to support human-machine interaction. In this thesis, the challenges affecting the areas of data integration and modern management (in terms of readjustment with respect to the new requirements) of relational DBMS are studied in depth. The main problem affecting data integration concerns its evaluation in real contexts, which typically requires the costly and time-demanding involvement of domain experts. In this perspective, the use of tools for the support and automation of this critical task, as well as its unsupervised resolution, would be very useful. In this context, my contribution can be summarized in the following points: 1) the realization of techniques for the unsupervised evaluation of data integration tasks and 2) the development of automatic approaches for the configuration of rules-based matching models. As for relational DBMSs, they have proved to be, over the last few decades, the workhorse of many companies, thanks to their simplicity of governance, security, audibility and high performance. Today, however, we are witnessing a partial rethinking of their use compared to the original design. For example, they are used in solving more advanced tasks, such as classification, regression and clustering, typical of the machine learning field. The establishment of a symbiotic relationship between these two research fields could be essential to solve some of the critical issues listed above. In this context, my main contribution was to verify the possibility of performing in-DBMS inference of machine learning pipeline at serving time

Archivio istituzionale della ricerca - Università di Modena e Reggio Emilia

A multi-facet analysis of BERT-based entity matching models

Author: Guerra Francesco
Tiano Donato
Paganelli Matteo
Publication venue
Publication date: 01/01/2023
Field of study

Archivio istituzionale della ricerca - Università di Modena e Reggio Emilia

Parallelizing computations of full disjunctions

Author: Beneventano Domenico
Guerra Francesco
Sottovia Paolo
Paganelli Matteo
Publication venue
Publication date: 01/01/2019
Field of study

In relational databases, the full disjunction operator is an associative extension of the full outerjoin to an arbitrary number of relations. Its goal is to maximize the information we can extract from a database by connecting all tables through all join paths. The use of full disjunctions has been envisaged in several scenarios, such as data integration, and knowledge extraction. One of the main limitations in its adoption in real business scenarios is the large time its computation requires. This paper overcomes this limitation by introducing a novel approach parafd, based on parallel computing techniques, for implementing the full disjunction operator in an exact and approximate version. Our proposal has been compared with state of the art algorithms, which have also been re-implemented for performing in parallel. The experiments show that the time performance outperforms existing approaches. Finally, we have experimented the full disjunction as a collection of documents indexed by a textual search engine. In this way, we provide a simple technique for performing keyword search over relational databases. The results obtained against a benchmark show high precision and recall levels even compared with the existing proposals

Archivio istituzionale della ricerca - Università di Modena e Reggio Emilia

Evaluating the integration of datasets

Author: Guerra Francesco
Paganelli Matteo
Ferro Nicola
Buono Francesco Del
Publication venue
Publication date: 01/01/2022
Field of study

Evaluation is a bottleneck in data integration processes: it is performed by domain experts through manual onerous data inspections. This task is particularly heavy in real business scenarios, where the large amount of data makes checking all integrated tuples infeasible. Our idea is to address this issue by providing the experts with an unsupervised measure, based on word frequencies, which quantifies how much a dataset is representative of another dataset, giving an indication of how good is the integration process. The paper motivates and introduces the measure and provides extensive experimental evaluations, that show the effectiveness and the efficiency of the approach

Archivio istituzionale della ricerca - Università di Modena e Reggio Emilia

Archivio istituzionale della ricerca - Università di Padova

Explaining data with descriptions

Author: Sottovia Paolo
Guerra Francesco
Interlandi Matteo
Paganelli Matteo
Maccioni Antonio
Publication venue
Publication date: 01/01/2020
Field of study

Archivio istituzionale della ricerca - Università di Modena e Reggio Emilia

Finding Synonymous Attributes in Evolving Wikipedia Infoboxes

Author: Paolo Sottovia
Velegrakis Yannis
Francesco Guerra
Yannis Velegrakis
Sottovia Paolo
Guerra Francesco
Matteo Paganelli
Paganelli Matteo
Publication venue
Publication date: 01/01/2019
Field of study

Wikipedia Infoboxes are semi-structured data structures organized in an attribute-value fashion. Policies establish for each type of entity represented in Wikipedia the attribute names that the Infobox should contain in the form of a template. However, these requirements change over time and often users choose not to strictly obey them. As a result, it is hard to treat in an integrated way the history of the Wikipedia pages, making it difficult to analyze the temporal evolution of Wikipedia entities through their Infobox and impossible to perform direct comparison of entities of the same type. To address this challenge, we propose an approach to deal with the misalignment of the attribute names and identify clusters of synonymous Infobox attributes. Elements in the same cluster are considered as a temporal evolution of the same attribute. To identify the clusters we use two different distance metrics. The first is the co-occurrence degree that is treated as a negative distance, and the second is the co-occurrence of similar values in the attributes that are treated as a positive evidence of synonymy. We formalize the problem as a correlation clustering problem over a weighted graph constructed with attributes as nodes and positive and negative evidence as edges. We solve it with a linear programming model that shows a good approximation. Our experiments over a collection of Infoboxes of the last 13 years shows the potential of our approach

Crossref

Archivio istituzionale della ricerca - Università di Modena e Reggio Emilia

Big Data Integration of Heterogeneous Data Sources: The Re-Search Alps Case Study

Author: Paolo Sottovia
PAGANELLI MATTEO
Maurizio Vincini
Francesco Guerra
Sottovia Paolo
Matteo Paganelli
VINCINI Maurizio
GUERRA Francesco
Publication venue
Publication date: 01/01/2019
Field of study

The application of big data integration techniques in real scenarios needs to address practical issues related to the scalability of the process and the heterogeneity of data sources. In this paper, we describe the pipeline that has been developed in the context of the Re-search Alps project, a project funded by the EU Commission through the INEA Agency in the CEF Telecom framework, that aims at creating an open dataset describing research centers located in the Alpine area

Crossref

Archivio istituzionale della ricerca - Università di Modena e Reggio Emilia

Author Instructions

Author: Instructions Author
Publication venue
Publication date: 04/11/2013
Field of study

Crossref

Cartographic Perspectives (E-Journal - North American Cartographic Information Society, NACIS)

Going Beyond Counting First Authors in Author Co-citation Analysis

Author: Zhao Dangzhi
Publication venue
Publication date: 01/01/2005
Field of study

The present study examines one of the fundamental aspects of author co-citation analysis (ACA) - the way co-citation counts are defined. Co-citation counting provides the data on which all subsequent statistical analyses and mappings are based, and we compare ACA results based on two different types of co-citation counting - the traditional type that only counts the first one among a cited work's authors on the one hand and a non-traditional type that takes into account the first 5 authors of a cited work on the other hand. Results indicate that the picture produced through this non-traditional author co-citation counting contains more coherent author groups and is therefore considerably clearer. However, this picture represents fewer specialties in the research field being studied than that produced through the traditional first-author co-citation counting when the same number of top-ranked authors is selected and analyzed. Reasons for these effects are discussed

E-LIS

Understanding Data in the Blink of an Eye

Author: Paolo Sottovia
Matteo Interlandi
Francesco Guerra
Sottovia Paolo
Guerra Francesco
Interlandi Matteo
Antonio Maccioni
Matteo Paganelli
Paganelli Matteo
Maccioni Antonio
Publication venue
Publication date: 01/01/2019
Field of study

Many data analysis and knowledge mining tasks require a basic understanding of the content of a dataset prior to any data access. In this demo, we showcase how data descriptions---a set of compact, readable and insightful formulas of boolean predicates---can be used to guide users in understanding datasets. Finding the best description for a dataset is, unfortunately, both computationally hard and task-specific. This demo shows that not only we can generate descriptions at interactive speed, but also that diverse user needs---from anomaly detection to data exploration---can be accommodated through a user-driven process exploiting dynamic programming in concert with a set of heuristics

Crossref

Archivio istituzionale della ricerca - Università di Modena e Reggio Emilia