1,720,981 research outputs found
Gestione ed Analisi di Big Data: Sfide e Opportunità nell'Integrazione e nell'Estrazione di Conoscenza dai Dati
Nell'era dei Big Data, l'adeguata gestione e consumo dei dati rappresenta una delle attività più sfidanti, a causa di una serie di criticità che si è soliti categorizzare in 5 concetti chiave: volume, velocità, varietà, veridicità e variabilità. In risposta a queste esigenze, negli ultimi anni numerosi algoritmi e tecnologie sono stati proposti, tuttavia rimangono molti problemi aperti e nuove sfide sono emerse. Tra queste, solo per citarne alcune, ci sono la necessità di disporre di dati annotati per l'addestramento di tecniche di machine learning, di interpretare la logica dei sistemi utilizzati, di ridurre l'impatto della loro gestione in produzione (ovvero il cosiddetto debito tecnico o technical debt) e di fornire degli strumenti a supporto dell'interazione uomo-macchina.
In questa tesi si approfondiscono in particolare le criticità che affliggono gli ambiti dell'integrazione dati e della moderna gestione (in termini di riadattamento rispetto i nuovi requisiti) dei DBMS relazionali.
Il principale problema che affligge l'integrazione di dati riguarda la sua valutazione in contesti reali, la quale richiede tipicamente il costoso coinvolgimento, sia a livello economico che di tempo, di esperti del dominio. In quest'ottica l'impiego di strumenti per il supporto e l'automazione di questa operazione critica, nonché la sua risoluzione in maniera non supervisionata, risulterebbero molto utili. In questo ambito, il mio contributo può essere riassunto nei seguenti punti: 1) la realizzazione di tecniche per la valutazione non supervisionata di processi di integrazione di dati e 2) lo sviluppo di approcci automatici per la configurazione di modelli di matching basati su regole.
Per quanto riguarda i DBMS relazionali, essi si sono dimostrati di essere, nell'arco degli ultimi decenni, il cavallo di battaglia di molte aziende, per merito della loro semplicità di governance, sicurezza, verificabilità e dell'elevate performance. Oggigiorno, tuttavia si assiste ad un parziale ripensamento del loro utilizzo rispetto alla progettazione originale. Si tratta per esempio di impiegarli nella risoluzione di compiti più avanzati, quali classificazione, regressione e clustering, tipici dell'ambito del machine learning. L'instaurazione di un rapporto simbiotico tra questi due ambiti di ricerca potrebbe rivelarsi essenziale al fine di risolvere alcune delle criticità sopra elencate. In questo ambito, il mio principale contributo è stato quello di verificare la possibilità di eseguire, durante la messa in produzione di un sistema, predizioni di modelli di machine learning direttamente all'interno del database.In the Big Data era, the adequate management and consumption of data represents one of the most challenging activities, due to a series of critical issues that are usually categorized into 5 key concepts: volume, velocity, variety, veridicity and variability. In response to these needs, a large number of algorithms and technologies have been proposed in recent years, however many open problems remain and new challenges have emerged. Among these, just to name a few, there is the need to have annotated data for the training of machine learning techniques, to interpret the logic of the systems used, to reduce the impact of their management in production (i.e. the so-called technical debt) and to provide tools to support human-machine interaction.
In this thesis, the challenges affecting the areas of data integration and modern management (in terms of readjustment with respect to the new requirements) of relational DBMS are studied in depth.
The main problem affecting data integration concerns its evaluation in real contexts, which typically requires the costly and time-demanding involvement of domain experts. In this perspective, the use of tools for the support and automation of this critical task, as well as its unsupervised resolution, would be very useful. In this context, my contribution can be summarized in the following points: 1) the realization of techniques for the unsupervised evaluation of data integration tasks and 2) the development of automatic approaches for the configuration of rules-based matching models.
As for relational DBMSs, they have proved to be, over the last few decades, the workhorse of many companies, thanks to their simplicity of governance, security, audibility and high performance. Today, however, we are witnessing a partial rethinking of their use compared to the original design. For example, they are used in solving more advanced tasks, such as classification, regression and clustering, typical of the machine learning field. The establishment of a symbiotic relationship between these two research fields could be essential to solve some of the critical issues listed above. In this context, my main contribution was to verify the possibility of performing in-DBMS inference of machine learning pipeline at serving time
Parallelizing computations of full disjunctions
In relational databases, the full disjunction operator is an associative extension of the full outerjoin to an arbitrary number of relations. Its goal is to maximize the information we can extract from a database by connecting all tables through all join paths. The use of full disjunctions has been envisaged in several scenarios, such as data integration, and knowledge extraction. One of the main limitations in its adoption in real business scenarios is the large time its computation requires. This paper overcomes this limitation by introducing a novel approach parafd, based on parallel computing techniques, for implementing the full disjunction operator in an exact and approximate version. Our proposal has been compared with state of the art algorithms, which have also been re-implemented for performing in parallel. The experiments show that the time performance outperforms existing approaches. Finally, we have experimented the full disjunction as a collection of documents indexed by a textual search engine. In this way, we provide a simple technique for performing keyword search over relational databases. The results obtained against a benchmark show high precision and recall levels even compared with the existing proposals
Evaluating the integration of datasets
Evaluation is a bottleneck in data integration processes: it is performed by domain experts through manual onerous data inspections. This task is particularly heavy in real business scenarios, where the large amount of data makes checking all integrated tuples infeasible. Our idea is to address this issue by providing the experts with an unsupervised measure, based on word frequencies, which quantifies how much a dataset is representative of another dataset, giving an indication of how good is the integration process. The paper motivates and introduces the measure and provides extensive experimental evaluations, that show the effectiveness and the efficiency of the approach
Finding Synonymous Attributes in Evolving Wikipedia Infoboxes
Wikipedia Infoboxes are semi-structured data structures organized in an attribute-value fashion. Policies establish for each type of entity represented in Wikipedia the attribute names that the Infobox should contain in the form of a template. However, these requirements change over time and often users choose not to strictly obey them. As a result, it is hard to treat in an integrated way the history of the Wikipedia pages, making it difficult to analyze the temporal evolution of Wikipedia entities through their Infobox and impossible to perform direct comparison of entities of the same type. To address this challenge, we propose an approach to deal with the misalignment of the attribute names and identify clusters of synonymous Infobox attributes. Elements in the same cluster are considered as a temporal evolution of the same attribute. To identify the clusters we use two different distance metrics. The first is the co-occurrence degree that is treated as a negative distance, and the second is the co-occurrence of similar values in the attributes that are treated as a positive evidence of synonymy. We formalize the problem as a correlation clustering problem over a weighted graph constructed with attributes as nodes and positive and negative evidence as edges. We solve it with a linear programming model that shows a good approximation. Our experiments over a collection of Infoboxes of the last 13 years shows the potential of our approach
Big Data Integration of Heterogeneous Data Sources: The Re-Search Alps Case Study
The application of big data integration techniques in real scenarios needs to address practical issues related to the scalability of the process and the heterogeneity of data sources. In this paper, we describe the pipeline that has been developed in the context of the Re-search Alps project, a project funded by the EU Commission through the INEA Agency in the CEF Telecom framework, that aims at creating an open dataset describing research centers located in the Alpine area
Going Beyond Counting First Authors in Author Co-citation Analysis
The present study examines one of the fundamental aspects of author co-citation analysis (ACA) - the way co-citation
counts are defined. Co-citation counting provides the data on which all subsequent statistical analyses and mappings
are based, and we compare ACA results based on two different types of co-citation counting - the traditional type that
only counts the first one among a cited work's authors on the one hand and a non-traditional type that takes into
account the first 5 authors of a cited work on the other hand. Results indicate that the picture produced through this non-traditional author co-citation counting contains more coherent author groups and is therefore considerably clearer. However, this picture represents fewer specialties in the research field being studied than that produced through the traditional first-author co-citation counting when the same number of top-ranked authors is selected and analyzed. Reasons for these effects are discussed
Understanding Data in the Blink of an Eye
Many data analysis and knowledge mining tasks require a basic understanding of the content of a dataset prior to any data access. In this demo, we showcase how data descriptions---a set of compact, readable and insightful formulas of boolean predicates---can be used to guide users in understanding datasets. Finding the best description for a dataset is, unfortunately, both computationally hard and task-specific. This demo shows that not only we can generate descriptions at interactive speed, but also that diverse user needs---from anomaly detection to data exploration---can be accommodated through a user-driven process exploiting dynamic programming in concert with a set of heuristics
- …
