1,721,147 research outputs found
Sopj: A scalable online provenance join for data integration
Data integration is a technique used to combine different sources of data together to provide an unified view among them. MOMIS[1] is an open-source data integration framework developed by the DBGroup1. The goal of our work is to make MOMIS be able to scale-out as the input data sources increase without introducing noticeable performance penalty. In particular, we present a full outer join method capable to efficiently integrate multiple sources at the same time by using data streams and provenance information. To evaluate the scalability of this innovative approach, we developed a join engine employing a distributed data processing framework. Our solution is able to process input data sources in the form of continuous stream, execute the join operation on-the-fly and produce outputs as soon as they are generated. In this way, the join can return partial results before the input streams have been completely received or processed optimizing the entire execution
Enhancing entity resolution efficiency with loosely schema-aware techniques - Discussion paper
Entity Resolution, the task of identifying records that refer to the same real-world entity, is a fundamental step in data integration. Blocking is a widely employed technique to avoid the comparison of all possible record pairs in a dataset (an inefficient approach). Renouncing to exploit schema information for blocking has been proved to limit the chance of missing matches (i.e., it guarantees high recall), at the cost of a low precision. Meta-blocking alleviates this issue by restructuring a block collection, removing redundant and superfluous comparisons. Yet, existing meta-blocking techniques exclusively rely on schema-agnostic features. In this paper, we investigate how loose schema information, induced directly from the data, can be exploited in an holistic loosely schema-aware (meta-)blocking approach that outperforms the state-of-the-art meta-blocking in terms of precision, without renouncing high level of recall. We implemented our idea in a system called Blast, and experimentally evaluated it on real-world datasets
A model for visual building SPARQL queries
LODeX is a Semantic Web tool that, leveraging a summarized representation of a LOD source structure (i.e. Schema Summary), helps users explore and query SPARQL endpoints by hiding the complexity of Semantic Web technologies. By leveraging Schema Summary of a LOD source, LODeX guides the user in composing visual queries that are automatically translated in correct SPARQL queries through a SPARQL compiler. In this work we inspected how LODeX can deal with the high expressivity of SPARQL. In particular, we propose a formal model that allow to define queries over the Schema Summary (i.e. Basic Query) and we analyze how this model can handle different join patterns used in SPARQL queries. Finally, we inspect how LODeX can satisfy real world users necessities by analyzing the query logs contained in the LSQ dataset. We show that LODeX could be able to generate the 77.6% of the 5 million queries contained in LSQ dataset
Extraction of Informations From Highly Heterogeneous Source of Textual Data
. Extracting informations from multiple sources, highly heterogeneous, of textual data and integrating them in order to provide true information is a challenging research topic in the database area. In order to illustrate problems and solutions, one of the most interesting projects facing this problem, TSIMMIS, is presented. Furthermore, a Description Logics approach, able to provide interesting solutions both for data integration and data querying, is introduced. 1 Introduction The availiability of large numbers of network informations sources (and the recent explosion of Internet) makes it possible to access to a very large amount of information sources all over the world. The increased amount of available informations has as a consequence the fact that, for a given query, the set of potentially interesting sites is very high but only very few sites are really relevant. Furthermore, informations are highly heterogeneous both in their structure and in their origin. In particular, n..
The E/S knowledge representation system
This paper introduces the E/S knowledge representation model and describes a system based on that model. The model takes ideas from KL-ONE and ER, and its main strength is the direct representation of n-ary relationships. The system is classification-based, and therefore organizes its knowledge in hierarchies of structured intensional objects and offers a set of services to reason about intensional objects, to store extensional objects and to make inferences on the stored knowledge. © 1994
Towards declarative imperative data-parallel systems ?
Pushed by recent evolvements in the field of declarative networking and data-parallel computation, we propose a first investigation over a declarative imperative parallel programming model which tries to combine the two worlds. We identify a set of requirements that the model should possess and introduce a conceptual sketch of the system implementing the foresaw model
A semantic multi-lingual method for publishing linked open data
Nowadays, there has been an increment of open data initiatives promoting the freely publication of data produced by public administrations (such as public spending, health care, education etc.). However, the great majority of these data are published in an unstructured format (such as spreadsheets or CSV) and is typically accessed only by closed communities. To address this problem, we propose a semiautomatic multi-lingual and semantic method for facilitating resource providers in publishing public data into the Linked Open Data (LOD) cloud, and for helping consumers (companies and citizens) in efficiently accessing and querying them. The method has been applied on a real case on a set of data provided in Italian
Entity resolution on camera records without machine learning
This paper reports the runner-up solution to the ACM SIGMOD 2020 programming contest, whose target was to identify the specifications (i.e., records) collected across 24 e-commerce data sources that refer to the same real-world entities. First, we investigate the machine learning (ML) approach, but surprisingly find that existing state-of-the-art ML-based methods fall short in such a context-not reaching 0.49 F-score. Then, we propose an efficient solution that exploits annotated lists and regular expressions generated by humans that reaches a 0.99 F-score. In our experience, our approach was not more expensive than the dataset labeling of match/non-match pairs required by ML-based methods, in terms of human efforts
DEXA 2008: Second international workshop on Semantic Web Architectures for Enterprises - SWAE'08
The aim of the second edition of the workshop on Semantic Web Architectures for Enterprises (SWAE) is to evaluate how and how much the Semantic Web vision has met its promises with respect to business and market needs. On the basis of our research experience within the basic research Italian project NeP4B (http://www.dbgroup.unimo.it/nep4b/it/index.htm), the European projects SEWASIE (www.sewasie.org), STASIS (http://www.dbgroup.unimo.it/stasis/), OKKAM (www.okkam.org) and Papyrus (www.ict-papyrus.eu), we focus on the permeation of the Semantic Web technologies in industrial and real applications
- …
