1,721,004 research outputs found
Schema Mapping and Data Exchange Tools: Time for the Golden Age
In the last 10 years, schema mapping management has become an important research area in data transformation, exchange, and integration systems. The reasons for its success can be found in the declarative nature of its building block (thus enabling clean semantics and easy to use design tools) paired with the efficiency and modularity in the deployment step. In this paper we sketch a line of evolution in schema-mappings and data exchange systems, through what we identify as three main ages. We start presenting the foundations of schema mapping tools and the first tools aimed at translating data from a source to a target schema in the first, heroic age. We then discuss the silver age, when schema mapping tools have grown their way into complex systems and have been translated into both commercial and open-source tools. Finally, we show how recent results in schema-mapping and data-exchange research may be considered the starting point for a forthcoming golden age, with novel research opportunities and a new generation of systems capable of dealing with a significantly larger class of real-life applications
Evaluating Ambiguous Questions in Semantic Parsing
Tabular Representation Learning and Large Language Models have recently achieved promising results in solving the Semantic Parsing (SP) task. Given a question posed in natural language on a relational table, the goal is to return to the end-users executable SQL declarations. However, models struggle to produce the correct output when questions are ambiguously defined w.r.t. the table schema. Assessing the robustness to data-ambiguity can be particularly time-consuming as entails seeking ambiguous patterns on a large number of queries with varying complexity. To automate this process, we propose Data-Ambiguity Tester, a pipeline for data-ambiguity testing tailored to SP. It first automatically generates non-ambiguous natural language questions and SQL queries of varying complexity. Then, it injects ambiguous patterns, extracted from a human-annotated set of relational tables, in the natural language questions. Finally, it quantifies the level of ambiguity using customized performance metrics. Results show strengths and limitations of existing models in coping with ambiguity between questions and tabular data
QATCH: Benchmarking SQL-centric tasks with Table Representation Learning Models on Your Data
Table Representation Learning (TRL) models are commonly pre-trained on large open-domain datasets comprising millions of tables and then used to address downstream tasks. Choosing the right TRL model to use on proprietary data can be challenging, as the best results depend on the content domain, schema, and data quality. Our purpose is to support end-users in testing TRL models on proprietary data in two established SQL-centric tasks, i.e., Question Answering (QA) and Semantic Parsing (SP). We present QATCH (Query-Aided TRL Checklist), a toolbox to highlight TRL models' strengths and weaknesses on relational tables unseen at training time. For an input table, QATCH automatically generates a testing checklist tailored to QA and SP. Checklist generation is driven by a SQL query engine that crafts tests of different complexity. This design facilitates inherent portability, allowing the checks to be used by alternative models. We also introduce a set of cross-task performance metrics evaluating the TRL model's performance over its output. Finally, we show how QATCH automatically generates tests for proprietary datasets to evaluate various state-of-the-art models including TAPAS, TAPEX, and CHATGPT
Cleaning data with Llunatic
Data cleaning (or data repairing) is considered a crucial problem in many database-related tasks. It consists in making a database consistent with respect to a given set of constraints. In recent years, repairing methods have been proposed for several classes of constraints. These methods, however, tend to hard-code the strategy to repair conflicting values and are specialized toward specific classes of constraints. In this paper, we develop a general chase-based repairing framework, referred to as Llunatic, in which repairs can be obtained for a large class of constraints and by using different strategies to select preferred values. The framework is based on an elegant formalization in terms of labeled instances and partially ordered preference labels. In this context, we revisit concepts such as upgrades, repairs and the chase. In Llunatic, various repairing strategies can be slotted in, without the need for changing the underlying implementation. Furthermore, Llunatic is the first data repairing system which is DBMS-based. We report experimental results that confirm its good scalability and show that various instantiations of the framework result in repairs of good quality
Attribute Ambiguity Discovery: A Deep Learning Approach via Unsupervised Learning
Several applications, such as text-to-SQL and computational fact checking, exploit the relationship between relational data and natural language text. However, state of the art solutions simply fail in managing “data-ambiguity", i.e., the case when there are multiple interpretations of the relationship between text and data. Given the ambiguity in language, text can be mapped to different subsets of data, but existing training corpora only have examples in which every sentence/question is annotated precisely w.r.t. the relation. This unrealistic assumption leaves the target applications unable to handle ambiguous cases. To tackle this problem, we present a deep learning method that identifies every pair of data ambiguous attributes and a label that describes both columns. Such metadata can then be used to generate examples with data ambiguities for any input table
Data Ambiguity Profiling for the Generation of Training Examples
Several applications, such as text-to-SQL and computational fact checking, exploit the relationship between relational data and natural language text. However, state of the art solutions simply fail in managing "data-ambiguity", i.e., the case when there are multiple interpretations of the relationship between text and data. Given the ambiguity in language, text can be mapped to different subsets of data, but existing training corpora only have examples in which every sentence/question is annotated precisely w.r.t. the relation. This unrealistic assumption leaves the target applications unable to handle ambiguous cases. To tackle this problem, we present an end-to-end solution that, given a table D, generates examples that consist of text, annotated with its data evidence, with factual ambiguities w.r.t. D. We formulate the problem of profiling relational tables to identify row and attribute data ambiguity. For the latter, we propose a deep learning method that identifies every pair of data ambiguous attributes and a label that describes both columns. Such metadata is then used to generate examples with data ambiguities for any input table. To enable scalability, we finally introduce a SQL approach that can generate millions of examples in seconds. We show the high accuracy of our solution in profiling relational tables and report on how our automatically generated examples lead to drastic quality improvements in two fact-checking applications, including a website with thousands of users, and in a text-to-SQL system
- …
