1,721,137 research outputs found
BESS: bounded evaluation SQL systems
It could be prohibitively costly to query big relations, even with the power of parallel processing on clusters containing thousands of machines. The tremendous resource burden drives us to find an alternative method to answer big datasets, which need to be affordable for small businesses and lightweight software. The theory of bounded evaluation could reduce queries on big data to computations on small data. It advocates an unconventional query evaluation paradigm under an access schema A, a combination of cardinality constraints and associated indices. For practical use to emerge from the work, it is urgent to design a database system with bounded evaluation capacity.
Firstly, the thesis proposes the design and framework of bounded evaluation systems named BESS. The framework is conducted by two workflows, online query processing and offline access schema management. Its unique features are discussed, which differ from the framework paradigm and conventional database systems. Besides, the thesis shows the feasibility of the framework with two implementation variants. The challenges and advantages of each approach are discussed during the comparison of the workflows.
Secondly, the thesis presents a prototype named BEAS, targeting traditional DBMS. Under this prototype, the thesis proposes the algorithms for online query processing, from checking bounded evaluability to generating bounded query plans. Moreover, the thesis extends bounded evaluation to RAaggr (RA with aggregation) queries under bag semantics to make the prototype more practical in real-life circumstances.
Thirdly, the thesis presents another prototype, named Zidian, based on KV stores to speed up SQL query evaluation over NoSQL. As the foundation, the thesis proposes a block-as-a-value data (BaaV) model and its corresponding relational algebra on this prototype. Beyond the bounded evaluation, the thesis studies scan-free data access. Also, the thesis studies the query evaluation process, which verified that Zidian substantially reduces data access and communication cost. Moreover, Zidian could be plugged into existing SQL-over-NoSQL systems while retaining horizontal scalability.
Finally, to make the framework more practical, the thesis studies the problem of schema discovery for BESS. In the form of a schema selection problem for the BaaV model, the thesis presents a full treatment to generate schema for parametric SQL queries over keyed blocks on KV stores. The thesis develops and verifies the framework and criteria for the schema selection problem, considering both storage and computation. Meanwhile, the thesis presents practical algorithms for selection that guarantees certain optimality under practical conditions
The ACM PODS Alberto O. Mendelzon Test-of-Time Award 2023
In 2007, the PODS Executive Committee decided to establish a Test-of-Time Award, named after the late Alberto O. Mendelzon, in recognition of his scientific legacy, and his service and dedication to the database community. Mendelzon was an international leader in database theory, whose pioneering and fundamental work has inspired and influenced both database theoreticians and practitioners, and continues to be applied in a variety of advanced settings. He served the database community in many ways; in particular, he served as the General Chair of the PODS conference, and was instrumental in bringing together the PODS and SIGMOD conferences. He also was an outstanding educator, who guided the research of numerous doctoral students and postdoctoral fellows. The Award is to be awarded each year to a paper or a small number of papers published in the PODS proceedings ten years prior, that had the most impact (in terms of research, methodology, or transfer of practice) over the intervening decade. The decision was approved by SIGMOD and ACM. The funds for the Award were contributed by IBM Toronto. The PODS Executive Chair has appointed us to serve as the Award Committee for 2023. After careful consideration, we have decided to select the following two papers as the award winners for 2023: The first paper introduces the Massively Parallel Communication (MPC) model to analyze the tradeoff between the number of rounds and the amount of communication required in a massively parallel computing environment. A decade ago there was a growing interest in the study of data processing on large distributed clusters, which resulted in the introduction of various models focusing on different aspects. The MPC model can be seen as the culmination of these models and was designed as an abstraction for the shared-nothing architecture which remains the architecture used by large data processing systems today. Since then, the MPC model is widely adopted in the literature. In the paper, the authors obtain both lower and upper bounds for computing full conjunctive queries in the one round and multi-round case. In particular, they discover an interesting connection between the number of bits that are required to be sent in the single-round case for computing a conjunctive query and the fractional vertex covering number of the hypergraph associated to that query. The monograph "Algorithmic Aspects of Parallel Data Processing" in Foundations and Trends in Databases (2018) presents a detailed overview of the research on the MPC model since its introduction. Ontology-based data access (OBDA) provides a formalization of the problem of querying a database enhanced with an ontology. This simple formal model provides a unifying view for problems in many different areas, and in particular for the widely studied issue of extracting information from a knowledge graph. As such, OBDA has been extensively studied, mainly following two lines of research: the development of efficient query answering algorithms for some classes of query and ontology languages, and the characterization of combinations of these two elements that lead to intractability. In the second paper, the authors follow a different path to study OBDA, which has brought new tools to the area and, as such, has become a fruitful way to address the fundamental challenges in OBDA. More precisely, the authors consider the expressive power of the settings used in OBDA, whose main parameters are the query and ontology
The ACM PODS Alberto O. Mendelzon Test-of-Time Award 2023
In 2007, the PODS Executive Committee decided to establish a Test-of-Time Award, named after the late Alberto O. Mendelzon, in recognition of his scientific legacy, and his service and dedication to the database community. Mendelzon was an international leader in database theory, whose pioneering and fundamental work has inspired and influenced both database theoreticians and practitioners, and continues to be applied in a variety of advanced settings. He served the database community in many ways; in particular, he served as the General Chair of the PODS conference, and was instrumental in bringing together the PODS and SIGMOD conferences. He also was an outstanding educator, who guided the research of numerous doctoral students and postdoctoral fellows. The Award is to be awarded each year to a paper or a small number of papers published in the PODS proceedings ten years prior, that had the most impact (in terms of research, methodology, or transfer of practice) over the intervening decade. The decision was approved by SIGMOD and ACM. The funds for the Award were contributed by IBM Toronto. The PODS Executive Chair has appointed us to serve as the Award Committee for 2023. After careful consideration, we have decided to select the following two papers as the award winners for 2023: The first paper introduces the Massively Parallel Communication (MPC) model to analyze the tradeoff between the number of rounds and the amount of communication required in a massively parallel computing environment. A decade ago there was a growing interest in the study of data processing on large distributed clusters, which resulted in the introduction of various models focusing on different aspects. The MPC model can be seen as the culmination of these models and was designed as an abstraction for the shared-nothing architecture which remains the architecture used by large data processing systems today. Since then, the MPC model is widely adopted in the literature. In the paper, the authors obtain both lower and upper bounds for computing full conjunctive queries in the one round and multi-round case. In particular, they discover an interesting connection between the number of bits that are required to be sent in the single-round case for computing a conjunctive query and the fractional vertex covering number of the hypergraph associated to that query. The monograph "Algorithmic Aspects of Parallel Data Processing" in Foundations and Trends in Databases (2018) presents a detailed overview of the research on the MPC model since its introduction. Ontology-based data access (OBDA) provides a formalization of the problem of querying a database enhanced with an ontology. This simple formal model provides a unifying view for problems in many different areas, and in particular for the widely studied issue of extracting information from a knowledge graph. As such, OBDA has been extensively studied, mainly following two lines of research: the development of efficient query answering algorithms for some classes of query and ontology languages, and the characterization of combinations of these two elements that lead to intractability. In the second paper, the authors follow a different path to study OBDA, which has brought new tools to the area and, as such, has become a fruitful way to address the fundamental challenges in OBDA. More precisely, the authors consider the expressive power of the settings used in OBDA, whose main parameters are the query and ontology
Making Queries Tractable on Big Data with Preprocessing
A query class is traditionally considered tractable if there exists a polynomial-time (PTIME) algorithm to answer its queries. When it comes to big data, however, PTIME algorithms often become infeasible in practice. A traditionaland effective approach to coping with this is to preprocess data off-line, so that queries in the class can be subsequently evaluated on the data efficiently. This paper aims to provide a formal foundation for this approach in terms of computational complexity. (1) We propose a set of -tractable queries, denoted by T0 Q, to characterize classes of queries that can be answered in parallel poly-logarithmic time (NC) after PTIME preprocessing. (2) We show that several natural query classes are -tractable and are feasible on big data. (3) We also study a set TQ of query classes that can be effectively converted to -tractable queries by re-factorizing its data and queries for preprocessing. We introduce a form of NC reductions to characterize such conversions. (4) We show that a natural query class is complete for TQ. (5) We also show that T0 Q ⊂ P unless P = NC, i.e., the set T0 Q of all -tractable queries is properly contained in the set P of all PTIME queries. Nonetheless, TQ = P, i.e., all PTIME query classes can be made -tractable via proper refactorizations. This work is a step towards understanding the tractability of queries in the context of big data
Boosting graph computation with generic methods: partitioning and incrementalization
In this thesis we develop a package of generic methods for boosting the velocity of
graph computations, regarding partitioning and incrementalization. The former is to
deal with the challenges of volume and velocity over big data, and make computations
scalable. The latter is to speed up computations for dynamic graph analytics. On the
one hand, existing partitioners handle graphs in an “one-size-fits-all” way and do not
consider the applications running on the top. This results in sub-optimal performances
for distributed computations. On the other hand, however, current incremental graph
algorithms are ad hoc designed, which require a lot of efforts even for domain experts.
Worse still, most of these algorithms lack provable guarantees for their efficiency. To
this end, this thesis presents a series of generic yet effective approaches for boosting
the efficiency and scalability of graph computation.
Firstly, for graph partitioning, we propose an application-driven hybrid partitioning
strategy that takes the top-level application into consideration. Given a graph algorithm
A, the strategy learns a cost model for A as polynomial regression. We develop partitioners that given the learned cost model, refine an edge-cut or vertex-cut partition to a
hybrid partition and reduce the parallel cost of A. Moreover, we extend the partitioners
to handle multiple cost models of mixed workloads in one batch.
Secondly, we study generic guidelines for incrementalizing graph algorithms. We
identify a class of incrementalizable algorithms abstracted in a fixpoint model. We
show how to deduce an incremental algorithm A∆ from such an algorithm A. Moreover, A∆ can be made bounded relative to A, i.e., its cost is determined by the size
of changes to graphs and changes to the affected area that is necessarily checked by
batch algorithm A. We provide generic conditions under which a deduced algorithm
A∆ warrants to be correct and relatively bounded, by adopting the same logic and data
structures of A, at most using timestamps as an additional auxiliary structure.
Finally, we go beyond the incrementalization of generic graph algorithms, and
focus on graph partitioners where the algorithms are heuristic and exact results are
not required. We propose to incrementalize widely-used graph partitioners A into
heuristically-bounded incremental algorithms A∆. Given graph G, updates ∆G to G
and a partition A(G) of G by A, A∆ computes changes ∆O to A(G) such that (1) applying ∆O to A(G) produces a new partition of the updated graph although it may not
be exactly the one derived by A, (2) it retains the same bounds on balance and cut sizes
as A, and (3) ∆O is decided by ∆G alone. We show that we can deduce A∆ from both
vertex-cut and edge-cut partitioners A, retaining their bound
Tackling the veracity and variety of big data
This thesis tackles the veracity and variety challenges of big data, especially focusing
on graphs and relational data. We start with proposing a class of graph association
rules (GARs) to specify regularities between entities in graphs, which capture both
missing links and inconsistencies. A GAR is a combination of a graph pattern and a
dependency; it may take as predicates machine learning classifiers for link prediction.
We formalize association deduction with GARs in terms of the chase, and prove its
Church-Rosser property. We show that the satisfiability, implication and association
deduction problems for GARs are coNP-complete, NP-complete and NP-complete, respectively.
The incremental deduction problem is DP-complete for GARs. In addition,
we provide parallel algorithms for association deduction and incremental deduction.
We next develop a parallel algorithm to discover GARs, which applies an applicationdriven
strategy to cut back rules and data that are irrelevant to users’ interest, by training
a machine learning model to identify data pertaining to a given application. Moreover,
we introduce a sampling method to reduce a big graph G to a set H of small
sample graphs. Given expected support and recall bounds, this method is able to deduce
samples in H and mine rules from H to satisfy the bounds in the entire G.
Then we propose a class of temporal association rules (TACOs) for event prediction
in temporal graphs. TACOs are defined on temporal graphs in terms of change patterns
and (temporal) conditions, and may carry machine learning predicates for temporal
event prediction. We settle the complexity of reasoning about TACOs, including their
satisfiability, implication and prediction problems. We develop a system that discovers
TACOs by iteratively training a rule creator based on generative models in a creatorcritic
framework, and predicts events by applying the discovered TACOs in parallel.
Finally, we propose an approach to querying relations D and graphs G taken together
in SQL. The key idea is that if a tuple t in D and a vertex v in G are determined
to refer to the same real-world entity, then we join t and v, correlate their information
and complement tuple t with additional attributes of v from graphs. We show how to
do this in SQL extended with only syntactic sugar, for both static joins when t is a tuple
in D and dynamic joins when t comes from intermediate results of sub-queries on D.
To support the semantic joins, we propose an attribute extraction scheme based on Kmeans
clustering, to identify and fetch graph properties that are linked to v via paths.
Moreover, we develop a scheme to extract a relation schema for entities in graphs, and
a heuristic join method based on the schema to strike a balance between the complexity
and accuracy of dynamic joins
On unifying heterogeneous data management: from philosophy foundation to system implementation
This thesis aims to unify heterogeneous data anagement with a revised relational model for uniformly querying and updating across data in different formats without performance degradation. To address this well-recognized crucial challenge for extracting value from the variety of big data, the discussion begins with a prior philosophical reflection, though its necessity is often overlooked, on this variety itself: the existence of numerous incompatible data models. We clarify that various data models are all essentially cognitions of the task of database management, artificially developed under different considerations but all with the identical goal of making raw data manageable in the same physical world, rather than Cartesian mirror-images of distinct objective realities. The typical ignorance of this neglects the identity behind the opposites among the structures and operations they impose, thereby undermining the ideal of developing a general method for the task of database management.
Regaining this long-dusted ideal, we recognize a pathway to achieving it through an investigation into the distinctiveness of the relational model that has enabled it to dominate the field of database management for decades. We show that the key characteristics of the relational model can directly address the model-independent challenges inherent in this task, thereby enabling the systems developed accordingly to perform better, which thus makes them indispensable for any data model to succeed in practice and therefore renders them also “relational”. This is not meant to establish the relational model as a “codex”, but rather to shed light on the principle of evolving it to catch up with the continuously progressing task of database management, so that the overall landscape of this task, amidst endless emerging challenges, can still be addressed by a single model rather than a combination of heterogeneous ones.
In light of this, as a preliminary step toward unified database management, we provide an overarching framework to uniformly adopt different standpoints in accomplishing this task without combining heterogeneous models. Specifically, surrounding a proposed RG (Relational Generative) model, we develop different components for the options within the following aspects to be configurable: the connections can be represented either explicitly or implicitly; the consistency can be annotated on different data items and operations, across distinct isolation levels; and the analysis of data can be programmed both in the declarative and imperative paradigms. Practitioners would then no longer need to turn to an alternative, purpose-built model or system to adopt a particular method for managing their data, at least within these aspects, thereby paving the way toward the unification of heterogeneous data management.
More technically, in the first part, we introduce the RG model, an extension of the relational model with logic-level pointers to connect different tuples, to represent graphs in relations while preserving its topology. By incorporating graph exploration via logical-level pointers as an operator, we generalize the relational query evaluation workflow to provide an enlarged unified plan space for graph-relation hybrid queries. The optimal query plan can then be generated and executed on a typical relational database system seamlessly without the need for a customized execution engine, retaining the battle-hardened optimizations embedded in relational databases while enabling hybrid queries over graphs and relations.
In the second part, still concentrating on the analytical task, we identify redundant computations in query evaluation and introduce a recursive execution paradigm to eliminate them accordingly. Namely, we show that different subqueries might be automorphisms of each other, causing redundant computations that can only be eliminated by sharing results among subqueries. This will thus introduce recursive data passing, which breaks the separate execution of each join, even if those queries can still be described without recursion. We theoretically prove that such a recursive execution paradigm can help query evaluation to reach a complexity lower bound in string pattern matching that was once unachievable.
In the third part, we concentrate on transactional tasks. Building upon the approach above for unifying relations and graphs, we develop a way of running transactions across relations and graphs. In addition, we further propose a fine-grained isolation level to warrant data-driven isolation guarantees according to user-defined annotations in each transaction, reflecting the heterogeneity of the data. This allows data items touched by a single transaction to be treated separately and differently, protecting only the critical logic with relatively high isolation while avoiding unnecessary isolation guarantees, thereby improving transaction throughput without impairing the correctness of application logic during concurrent execution.
Finally, as a practical and accessible implementation of the proposed mechanisms, WhiteDB has been developed to assess the benefits they offer when applied to real-world data. It allows the same piece of data to be viewed both as “data model” and “object” and also to be accessed using both declarative and the imperative programming paradigms. Such a Versatility enables it to function as both C++ library with a variety of frequently utilized functionalities as well as an embedding database, thereby supporting an integrated multi-paradigm data analysis
From Relations to XML: Cleaning, Integrating and Securing Data
While relational databases are still the preferred approach for storing data, XML is emerging
as the primary standard for representing and exchanging data. Consequently, it has
been increasingly important to provide a uniform XML interface to various data sources—
integration; and critical to protect sensitive and confidential information in XML data —
access control. Moreover, it is preferable to first detect and repair the inconsistencies in
the data to avoid the propagation of errors to other data processing steps. In response to
these challenges, this thesis presents an integrated framework for cleaning, integrating and
securing data.
The framework contains three parts. First, the data cleaning sub-framework makes
use of a new class of constraints specially designed for improving data quality, referred
to as conditional functional dependencies (CFDs), to detect and remove inconsistencies in
relational data. Both batch and incremental techniques are developed for detecting CFD
violations by SQL efficiently and repairing them based on a cost model. The cleaned relational
data, together with other non-XML data, is then converted to XML format by using
widely deployed XML publishing facilities. Second, the data integration sub-framework
uses a novel formalism, XML integration grammars (XIGs), to integrate multi-source XML
data which is either native or published from traditional databases. XIGs automatically
support conformance to a target DTD, and allow one to build a large, complex integration
via composition of component XIGs. To efficiently materialize the integrated data, algorithms
are developed for merging XML queries in XIGs and for scheduling them. Third, to
protect sensitive information in the integrated XML data, the data security sub-framework
allows users to access the data only through authorized views. User queries posed on these
views need to be rewritten into equivalent queries on the underlying document to avoid the
prohibitive cost of materializing and maintaining large number of views. Two algorithms
are proposed to support virtual XML views: a rewriting algorithm that characterizes the
rewritten queries as a new form of automata and an evaluation algorithm to execute the
automata-represented queries. They allow the security sub-framework to answer queries
on views in linear time.
Using both relational and XML technologies, this framework provides a uniform approach
to clean, integrate and secure data. The algorithms and techniques in the framework
have been implemented and the experimental study verifies their effectiveness and efficiency
Extending dependencies for improving data quality
This doctoral thesis presents the results of my work on extending dependencies for
improving data quality, both in a centralized environment with a single database and
in a data exchange and integration environment with multiple databases.
The first part of the thesis proposes five classes of data dependencies, referred to as
CINDs, eCFDs, CFDcs, CFDps and CINDps, to capture data inconsistencies commonly
found in practice in a centralized environment. For each class of these dependencies,
we investigate two central problems: the satisfiability problem and the implication
problem. The satisfiability problem is to determine given a set Σ of dependencies
defined on a database schema R, whether or not there exists a nonempty database D of
R that satisfies Σ. And the implication problem is to determine whether or not a set Σ
of dependencies defined on a database schema R entails another dependency φ on R.
That is, for each database D ofRthat satisfies Σ, the D must satisfy φ as well. These are
important for the validation and optimization of data-cleaning processes. We establish
complexity results of the satisfiability problem and the implication problem for all
these five classes of dependencies, both in the absence of finite-domain attributes and in
the general setting with finite-domain attributes. Moreover, SQL-based techniques are
developed to detect data inconsistencies for each class of the proposed dependencies,
which can be easily implemented on the top of current database management systems.
The second part of the thesis studies three important topics for data cleaning in a
data exchange and integration environment with multiple databases.
One is the dependency propagation problem, which is to determine, given a view
defined on data sources and a set of dependencies on the sources, whether another
dependency is guaranteed to hold on the view. We investigate dependency propagation
for views defined in various fragments of relational algebra, conditional functional
dependencies (CFDs) [FGJK08] as view dependencies, and for source dependencies
given as either CFDs or traditional functional dependencies (FDs). And we establish
lower and upper bounds, all matching, ranging from PTIME to undecidable. These not
only provide the first results for CFD propagation, but also extend the classical work
of FD propagation by giving new complexity bounds in the presence of a setting with
finite domains. We finally provide the first algorithm for computing a minimal cover of
all CFDs propagated via SPC views. The algorithm has the same complexity as one of
the most efficient algorithms for computing a cover of FDs propagated via a projection
view, despite the increased expressive power of CFDs and SPC views. Another one is matching records from unreliable data sources. A class of matching
dependencies (MDs) is introduced for specifying the semantics of unreliable data. As
opposed to static constraints for schema design such as FDs, MDs are developed for
record matching, and are defined in terms of similarity metrics and a dynamic semantics. We identify a special case of MDs, referred to as relative candidate keys (RCKs),
to determine what attributes to compare and how to compare them when matching
records across possibly different relations. We also propose a mechanism for inferring MDs with a sound and complete system, a departure from traditional implication
analysis, such that when we cannot match records by comparing attributes that contain
errors, we may still find matches by using other, more reliable attributes. We finally
provide a quadratic time algorithm for inferring MDs, and an effective algorithm for
deducing quality RCKs from a given set of MDs.
The last one is finding certain fixes for data monitoring [CGGM03, SMO07], which
is to find and correct errors in a tuple when it is created, either entered manually or
generated by some process. That is, we want to ensure that a tuple t is clean before it
is used, to prevent errors introduced by adding t. As noted by [SMO07], it is far less
costly to correct a tuple at the point of entry than fixing it afterward.
Data repairing based on integrity constraints may not find certain fixes that are
absolutely correct, and worse, may introduce new errors when repairing the data. We
propose a method for finding certain fixes, based on master data, a notion of certain
regions, and a class of editing rules. A certain region is a set of attributes that are
assured correct by the users. Given a certain region and master data, editing rules tell
us what attributes to fix and how to update them. We show how the method can be used
in data monitoring and enrichment. We develop techniques for reasoning about editing
rules, to decide whether they lead to a unique fix and whether they are able to fix all
the attributes in a tuple, relative to master data and a certain region. We also provide
an algorithm to identify minimal certain regions, such that a certain fix is warranted by
editing rules and master data as long as one of the regions is correct
- …
