1,721,149 research outputs found
Static Analysis and Query Answering for Incomplete Data Trees with Constraints
Data trees serve as an abstraction of XML documents: in such trees, every node comes with a label from a finite alphabet, as well as a data value from an infinite set. Incomplete data trees model XML documents with incomplete information; they may include both structural incompleteness and incompleteness of data. Here we study two basic problems for incomplete data trees under typical constraints such as keys and foreign keys. The first problem is consistency of specifications of incomplete data trees. We show that many of recently established results on consistency of constraints and schema descriptions can be transferred to the consistency of incomplete tree specifications without any increase in complexity. After that we examine query answering over incomplete data trees under constraints, and show that tractable bounds can be recovered under key constraints, but are lost under foreign keys
XML data exchange under expressive mappings
Data Exchange is the problem of transforming data in one format (the source schema)
into data in another format (the target schema). Its core component is a schema mapping,
which is a high level specification of how such transformation should be done. Relational
data exchange has been extensively studied, but exchanging XML data have been paid
much less attention. The goal of this thesis is to develop a theory of XML data exchange
with expressive schema mappings, extending a previous work using restricted mappings.
Our mapping language is based on tree patterns that can use horizontal navigation and
data comparison in addition to downward navigation.
First we look at static analysis problems concerning a single mapping. More specif-
ically, we consider consistency problems with different flavours. One such problem, for
instance, asks if any tree has a solution under the given mapping. Then we turn to analyse
the complexity of mapping themselves, i.e., recognising pairs of trees such that the one
is mapped to the other. For both problems, we provide classifications based on sets of
features used in the mappings.
Second we investigate the composition of XML schema mappings. Generally it is hard,
or rather simply impossible, to achieve closure under composition in XML settings unlike
in relational settings. Nevertheless we identify a class of XML schema mappings that is
closed under composition.
Lastly we consider the problem of query answering. It is important to exchange data
so that we can feasibly answer queries while it often leads to intractability. We identify the
dividing line between tractable and intractable cases: answering queries with extended
features is always intractable while tractability of answering simple queries can be retained
in extended mappings
Querying graphs on large-scale data
This doctoral thesis will present the results of my work into querying graphs on large-scale data, from both the data perspective and query perspective.
We first propose a scheme to reduce large graphs into small ones. It contracts obsolete parts, stars, cliques and paths into supernodes. We then build a hierarchical scheme to further reduce the graph, under limited resources. For both the contraction scheme and the hierarchy, we show that it is generic and lossless. We show that the same contracted graph is able to support multiple query classes at the same time, no matter whether their queries are label-based or not, local or non-local. Moreover, existing algorithms for these queries can be readily adapted to compute exact answers by using the synopses when possible, and decontracting the supernodes only when necessary. Using real-life graphs, we experimentally verify the efficiency and effectiveness of our contraction schemes.
Meanwhile, we propose an extension of graph patterns, referred to as conditional graph patterns (CGPs). The CGPs allows one to express several conventional queries in a conditioned one, and annotate similar patterns to compute answers for all patterns in a single enumeration. In a CGP, one can specify a simple condition on each edge such that the edge exists if and only if the condition is satisfied. We show that CGPs allow us to catch missing links, increase the expressivity of graph functional dependencies, and provide a succinct representation of graph patterns. We studies their consistency, matching, incremental matching and containment problems and settles down their complexity bounds. We develop algorithms for matching, incremental matching and parallel matching of CGPs, and for (incremental, parallel) multi-CGP matching and optimization. We empiricallyverify the efficiency and effectiveness of our algorithms on real-life and synthetic graphs
Design Guidelines for Reducing Redundancy in Relational and XML Data
In this dissertation, we propose new design guidelines to reduce the amount of redundancy that databases carry. We use techniques from information theory to define a measure that evaluates a database design based on the worst possible redundancy carried in the instances. We then continue by revisiting the design problem of relational data with functional dependencies, and measure the lowest price, in terms of redundancy, that has to be paid to guarantee a dependency-preserving normalization for all schemas. We provide a formal justification for the Third Normal Form (3NF) by showing that we can achieve this lowest price by doing a good 3NF normalization.
We then study the design problem for XML documents that are views of relational data. We show that we can design a redundancy-free XML representation for some relational schemas while preserving all data dependencies. We present an algorithm for converting a relational schema to such an XML design.
We finally study the design problem for XML documents that are stored in relational databases. We look for XML design criteria that ensure a relational storage with low redundancy. First, we characterize XML designs that have a redundancy-free relational storage. Then we propose a restrictive condition for XML functional dependencies that guarantees a low redundancy for data values in the relational storage.Ph
Querying big data with bounded data access
Query answering over big data is cost-prohibitive. A linear scan of a dataset D may
take days with a solid state device if D is of PB size and years if D is of EB size. In
other words, polynomial-time (PTIME) algorithms for query evaluation are already
not feasible on big data. To tackle this, we propose querying big data with bounded
data access, such that the cost of query evaluation is independent of the scale of D.
First of all, we propose a class of boundedly evaluable queries. A query Q is boundedly
evaluable under a set A of access constraints if for any dataset D that satisfies
constraints in A, there exists a subset DQ ⊆ D such that (a) Q(DQ) = Q(D), and (b) the
time for identifying DQ from D, and hence the size |DQ| of DQ, are independent of |D|.
That is, we can compute Q(D) by accessing a bounded amount of data no matter how
big D grows.We study the problem of deciding whether a query is boundedly evaluable
under A. It is known that the problem is undecidable for FO without access constraints.
We show that, in the presence of access constraints, it is decidable in 2EXPSPACE for
positive fragments of FO queries, but is already EXPSPACE-hard even for CQ.
To handle the undecidability and high complexity of the analysis, we develop effective
syntax for boundedly evaluable queries under A, referred to as queries covered
by A, such that, (a) any boundedly evaluable query under A is equivalent to a query
covered by A, (b) each covered query is boundedly evaluable, and (c) it is efficient to
decide whether Q is covered by A. On top of DBMS, we develop practical algorithms
for checking whether queries are covered by A, and generating bounded plans if so.
For queries that are not boundedly evaluable, we extend bounded evaluability
to resource-bounded approximation and bounded query rewriting using views.
(1) Resource-bounded approximation is parameterized with a resource ratio a ∈ (0,1],
such that for any query Q and dataset D, it computes approximate answers with an
accuracy bound h by accessing at most a|D| tuples. It is based on extended access constraints
and a new accuracy measure. (2) Bounded query rewriting tackles the problem
by incorporating bounded evaluability with views, such that the queries can be exactly
answered by accessing cached views and a bounded amount of data in D. We study the
problem of deciding whether a query has a bounded rewriting, establish its complexity
bounds, and develop effective syntax for FO queries with a bounded rewriting.
Finally, we extend bounded evaluability to graph pattern queries, by extending
access constraints to graph data. We characterize bounded evaluability for subgraph
and simulation patterns and develop practical algorithms for associated problems
Boosting graph computation with generic methods: partitioning and incrementalization
In this thesis we develop a package of generic methods for boosting the velocity of
graph computations, regarding partitioning and incrementalization. The former is to
deal with the challenges of volume and velocity over big data, and make computations
scalable. The latter is to speed up computations for dynamic graph analytics. On the
one hand, existing partitioners handle graphs in an “one-size-fits-all” way and do not
consider the applications running on the top. This results in sub-optimal performances
for distributed computations. On the other hand, however, current incremental graph
algorithms are ad hoc designed, which require a lot of efforts even for domain experts.
Worse still, most of these algorithms lack provable guarantees for their efficiency. To
this end, this thesis presents a series of generic yet effective approaches for boosting
the efficiency and scalability of graph computation.
Firstly, for graph partitioning, we propose an application-driven hybrid partitioning
strategy that takes the top-level application into consideration. Given a graph algorithm
A, the strategy learns a cost model for A as polynomial regression. We develop partitioners that given the learned cost model, refine an edge-cut or vertex-cut partition to a
hybrid partition and reduce the parallel cost of A. Moreover, we extend the partitioners
to handle multiple cost models of mixed workloads in one batch.
Secondly, we study generic guidelines for incrementalizing graph algorithms. We
identify a class of incrementalizable algorithms abstracted in a fixpoint model. We
show how to deduce an incremental algorithm A∆ from such an algorithm A. Moreover, A∆ can be made bounded relative to A, i.e., its cost is determined by the size
of changes to graphs and changes to the affected area that is necessarily checked by
batch algorithm A. We provide generic conditions under which a deduced algorithm
A∆ warrants to be correct and relatively bounded, by adopting the same logic and data
structures of A, at most using timestamps as an additional auxiliary structure.
Finally, we go beyond the incrementalization of generic graph algorithms, and
focus on graph partitioners where the algorithms are heuristic and exact results are
not required. We propose to incrementalize widely-used graph partitioners A into
heuristically-bounded incremental algorithms A∆. Given graph G, updates ∆G to G
and a partition A(G) of G by A, A∆ computes changes ∆O to A(G) such that (1) applying ∆O to A(G) produces a new partition of the updated graph although it may not
be exactly the one derived by A, (2) it retains the same bounds on balance and cut sizes
as A, and (3) ∆O is decided by ∆G alone. We show that we can deduce A∆ from both
vertex-cut and edge-cut partitioners A, retaining their bound
GRAPE: Parallel Graph Query Engine
The need for graph computations is evident in a multitude of use cases. To support
computations on large-scale graphs, several parallel systems have been developed.
However, existing graph systems require users to recast algorithms into new models,
which makes parallel graph computations as a privilege to experienced users only.
Moreover, real world applications often require much more complex graph processing
workflows than previously evaluated. In response to these challenges, the thesis
presents GRAPE, a distributed graph computation system, shipped with various applications
for social network analysis, social media marketing and functional dependencies
on graphs.
Firstly, the thesis presents the foundation of GRAPE. The principled approach of
GRAPE is based on partial evaluation and incremental computation. Sequential graph
algorithms can be plugged into GRAPE with minor changes, and get parallelized as a
whole. The termination and correctness are guaranteed under a monotonic condition.
Secondly, as an application on GRAPE, the thesis proposes graph-pattern association
rules (GPARs) for social media marketing. GPARs help users discover regularities
between entities in social graphs and identify potential customers by exploring social
influence. The thesis studies the problem of discovering top-k diversified GPARs and
the problem of identifying potential customers with GPARs. Although both are NP-
hard, parallel scalable algorithms on GRAPE are developed, which guarantee a polynomial
speedup over sequential algorithms with the increase of processors.
Thirdly, the thesis proposes quantified graph patterns (QGPs), an extension of
graph patterns by supporting simple counting quantifiers on edges. QGPs naturally express
universal and existential quantification, numeric and ratio aggregates, as well as
negation. The thesis proves that the matching problem of QGPs remains NP-complete
in the absence of negation, and is DP-complete for general QGPs. In addition, the
thesis introduces quantified graph association rules defined with QGPs, to identify potential
customers in social media marketing.
Finally, to address the issue of data consistency, the thesis proposes a class of functional
dependencies for graphs, referred to as GFDs. GFDs capture both attribute-value
dependencies and topological structures of entities. The satisfiability and implication
problems for GFDs are studied and proved to be coNP-complete and NP-complete,
respectively. The thesis also proves that the validation problem for GFDs is coNP-
complete. The parallel algorithms developed on GRAPE verify that GFDs provide an
effective approach to detecting inconsistencies in knowledge and social graphs
BESS: bounded evaluation SQL systems
It could be prohibitively costly to query big relations, even with the power of parallel processing on clusters containing thousands of machines. The tremendous resource burden drives us to find an alternative method to answer big datasets, which need to be affordable for small businesses and lightweight software. The theory of bounded evaluation could reduce queries on big data to computations on small data. It advocates an unconventional query evaluation paradigm under an access schema A, a combination of cardinality constraints and associated indices. For practical use to emerge from the work, it is urgent to design a database system with bounded evaluation capacity.
Firstly, the thesis proposes the design and framework of bounded evaluation systems named BESS. The framework is conducted by two workflows, online query processing and offline access schema management. Its unique features are discussed, which differ from the framework paradigm and conventional database systems. Besides, the thesis shows the feasibility of the framework with two implementation variants. The challenges and advantages of each approach are discussed during the comparison of the workflows.
Secondly, the thesis presents a prototype named BEAS, targeting traditional DBMS. Under this prototype, the thesis proposes the algorithms for online query processing, from checking bounded evaluability to generating bounded query plans. Moreover, the thesis extends bounded evaluation to RAaggr (RA with aggregation) queries under bag semantics to make the prototype more practical in real-life circumstances.
Thirdly, the thesis presents another prototype, named Zidian, based on KV stores to speed up SQL query evaluation over NoSQL. As the foundation, the thesis proposes a block-as-a-value data (BaaV) model and its corresponding relational algebra on this prototype. Beyond the bounded evaluation, the thesis studies scan-free data access. Also, the thesis studies the query evaluation process, which verified that Zidian substantially reduces data access and communication cost. Moreover, Zidian could be plugged into existing SQL-over-NoSQL systems while retaining horizontal scalability.
Finally, to make the framework more practical, the thesis studies the problem of schema discovery for BESS. In the form of a schema selection problem for the BaaV model, the thesis presents a full treatment to generate schema for parametric SQL queries over keyed blocks on KV stores. The thesis develops and verifies the framework and criteria for the schema selection problem, considering both storage and computation. Meanwhile, the thesis presents practical algorithms for selection that guarantees certain optimality under practical conditions
Tackling the veracity and variety of big data
This thesis tackles the veracity and variety challenges of big data, especially focusing
on graphs and relational data. We start with proposing a class of graph association
rules (GARs) to specify regularities between entities in graphs, which capture both
missing links and inconsistencies. A GAR is a combination of a graph pattern and a
dependency; it may take as predicates machine learning classifiers for link prediction.
We formalize association deduction with GARs in terms of the chase, and prove its
Church-Rosser property. We show that the satisfiability, implication and association
deduction problems for GARs are coNP-complete, NP-complete and NP-complete, respectively.
The incremental deduction problem is DP-complete for GARs. In addition,
we provide parallel algorithms for association deduction and incremental deduction.
We next develop a parallel algorithm to discover GARs, which applies an applicationdriven
strategy to cut back rules and data that are irrelevant to users’ interest, by training
a machine learning model to identify data pertaining to a given application. Moreover,
we introduce a sampling method to reduce a big graph G to a set H of small
sample graphs. Given expected support and recall bounds, this method is able to deduce
samples in H and mine rules from H to satisfy the bounds in the entire G.
Then we propose a class of temporal association rules (TACOs) for event prediction
in temporal graphs. TACOs are defined on temporal graphs in terms of change patterns
and (temporal) conditions, and may carry machine learning predicates for temporal
event prediction. We settle the complexity of reasoning about TACOs, including their
satisfiability, implication and prediction problems. We develop a system that discovers
TACOs by iteratively training a rule creator based on generative models in a creatorcritic
framework, and predicts events by applying the discovered TACOs in parallel.
Finally, we propose an approach to querying relations D and graphs G taken together
in SQL. The key idea is that if a tuple t in D and a vertex v in G are determined
to refer to the same real-world entity, then we join t and v, correlate their information
and complement tuple t with additional attributes of v from graphs. We show how to
do this in SQL extended with only syntactic sugar, for both static joins when t is a tuple
in D and dynamic joins when t comes from intermediate results of sub-queries on D.
To support the semantic joins, we propose an attribute extraction scheme based on Kmeans
clustering, to identify and fetch graph properties that are linked to v via paths.
Moreover, we develop a scheme to extract a relation schema for entities in graphs, and
a heuristic join method based on the schema to strike a balance between the complexity
and accuracy of dynamic joins
- …
