363 research outputs found
Graph pattern matching on social network analysis
Graph pattern matching is fundamental to social network analysis. Its effectiveness
for identifying social communities and social positions, making recommendations and
so on has been repeatedly demonstrated. However, the social network analysis raises
new challenges to graph pattern matching. As real-life social graphs are typically
large, it is often prohibitively expensive to conduct graph pattern matching over such
large graphs, e.g., NP-complete for subgraph isomorphism, cubic time for bounded
simulation, and quadratic time for simulation. These hinder the applicability of graph
pattern matching on social network analysis. In response to these challenges, the thesis
presents a series of effective techniques for querying large, dynamic, and distributively
stored social networks.
First of all, we propose a notion of query preserving graph compression, to compress
large social graphs relative to a class Q of queries. We then develop both batch
and incremental compression strategies for two commonly used pattern queries. Via
both theoretical analysis and experimental studies, we show that (1) using compressed
graphs Gr benefits graph pattern matching dramatically; and (2) the computation of Gr
as well as its maintenance can be processed efficiently.
Secondly, we investigate the distributed graph pattern matching problem, and explore
parallel computation for graph pattern matching. We show that our techniques
possess following performance guarantees: (1) each site is visited only once; (2) the total
network traffic is independent of the size of G; and (3) the response time is decided
by the size of largest fragment of G rather than the size of entire G. Furthermore, we
show how these distributed algorithms can be implemented in the MapReduce framework.
Thirdly, we study the problem of answering graph pattern matching using views
since view based techniques have proven an effective technique for speeding up query
evaluation. We propose a notion of pattern containment to characterise graph pattern
matching using views, and introduce efficient algorithms to answer graph pattern
matching using views. Moreover, we identify three problems related to graph pattern
containment, and provide efficient algorithms for containment checking (approximation
when the problem is intractable).
Fourthly, we revise graph pattern matching by supporting a designated output node,
which we treat as “query focus”. We then introduce algorithms for computing the top-k
relevant matches w.r.t. the output node for both acyclic and cyclic pattern graphs, respectively,
with early termination property. Furthermore, we investigate the diversified
top-k matching problem, and develop an approximation algorithm with performance
guarantee and a heuristic algorithm with early termination property.
Finally, we introduce an expert search system, called ExpFinder, for large and dynamic
social networks. ExpFinder identifies top-k experts in social networks by graph
pattern matching, and copes with the sheer size of real-life social networks by integrating
incremental graph pattern matching, query preserving compression and top-k
matching computation. In particular, we also introduce bounded (resp. unbounded)
incremental algorithms to maintain the weighted landmark vectors which are used for
incremental maintenance for cached results
Making Queries Tractable on Big Data with Preprocessing
A query class is traditionally considered tractable if there exists a polynomial-time (PTIME) algorithm to answer its queries. When it comes to big data, however, PTIME algorithms often become infeasible in practice. A traditionaland effective approach to coping with this is to preprocess data off-line, so that queries in the class can be subsequently evaluated on the data efficiently. This paper aims to provide a formal foundation for this approach in terms of computational complexity. (1) We propose a set of -tractable queries, denoted by T0 Q, to characterize classes of queries that can be answered in parallel poly-logarithmic time (NC) after PTIME preprocessing. (2) We show that several natural query classes are -tractable and are feasible on big data. (3) We also study a set TQ of query classes that can be effectively converted to -tractable queries by re-factorizing its data and queries for preprocessing. We introduce a form of NC reductions to characterize such conversions. (4) We show that a natural query class is complete for TQ. (5) We also show that T0 Q ⊂ P unless P = NC, i.e., the set T0 Q of all -tractable queries is properly contained in the set P of all PTIME queries. Nonetheless, TQ = P, i.e., all PTIME query classes can be made -tractable via proper refactorizations. This work is a step towards understanding the tractability of queries in the context of big data
Fishing Fort: A System for Graph Analytics with ML Prediction and Logic Deduction
This paper reports Fishing Fort, a graph analytic system developed in response to the following questions. What practical value can we get out of graph analytics? How can we effectively deduce the value from a real-life graph? Where can we get clean graphs to make accurate analyses possible? To answer these questions, Fishing Fort advocates to unify logic deduction and ML prediction by proposing Graph Association Rules (GARs), a class of logic rules in which ML models can be embedded as predicates. It employs GARs to deduce graph associations, enrich graphs and clean graphs. It has been deployed in production lines and proven effective in online recommendation, drug discovery, credit risk assessment, battery manufacturing and cybersecurity, among other things
Boosting graph computation with generic methods: partitioning and incrementalization
In this thesis we develop a package of generic methods for boosting the velocity of
graph computations, regarding partitioning and incrementalization. The former is to
deal with the challenges of volume and velocity over big data, and make computations
scalable. The latter is to speed up computations for dynamic graph analytics. On the
one hand, existing partitioners handle graphs in an “one-size-fits-all” way and do not
consider the applications running on the top. This results in sub-optimal performances
for distributed computations. On the other hand, however, current incremental graph
algorithms are ad hoc designed, which require a lot of efforts even for domain experts.
Worse still, most of these algorithms lack provable guarantees for their efficiency. To
this end, this thesis presents a series of generic yet effective approaches for boosting
the efficiency and scalability of graph computation.
Firstly, for graph partitioning, we propose an application-driven hybrid partitioning
strategy that takes the top-level application into consideration. Given a graph algorithm
A, the strategy learns a cost model for A as polynomial regression. We develop partitioners that given the learned cost model, refine an edge-cut or vertex-cut partition to a
hybrid partition and reduce the parallel cost of A. Moreover, we extend the partitioners
to handle multiple cost models of mixed workloads in one batch.
Secondly, we study generic guidelines for incrementalizing graph algorithms. We
identify a class of incrementalizable algorithms abstracted in a fixpoint model. We
show how to deduce an incremental algorithm A∆ from such an algorithm A. Moreover, A∆ can be made bounded relative to A, i.e., its cost is determined by the size
of changes to graphs and changes to the affected area that is necessarily checked by
batch algorithm A. We provide generic conditions under which a deduced algorithm
A∆ warrants to be correct and relatively bounded, by adopting the same logic and data
structures of A, at most using timestamps as an additional auxiliary structure.
Finally, we go beyond the incrementalization of generic graph algorithms, and
focus on graph partitioners where the algorithms are heuristic and exact results are
not required. We propose to incrementalize widely-used graph partitioners A into
heuristically-bounded incremental algorithms A∆. Given graph G, updates ∆G to G
and a partition A(G) of G by A, A∆ computes changes ∆O to A(G) such that (1) applying ∆O to A(G) produces a new partition of the updated graph although it may not
be exactly the one derived by A, (2) it retains the same bounds on balance and cut sizes
as A, and (3) ∆O is decided by ∆G alone. We show that we can deduce A∆ from both
vertex-cut and edge-cut partitioners A, retaining their bound
Querying big data with bounded data access
Query answering over big data is cost-prohibitive. A linear scan of a dataset D may
take days with a solid state device if D is of PB size and years if D is of EB size. In
other words, polynomial-time (PTIME) algorithms for query evaluation are already
not feasible on big data. To tackle this, we propose querying big data with bounded
data access, such that the cost of query evaluation is independent of the scale of D.
First of all, we propose a class of boundedly evaluable queries. A query Q is boundedly
evaluable under a set A of access constraints if for any dataset D that satisfies
constraints in A, there exists a subset DQ ⊆ D such that (a) Q(DQ) = Q(D), and (b) the
time for identifying DQ from D, and hence the size |DQ| of DQ, are independent of |D|.
That is, we can compute Q(D) by accessing a bounded amount of data no matter how
big D grows.We study the problem of deciding whether a query is boundedly evaluable
under A. It is known that the problem is undecidable for FO without access constraints.
We show that, in the presence of access constraints, it is decidable in 2EXPSPACE for
positive fragments of FO queries, but is already EXPSPACE-hard even for CQ.
To handle the undecidability and high complexity of the analysis, we develop effective
syntax for boundedly evaluable queries under A, referred to as queries covered
by A, such that, (a) any boundedly evaluable query under A is equivalent to a query
covered by A, (b) each covered query is boundedly evaluable, and (c) it is efficient to
decide whether Q is covered by A. On top of DBMS, we develop practical algorithms
for checking whether queries are covered by A, and generating bounded plans if so.
For queries that are not boundedly evaluable, we extend bounded evaluability
to resource-bounded approximation and bounded query rewriting using views.
(1) Resource-bounded approximation is parameterized with a resource ratio a ∈ (0,1],
such that for any query Q and dataset D, it computes approximate answers with an
accuracy bound h by accessing at most a|D| tuples. It is based on extended access constraints
and a new accuracy measure. (2) Bounded query rewriting tackles the problem
by incorporating bounded evaluability with views, such that the queries can be exactly
answered by accessing cached views and a bounded amount of data in D. We study the
problem of deciding whether a query has a bounded rewriting, establish its complexity
bounds, and develop effective syntax for FO queries with a bounded rewriting.
Finally, we extend bounded evaluability to graph pattern queries, by extending
access constraints to graph data. We characterize bounded evaluability for subgraph
and simulation patterns and develop practical algorithms for associated problems
Extending dependencies for improving data quality
This doctoral thesis presents the results of my work on extending dependencies for
improving data quality, both in a centralized environment with a single database and
in a data exchange and integration environment with multiple databases.
The first part of the thesis proposes five classes of data dependencies, referred to as
CINDs, eCFDs, CFDcs, CFDps and CINDps, to capture data inconsistencies commonly
found in practice in a centralized environment. For each class of these dependencies,
we investigate two central problems: the satisfiability problem and the implication
problem. The satisfiability problem is to determine given a set Σ of dependencies
defined on a database schema R, whether or not there exists a nonempty database D of
R that satisfies Σ. And the implication problem is to determine whether or not a set Σ
of dependencies defined on a database schema R entails another dependency φ on R.
That is, for each database D ofRthat satisfies Σ, the D must satisfy φ as well. These are
important for the validation and optimization of data-cleaning processes. We establish
complexity results of the satisfiability problem and the implication problem for all
these five classes of dependencies, both in the absence of finite-domain attributes and in
the general setting with finite-domain attributes. Moreover, SQL-based techniques are
developed to detect data inconsistencies for each class of the proposed dependencies,
which can be easily implemented on the top of current database management systems.
The second part of the thesis studies three important topics for data cleaning in a
data exchange and integration environment with multiple databases.
One is the dependency propagation problem, which is to determine, given a view
defined on data sources and a set of dependencies on the sources, whether another
dependency is guaranteed to hold on the view. We investigate dependency propagation
for views defined in various fragments of relational algebra, conditional functional
dependencies (CFDs) [FGJK08] as view dependencies, and for source dependencies
given as either CFDs or traditional functional dependencies (FDs). And we establish
lower and upper bounds, all matching, ranging from PTIME to undecidable. These not
only provide the first results for CFD propagation, but also extend the classical work
of FD propagation by giving new complexity bounds in the presence of a setting with
finite domains. We finally provide the first algorithm for computing a minimal cover of
all CFDs propagated via SPC views. The algorithm has the same complexity as one of
the most efficient algorithms for computing a cover of FDs propagated via a projection
view, despite the increased expressive power of CFDs and SPC views. Another one is matching records from unreliable data sources. A class of matching
dependencies (MDs) is introduced for specifying the semantics of unreliable data. As
opposed to static constraints for schema design such as FDs, MDs are developed for
record matching, and are defined in terms of similarity metrics and a dynamic semantics. We identify a special case of MDs, referred to as relative candidate keys (RCKs),
to determine what attributes to compare and how to compare them when matching
records across possibly different relations. We also propose a mechanism for inferring MDs with a sound and complete system, a departure from traditional implication
analysis, such that when we cannot match records by comparing attributes that contain
errors, we may still find matches by using other, more reliable attributes. We finally
provide a quadratic time algorithm for inferring MDs, and an effective algorithm for
deducing quality RCKs from a given set of MDs.
The last one is finding certain fixes for data monitoring [CGGM03, SMO07], which
is to find and correct errors in a tuple when it is created, either entered manually or
generated by some process. That is, we want to ensure that a tuple t is clean before it
is used, to prevent errors introduced by adding t. As noted by [SMO07], it is far less
costly to correct a tuple at the point of entry than fixing it afterward.
Data repairing based on integrity constraints may not find certain fixes that are
absolutely correct, and worse, may introduce new errors when repairing the data. We
propose a method for finding certain fixes, based on master data, a notion of certain
regions, and a class of editing rules. A certain region is a set of attributes that are
assured correct by the users. Given a certain region and master data, editing rules tell
us what attributes to fix and how to update them. We show how the method can be used
in data monitoring and enrichment. We develop techniques for reasoning about editing
rules, to decide whether they lead to a unique fix and whether they are able to fix all
the attributes in a tuple, relative to master data and a certain region. We also provide
an algorithm to identify minimal certain regions, such that a certain fix is warranted by
editing rules and master data as long as one of the regions is correct
GRAPE: Parallel Graph Query Engine
The need for graph computations is evident in a multitude of use cases. To support
computations on large-scale graphs, several parallel systems have been developed.
However, existing graph systems require users to recast algorithms into new models,
which makes parallel graph computations as a privilege to experienced users only.
Moreover, real world applications often require much more complex graph processing
workflows than previously evaluated. In response to these challenges, the thesis
presents GRAPE, a distributed graph computation system, shipped with various applications
for social network analysis, social media marketing and functional dependencies
on graphs.
Firstly, the thesis presents the foundation of GRAPE. The principled approach of
GRAPE is based on partial evaluation and incremental computation. Sequential graph
algorithms can be plugged into GRAPE with minor changes, and get parallelized as a
whole. The termination and correctness are guaranteed under a monotonic condition.
Secondly, as an application on GRAPE, the thesis proposes graph-pattern association
rules (GPARs) for social media marketing. GPARs help users discover regularities
between entities in social graphs and identify potential customers by exploring social
influence. The thesis studies the problem of discovering top-k diversified GPARs and
the problem of identifying potential customers with GPARs. Although both are NP-
hard, parallel scalable algorithms on GRAPE are developed, which guarantee a polynomial
speedup over sequential algorithms with the increase of processors.
Thirdly, the thesis proposes quantified graph patterns (QGPs), an extension of
graph patterns by supporting simple counting quantifiers on edges. QGPs naturally express
universal and existential quantification, numeric and ratio aggregates, as well as
negation. The thesis proves that the matching problem of QGPs remains NP-complete
in the absence of negation, and is DP-complete for general QGPs. In addition, the
thesis introduces quantified graph association rules defined with QGPs, to identify potential
customers in social media marketing.
Finally, to address the issue of data consistency, the thesis proposes a class of functional
dependencies for graphs, referred to as GFDs. GFDs capture both attribute-value
dependencies and topological structures of entities. The satisfiability and implication
problems for GFDs are studied and proved to be coNP-complete and NP-complete,
respectively. The thesis also proves that the validation problem for GFDs is coNP-
complete. The parallel algorithms developed on GRAPE verify that GFDs provide an
effective approach to detecting inconsistencies in knowledge and social graphs
BESS: bounded evaluation SQL systems
It could be prohibitively costly to query big relations, even with the power of parallel processing on clusters containing thousands of machines. The tremendous resource burden drives us to find an alternative method to answer big datasets, which need to be affordable for small businesses and lightweight software. The theory of bounded evaluation could reduce queries on big data to computations on small data. It advocates an unconventional query evaluation paradigm under an access schema A, a combination of cardinality constraints and associated indices. For practical use to emerge from the work, it is urgent to design a database system with bounded evaluation capacity.
Firstly, the thesis proposes the design and framework of bounded evaluation systems named BESS. The framework is conducted by two workflows, online query processing and offline access schema management. Its unique features are discussed, which differ from the framework paradigm and conventional database systems. Besides, the thesis shows the feasibility of the framework with two implementation variants. The challenges and advantages of each approach are discussed during the comparison of the workflows.
Secondly, the thesis presents a prototype named BEAS, targeting traditional DBMS. Under this prototype, the thesis proposes the algorithms for online query processing, from checking bounded evaluability to generating bounded query plans. Moreover, the thesis extends bounded evaluation to RAaggr (RA with aggregation) queries under bag semantics to make the prototype more practical in real-life circumstances.
Thirdly, the thesis presents another prototype, named Zidian, based on KV stores to speed up SQL query evaluation over NoSQL. As the foundation, the thesis proposes a block-as-a-value data (BaaV) model and its corresponding relational algebra on this prototype. Beyond the bounded evaluation, the thesis studies scan-free data access. Also, the thesis studies the query evaluation process, which verified that Zidian substantially reduces data access and communication cost. Moreover, Zidian could be plugged into existing SQL-over-NoSQL systems while retaining horizontal scalability.
Finally, to make the framework more practical, the thesis studies the problem of schema discovery for BESS. In the form of a schema selection problem for the BaaV model, the thesis presents a full treatment to generate schema for parametric SQL queries over keyed blocks on KV stores. The thesis develops and verifies the framework and criteria for the schema selection problem, considering both storage and computation. Meanwhile, the thesis presents practical algorithms for selection that guarantees certain optimality under practical conditions
Towards effective analysis of big graphs: from scalability to quality
This thesis investigates the central issues underlying graph analysis, namely, scalability
and quality.
We first study the incremental problems for graph queries, which aim to compute
the changes to the old query answer, in response to the updates to the input graph.
The incremental problem is called bounded if its cost is decided by the sizes of the
query and the changes only. No matter how desirable, however, our first results are
negative: for common graph queries such as graph traversal, connectivity, keyword
search and pattern matching, their incremental problems are unbounded. In light of
the negative results, we propose two new characterizations for the effectiveness of
incremental computation, and show that the incremental computations above can still
be effectively conducted, by either reducing the computations on big graphs to small
data, or incrementalizing batch algorithms by minimizing unnecessary recomputation.
We next study the problems with regards to improving the quality of the graphs.
To uniquely identify entities represented by vertices in a graph, we propose a class of
keys that are recursively defined in terms of graph patterns, and are interpreted with
subgraph isomorphism. As an application, we study the entity matching problem,
which is to find all pairs of entities in a graph that are identified by a given set of
keys. Although the problem is proved to be intractable, and cannot be parallelized in
logarithmic rounds, we provide two parallel scalable algorithms for it.
In addition, to catch numeric inconsistencies in real-life graphs, we extend graph
functional dependencies with linear arithmetic expressions and comparison predicates,
referred to as NGDs. Indeed, NGDs strike a balance between expressivity and complexity,
since if we allow non-linear arithmetic expressions, even of degree at most 2, the
satisfiability and implication problems become undecidable. A localizable incremental
algorithm is developed to detect errors using NGDs, where the cost is determined by
small neighbors of nodes in the updates instead of the entire graph.
Finally, a rule-based method to clean graphs is proposed. We extend graph entity
dependencies (GEDs) as data quality rules. Given a graph, a set of GEDs and a block of
ground truth, we fix violations of GEDs in the graph by combining data repairing and
object identification. The method finds certain fixes to errors detected by GEDs, i.e.,
as long as the GEDs and the ground truth are correct, the fixes are assured correct as
their logical consequences. Several fundamental results underlying the method are established,
and an algorithm is developed to implement the method. We also parallelize
the method and guarantee to reduce its running time with the increase of processors
The ACM PODS Alberto O. Mendelzon Test-of-Time Award 2023
In 2007, the PODS Executive Committee decided to establish a Test-of-Time Award, named after the late Alberto O. Mendelzon, in recognition of his scientific legacy, and his service and dedication to the database community. Mendelzon was an international leader in database theory, whose pioneering and fundamental work has inspired and influenced both database theoreticians and practitioners, and continues to be applied in a variety of advanced settings. He served the database community in many ways; in particular, he served as the General Chair of the PODS conference, and was instrumental in bringing together the PODS and SIGMOD conferences. He also was an outstanding educator, who guided the research of numerous doctoral students and postdoctoral fellows. The Award is to be awarded each year to a paper or a small number of papers published in the PODS proceedings ten years prior, that had the most impact (in terms of research, methodology, or transfer of practice) over the intervening decade. The decision was approved by SIGMOD and ACM. The funds for the Award were contributed by IBM Toronto. The PODS Executive Chair has appointed us to serve as the Award Committee for 2023. After careful consideration, we have decided to select the following two papers as the award winners for 2023: The first paper introduces the Massively Parallel Communication (MPC) model to analyze the tradeoff between the number of rounds and the amount of communication required in a massively parallel computing environment. A decade ago there was a growing interest in the study of data processing on large distributed clusters, which resulted in the introduction of various models focusing on different aspects. The MPC model can be seen as the culmination of these models and was designed as an abstraction for the shared-nothing architecture which remains the architecture used by large data processing systems today. Since then, the MPC model is widely adopted in the literature. In the paper, the authors obtain both lower and upper bounds for computing full conjunctive queries in the one round and multi-round case. In particular, they discover an interesting connection between the number of bits that are required to be sent in the single-round case for computing a conjunctive query and the fractional vertex covering number of the hypergraph associated to that query. The monograph "Algorithmic Aspects of Parallel Data Processing" in Foundations and Trends in Databases (2018) presents a detailed overview of the research on the MPC model since its introduction. Ontology-based data access (OBDA) provides a formalization of the problem of querying a database enhanced with an ontology. This simple formal model provides a unifying view for problems in many different areas, and in particular for the widely studied issue of extracting information from a knowledge graph. As such, OBDA has been extensively studied, mainly following two lines of research: the development of efficient query answering algorithms for some classes of query and ontology languages, and the characterization of combinations of these two elements that lead to intractability. In the second paper, the authors follow a different path to study OBDA, which has brought new tools to the area and, as such, has become a fruitful way to address the fundamental challenges in OBDA. More precisely, the authors consider the expressive power of the settings used in OBDA, whose main parameters are the query and ontology
- …
