Search CORE

1,721,149 research outputs found

Static Analysis and Query Answering for Incomplete Data Trees with Constraints

Author: Leonid Libkin
Libkin Leonid
Gheerbrant Amelie
Juan Reutter
Reutter de la Maza Juan Lorenzo
Reutter Juan
Amélie Gheerbrant
Publication venue
Publication date: 01/01/2013
Field of study

Data trees serve as an abstraction of XML documents: in such trees, every node comes with a label from a finite alphabet, as well as a data value from an infinite set. Incomplete data trees model XML documents with incomplete information; they may include both structural incompleteness and incompleteness of data. Here we study two basic problems for incomplete data trees under typical constraints such as keys and foreign keys. The first problem is consistency of specifications of incomplete data trees. We show that many of recently established results on consistency of constraints and schema descriptions can be transferred to the consistency of incomplete tree specifications without any increase in complexity. After that we examine query answering over incomplete data trees under constraints, and show that tractable bounds can be recovered under key constraints, but are lost under foreign keys

Pontificia Universidad Católica de Chile: Repositorio UC

Crossref

Edinburgh Research Explorer

XML data exchange under expressive mappings

Author: Amano Shun’ichi
Publication venue
Publication date: 01/01/2009
Field of study

Data Exchange is the problem of transforming data in one format (the source schema) into data in another format (the target schema). Its core component is a schema mapping, which is a high level specification of how such transformation should be done. Relational data exchange has been extensively studied, but exchanging XML data have been paid much less attention. The goal of this thesis is to develop a theory of XML data exchange with expressive schema mappings, extending a previous work using restricted mappings. Our mapping language is based on tree patterns that can use horizontal navigation and data comparison in addition to downward navigation. First we look at static analysis problems concerning a single mapping. More specif- ically, we consider consistency problems with different flavours. One such problem, for instance, asks if any tree has a solution under the given mapping. Then we turn to analyse the complexity of mapping themselves, i.e., recognising pairs of trees such that the one is mapped to the other. For both problems, we provide classifications based on sets of features used in the mappings. Second we investigate the composition of XML schema mappings. Generally it is hard, or rather simply impossible, to achieve closure under composition in XML settings unlike in relational settings. Nevertheless we identify a class of XML schema mappings that is closed under composition. Lastly we consider the problem of query answering. It is important to exchange data so that we can feasibly answer queries while it often leads to intractability. We identify the dividing line between tractable and intractable cases: answering queries with extended features is always intractable while tractability of answering simple queries can be retained in extended mappings

Edinburgh Research Archive

Querying graphs on large-scale data

Author: Li Yuanhao
Publication venue
Publication date: 31/07/2021
Field of study

This doctoral thesis will present the results of my work into querying graphs on large-scale data, from both the data perspective and query perspective. We first propose a scheme to reduce large graphs into small ones. It contracts obsolete parts, stars, cliques and paths into supernodes. We then build a hierarchical scheme to further reduce the graph, under limited resources. For both the contraction scheme and the hierarchy, we show that it is generic and lossless. We show that the same contracted graph is able to support multiple query classes at the same time, no matter whether their queries are label-based or not, local or non-local. Moreover, existing algorithms for these queries can be readily adapted to compute exact answers by using the synopses when possible, and decontracting the supernodes only when necessary. Using real-life graphs, we experimentally verify the efficiency and effectiveness of our contraction schemes. Meanwhile, we propose an extension of graph patterns, referred to as conditional graph patterns (CGPs). The CGPs allows one to express several conventional queries in a conditioned one, and annotate similar patterns to compute answers for all patterns in a single enumeration. In a CGP, one can specify a simple condition on each edge such that the edge exists if and only if the condition is satisfied. We show that CGPs allow us to catch missing links, increase the expressivity of graph functional dependencies, and provide a succinct representation of graph patterns. We studies their consistency, matching, incremental matching and containment problems and settles down their complexity bounds. We develop algorithms for matching, incremental matching and parallel matching of CGPs, and for (incremental, parallel) multi-CGP matching and optimization. We empiricallyverify the efficiency and effectiveness of our algorithms on real-life and synthetic graphs

Edinburgh Research Archive

Design Guidelines for Reducing Redundancy in Relational and XML Data

Author: Kolahi Solmaz
Publication venue
Publication date: 2008
Field of study

In this dissertation, we propose new design guidelines to reduce the amount of redundancy that databases carry. We use techniques from information theory to define a measure that evaluates a database design based on the worst possible redundancy carried in the instances. We then continue by revisiting the design problem of relational data with functional dependencies, and measure the lowest price, in terms of redundancy, that has to be paid to guarantee a dependency-preserving normalization for all schemas. We provide a formal justification for the Third Normal Form (3NF) by showing that we can achieve this lowest price by doing a good 3NF normalization. We then study the design problem for XML documents that are views of relational data. We show that we can design a redundancy-free XML representation for some relational schemas while preserving all data dependencies. We present an algorithm for converting a relational schema to such an XML design. We finally study the design problem for XML documents that are stored in relational databases. We look for XML design criteria that ensure a relational storage with low redundancy. First, we characterize XML designs that have a redundancy-free relational storage. Then we propose a restrictive condition for XML functional dependencies that guarantees a low redundancy for data values in the relational storage.Ph

Querying big data with bounded data access

Author: Cao Yang
Publication venue
Publication date: 29/11/2016
Field of study

Query answering over big data is cost-prohibitive. A linear scan of a dataset D may take days with a solid state device if D is of PB size and years if D is of EB size. In other words, polynomial-time (PTIME) algorithms for query evaluation are already not feasible on big data. To tackle this, we propose querying big data with bounded data access, such that the cost of query evaluation is independent of the scale of D. First of all, we propose a class of boundedly evaluable queries. A query Q is boundedly evaluable under a set A of access constraints if for any dataset D that satisfies constraints in A, there exists a subset DQ ⊆ D such that (a) Q(DQ) = Q(D), and (b) the time for identifying DQ from D, and hence the size |DQ| of DQ, are independent of |D|. That is, we can compute Q(D) by accessing a bounded amount of data no matter how big D grows.We study the problem of deciding whether a query is boundedly evaluable under A. It is known that the problem is undecidable for FO without access constraints. We show that, in the presence of access constraints, it is decidable in 2EXPSPACE for positive fragments of FO queries, but is already EXPSPACE-hard even for CQ. To handle the undecidability and high complexity of the analysis, we develop effective syntax for boundedly evaluable queries under A, referred to as queries covered by A, such that, (a) any boundedly evaluable query under A is equivalent to a query covered by A, (b) each covered query is boundedly evaluable, and (c) it is efficient to decide whether Q is covered by A. On top of DBMS, we develop practical algorithms for checking whether queries are covered by A, and generating bounded plans if so. For queries that are not boundedly evaluable, we extend bounded evaluability to resource-bounded approximation and bounded query rewriting using views. (1) Resource-bounded approximation is parameterized with a resource ratio a ∈ (0,1], such that for any query Q and dataset D, it computes approximate answers with an accuracy bound h by accessing at most a|D| tuples. It is based on extended access constraints and a new accuracy measure. (2) Bounded query rewriting tackles the problem by incorporating bounded evaluability with views, such that the queries can be exactly answered by accessing cached views and a bounded amount of data in D. We study the problem of deciding whether a query has a bounded rewriting, establish its complexity bounds, and develop effective syntax for FO queries with a bounded rewriting. Finally, we extend bounded evaluability to graph pattern queries, by extending access constraints to graph data. We characterize bounded evaluability for subgraph and simulation patterns and develop practical algorithms for associated problems

Edinburgh Research Archive

Boosting graph computation with generic methods: partitioning and incrementalization

Author: Xu Ruiqi
Publication venue
Publication date: 30/11/2021
Field of study

In this thesis we develop a package of generic methods for boosting the velocity of graph computations, regarding partitioning and incrementalization. The former is to deal with the challenges of volume and velocity over big data, and make computations scalable. The latter is to speed up computations for dynamic graph analytics. On the one hand, existing partitioners handle graphs in an “one-size-fits-all” way and do not consider the applications running on the top. This results in sub-optimal performances for distributed computations. On the other hand, however, current incremental graph algorithms are ad hoc designed, which require a lot of efforts even for domain experts. Worse still, most of these algorithms lack provable guarantees for their efficiency. To this end, this thesis presents a series of generic yet effective approaches for boosting the efficiency and scalability of graph computation. Firstly, for graph partitioning, we propose an application-driven hybrid partitioning strategy that takes the top-level application into consideration. Given a graph algorithm A, the strategy learns a cost model for A as polynomial regression. We develop partitioners that given the learned cost model, refine an edge-cut or vertex-cut partition to a hybrid partition and reduce the parallel cost of A. Moreover, we extend the partitioners to handle multiple cost models of mixed workloads in one batch. Secondly, we study generic guidelines for incrementalizing graph algorithms. We identify a class of incrementalizable algorithms abstracted in a fixpoint model. We show how to deduce an incremental algorithm A∆ from such an algorithm A. Moreover, A∆ can be made bounded relative to A, i.e., its cost is determined by the size of changes to graphs and changes to the affected area that is necessarily checked by batch algorithm A. We provide generic conditions under which a deduced algorithm A∆ warrants to be correct and relatively bounded, by adopting the same logic and data structures of A, at most using timestamps as an additional auxiliary structure. Finally, we go beyond the incrementalization of generic graph algorithms, and focus on graph partitioners where the algorithms are heuristic and exact results are not required. We propose to incrementalize widely-used graph partitioners A into heuristically-bounded incremental algorithms A∆. Given graph G, updates ∆G to G and a partition A(G) of G by A, A∆ computes changes ∆O to A(G) such that (1) applying ∆O to A(G) produces a new partition of the updated graph although it may not be exactly the one derived by A, (2) it retains the same bounds on balance and cut sizes as A, and (3) ∆O is decided by ∆G alone. We show that we can deduce A∆ from both vertex-cut and edge-cut partitioners A, retaining their bound

Edinburgh Research Archive

Proceedings of the 10th International Conference on Database Theory (ICDT 2005)

Author
Publication venue
Publication date: 01/01/2005
Field of study

reposiTUm (TUW Vienna)

GRAPE: Parallel Graph Query Engine

Author: Xu Jingbo
Publication venue
Publication date: 30/11/2017
Field of study

The need for graph computations is evident in a multitude of use cases. To support computations on large-scale graphs, several parallel systems have been developed. However, existing graph systems require users to recast algorithms into new models, which makes parallel graph computations as a privilege to experienced users only. Moreover, real world applications often require much more complex graph processing workflows than previously evaluated. In response to these challenges, the thesis presents GRAPE, a distributed graph computation system, shipped with various applications for social network analysis, social media marketing and functional dependencies on graphs. Firstly, the thesis presents the foundation of GRAPE. The principled approach of GRAPE is based on partial evaluation and incremental computation. Sequential graph algorithms can be plugged into GRAPE with minor changes, and get parallelized as a whole. The termination and correctness are guaranteed under a monotonic condition. Secondly, as an application on GRAPE, the thesis proposes graph-pattern association rules (GPARs) for social media marketing. GPARs help users discover regularities between entities in social graphs and identify potential customers by exploring social influence. The thesis studies the problem of discovering top-k diversified GPARs and the problem of identifying potential customers with GPARs. Although both are NP- hard, parallel scalable algorithms on GRAPE are developed, which guarantee a polynomial speedup over sequential algorithms with the increase of processors. Thirdly, the thesis proposes quantified graph patterns (QGPs), an extension of graph patterns by supporting simple counting quantifiers on edges. QGPs naturally express universal and existential quantification, numeric and ratio aggregates, as well as negation. The thesis proves that the matching problem of QGPs remains NP-complete in the absence of negation, and is DP-complete for general QGPs. In addition, the thesis introduces quantified graph association rules defined with QGPs, to identify potential customers in social media marketing. Finally, to address the issue of data consistency, the thesis proposes a class of functional dependencies for graphs, referred to as GFDs. GFDs capture both attribute-value dependencies and topological structures of entities. The satisfiability and implication problems for GFDs are studied and proved to be coNP-complete and NP-complete, respectively. The thesis also proves that the validation problem for GFDs is coNP- complete. The parallel algorithms developed on GRAPE verify that GFDs provide an effective approach to detecting inconsistencies in knowledge and social graphs

Edinburgh Research Archive

BESS: bounded evaluation SQL systems

Author: Yuan Tengfei
Publication venue
Publication date: 11/07/2022
Field of study

It could be prohibitively costly to query big relations, even with the power of parallel processing on clusters containing thousands of machines. The tremendous resource burden drives us to find an alternative method to answer big datasets, which need to be affordable for small businesses and lightweight software. The theory of bounded evaluation could reduce queries on big data to computations on small data. It advocates an unconventional query evaluation paradigm under an access schema A, a combination of cardinality constraints and associated indices. For practical use to emerge from the work, it is urgent to design a database system with bounded evaluation capacity. Firstly, the thesis proposes the design and framework of bounded evaluation systems named BESS. The framework is conducted by two workflows, online query processing and offline access schema management. Its unique features are discussed, which differ from the framework paradigm and conventional database systems. Besides, the thesis shows the feasibility of the framework with two implementation variants. The challenges and advantages of each approach are discussed during the comparison of the workflows. Secondly, the thesis presents a prototype named BEAS, targeting traditional DBMS. Under this prototype, the thesis proposes the algorithms for online query processing, from checking bounded evaluability to generating bounded query plans. Moreover, the thesis extends bounded evaluation to RAaggr (RA with aggregation) queries under bag semantics to make the prototype more practical in real-life circumstances. Thirdly, the thesis presents another prototype, named Zidian, based on KV stores to speed up SQL query evaluation over NoSQL. As the foundation, the thesis proposes a block-as-a-value data (BaaV) model and its corresponding relational algebra on this prototype. Beyond the bounded evaluation, the thesis studies scan-free data access. Also, the thesis studies the query evaluation process, which verified that Zidian substantially reduces data access and communication cost. Moreover, Zidian could be plugged into existing SQL-over-NoSQL systems while retaining horizontal scalability. Finally, to make the framework more practical, the thesis studies the problem of schema discovery for BESS. In the form of a schema selection problem for the BaaV model, the thesis presents a full treatment to generate schema for parametric SQL queries over keyed blocks on KV stores. The thesis develops and verifies the framework and criteria for the schema selection problem, considering both storage and computation. Meanwhile, the thesis presents practical algorithms for selection that guarantees certain optimality under practical conditions

Edinburgh Research Archive

Tackling the veracity and variety of big data

Author: Jin Ruochun
Publication venue
Publication date: 21/06/2022
Field of study

This thesis tackles the veracity and variety challenges of big data, especially focusing on graphs and relational data. We start with proposing a class of graph association rules (GARs) to specify regularities between entities in graphs, which capture both missing links and inconsistencies. A GAR is a combination of a graph pattern and a dependency; it may take as predicates machine learning classifiers for link prediction. We formalize association deduction with GARs in terms of the chase, and prove its Church-Rosser property. We show that the satisfiability, implication and association deduction problems for GARs are coNP-complete, NP-complete and NP-complete, respectively. The incremental deduction problem is DP-complete for GARs. In addition, we provide parallel algorithms for association deduction and incremental deduction. We next develop a parallel algorithm to discover GARs, which applies an applicationdriven strategy to cut back rules and data that are irrelevant to users’ interest, by training a machine learning model to identify data pertaining to a given application. Moreover, we introduce a sampling method to reduce a big graph G to a set H of small sample graphs. Given expected support and recall bounds, this method is able to deduce samples in H and mine rules from H to satisfy the bounds in the entire G. Then we propose a class of temporal association rules (TACOs) for event prediction in temporal graphs. TACOs are defined on temporal graphs in terms of change patterns and (temporal) conditions, and may carry machine learning predicates for temporal event prediction. We settle the complexity of reasoning about TACOs, including their satisfiability, implication and prediction problems. We develop a system that discovers TACOs by iteratively training a rule creator based on generative models in a creatorcritic framework, and predicts events by applying the discovered TACOs in parallel. Finally, we propose an approach to querying relations D and graphs G taken together in SQL. The key idea is that if a tuple t in D and a vertex v in G are determined to refer to the same real-world entity, then we join t and v, correlate their information and complement tuple t with additional attributes of v from graphs. We show how to do this in SQL extended with only syntactic sugar, for both static joins when t is a tuple in D and dynamic joins when t comes from intermediate results of sub-queries on D. To support the semantic joins, we propose an attribute extraction scheme based on Kmeans clustering, to identify and fetch graph properties that are linked to v via paths. Moreover, we develop a scheme to extract a relation schema for entities in graphs, and a heuristic join method based on the schema to strike a balance between the complexity and accuracy of dynamic joins

Edinburgh Research Archive