1,721,074 research outputs found
Static analysis of XML transformation and schema languages
XML (eXtensible Markup Language) is tijdens de laatste jaren ge¨evolueerd tot het
standaard-dataformaat voor de uitwisseling van gegevens op het internet [ABS99].
De belangrijkste voordelen van XML zijn dat het een standaard- en intuitive manier
voorziet om een zeer groot bereik aan gegevens te modelleren, en dat het gebruikers
toelaat om hun eigen tags in documenten te defini¨eren. Dit laatste geeft gebruikers de
mogelijkheid om een eigen formaat voor XML-documenten te ontwikkelen, hetwelk
wordt gedefinieerd aan de hand van een XML-schema. De aanwezigheid van zo’n
schema verbetert de effici¨entie van de automatisering van vele taken, zoals bijvoorbeeld het verwerken van query’s, query-optimalisatie en automatische data-integratie
Static analysis of XML transformation and schema languages
XML (eXtensible Markup Language) is tijdens de laatste jaren ge¨evolueerd tot het
standaard-dataformaat voor de uitwisseling van gegevens op het internet [ABS99].
De belangrijkste voordelen van XML zijn dat het een standaard- en intuitive manier
voorziet om een zeer groot bereik aan gegevens te modelleren, en dat het gebruikers
toelaat om hun eigen tags in documenten te defini¨eren. Dit laatste geeft gebruikers de
mogelijkheid om een eigen formaat voor XML-documenten te ontwikkelen, hetwelk
wordt gedefinieerd aan de hand van een XML-schema. De aanwezigheid van zo’n
schema verbetert de effici¨entie van de automatisering van vele taken, zoals bijvoorbeeld het verwerken van query’s, query-optimalisatie en automatische data-integratie
Database Principles and Challenges in Text Analysis
A common conceptual view of text analysis is that of a two-step process, where we first extract relations from text documents and then apply a relational query over the result. Hence, text analysis shares technical challenges with, and can draw ideas from, relational databases. A framework that formally instantiates this connection is that of the document spanners. In this article, we review recent advances in various research efforts that adapt fundamental database concepts to text analysis through the lens of document spanners. Among others, we discuss aspects of query evaluation, aggregate queries, provenance, and distributed query planning
Typechecking top-down XML transformations: Fixed input or output schemas
AbstractTypechecking consists of statically verifying whether the output of an XML transformation always conforms to an output type for documents satisfying a given input type. In this general setting, both the input and output schema as well as the transformation are part of the input for the problem. However, scenarios where the input or output schema can be considered to be fixed, are quite common in practice. In the present work, we investigate the computational complexity of the typechecking problem in the latter setting
Weight Annotation in Information Extraction
The framework of document spanners abstracts the task of information extraction from text as a function that maps every document (a string) into a relation over the document's spans (intervals identified by their start and end indices). For instance, the regular spanners are the closure under the Relational Algebra (RA) of the regular expressions with capture variables, and the expressive power of the regular spanners is precisely captured by the class of vset-automata-a restricted class of transducers that mark the endpoints of selected spans. In this work, we embark on the investigation of document spanners that can annotate extractions with auxiliary information such as confidence, support, and confidentiality measures. To this end, we adopt the abstraction of provenance semirings by Green et al., where tuples of a relation are annotated with the elements of a commutative semiring, and where the annotation propagates through the (positive) RA operators via the semiring operators. Hence, the proposed spanner extension, referred to as an annotator, maps every string into an annotated relation over the spans. As a specific instantiation, we explore weighted vset-automata that, similarly to weighted automata and transducers, attach semiring elements to transitions. We investigate key aspects of expressiveness, such as the closure under the positive RA, and key aspects of computational complexity, such as the enumeration of annotated answers and their ranked enumeration in the case of numeric semirings. For a number of these problems, fundamental properties of the underlying semiring, such as positivity, are crucial for establishing tractability.Funding
This work was supported by the German-Israeli Foundation for Scientific Research and
Development (GIF), grant I-1502-407.6/2019. The work of Johannes Doleschal and Wim Martens
was also supported by the Deutsche Forschungsgemeinschaft (DFG), grant MA 4938/4-1. The work
of Benny Kimelfeld and Liat Peterfreund was also supported by the Israel Science Foundation (ISF),
grants 1295/15 and 768/19, and the DFG project 412400621 (DIP program).
Liat Peterfreund: A part of the work was donewhile the author was affiliated with the Technion.
Acknowledgements
We are grateful to Matthias Niewerth for many useful discussions and his help
regarding Theorem 7.1 and Shaull Almagor for many helpful comments regarding weighted automata.
Furthermore, we thank the anonymous reviewers for ICDT 2020 for many helpful remarks
Which XML Schemas Admit 1-Pass Preorder Typing?
It is shown that the class of regular tree languages admitting one-pass preorder typing is exactly the class defined by restrained competition tree grammars introduced by Murata et al. [14]. In a streaming context, the former is the largest class of XSDs where every element in a document can be typed when its opening tag is met. The main technical machinery consists of semantical characterizations of restrained competition grammars and their subclasses. In particular, they can be characterized in terms of the context of nodes, closure properties, allowed patterns and guarded DTDs. It is further shown that deciding whether a schema is restrained competition is tractable. Deciding whether a schema is equivalent to a restained competition tree grammar, or one of its subclasses, is much more difficult: it is complete for EXPTIME. We show that our semantical characterizations allow for easy optimization and minimization algorithms. Finally, we relate the notion of one-pass preorder typing to the existing XML Schema standard
The Complexity of Aggregates over Extractions by Regular Expressions
Regular expressions with capture variables, also known as regex-formulas,
extract relations of spans (intervals identified by their start and end
indices) from text. In turn, the class of regular document spanners is the
closure of the regex formulas under the Relational Algebra. We investigate the
computational complexity of querying text by aggregate functions, such as sum,
average, and quantile, on top of regular document spanners. To this end, we
formally define aggregate functions over regular document spanners and analyze
the computational complexity of exact and approximate computation. More
precisely, we show that in a restricted case, all studied aggregate functions
can be computed in polynomial time. In general, however, even though exact
computation is intractable, some aggregates can still be approximated with
fully polynomial-time randomized approximation schemes (FPRAS)
BonXai
While the migration from DTD to XML Schema was driven by a need for increased expressivity and flexibility, the latter was also significantly more complex to use and understand. Whereas DTDs are characterized by their simplicity, XML Schema Documents are notoriously difficult. In this article, we introduce the XML specification language BonXai, which incorporates many features of XML Schema but is arguably almost as easy to use as DTDs. In brief, the latter is achieved by sacrificing the explicit use of types in favor of simple patterns expressing contexts for elements. The goal of BonXai is not to replace XML Schema but rather to provide a simpler alternative for users who want to go beyond the expressiveness and features of DTD but do not need the explicit use of types. Furthermore, XML Schema processing tools can be used as a back-end for BonXai, since BonXai can be automatically converted into XML Schema. A particularly strong point of BonXai is its solid foundation rooted in a decade of theoretical work around pattern-based schemas. We present a formal model for a core fragment of BonXai and the translation algorithms to and from a core fragment of XML Schema. We prove that BonXai and XML Schema can be converted back-and-forth on the level of tree languages and we formally study the size trade-offs between the two languages.</jats:p
Chisel
Chisel is a tool for flexible manipulation of CSV-like data, motivated by the recent effort of the World Wide Web Consortium
(W3C) towards a recommendation for tabular data and metadata
on the Web. In brief, Chisel supports an expressive built-in schema
language for CSV-like data, that can handle both tabular and nontabular data. Furthermore, it supports a simple programming language for transforming tabular and non-tabular CSV-like data.
In the demo, we showcase the system for specifying and validating schemas, building transformations, and setting up a pipeline for
automatic conversion of “wild” CSV-like data into structured tabular data. We present use cases for Chisel specifically targeted at
exemplifying the ease of specifying, modifying, and understanding
Sculpt schemas as well as extracting and transforming data
Satisfiability for SCULPT-Schemas for CSV-Like Data
SCULPT is a simple schema language inspired by the recent working effort towards a recommendation by the World Wide Web Consortium (W3C) for tabular data and metadata on the Web. In its core, a SCULPT schema consists of a set of rules where left-hand sides select sets of regions in the tabular data and the right-hand sides describe the contents of these regions. A document (divided in cells by row- and column-delimiters) then satisfies a schema if it satisfies every rule. In this paper, we study the satisfiability problem for SCULPT schemas. As SCULPT describes grid-like structures, satisfiability obviously becomes undecidable rather quickly even for very restricted schemas. We define a schema language called L-SCULPT (Lego SCULPT) that restricts the walking power of SCULPT by selecting rectangular shaped areas and only considers tables for which selected regions do not intersect. Depending on the axes used by L-SCULPT, we show that satisfiability is PTIME-complete or undecidable. One of the tractable fragments is practically useful as it extends the structural core of the current W3C proposal for schemas over tabular data. We therefore see how the navigational power of the W3C proposal can be extended while still retaining tractable satisfiability tests
- …
