Search CORE

1,721,074 research outputs found

Static analysis of XML transformation and schema languages

Author: MARTENS Wim
Publication venue
Publication date: 01/01/2006
Field of study

XML (eXtensible Markup Language) is tijdens de laatste jaren ge¨evolueerd tot het standaard-dataformaat voor de uitwisseling van gegevens op het internet [ABS99]. De belangrijkste voordelen van XML zijn dat het een standaard- en intuitive manier voorziet om een zeer groot bereik aan gegevens te modelleren, en dat het gebruikers toelaat om hun eigen tags in documenten te defini¨eren. Dit laatste geeft gebruikers de mogelijkheid om een eigen formaat voor XML-documenten te ontwikkelen, hetwelk wordt gedefinieerd aan de hand van een XML-schema. De aanwezigheid van zo’n schema verbetert de effici¨entie van de automatisering van vele taken, zoals bijvoorbeeld het verwerken van query’s, query-optimalisatie en automatische data-integratie

Document Server@UHasselt (Universiteit Hasselt)

Static analysis of XML transformation and schema languages

Author: MARTENS Wim
Publication venue
Publication date: 01/01/2006
Field of study

Document Server@UHasselt

Database Principles and Challenges in Text Analysis

Author: MARTENS Wim
DOLESCHAL Johannes
Kimelfeld Benny
Publication venue
Publication date: 2021
Field of study

A common conceptual view of text analysis is that of a two-step process, where we first extract relations from text documents and then apply a relational query over the result. Hence, text analysis shares technical challenges with, and can draw ideas from, relational databases. A framework that formally instantiates this connection is that of the document spanners. In this article, we review recent advances in various research efforts that adapt fundamental database concepts to text analysis through the lens of document spanners. Among others, we discuss aspects of query evaluation, aggregate queries, provenance, and distributed query planning

Document Server@UHasselt (Universiteit Hasselt)

Typechecking top-down XML transformations: Fixed input or output schemas

Author: Gyssens Marc
Neven Frank
Martens Wim
Publication venue
Publication date: 01/01/2008
Field of study

AbstractTypechecking consists of statically verifying whether the output of an XML transformation always conforms to an output type for documents satisfying a given input type. In this general setting, both the input and output schema as well as the transformation are part of the input for the problem. However, scenarios where the input or output schema can be considered to be fixed, are quite common in practice. In the present work, we investigate the computational complexity of the typechecking problem in the latter setting

Elsevier - Publisher Connector

Document Server@UHasselt (Universiteit Hasselt)

Document Server@UHasselt

Weight Annotation in Information Extraction

Author: DOLESCHAL Johannes
Peterfreund Liat
Martens Wim
Kimelfeld Benny
Publication venue
Publication date: 2020
Field of study

The framework of document spanners abstracts the task of information extraction from text as a function that maps every document (a string) into a relation over the document's spans (intervals identified by their start and end indices). For instance, the regular spanners are the closure under the Relational Algebra (RA) of the regular expressions with capture variables, and the expressive power of the regular spanners is precisely captured by the class of vset-automata-a restricted class of transducers that mark the endpoints of selected spans. In this work, we embark on the investigation of document spanners that can annotate extractions with auxiliary information such as confidence, support, and confidentiality measures. To this end, we adopt the abstraction of provenance semirings by Green et al., where tuples of a relation are annotated with the elements of a commutative semiring, and where the annotation propagates through the (positive) RA operators via the semiring operators. Hence, the proposed spanner extension, referred to as an annotator, maps every string into an annotated relation over the spans. As a specific instantiation, we explore weighted vset-automata that, similarly to weighted automata and transducers, attach semiring elements to transitions. We investigate key aspects of expressiveness, such as the closure under the positive RA, and key aspects of computational complexity, such as the enumeration of annotated answers and their ranked enumeration in the case of numeric semirings. For a number of these problems, fundamental properties of the underlying semiring, such as positivity, are crucial for establishing tractability.Funding This work was supported by the German-Israeli Foundation for Scientific Research and Development (GIF), grant I-1502-407.6/2019. The work of Johannes Doleschal and Wim Martens was also supported by the Deutsche Forschungsgemeinschaft (DFG), grant MA 4938/4-1. The work of Benny Kimelfeld and Liat Peterfreund was also supported by the Israel Science Foundation (ISF), grants 1295/15 and 768/19, and the DFG project 412400621 (DIP program). Liat Peterfreund: A part of the work was donewhile the author was affiliated with the Technion. Acknowledgements We are grateful to Matthias Niewerth for many useful discussions and his help regarding Theorem 7.1 and Shaull Almagor for many helpful comments regarding weighted automata. Furthermore, we thank the anonymous reviewers for ICDT 2020 for many helpful remarks

Document Server@UHasselt (Universiteit Hasselt)

Document Server@UHasselt

Which XML Schemas Admit 1-Pass Preorder Typing?

Author: MARTENS Wim
Schwentick Thomas
Frank Neven
Wim Martens
NEVEN Frank
Thomas Schwentick
Publication venue
Publication date: 01/01/2004
Field of study

It is shown that the class of regular tree languages admitting one-pass preorder typing is exactly the class defined by restrained competition tree grammars introduced by Murata et al. [14]. In a streaming context, the former is the largest class of XSDs where every element in a document can be typed when its opening tag is met. The main technical machinery consists of semantical characterizations of restrained competition grammars and their subclasses. In particular, they can be characterized in terms of the context of nodes, closure properties, allowed patterns and guarded DTDs. It is further shown that deciding whether a schema is restrained competition is tractable. Deciding whether a schema is equivalent to a restained competition tree grammar, or one of its subclasses, is much more difficult: it is complete for EXPTIME. We show that our semantical characterizations allow for easy optimization and minimization algorithms. Finally, we relate the notion of one-pass preorder typing to the existing XML Schema standard

Crossref

Document Server@UHasselt (Universiteit Hasselt)

Document Server@UHasselt

The Complexity of Aggregates over Extractions by Regular Expressions

Author: Doleschal Johannes
Kimelfeld B
Johannes Doleschal
Benny Kimelfeld
Martens Wim
Wim Martens
Kimelfeld Benny
Publication venue
Publication date: 08/08/2023
Field of study

Regular expressions with capture variables, also known as regex-formulas, extract relations of spans (intervals identified by their start and end indices) from text. In turn, the class of regular document spanners is the closure of the regex formulas under the Relational Algebra. We investigate the computational complexity of querying text by aggregate functions, such as sum, average, and quantile, on top of regular document spanners. To this end, we formally define aggregate functions over regular document spanners and analyze the computational complexity of exact and approximate computation. More precisely, we show that in a restricted case, all studied aggregate functions can be computed in polynomial time. In general, however, even though exact computation is intractable, some aggregates can still be approximated with fully polynomial-time randomized approximation schemes (FPRAS)

arXiv.org e-Print Archive

Episciences.org

Directory of Open Access Journals

Document Server@UHasselt (Universiteit Hasselt)

Document Server@UHasselt

BonXai

Author: MARTENS Wim
Matthias Niewerth
Schwentick Thomas
Frank Neven
Wim Martens
NEVEN Frank
Niewerth Matthias
Thomas Schwentick
Publication venue
Publication date: 01/01/2017
Field of study

While the migration from DTD to XML Schema was driven by a need for increased expressivity and flexibility, the latter was also significantly more complex to use and understand. Whereas DTDs are characterized by their simplicity, XML Schema Documents are notoriously difficult. In this article, we introduce the XML specification language BonXai, which incorporates many features of XML Schema but is arguably almost as easy to use as DTDs. In brief, the latter is achieved by sacrificing the explicit use of types in favor of simple patterns expressing contexts for elements. The goal of BonXai is not to replace XML Schema but rather to provide a simpler alternative for users who want to go beyond the expressiveness and features of DTD but do not need the explicit use of types. Furthermore, XML Schema processing tools can be used as a back-end for BonXai, since BonXai can be automatically converted into XML Schema. A particularly strong point of BonXai is its solid foundation rooted in a decade of theoretical work around pattern-based schemas. We present a formal model for a core fragment of BonXai and the translation algorithms to and from a core fragment of XML Schema. We prove that BonXai and XML Schema can be converted back-and-forth on the level of tree languages and we formally study the size trade-offs between the two languages.</jats:p

Crossref

Document Server@UHasselt

Chisel

Author: MARTENS Wim
Johannes Doleschal
DOLESCHAL Johannes
Höllerich Nico
Nico Höllerich
Frank Neven
Wim Martens
NEVEN Frank
Publication venue
Publication date: 01/01/2018
Field of study

Chisel is a tool for flexible manipulation of CSV-like data, motivated by the recent effort of the World Wide Web Consortium (W3C) towards a recommendation for tabular data and metadata on the Web. In brief, Chisel supports an expressive built-in schema language for CSV-like data, that can handle both tabular and nontabular data. Furthermore, it supports a simple programming language for transforming tabular and non-tabular CSV-like data. In the demo, we showcase the system for specifying and validating schemas, building transformations, and setting up a pipeline for automatic conversion of “wild” CSV-like data into structured tabular data. We present use cases for Chisel specifically targeted at exemplifying the ease of specifying, modifying, and understanding Sculpt schemas as well as extracting and transforming data

Crossref

Document Server@UHasselt

Satisfiability for SCULPT-Schemas for CSV-Like Data

Author: Doleschal Johannes
Neven Frank
Witkowski Adam
Martens Wim
Publication venue
Publication date: 01/01/2018
Field of study

SCULPT is a simple schema language inspired by the recent working effort towards a recommendation by the World Wide Web Consortium (W3C) for tabular data and metadata on the Web. In its core, a SCULPT schema consists of a set of rules where left-hand sides select sets of regions in the tabular data and the right-hand sides describe the contents of these regions. A document (divided in cells by row- and column-delimiters) then satisfies a schema if it satisfies every rule. In this paper, we study the satisfiability problem for SCULPT schemas. As SCULPT describes grid-like structures, satisfiability obviously becomes undecidable rather quickly even for very restricted schemas. We define a schema language called L-SCULPT (Lego SCULPT) that restricts the walking power of SCULPT by selecting rectangular shaped areas and only considers tables for which selected regions do not intersect. Depending on the axes used by L-SCULPT, we show that satisfiability is PTIME-complete or undecidable. One of the tractable fragments is practically useful as it extends the structural core of the current W3C proposal for schemas over tabular data. We therefore see how the navigational power of the W3C proposal can be extended while still retaining tractable satisfiability tests

DROPS Dagstuhl Research Online Publication Server

Document Server@UHasselt (Universiteit Hasselt)

Document Server@UHasselt

REIN UW Institutional Repository of the University of Warsaw