1,721,025 research outputs found
Database Principles and Challenges in Text Analysis
A common conceptual view of text analysis is that of a two-step process, where we first extract relations from text documents and then apply a relational query over the result. Hence, text analysis shares technical challenges with, and can draw ideas from, relational databases. A framework that formally instantiates this connection is that of the document spanners. In this article, we review recent advances in various research efforts that adapt fundamental database concepts to text analysis through the lens of document spanners. Among others, we discuss aspects of query evaluation, aggregate queries, provenance, and distributed query planning
Weight Annotation in Information Extraction
The framework of document spanners abstracts the task of information extraction from text as a function that maps every document (a string) into a relation over the document's spans (intervals identified by their start and end indices). For instance, the regular spanners are the closure under the Relational Algebra (RA) of the regular expressions with capture variables, and the expressive power of the regular spanners is precisely captured by the class of vset-automata-a restricted class of transducers that mark the endpoints of selected spans. In this work, we embark on the investigation of document spanners that can annotate extractions with auxiliary information such as confidence, support, and confidentiality measures. To this end, we adopt the abstraction of provenance semirings by Green et al., where tuples of a relation are annotated with the elements of a commutative semiring, and where the annotation propagates through the (positive) RA operators via the semiring operators. Hence, the proposed spanner extension, referred to as an annotator, maps every string into an annotated relation over the spans. As a specific instantiation, we explore weighted vset-automata that, similarly to weighted automata and transducers, attach semiring elements to transitions. We investigate key aspects of expressiveness, such as the closure under the positive RA, and key aspects of computational complexity, such as the enumeration of annotated answers and their ranked enumeration in the case of numeric semirings. For a number of these problems, fundamental properties of the underlying semiring, such as positivity, are crucial for establishing tractability.Funding
This work was supported by the German-Israeli Foundation for Scientific Research and
Development (GIF), grant I-1502-407.6/2019. The work of Johannes Doleschal and Wim Martens
was also supported by the Deutsche Forschungsgemeinschaft (DFG), grant MA 4938/4-1. The work
of Benny Kimelfeld and Liat Peterfreund was also supported by the Israel Science Foundation (ISF),
grants 1295/15 and 768/19, and the DFG project 412400621 (DIP program).
Liat Peterfreund: A part of the work was donewhile the author was affiliated with the Technion.
Acknowledgements
We are grateful to Matthias Niewerth for many useful discussions and his help
regarding Theorem 7.1 and Shaull Almagor for many helpful comments regarding weighted automata.
Furthermore, we thank the anonymous reviewers for ICDT 2020 for many helpful remarks
Split-Correctness in Information Extraction
Programs for extracting structured information from text, namely information extractors, often operate separately on document segments obtained from a generic splitting operation such as sentences, paragraphs, k-grams, HTTP requests, and so on. An automated detection of this behavior of extractors, which we refer to as split-correctness, would allow text analysis systems to devise query plans with parallel
evaluation on segments for accelerating the processing of large documents. Other applications include the incremental evaluation on dynamic content, where re-evaluation of information extractors can be restricted to revised segments, and debugging, where developers of information extractors are informed about potential boundary crossing of different semantic components.
We propose a new formal framework for split-correctness within the formalism of document spanners. Our preliminary analysis studies the complexity of split-correctness over regular spanners. We also discuss different variants of split-correctness, for instance, in the presence of black-box extractors with “split constraints”.We thank the anonymous reviewers for PODS 2019 for many helpful remarks. The work of Johannes Doleschal was supported by the Special Research Fund (BOF) of Hasselt University and the Deutsche Forschungsgemeinschaft (DFG), Grant MA 4938/4-1. The work of Benny Kimelfeld and Yoav Nahshon was supported in part by the Israel Science Foundation (ISF), Grant 1295/15
The Complexity of Aggregates over Extractions by Regular Expressions
Regular expressions with capture variables, also known as "regex-formulas", extract relations of spans (intervals identified by their start and end indices) from text. In turn, the class of regular document spanners is the closure of the regex formulas under the Relational Algebra. We investigate the computational complexity of querying text by aggregate functions, such as sum, average, and quantile, on top of regular document spanners. To this end, we formally define aggregate functions over regular document spanners and analyze the computational complexity of exact and approximate computation. More precisely, we show that in a restricted case, all studied aggregate functions can be computed in polynomial time. In general, however, even though exact computation is intractable, some aggregates can still be approximated with fully polynomial-time randomized approximation schemes (FPRAS)
SpannerLib: Embedding Declarative Information Extraction in an Imperative Workflow
Document spanners have been proposed as a formal framework for declarative Information Extraction (IE) from text, following IE products from the industry and academia. Over the past decade, the framework has been studied thoroughly in terms of expressive power, complexity, and the ability to naturally combine text analysis with relational querying. This demonstration presents SPANNERLIB-a library for embedding document spanners in Python code. SPANNERLIB facilitates the development of IE programs by providing an implementation of Spannerlog (Datalog-based document spanners) that interacts with the Python code in two directions: rules can be embedded inside Python, and they can invoke custom Python code (e.g., calls to ML-based NLP models) via user-defined functions. The demonstration scenarios showcase IE programs, with increasing levels of complexity, within Jupyter Notebook.This work was funded by the Israel Science Foundation (ISF) under grant 768/19. The work of Dean Light and Benny Kimelfeld has been funded by the German Research Foundation (DFG) under grant KI 2348/1-1. S. Vansummeren was supported by the Bijzonder Onderzoeksfonds (BOF) of Hasselt University under Grant No. BOF20ZAP02 as well as the Flanders AI research program
The theory of infinite probabilistic databases
Probabilistic (relational) databases extend the relational model of databases by a concept of uncertainty that is expressed in terms of probability distributions. Such an extension is reasonable and useful for a variety of application scenarios in which there is either inherent uncertainty about the concrete shape of data, or which profit from the introduction of additional uncertainty. The standard model of probabilistic databases is based on a “possible worlds”-approach, that associates every possible manifestation of the data with probabilities. Although databases, logic, and stochastic models are usually defined over infinite universes, for example the real numbers, the conventional model only covers discrete probability spaces over finitely many facts.In this work, we extend the formal “possible worlds”-approach of probabilistic databases to the (uncountable) infinite setting. This introduces a number of new, non-trivial problems and questions, first and foremost the construction of suitable probability spaces and the well-definedness of query semantics. Our model of standard probabilistic databases that we develop for this purpose is based on the probability-theoretic notion of finite point processes, and we show that it indeed has the desired properties. In addition to the plain “possible worlds”-approach, various independence assumptions play a major role in finite probabilistic databases. Therefore, we investigate such assumptions also for infinite probabilistic databases. We characterize the existence of probabilistic databases with such properties and introduce new notions for independence assumptions in different new settings. This is based, again, on notions from point process theory. With respect to independence assumptions we present first studies regarding representations, representability and query answering. Finally, we discuss a particular probabilistic language—generative Datalog, which has been developed by Bárány, ten Cate, Kimelfeld, Olteanu and Vagena in order to conjoin probabilistic programming with the declarative nature of database query languages. This language is both suitable as a query language, and for representing infinite probabilistic databases. We embed the language into our framework of standard probabilistic databases and equip it with an improved semantics that also supports continuous distributions and probabilistic inputs. With our results and concepts, we establish a theory of infinite probabilistic databases and make an important contribution towards the understanding of the mathematical background of probabilistic relational databases in general
Facets of Probabilistic Databases (Invited Talk)
Probabilistic databases are commonly known in the form of the tuple-independent model, where the validity of every tuple is an independent random event. Conceptually, the notion is more general, as a probabilistic database refers to any probability distribution over ordinary databases. A central computational problem is that of marginal inference for database queries: what is the probability that a given tuple is a query answer? In this talk, I will discuss recent developments in several research directions that, collectively, position probabilistic databases as the common and natural foundation of various challenges at the core of data analytics. Examples include reasoning about uncertain preferences from conventional distributions such as the Mallows model, data cleaning and repairing in probabilistic paradigms such as the HoloClean system, and the explanation of query answers through concepts from cooperative game theory such as the Shapley value and the Banzhaf Power Index. While these challenges manifest different facets of probabilistic databases, I will show how they interrelate and, moreover, how they relate to the basic theory of inference over tuple-independent databases
Going Beyond Counting First Authors in Author Co-citation Analysis
The present study examines one of the fundamental aspects of author co-citation analysis (ACA) - the way co-citation
counts are defined. Co-citation counting provides the data on which all subsequent statistical analyses and mappings
are based, and we compare ACA results based on two different types of co-citation counting - the traditional type that
only counts the first one among a cited work's authors on the one hand and a non-traditional type that takes into
account the first 5 authors of a cited work on the other hand. Results indicate that the picture produced through this non-traditional author co-citation counting contains more coherent author groups and is therefore considerably clearer. However, this picture represents fewer specialties in the research field being studied than that produced through the traditional first-author co-citation counting when the same number of top-ranked authors is selected and analyzed. Reasons for these effects are discussed
Variations on the Author
“Variations on the Author” discusses two of Eduardo Coutinho’s recent films (Um Dia na Vida, from 2010, and Últimas Conversas, posthumously released in 2015) and their contribution to the general question of documentary authorship. The director’s filmography is characterized by a consistent yet self-effacing form of authorial self-inscription: Coutinho often features as an interviewer that rather than express opinions propels discourses; an interviewer that is good at listening. This mode of self-inscription characterizes him as an author who is not expressive but who is nonetheless markedly present on the screen. In Um Dia na Vida, however, Coutinho is completely absent form the image, while Últimas Conversas, on the contrary, includes a confessional prologue that moves the director from the margins to the center of his films. This article examines the ways in which these works stand out in the filmography of a director who offers new insights into the notion of cinematic authorship
- …
