1,720,978 research outputs found
Weight Annotation in Information Extraction
The framework of document spanners abstracts the task of information extraction from text as a function that maps every document (a string) into a relation over the document's spans (intervals identified by their start and end indices). For instance, the regular spanners are the closure under the Relational Algebra (RA) of the regular expressions with capture variables, and the expressive power of the regular spanners is precisely captured by the class of vset-automata-a restricted class of transducers that mark the endpoints of selected spans. In this work, we embark on the investigation of document spanners that can annotate extractions with auxiliary information such as confidence, support, and confidentiality measures. To this end, we adopt the abstraction of provenance semirings by Green et al., where tuples of a relation are annotated with the elements of a commutative semiring, and where the annotation propagates through the (positive) RA operators via the semiring operators. Hence, the proposed spanner extension, referred to as an annotator, maps every string into an annotated relation over the spans. As a specific instantiation, we explore weighted vset-automata that, similarly to weighted automata and transducers, attach semiring elements to transitions. We investigate key aspects of expressiveness, such as the closure under the positive RA, and key aspects of computational complexity, such as the enumeration of annotated answers and their ranked enumeration in the case of numeric semirings. For a number of these problems, fundamental properties of the underlying semiring, such as positivity, are crucial for establishing tractability.Funding
This work was supported by the German-Israeli Foundation for Scientific Research and
Development (GIF), grant I-1502-407.6/2019. The work of Johannes Doleschal and Wim Martens
was also supported by the Deutsche Forschungsgemeinschaft (DFG), grant MA 4938/4-1. The work
of Benny Kimelfeld and Liat Peterfreund was also supported by the Israel Science Foundation (ISF),
grants 1295/15 and 768/19, and the DFG project 412400621 (DIP program).
Liat Peterfreund: A part of the work was donewhile the author was affiliated with the Technion.
Acknowledgements
We are grateful to Matthias Niewerth for many useful discussions and his help
regarding Theorem 7.1 and Shaull Almagor for many helpful comments regarding weighted automata.
Furthermore, we thank the anonymous reviewers for ICDT 2020 for many helpful remarks
Grammars for Document Spanners
We propose a new grammar-based language for defining information-extractors
from documents (text) that is built upon the well-studied framework of document
spanners for extracting structured data from text. While previously studied
formalisms for document spanners are mainly based on regular expressions, we
use an extension of context-free grammars, called {extraction grammars}, to
define the new class of context-free spanners. Extraction grammars are simply
context-free grammars extended with variables that capture interval positions
of the document, namely spans. While regular expressions are efficient for
tokenizing and tagging, context-free grammars are also efficient for capturing
structural properties. Indeed, we show that context-free spanners are strictly
more expressive than their regular counterparts. We reason about the expressive
power of our new class and present a pushdown-automata model that captures it.
We show that extraction grammars can be evaluated with polynomial data
complexity. Nevertheless, as the degree of the polynomial depends on the query,
we present an enumeration algorithm for unambiguous extraction grammars that,
after quintic preprocessing, outputs the results sequentially, without
repetitions, with a constant delay between every two consecutive ones
Playing Guess Who with Your Kids
Guess who is a two-player search game in which each player chooses a character from a deck of 24 cards, and has to infer the other player’s character by asking yes-no questions. A simple binary search strategy allows the starting player find the opponent’s character by asking 5 questions only, when the opponent is honest.
Real-life observations show that in more realistic scenarios, the game is played against adversaries that do not strictly follow the rules, e.g., kids. Such players might decide to answer all questions at once, answer only part of the questions as they do not know the answers to all, and even lie occasionally. We devise strategies for such scenarios using techniques from error-correcting and erasure codes. This connects to a recent line of work on search problems on graphs and trees with unreliable auxiliary information, and could be of independent interest
Handling SQL Nulls with Two-Valued Logic
The design of SQL is based on a three-valued logic (3VL), rather than the familiar Boolean logic with truth values true and false, to accommodate the additional truth value unknown for handling nulls. It is viewed as indispensable for SQL expressiveness, but is at the same time much criticized for leading to unintuitive behavior of queries and thus being a source of programmer mistakes. We show that, contrary to the widely held view, SQL could have been designed based on the standard Boolean logic, without any loss of expressiveness and without giving up nulls. The approach itself follows SQL’s evaluation which only retains tuples for which conditions in the WHERE clause evaluate to true. We show that conflating unknown, resulting from nulls, with false leads to an equally expressive version of SQL that does not use the third truth value. Queries written under the two-valued semantics can be efficiently translated into the standard SQL and thus executed on any existing RDBMS. These results cover the core of the SQL 1999 Standard, including SELECT-FROM-WHERE-GROUP BY-HAVING queries extended with subqueries and IN/EXISTS/ANY/ALL conditions, and recursive queries. We provide two extensions of this result showing that no other way of converting 3VL into Boolean logic, nor any other many-valued logic for treating nulls could have possibly led to a more expressive language. These results not only present small modifications of SQL that eliminate the source of many programmer errors without the need to reimplement database internals, but they also strongly suggest that new query languages for various data models do not have to follow the much criticized SQL’s three-valued approach
SQL Nulls and Two-Valued Logic
The design of SQL is based on a three-valued logic (3VL), rather than the familiar two-valued Boolean logic (2VL). In addition to true and false, 3VL adds unknown to handle nulls. Viewed as indispensable for SQL expressiveness, it is often criticized for unintuitive behavior of queries and for being a source of programmer mistakes. We show that, contrary to the widely held view, SQL could have been designed based on 2VL, without any loss of expressiveness. Similarly to SQL’s WHERE clause, which only keeps true tuples, we conflate false and unknown for conditions involving nulls to obtain an equally expressive 2VL-based version of SQL. This applies to the core of the 1999 SQL Standard. Queries written under the 2VL semantics can be efficiently trans lated into the 3VL SQL and thus executed on any existing RDBMS. We show that 2VL enables additional optimizations. To gauge its applicability, we establish criteria under which 2VL and 3VL seman tics coincide, and analyze common benchmarks such as TPC-H and TPC-DS to show that most of their queries are such. For queries that behave differently under 2VL and 3VL, we undertake a user study to show a consistent preference for the 2VL semantics
Going Beyond Counting First Authors in Author Co-citation Analysis
The present study examines one of the fundamental aspects of author co-citation analysis (ACA) - the way co-citation
counts are defined. Co-citation counting provides the data on which all subsequent statistical analyses and mappings
are based, and we compare ACA results based on two different types of co-citation counting - the traditional type that
only counts the first one among a cited work's authors on the one hand and a non-traditional type that takes into
account the first 5 authors of a cited work on the other hand. Results indicate that the picture produced through this non-traditional author co-citation counting contains more coherent author groups and is therefore considerably clearer. However, this picture represents fewer specialties in the research field being studied than that produced through the traditional first-author co-citation counting when the same number of top-ranked authors is selected and analyzed. Reasons for these effects are discussed
Enumerating grammar-based extractions
International audienceWe propose a new grammar-based language for defining information extractors from documents (text) that is built upon the well-studied framework of document spanners for extracting structured data from text. While previously studied formalisms for document spanners are mainly based on regular expressions, we use an extension of context-free grammars, called extraction grammars, to define the new class of context-free spanners. Extraction grammars are simply context-free grammars extended with variables that capture interval positions of the document, namely spans. While regular expressions are efficient for tokenizing and tagging, context-free grammars are also efficient for capturing structural properties. Indeed, we show that context-free spanners are strictly more expressive than their regular counterparts. We reason about the expressive power of our new class and present a pushdown-automata model that captures it. We show that extraction grammars can be evaluated with polynomial data complexity. Nevertheless, as the degree of the polynomial depends on the query, we present an enumeration algorithm for unambiguous extraction grammars that, after quintic preprocessing, outputs the results sequentially, without repetitions, with a constant delay between every two consecutive ones
Playing Guess Who with your kids: Code-word strategy against adversaries
International audienceGuess Who is a two-player search game in which each player chooses a character from a deck of 24 cards, and has to infer the other player's character by asking yes-no questions. A simple binary search strategy allows the starting player find the opponent's character by asking 5 questions only, when the opponent is honest. Real-life observations show that in more realistic scenarios, the game is played against adversaries that do not strictly follow the rules, e.g., kids. Such players might decide to answer all questions at once, answer only part of the questions as they do not know the answers to all, and even lie occasionally. We devise strategies for such scenarios using techniques from error-correcting and erasure codes. This connects to a recent line of work on search problems on graphs and trees with unreliable auxiliary information, and could be of independent interest
Variations on the Author
“Variations on the Author” discusses two of Eduardo Coutinho’s recent films (Um Dia na Vida, from 2010, and Últimas Conversas, posthumously released in 2015) and their contribution to the general question of documentary authorship. The director’s filmography is characterized by a consistent yet self-effacing form of authorial self-inscription: Coutinho often features as an interviewer that rather than express opinions propels discourses; an interviewer that is good at listening. This mode of self-inscription characterizes him as an author who is not expressive but who is nonetheless markedly present on the screen. In Um Dia na Vida, however, Coutinho is completely absent form the image, while Últimas Conversas, on the contrary, includes a confessional prologue that moves the director from the margins to the center of his films. This article examines the ways in which these works stand out in the filmography of a director who offers new insights into the notion of cinematic authorship
- …
