Search CORE

1,720,963 research outputs found

Crowdsourcing large scale wrapper inference

Author: QIU DISHENG
MERIALDO PAOLO
CRESCENZI VALTER
Publication venue
Publication date: 01/01/2015
Field of study

We present a crowdsourcing system for large-scale production of accurate wrappers to extract data from data-intensive websites. Our approach is based on supervised wrapper inference algorithms which demand the burden of generating training data to workers recruited on a crowdsourcing platform. Workers are paid for answering simple queries carefully chosen by the system. We present two algorithms: a single worker algorithm (ALFη) and a multiple workers algorithm (alfred). Both the algorithms deal with the inherent uncertainty of the workers’ responses and use an active learning approach to select the most informative queries. alfred estimates the workers’ error rate to decide at runtime how many workers should be recruited to achieve a quality target. The system has been fully implemented and tested: the experimental evaluation conducted with both synthetic workers and real workers recruited on a crowdsourcing platform show that our approach is able to produce accurate wrappers at a low cost, even in presence of workers with a significant error rate.We present a crowdsourcing system for large-scale production of accurate wrappers to extract data from data-intensive websites. Our approach is based on supervised wrapper inference algorithms which demand the burden of generating training data to workers recruited on a crowdsourcing platform. Workers are paid for answering simple queries carefully chosen by the system. We present two algorithms: a single worker algorithm (ALFη) and a multiple workers algorithm (alfred). Both the algorithms deal with the inherent uncertainty of the workers’ responses and use an active learning approach to select the most informative queries. alfred estimates the workers’ error rate to decide at runtime how many workers should be recruited to achieve a quality target. The system has been fully implemented and tested: the experimental evaluation conducted with both synthetic workers and real workers recruited on a crowdsourcing platform show that our approach is able to produce accurate wrappers at a low cost, even in presence of workers with a significant error rate

Crossref

Archivio della Ricerca - Università di Roma 3

Future locations prediction with uncertain data

Author: Blanco L.
QIU DISHENG
Papotti P.
Publication venue
Publication date: 01/01/2013
Field of study

Archivio della Ricerca - Università di Roma 3

ALFRED: crowd assisted data extraction

Author: QIU DISHENG
MERIALDO PAOLO
CRESCENZI VALTER
Publication venue
Publication date: 01/01/2013
Field of study

The development of solutions to scale the extraction of data from Web sources is still a challenging issue. High accu- racy can be achieved by supervised approaches, but the costs of training data, i.e., annotations over a set of sam- ple pages, limit their scalability. Crowdsourcing platforms are making the manual annotation process more affordable. However, the tasks demanded to these platforms should be extremely simple, to be performed by non-expert people, and their number should be minimized, to contain the costs. We demonstrate alfred, a wrapper inference system super- vised by the workers of a crowdsourcing platform. Training data are labeled values generated by means of membership queries, the simplest form of queries, posed to the crowd. alfred includes several original features: it automatically selects a representative sample set from the input collection of pages; in order to minimize the wrapper inference costs, it dynamically sets the expressiveness of the wrapper for- malism and it adopts an active learning algorithm to select the queries posed to the crowd; it is able to manage inaccu- rate answers that can be provided by the workers engaged by crowdsourcing platforms

Archivio della Ricerca - Università di Roma 3

A framework for learning web wrappers from the crowd

Author: QIU DISHENG
Valter Crescenzi
MERIALDO PAOLO
CRESCENZI VALTER
Disheng Qiu
Paolo Merialdo
Publication venue
Publication date: 01/01/2013
Field of study

Crossref

Archivio della Ricerca - Università di Roma 3

Big data linkage for product specification pages

Author: Merialdo Paolo
Divesh Srivastava
Qiu Disheng
Luciano Barbosa
Crescenzi Valter
Valter Crescenzi
Srivastava Divesh
Disheng Qiu
Barbosa Luciano
Paolo Merialdo
Publication venue
Publication date: 01/01/2018
Field of study

An increasing number of product pages are available from thousands of web sources, each page associated with a product, containing its attributes and one or more product identifiers. The sources provide overlapping information about the products, using diverse schemas, making web-scale integration extremely challenging. In this paper, we take advantage of the opportunity that sources publish product identifiers to perform big data linkage across sources at the beginning of the data integration pipeline, before schema alignment. To realize this opportunity, several challenges need to be addressed: identifiers need to be discovered on product pages, made difficult by the diversity of identifiers; the main product identifier on the page needs to be identified, made difficult by the many related products presented on the page; and identifiers across pages need to be resolved, made difficult by the ambiguity between identifiers across product categories. We present our RaF (Redundancy as Friend) solution to the problem of big data linkage for product specification pages, which takes advantage of the redundancy of identifiers at a global level, and the homogeneity of structure and semantics at the local source level, to effectively and efficiently link millions of pages of head and tail products across thousands of head and tail sources. We perform a thorough empirical evaluation of our RaF approach using the publicly available Dexter dataset consisting of 1.9M product pages from 7.1k sources of 3.5k websites, and demonstrate its effectiveness in practice

Crossref

Archivio della Ricerca - Università di Roma 3

Web-Scale Extension of RDF Knowledge Bases from Templated Websites

Author: Both A
Usbeck R
Saleem M
QIU DISHENG
Bühmann L
MERIALDO PAOLO
CRESCENZI VALTER
Ngonga Ngomo A. C.
Publication venue
Publication date: 01/01/2014
Field of study

Archivio della Ricerca - Università di Roma 3

Minimizing the Costs of the Training Data for Learning Web Wrappers

Author: Creo Rolando
Rolando Creo
QIU DISHENG
MERIALDO PAOLO
CRESCENZI VALTER
Publication venue
Publication date: 01/01/2012
Field of study

Data extraction from the Web represents an important issue. Several approaches have been developed to bring the wrapper generation process at the web scale. Although they rely on different techniques and formalisms, they all learn a wrapper given a set of sample pages. Unsupervised approaches require just a set of sample pages, supervised ones also need training data. Unfortunately, the accuracy obtained by unsupervised techniques is not sufficient for many applications. On the other hand, obtaining training data is not cheap at the web scale. This paper addresses the issue of minimizing the costs of collecting training data for learning web wrappers. We show that two interleaved problems affect this issue: the choice of the sample pages, and the expressiveness of the wrapper language. We propose a solution that leverages contributions in the field of learning theory, and we discuss the promising results of an experimental evaluation of our approach

Archivio della Ricerca - Università di Roma 3

Author Instructions

Author: Instructions Author
Publication venue
Publication date: 04/11/2013
Field of study

Crossref

Cartographic Perspectives (E-Journal - North American Cartographic Information Society, NACIS)

Going Beyond Counting First Authors in Author Co-citation Analysis

Author: Zhao Dangzhi
Publication venue
Publication date: 01/01/2005
Field of study

The present study examines one of the fundamental aspects of author co-citation analysis (ACA) - the way co-citation counts are defined. Co-citation counting provides the data on which all subsequent statistical analyses and mappings are based, and we compare ACA results based on two different types of co-citation counting - the traditional type that only counts the first one among a cited work's authors on the one hand and a non-traditional type that takes into account the first 5 authors of a cited work on the other hand. Results indicate that the picture produced through this non-traditional author co-citation counting contains more coherent author groups and is therefore considerably clearer. However, this picture represents fewer specialties in the research field being studied than that produced through the traditional first-author co-citation counting when the same number of top-ranked authors is selected and analyzed. Reasons for these effects are discussed

E-LIS

Variations on the Author

Author: Sayad Cecilia
Publication venue
Publication date: 01/01/2016
Field of study

“Variations on the Author” discusses two of Eduardo Coutinho’s recent films (Um Dia na Vida, from 2010, and Últimas Conversas, posthumously released in 2015) and their contribution to the general question of documentary authorship. The director’s filmography is characterized by a consistent yet self-effacing form of authorial self-inscription: Coutinho often features as an interviewer that rather than express opinions propels discourses; an interviewer that is good at listening. This mode of self-inscription characterizes him as an author who is not expressive but who is nonetheless markedly present on the screen. In Um Dia na Vida, however, Coutinho is completely absent form the image, while Últimas Conversas, on the contrary, includes a confessional prologue that moves the director from the margins to the center of his films. This article examines the ways in which these works stand out in the filmography of a director who offers new insights into the notion of cinematic authorship

Crossref

Kent Academic Repository