Search CORE

1,721,007 research outputs found

WRAPPER INFERENCE FOR AMBIGUOUS WEB PAGES

Author: MERIALDO PAOLO
CRESCENZI VALTER
Publication venue
Publication date: 01/01/2008
Field of study

Several studies have concentrated on the generation of wrappers for web data sources. As wrappers can be easily described as grammars, the grammatical inference heritage could play a significant role in this research field. Recent results have identified a new subclass of regular languages, called prefix mark-up languages, that nicely abstract the structures usually found in HTML pages of large web sites. This class has been proven to be identifiable in the limit, and a PTIME unsupervised learning algorithm has been previously developed. Unfortunately, many real-life web pages do not fall in this class of languages. In this article we analyze the roots of the problem and we propose a technique to transform pages in order to bring them into the class of prefix mark-up languages. In this way, we have a practical solution without renouncing to the formal background defined within the grammatical inference framework. We report on some experiments that we have conducted on real-life web pages to evaluate the approach; the results of this activity demonstrate the effectiveness of the presented techniques

Archivio della Ricerca - Università di Roma 3

Structure and Semantics of Data-IntensiveWeb Pages: An Experimental Study on their Relationships

Author: Crescenzi Valter
LORENZO BLANCO
MERIALDO PAOLO
CRESCENZI VALTER
Merialdo Paolo
Blanco Lorenzo
Blanco Lorenzo
Publication venue
Publication date: 01/01/2008
Field of study

In data-intensive web sites pages are generated by scripts that embed data from a backend database into HTML templates. There is usually a relationship between the semantics of the data in a page and its corresponding template. For example, in a web site about sports events, it is likely that pages with data about athletes are associated with a template that differs from the template used to generate pages about coaches or referees. This article presents a method to classify web pages according to the associated template. Given a web page, the goal of our method is to accurately find the pages that are about the same topic. Our method leverages on a simple, yet effective model to abstract some structural features of a web page. We present the results of an extensive experimental analysis that show the performance of our methods in terms of both recall and precision regarding a large number of real-world web pages

ZENODO

Archivio della Ricerca - Università di Roma 3

ARPHA OAI-PMH Endpoint

ARPHA Preprints

The RoadRunner Project: Towards Automatic Extraction of Web Data

Author: Merialdo Paolo
MECCA Giansalvatore
Crescenzi Valter
Publication venue
Publication date: 01/01/2001
Field of study

Archivio della Ricerca - Università della Basilicata

Crowdsourcing large scale wrapper inference

Author: QIU DISHENG
MERIALDO PAOLO
CRESCENZI VALTER
Publication venue
Publication date: 01/01/2015
Field of study

We present a crowdsourcing system for large-scale production of accurate wrappers to extract data from data-intensive websites. Our approach is based on supervised wrapper inference algorithms which demand the burden of generating training data to workers recruited on a crowdsourcing platform. Workers are paid for answering simple queries carefully chosen by the system. We present two algorithms: a single worker algorithm (ALFη) and a multiple workers algorithm (alfred). Both the algorithms deal with the inherent uncertainty of the workers’ responses and use an active learning approach to select the most informative queries. alfred estimates the workers’ error rate to decide at runtime how many workers should be recruited to achieve a quality target. The system has been fully implemented and tested: the experimental evaluation conducted with both synthetic workers and real workers recruited on a crowdsourcing platform show that our approach is able to produce accurate wrappers at a low cost, even in presence of workers with a significant error rate.We present a crowdsourcing system for large-scale production of accurate wrappers to extract data from data-intensive websites. Our approach is based on supervised wrapper inference algorithms which demand the burden of generating training data to workers recruited on a crowdsourcing platform. Workers are paid for answering simple queries carefully chosen by the system. We present two algorithms: a single worker algorithm (ALFη) and a multiple workers algorithm (alfred). Both the algorithms deal with the inherent uncertainty of the workers’ responses and use an active learning approach to select the most informative queries. alfred estimates the workers’ error rate to decide at runtime how many workers should be recruited to achieve a quality target. The system has been fully implemented and tested: the experimental evaluation conducted with both synthetic workers and real workers recruited on a crowdsourcing platform show that our approach is able to produce accurate wrappers at a low cost, even in presence of workers with a significant error rate

Crossref

Archivio della Ricerca - Università di Roma 3

Back to the Gold's age: Bridging the gap between traditional grammar inference and web information extraction

Author: G. MECCA
MERIALDO PAOLO
CRESCENZI VALTER
Publication venue
Publication date: 01/01/2002
Field of study

Archivio della Ricerca - Università di Roma 3

The RoadRunner Web Data Extraction System

Author: G. MECCA
MERIALDO PAOLO
CRESCENZI VALTER
Publication venue
Publication date: 01/01/2001
Field of study

Archivio della Ricerca - Università di Roma 3

Improving the expressiveness of ROADRUNNER

Author: G. MECCA
MERIALDO PAOLO
CRESCENZI VALTER
Publication venue
Publication date: 01/01/2004
Field of study

Archivio della Ricerca - Università di Roma 3

Wrapper Generation Supervised by a Noisy Crowd

Author: Qiu D.
MERIALDO PAOLO
CRESCENZI VALTER
Publication venue
Publication date: 01/01/2013
Field of study

Archivio della Ricerca - Università di Roma 3

Grammars have exceptions

Author: Giansalvatore Mecca
G. MECCA
Valter Crescenzi
CRESCENZI VALTER
Publication venue
Publication date: 01/01/1998
Field of study

Crossref

Archivio della Ricerca - Università di Roma 3

Handling irregularities in ROADRUNNER

Author: MECCA Giansalvatore
MERIALDO PAOLO
CRESCENZI VALTER
Publication venue
Publication date: 01/01/2004
Field of study

We report on some recent advancements on the development of the ROADRUNNER system, which is able to automatically infer a wrapper for HTML pages. One of the major drawbacks of the ROADRUNNER approach was its limited ability in handling irregularities in the source pages. To overcome this issue, we have developed a technique to deal with chunks of unstructured HTML code. Several experiments have been conducted to evaluate the effectiveness of the approach, producing encouraging results. Copyright c 2004, American Association for Artificial Intelligence (www.aaai.org). All rights reserved

Archivio della Ricerca - Università della Basilicata

Archivio della Ricerca - Università di Roma 3