1,721,007 research outputs found
WRAPPER INFERENCE FOR AMBIGUOUS WEB PAGES
Several studies have concentrated on the generation of wrappers for web data sources. As
wrappers can be easily described as grammars, the grammatical inference heritage could play a
significant role in this research field. Recent results have identified a new subclass of regular
languages, called prefix mark-up languages, that nicely abstract the structures usually found in
HTML pages of large web sites. This class has been proven to be identifiable in the limit, and a
PTIME unsupervised learning algorithm has been previously developed. Unfortunately, many
real-life web pages do not fall in this class of languages. In this article we analyze the roots of
the problem and we propose a technique to transform pages in order to bring them into the class
of prefix mark-up languages. In this way, we have a practical solution without renouncing to
the formal background defined within the grammatical inference framework. We report on some
experiments that we have conducted on real-life web pages to evaluate the approach; the results
of this activity demonstrate the effectiveness of the presented techniques
Structure and Semantics of Data-IntensiveWeb Pages: An Experimental Study on their Relationships
In data-intensive web sites pages are generated by scripts that embed data from a backend database into HTML templates. There is usually a relationship between the semantics of the data in a page and its corresponding template. For example, in a web site about sports events, it is likely that pages with data about athletes are associated with a template that differs from the template used to generate pages about coaches or referees. This article presents a method to classify web pages according to the associated template. Given a web page, the goal of our method is to accurately find the pages that are about the same topic. Our method leverages on a simple, yet effective model to abstract some structural features of a web page. We present the results of an extensive experimental analysis that show the performance of our methods in terms of both recall and precision regarding a large number of real-world web pages
Crowdsourcing large scale wrapper inference
We present a crowdsourcing system for large-scale production of accurate wrappers to extract data from data-intensive websites. Our approach is based on supervised wrapper inference algorithms which demand the burden of generating training data to workers recruited on a crowdsourcing platform. Workers are paid for answering simple queries carefully chosen by the system. We present two algorithms: a single worker algorithm (ALFη) and a multiple workers algorithm (alfred). Both the algorithms deal with the inherent uncertainty of the workers’ responses and use an active learning approach to select the most informative queries. alfred estimates the workers’ error rate to decide at runtime how many workers should be recruited to achieve a quality target. The system has been fully implemented and tested: the experimental evaluation conducted with both synthetic workers and real workers recruited on a crowdsourcing platform show that our approach is able to produce accurate wrappers at a low cost, even in presence of workers with a significant error rate.We present a crowdsourcing system for large-scale production of accurate wrappers to extract data from data-intensive websites. Our approach is based on supervised wrapper inference algorithms which demand the burden of generating training data to workers recruited on a crowdsourcing platform. Workers are paid for answering simple queries carefully chosen by the system. We present two algorithms: a single worker algorithm (ALFη) and a multiple workers algorithm (alfred). Both the algorithms deal with the inherent uncertainty of the workers’ responses and use an active learning approach to select the most informative queries. alfred estimates the workers’ error rate to decide at runtime how many workers should be recruited to achieve a quality target. The system has been fully implemented and tested: the experimental evaluation conducted with both synthetic workers and real workers recruited on a crowdsourcing platform show that our approach is able to produce accurate wrappers at a low cost, even in presence of workers with a significant error rate
Back to the Gold's age: Bridging the gap between traditional grammar inference and web information extraction
Handling irregularities in ROADRUNNER
We report on some recent advancements on the development of the ROADRUNNER system, which is able to automatically infer a wrapper for HTML pages. One of the major drawbacks of the ROADRUNNER approach was its limited ability in handling irregularities in the source pages. To overcome this issue, we have developed a technique to deal with chunks of unstructured HTML code. Several experiments have been conducted to evaluate the effectiveness of the approach, producing encouraging results. Copyright c 2004, American Association for Artificial Intelligence (www.aaai.org). All rights reserved
- …
