1,721,043 research outputs found
WRAPPER INFERENCE FOR AMBIGUOUS WEB PAGES
Several studies have concentrated on the generation of wrappers for web data sources. As
wrappers can be easily described as grammars, the grammatical inference heritage could play a
significant role in this research field. Recent results have identified a new subclass of regular
languages, called prefix mark-up languages, that nicely abstract the structures usually found in
HTML pages of large web sites. This class has been proven to be identifiable in the limit, and a
PTIME unsupervised learning algorithm has been previously developed. Unfortunately, many
real-life web pages do not fall in this class of languages. In this article we analyze the roots of
the problem and we propose a technique to transform pages in order to bring them into the class
of prefix mark-up languages. In this way, we have a practical solution without renouncing to
the formal background defined within the grammatical inference framework. We report on some
experiments that we have conducted on real-life web pages to evaluate the approach; the results
of this activity demonstrate the effectiveness of the presented techniques
Roman medieval documents meet machine learning An experience of research and teaching.
This contribution is dedicated to recounting our experience of researching the doc- uments of medieval Rome, initiated in 2016 thanks to In codice ratio2. In codice ratio is a research project conceived by Paolo Merialdo that aims to develop innovative methodologies and tools for the analysis and study of manuscript sources. In its ini- tial phase, the project involved the collaboration of the Department of Humanistic Studies of Roma Tre University and the Vatican Apostolic Archive in the appli- cation of machine learning techniques to create software capable of automatically recognizing and transcribing manuscript sources from the medieval period
Structure and Semantics of Data-IntensiveWeb Pages: An Experimental Study on their Relationships
In data-intensive web sites pages are generated by scripts that embed data from a backend database into HTML templates. There is usually a relationship between the semantics of the data in a page and its corresponding template. For example, in a web site about sports events, it is likely that pages with data about athletes are associated with a template that differs from the template used to generate pages about coaches or referees. This article presents a method to classify web pages according to the associated template. Given a web page, the goal of our method is to accurately find the pages that are about the same topic. Our method leverages on a simple, yet effective model to abstract some structural features of a web page. We present the results of an extensive experimental analysis that show the performance of our methods in terms of both recall and precision regarding a large number of real-world web pages
Crowdsourcing large scale wrapper inference
We present a crowdsourcing system for large-scale production of accurate wrappers to extract data from data-intensive websites. Our approach is based on supervised wrapper inference algorithms which demand the burden of generating training data to workers recruited on a crowdsourcing platform. Workers are paid for answering simple queries carefully chosen by the system. We present two algorithms: a single worker algorithm (ALFη) and a multiple workers algorithm (alfred). Both the algorithms deal with the inherent uncertainty of the workers’ responses and use an active learning approach to select the most informative queries. alfred estimates the workers’ error rate to decide at runtime how many workers should be recruited to achieve a quality target. The system has been fully implemented and tested: the experimental evaluation conducted with both synthetic workers and real workers recruited on a crowdsourcing platform show that our approach is able to produce accurate wrappers at a low cost, even in presence of workers with a significant error rate.We present a crowdsourcing system for large-scale production of accurate wrappers to extract data from data-intensive websites. Our approach is based on supervised wrapper inference algorithms which demand the burden of generating training data to workers recruited on a crowdsourcing platform. Workers are paid for answering simple queries carefully chosen by the system. We present two algorithms: a single worker algorithm (ALFη) and a multiple workers algorithm (alfred). Both the algorithms deal with the inherent uncertainty of the workers’ responses and use an active learning approach to select the most informative queries. alfred estimates the workers’ error rate to decide at runtime how many workers should be recruited to achieve a quality target. The system has been fully implemented and tested: the experimental evaluation conducted with both synthetic workers and real workers recruited on a crowdsourcing platform show that our approach is able to produce accurate wrappers at a low cost, even in presence of workers with a significant error rate
Back to the Gold's age: Bridging the gap between traditional grammar inference and web information extraction
- …
