1,721,043 research outputs found

    WRAPPER INFERENCE FOR AMBIGUOUS WEB PAGES

    No full text
    Several studies have concentrated on the generation of wrappers for web data sources. As wrappers can be easily described as grammars, the grammatical inference heritage could play a significant role in this research field. Recent results have identified a new subclass of regular languages, called prefix mark-up languages, that nicely abstract the structures usually found in HTML pages of large web sites. This class has been proven to be identifiable in the limit, and a PTIME unsupervised learning algorithm has been previously developed. Unfortunately, many real-life web pages do not fall in this class of languages. In this article we analyze the roots of the problem and we propose a technique to transform pages in order to bring them into the class of prefix mark-up languages. In this way, we have a practical solution without renouncing to the formal background defined within the grammatical inference framework. We report on some experiments that we have conducted on real-life web pages to evaluate the approach; the results of this activity demonstrate the effectiveness of the presented techniques

    Roman medieval documents meet machine learning An experience of research and teaching.

    No full text
    This contribution is dedicated to recounting our experience of researching the doc- uments of medieval Rome, initiated in 2016 thanks to In codice ratio2. In codice ratio is a research project conceived by Paolo Merialdo that aims to develop innovative methodologies and tools for the analysis and study of manuscript sources. In its ini- tial phase, the project involved the collaboration of the Department of Humanistic Studies of Roma Tre University and the Vatican Apostolic Archive in the appli- cation of machine learning techniques to create software capable of automatically recognizing and transcribing manuscript sources from the medieval period

    Structure and Semantics of Data-IntensiveWeb Pages: An Experimental Study on their Relationships

    No full text
    In data-intensive web sites pages are generated by scripts that embed data from a backend database into HTML templates. There is usually a relationship between the semantics of the data in a page and its corresponding template. For example, in a web site about sports events, it is likely that pages with data about athletes are associated with a template that differs from the template used to generate pages about coaches or referees. This article presents a method to classify web pages according to the associated template. Given a web page, the goal of our method is to accurately find the pages that are about the same topic. Our method leverages on a simple, yet effective model to abstract some structural features of a web page. We present the results of an extensive experimental analysis that show the performance of our methods in terms of both recall and precision regarding a large number of real-world web pages

    Crowdsourcing large scale wrapper inference

    Full text link
    We present a crowdsourcing system for large-scale production of accurate wrappers to extract data from data-intensive websites. Our approach is based on supervised wrapper inference algorithms which demand the burden of generating training data to workers recruited on a crowdsourcing platform. Workers are paid for answering simple queries carefully chosen by the system. We present two algorithms: a single worker algorithm (ALFη) and a multiple workers algorithm (alfred). Both the algorithms deal with the inherent uncertainty of the workers’ responses and use an active learning approach to select the most informative queries. alfred estimates the workers’ error rate to decide at runtime how many workers should be recruited to achieve a quality target. The system has been fully implemented and tested: the experimental evaluation conducted with both synthetic workers and real workers recruited on a crowdsourcing platform show that our approach is able to produce accurate wrappers at a low cost, even in presence of workers with a significant error rate.We present a crowdsourcing system for large-scale production of accurate wrappers to extract data from data-intensive websites. Our approach is based on supervised wrapper inference algorithms which demand the burden of generating training data to workers recruited on a crowdsourcing platform. Workers are paid for answering simple queries carefully chosen by the system. We present two algorithms: a single worker algorithm (ALFη) and a multiple workers algorithm (alfred). Both the algorithms deal with the inherent uncertainty of the workers’ responses and use an active learning approach to select the most informative queries. alfred estimates the workers’ error rate to decide at runtime how many workers should be recruited to achieve a quality target. The system has been fully implemented and tested: the experimental evaluation conducted with both synthetic workers and real workers recruited on a crowdsourcing platform show that our approach is able to produce accurate wrappers at a low cost, even in presence of workers with a significant error rate
    corecore