Search CORE

1,721,043 research outputs found

WRAPPER INFERENCE FOR AMBIGUOUS WEB PAGES

Author: MERIALDO PAOLO
CRESCENZI VALTER
Publication venue
Publication date: 01/01/2008
Field of study

Several studies have concentrated on the generation of wrappers for web data sources. As wrappers can be easily described as grammars, the grammatical inference heritage could play a significant role in this research field. Recent results have identified a new subclass of regular languages, called prefix mark-up languages, that nicely abstract the structures usually found in HTML pages of large web sites. This class has been proven to be identifiable in the limit, and a PTIME unsupervised learning algorithm has been previously developed. Unfortunately, many real-life web pages do not fall in this class of languages. In this article we analyze the roots of the problem and we propose a technique to transform pages in order to bring them into the class of prefix mark-up languages. In this way, we have a practical solution without renouncing to the formal background defined within the grammatical inference framework. We report on some experiments that we have conducted on real-life web pages to evaluate the approach; the results of this activity demonstrate the effectiveness of the presented techniques

Archivio della Ricerca - Università di Roma 3

The Startup Ecosystem: a Quick Tour

Author: MERIALDO PAOLO
Publication venue
Publication date: 01/01/2015
Field of study

Archivio della Ricerca - Università di Roma 3

Roman medieval documents meet machine learning An experience of research and teaching.

Author: Merialdo Paolo
Ammirati Serena
Publication venue
Publication date: 01/01/2025
Field of study

This contribution is dedicated to recounting our experience of researching the doc- uments of medieval Rome, initiated in 2016 thanks to In codice ratio2. In codice ratio is a research project conceived by Paolo Merialdo that aims to develop innovative methodologies and tools for the analysis and study of manuscript sources. In its ini- tial phase, the project involved the collaboration of the Department of Humanistic Studies of Roma Tre University and the Vatican Apostolic Archive in the appli- cation of machine learning techniques to create software capable of automatically recognizing and transcribing manuscript sources from the medieval period

Archivio della Ricerca - Università di Roma 3

Web Site Evaluation: Methodology and Case Study

Author: ATZENI Paolo
MERIALDO PAOLO
SINDONI G.
Publication venue
Publication date: 01/01/2001
Field of study

Archivio della Ricerca - Università di Roma 3

The RoadRunner Project: Towards Automatic Extraction of Web Data

Author: Merialdo Paolo
MECCA Giansalvatore
Crescenzi Valter
Publication venue
Publication date: 01/01/2001
Field of study

Archivio della Ricerca - Università della Basilicata

Structure and Semantics of Data-IntensiveWeb Pages: An Experimental Study on their Relationships

Author: Crescenzi Valter
LORENZO BLANCO
MERIALDO PAOLO
CRESCENZI VALTER
Merialdo Paolo
Blanco Lorenzo
Blanco Lorenzo
Publication venue
Publication date: 01/01/2008
Field of study

In data-intensive web sites pages are generated by scripts that embed data from a backend database into HTML templates. There is usually a relationship between the semantics of the data in a page and its corresponding template. For example, in a web site about sports events, it is likely that pages with data about athletes are associated with a template that differs from the template used to generate pages about coaches or referees. This article presents a method to classify web pages according to the associated template. Given a web page, the goal of our method is to accurately find the pages that are about the same topic. Our method leverages on a simple, yet effective model to abstract some structural features of a web page. We present the results of an extensive experimental analysis that show the performance of our methods in terms of both recall and precision regarding a large number of real-world web pages

ZENODO

Archivio della Ricerca - Università di Roma 3

ARPHA OAI-PMH Endpoint

ARPHA Preprints

Semistructured und Structured Data in the Web: Going Back and Forth

Author: PAOLO ATZENI
GIANSALVATORE MECCA
MERIALDO PAOLO
Publication venue
Publication date: 01/01/1997
Field of study

Archivio della Ricerca - Università di Roma 3

Design and Maintenance of Data-Intensive Web Sites

Author: MECCA G.
ATZENI Paolo
MERIALDO PAOLO
Publication venue
Publication date: 01/01/1998
Field of study

Archivio della Ricerca - Università di Roma 3

Crowdsourcing large scale wrapper inference

Author: QIU DISHENG
MERIALDO PAOLO
CRESCENZI VALTER
Publication venue
Publication date: 01/01/2015
Field of study

We present a crowdsourcing system for large-scale production of accurate wrappers to extract data from data-intensive websites. Our approach is based on supervised wrapper inference algorithms which demand the burden of generating training data to workers recruited on a crowdsourcing platform. Workers are paid for answering simple queries carefully chosen by the system. We present two algorithms: a single worker algorithm (ALFη) and a multiple workers algorithm (alfred). Both the algorithms deal with the inherent uncertainty of the workers’ responses and use an active learning approach to select the most informative queries. alfred estimates the workers’ error rate to decide at runtime how many workers should be recruited to achieve a quality target. The system has been fully implemented and tested: the experimental evaluation conducted with both synthetic workers and real workers recruited on a crowdsourcing platform show that our approach is able to produce accurate wrappers at a low cost, even in presence of workers with a significant error rate.We present a crowdsourcing system for large-scale production of accurate wrappers to extract data from data-intensive websites. Our approach is based on supervised wrapper inference algorithms which demand the burden of generating training data to workers recruited on a crowdsourcing platform. Workers are paid for answering simple queries carefully chosen by the system. We present two algorithms: a single worker algorithm (ALFη) and a multiple workers algorithm (alfred). Both the algorithms deal with the inherent uncertainty of the workers’ responses and use an active learning approach to select the most informative queries. alfred estimates the workers’ error rate to decide at runtime how many workers should be recruited to achieve a quality target. The system has been fully implemented and tested: the experimental evaluation conducted with both synthetic workers and real workers recruited on a crowdsourcing platform show that our approach is able to produce accurate wrappers at a low cost, even in presence of workers with a significant error rate

Crossref

Archivio della Ricerca - Università di Roma 3

Back to the Gold's age: Bridging the gap between traditional grammar inference and web information extraction

Author: G. MECCA
MERIALDO PAOLO
CRESCENZI VALTER
Publication venue
Publication date: 01/01/2002
Field of study

Archivio della Ricerca - Università di Roma 3