1,721,029 research outputs found
Efficient Queries over Web Views
Large Web sites are becoming repositories of structured information that can benefit from being viewed and queried as relational databases. However, querying these views efficiently requires new techniques. Data usually resides at a remote site and is organized as a set of related HTML documents, with network access being a primary cost factor in query evaluation. This cost can be reduced by exploiting the redundancy often found in site design. We use a simple data model, a subset of the Araneus data model, to describe the structure of a Web site. We augment the model with link and inclusion constraints that capture the redundancies in the site. We map relational views of a site to a navigational algebra and show how to use the constraints to rewrite algebraic expressions, reducing the number of network accesses. We show that similar techniques can be used to maintain materialized views over sets of HTML pages
Data-Intensive Web Sites: Design and Maintenance
A methodology for designing and maintaining data-intensive Web sites is introduced. Leveraging on ideas well established in the database field, the approach heavily relies on the use of models for the description of Web sites. The design process is composed of two intertwined activities: database design and hypertext design. Each of these is further divided in a conceptual phase and a logical phase, based on specific data models. The methodology strongly supports site maintenance: in fact, the various models provide a concise description of the site structure; they allow to reason about the overall organization of pages in the site and possibly to restructure it
Managing Web-based data - Database models and transformations
Database research, traditionally aimed at data management methods and tools in various frameworks, now requires a broader focus. Building on recent successes in business applications, researchers in database technology need to widen their spectrum of interest to confront new data management opportunities--particularly in thecontext of the Internet. Indeed, in the Asilomar Report on Database Research, experts from industry and academia called for researchers to "make it easy for everyoneto store, organize, access, and analyze the majority of human information online" within the next 10 years
AND GIANSALVATORE MECCA
Abstract. Information extraction from websites is nowadays a relevant problem, usually performed by software modules called wrappers. A key requirement is that the wrapper generation process should be automated to the largest extent, in order to allow for large-scale extraction tasks even in presence of changes in the underlying sites. So far, however, only semi-automatic proposals have appeared in the literature. We present a novel approach to information extraction from websites, which reconciles recent proposals for supervised wrapper induction with the more traditional field of grammar inference. Grammar inference provides a promising theoretical framework for the study of unsupervised—that is, fully automatic—wrapper generation algorithms. However, due to some unrealistic assumptions on the input, these algorithms are not practically applicable to Web information extraction tasks. The main contributions of the article stand in the definition of a class of regular languages, called the prefix mark-up languages, that abstract the structures usually found in HTML pages, and in the definition of a polynomial-time unsupervised learning algorithm for this class. The article shows that, differently from other known classes, prefix mark-up languages and the associated algorithm can be practically used for information extraction purposes. A system based on the techniques described in the article has been implemented in a working prototype. We present some experimental results on known Websites, and discuss opportunities and limitations of the proposed approach. Categories and Subject Descriptors: F.4.3 [Mathematical Logic and Formal Languages]: Formal Languages—Classes defined by grammars or automata; H.2.4 [Database Management]: Systems— Relational database
An Automatic Data Grabber for Large Web Sites
This chapter investigates a system to automatically grab data from data intensive Websites. The system first infers a model that describes the Website as a collection of classes. Each class represents a set of structurally homogeneous pages, and it is associated with a small set of representative pages. Based on the model, a library of wrappers, one per class, is then inferred with the help an external wrapper generator. The model, together with the library of wrappers, can thus be used to navigate the site and extract the data. The inference process is performed incrementally. The system starts from a given entry point that becomes the first member of the first class in the model. It then refines the model by exploring its boundaries to gather new pages. At each iteration, the system selects a link collection from the model outbound, and iteratively fetches a page by following one of the links in the collection. In order to reduce the number of pages actually visited, after each download the system makes a guess on the class of remaining pages. If looking at the pages already downloaded, there is sufficient evidence that the guess is right, the remaining pages of the collections are assigned to classes without actually fetching them. The process iterates until all the link collections are typed with a known class.</p
- …
