1,721,009 research outputs found

    Beyond Topic Modeling: Comparative Evaluation of Topic Interpretation by Large Language Models

    No full text
    This study investigates the application of Large Language Models (LLMs) for interpreting topics derived from topic modeling within the domain of restaurant reviews in Brazilian Portuguese. Traditional topic modeling techniques often produce outputs that require further interpretation to fully capture the nuanced meanings within the data. This research leverages the advanced natural language processing capabilities of LLMs to provide deeper insights into these topics, aiming to bridge the gap between computational topic identification and human-like understanding. A comparative analysis of several LLMs, including ChatGPT versions 3.5 and 4.0 and Google’s BARD, was conducted to assess their efficacy in interpreting the generated topics from a large dataset of restaurant reviews. The topics identified were subjected to interpretation by both human evaluators and LLMs, enabling a direct comparison between human and machine-generated interpretations. Preliminary results indicate that LLMs, especially with well-crafted prompts, can produce interpretations that are closely aligned with human understanding, underscoring their potential utility in qualitative data analysis within NLP applications. This research not only sheds light on the interpretative capabilities of LLMs but also opens new pathways for automating complex interpretive tasks across diverse linguistic and cultural landscapes

    Design and Maintenance of Data-Intensive Web Sites

    No full text
    A methodology for designing and maintaining large Web sites is introduced. It would be especially useful if data to be published in the site are managed using a DBMS. The design process is composed of two intertwined activities: database design and hypertext design. Each of these is further divided in a conceptual phase and a logical phase, based on specific data models, proposed in our project. The methodology strongly supports site maintenance: in fact, the various models provide a concise description of the site structure; they allow to reason about the overall organization of pages in the site and possibly to restructure it

    Semistructured und Structured Data in the Web: Going Back and Forth

    No full text
    Database systems offer efficient and reliable technology to query structured data. However, because of the explosion of the World Wide Web, an increasing amount of information is stored in repositories organized according to less rigid structures, usually as hypertextual documents, and data access is based on browsing and information retrieval techniques. Since browsing and search engines present important limitations, several query languages for the Web have been recently proposed. These approaches are mainly based on a loose notion of structure, and tend to see the Web as a huge collection of unstructured objects, organized as a graph Clearly, traditional database techniques are of little use in this field, and new techniques need to be developed. In this paper, we present the approach to the management of Web data as attacked in the ARANEUS project carried out by the database group at University Roma Tre Our approach is based on a generalization of the notion of view to the Web framework

    To Weave the Web

    No full text
    The paper discusses the issue of views in the Web context. We introduce a set of languages for managing and restructuring data coming from the World Wide Web. We present a spe- cific data model, called the ARANEUS Data Model, inspired to the structures typically present in Web sites. The model allows us to describe the scheme of a Web hypertext, in the spirit of databases. Based on the data model, we develop two languages to support a sophisticate view definition process: the first, called ULIXES, is used to build database views of the Web, which can then be analyzed and integrated using database techniques; the sec- ond, called PENELOPE, allows the definition of derived Web hypertexts from relational views. This can be used to generate hypertextual views over the Web

    Araneus in the Era of XML

    No full text
    A large body of research has been recently motivated by the attempt to extend database manipulation techniques to data on the Web. Most of these research efforts -- which range from the definition of Web query languages and the related optimizations, to systems for Web site development and management, and to integration techniques -- started before XML was introduced, and therefore have strived for a long time to handle the highly heterogeneous nature of HTML pages. In the meanwhile, Web data sources have evolved from small, home-made collections of HTML pages into complex platforms for distributed data access and application development, and XML promises to impose itself as a more appropriate format for this new breed of Web sites. XML brings data on the Web closer to databases, since, differently from HTML, it is based on a clean distinction between the way the data, its logical structure (the DTD), and the chosen presentation (the stylesheet) are specified. By virtue of this, most of the early research proposals for data management on the Web are now being reconsidered in this new perspective. In this paper, we discuss the impact of XML on the research work conducted in the last few years by our group in the framework of the Araneus project. Araneus started as an attempt to investigate the chances of re-applying traditional database concepts and abstractions, such as the ones of data-model and query language, to data on the Web. In this spirit, we have developed several tools and techniques to handle both structured and semistructured data, in the Web style, as follows: (i) a data model called ADM for modeling Web documents and hypertexts; (ii) languages for wrapping and querying Web sites; (iii) tools and techniques for Web site design and implementation

    RoadRunner: Towards Automatic Data Extraction from Large Web Sites

    No full text
    The paper investigates techniques for extracting data from HTML sites through the use of auto- matically generated wrappers. To automate the wrapper generation and the data extraction pro- cess, the paper develops a novel technique to com- pare HTML pages and generate a wrapper based on their similarities and differences. Experimental results on real-life data-intensive Web sites con- firm the feasibility of the approach

    Hybrid crowd-machine wrapper inference

    No full text
    Wrapper inference deals in generating programs to extract data from Web pages. Several supervised and unsupervised wrapper inference approaches have been proposed in the literature. On one hand, unsupervised approaches produce erratic wrappers: whenever the sources do not satisfy underlying assumptions of the inference algorithm, their accuracy is compromised. On the other hand, supervised approaches produce accurate wrappers, but since they need training data, their scalability is limited. The recent advent of crowdsourcing platforms has opened new opportunities for supervised approaches, as they make possible the production of large amounts of training data with the support of workers recruited online. Nevertheless, involving human workers has monetary costs. We present an original hybrid crowd-machine wrapper inference system that offers the benefits of both approaches exploiting the cooperation of crowd workers and unsupervised algorithms. Based on a principled probabilistic model that estimates the quality of wrappers, humans workers are recruited only when unsupervised wrapper induction algorithms are not able to produce sufficiently accurate solutions
    corecore