1,721,010 research outputs found

    A KDD Platform based on the Application Service Provider Paradigm

    No full text
    Nowadays, Small and Medium Enterprises (SMEs) are forced to compete on a global market and to make strategic decisions in short periods of time. In order to allow SMEs access to information technologies which can support their competition on a global scale, public administrations are fostering the setting up of Digital Districts. In this paper, we describe a distributed collaborative data mining platform, named KD-ASP, developed for a Digital District. It is based on the application service provider (ASP) paradigm, which allows SMEs accessing to data mining services over a network and to cut down costs for their acquisition, upgrading and maintenance. KD-ASP allows the users to collaborate on the design of a knowledge discovery process whose execution is then demanded to a workflow engine. Tasks involved in a process are classified as data selection, pre-processing, data transformation, data mining and data visualization, and are made available as Web services

    A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce

    No full text
    Recently, several algorithms based on the MapReduce framework have been proposed for frequent pattern mining in Big Data. However, the proposed solutions come with their own technical challenges, such as inter-communication costs, in-process synchronizations, balanced data distribution and input parameters tuning, which negatively affect the computation time. In this paper we present MrAdam, a novel parallel, distributed algorithm which addresses these problems. The key principle underlying the design of MrAdam is that one can make reasonable decisions in the absence of perfect answers. Indeed, given the classical threshold for minimum support and a user-specified error bound, MrAdam exploits the Chernoff bound to mine "approximate" frequent itemsets with statistical error guarantees on their actual supports. These itemsets are generated in parallel and independently from subsets of the input dataset, by exploiting the MapReduce parallel computation framework. The result collections of frequent itemsets from each subset are aggregated and filtered by using a novel technique to provide a single collection in output. MrAdam can scale well on gigabytes of data and tens of machines, as experimentally proven on real datasets. In the experiments we also show that the proposed algorithm returns a good statistically bounded approximation of the exact results

    Unexpected results in automatic list extraction on the web

    No full text
    The discovery and extraction of general lists on the Web continues to be an important problem facing the Web mining community. There have been numerous studies that claim to automatically extract structured data (i.e. lists, record sets, tables, etc.) from the Web for various purposes. Our own recent experiences have shown that the list-finding methods used as part of these larger frameworks do not generalize well and therefore ought to be reevaluated. This paper briefly describes some of the current approaches, and tests them on various list-pages. Based on our findings, we conclude that analyzing a Web page’s DOM-structure is not sufficient for the general list finding task

    Discovering Novelty Patterns from the Ancient Christian Inscriptions of Rome

    No full text
    Studying Greek and Latin cultural heritage has always been considered essential to the understanding of important aspects of the roots of current European societies. However, only a small fraction of the total production of texts from ancient Greece and Rome has survived up to the present, leaving many gaps in the historiographic records. Epigraphy, which is the study of inscriptions (epigraphs), helps to fill these gaps. In particular, the goal of epigraphy is to clarify the meanings of epigraphs; to classify their uses according to their dating and cultural contexts; and to study aspects of the writing, the writers, and their “consumers.” Although several research projects have recently been promoted for digitally storing and retrieving data and metadata about epigraphs, there has actually been no attempt to apply data mining technologies to discover previously unknown cultural aspects. In this context, we propose to exploit the temporal dimension associated with epigraphs (dating) by applying a data mining method for novelty detection. The main goal is to discover relational novelty patterns—that is, patterns expressed as logical clauses describing significant variations (in frequency) over the different epochs, in terms of relevant features such as language, writing style, and material. As a case study, we considered the set of Inscriptiones Christianae Vrbis Romae stored in Epigraphic Database Bari, an epigraphic repository. Some patterns discovered by the data mining method were easily deciphered by experts since they captured relevant cultural changes, whereas others disclosed unexpected variations, which might be used to formulate new questions, thus expanding the research opportunities in the field of epigraph
    corecore