1,721,216 research outputs found

    Efficient Stream Join Processing: Novel Approaches and Challenges

    No full text
    Stream join is a fundamental data operator for processing real-time data, but it faces computational challenges during stream inequality join (theta join operators) due to frequent updates in indexing data structures. To tackle this problem, we identify three key insights: 1) identifying skewed data distributions in real-time and implementing dedicated indexing structures for skewed keys to reduce index update costs; 2) leveraging optimized data structures, including insert-efficient mutable and search-efficient immutable structures to optimize the search stream join process and 3) adopting learned indexes instead of conventional ones, which can provide up to 4x better performance.In this Ph.D. work, we propose novel solutions for distributed and multi-core stream join processing, including an indexing solution that uses a space-efficient dedicated filter and a two-stage data structure that effectively holds and processes sliding window items (bounded streaming contents). We are also exploring the adoption and benefits of learned indexes for real-time stream join processing. Despite non-trivial challenges like state management for distributed processing, processing guarantees, and efficient concurrency mechanisms, experiments on distributed stream processing systems show superior performance compared to state-of-the-art solutions

    Enhancing entity resolution efficiency with loosely schema-aware techniques - Discussion paper

    No full text
    Entity Resolution, the task of identifying records that refer to the same real-world entity, is a fundamental step in data integration. Blocking is a widely employed technique to avoid the comparison of all possible record pairs in a dataset (an inefficient approach). Renouncing to exploit schema information for blocking has been proved to limit the chance of missing matches (i.e., it guarantees high recall), at the cost of a low precision. Meta-blocking alleviates this issue by restructuring a block collection, removing redundant and superfluous comparisons. Yet, existing meta-blocking techniques exclusively rely on schema-agnostic features. In this paper, we investigate how loose schema information, induced directly from the data, can be exploited in an holistic loosely schema-aware (meta-)blocking approach that outperforms the state-of-the-art meta-blocking in terms of precision, without renouncing high level of recall. We implemented our idea in a system called Blast, and experimentally evaluated it on real-world datasets

    Towards declarative imperative data-parallel systems ?

    No full text
    Pushed by recent evolvements in the field of declarative networking and data-parallel computation, we propose a first investigation over a declarative imperative parallel programming model which tries to combine the two worlds. We identify a set of requirements that the model should possess and introduce a conceptual sketch of the system implementing the foresaw model

    The burden of extracutaneous manifestations in juvenile localized scleroderma: A literature review

    No full text
    Objectives: Juvenile Localized Scleroderma (JLS) is an autoimmune disease leading to fibrosis of skin and subcutaneous tissues affecting children, that is characterized by extracutaneous manifestations (ECM) in about 20 % of patients. JLS and ECM can cause severe disabilities, potentially impacting patients' quality of life (QoL). We aimed to systematically review studies reporting ECM in young patients with JLS. Methods: Pubmed, Cochrane and Scopus databases were approached to identify studies evaluating ECM in children with LS. Selected papers focusing on QoL and multidisciplinary approach were separately analysed. Results: At the end of the selection process, 15 papers (encompassing 3604 children) focused on the description of ECM were included. Overall, ECM were reported in 958/3604 (26.5 %) children, and the 3 most frequent ones were musculoskeletal (24 %), neurological (10.3 %) and odontostomatological (7.6 %). Six papers (435 patients) focusing on QoL in children with JLS resulted comparable. Three studies focusing on the role of a multidisciplinary team in the management of children and adolescents with JLS and ECM were also selected (216 children). Conclusions: Almost one-third of patients with JLS may present several clinical problems other than skin lesions that should be managed by a multidisciplinary team. However, evidence on the efficacy of a multispecialty management is still lacking. The impact of ECM on QoL of these patients may be underestimated, as no specifically developed assessment tool has been applied so far, but recently proposed overall disease severity and disease-specific patient-reported outcome measures may improve the evaluation of this important clinical aspect

    Entity resolution on camera records without machine learning

    No full text
    This paper reports the runner-up solution to the ACM SIGMOD 2020 programming contest, whose target was to identify the specifications (i.e., records) collected across 24 e-commerce data sources that refer to the same real-world entities. First, we investigate the machine learning (ML) approach, but surprisingly find that existing state-of-the-art ML-based methods fall short in such a context-not reaching 0.49 F-score. Then, we propose an efficient solution that exploits annotated lists and regular expressions generated by humans that reaches a 0.99 F-score. In our experience, our approach was not more expensive than the dataset labeling of match/non-match pairs required by ML-based methods, in terms of human efforts

    Going Beyond Counting First Authors in Author Co-citation Analysis

    Full text link
    The present study examines one of the fundamental aspects of author co-citation analysis (ACA) - the way co-citation counts are defined. Co-citation counting provides the data on which all subsequent statistical analyses and mappings are based, and we compare ACA results based on two different types of co-citation counting - the traditional type that only counts the first one among a cited work's authors on the one hand and a non-traditional type that takes into account the first 5 authors of a cited work on the other hand. Results indicate that the picture produced through this non-traditional author co-citation counting contains more coherent author groups and is therefore considerably clearer. However, this picture represents fewer specialties in the research field being studied than that produced through the traditional first-author co-citation counting when the same number of top-ranked authors is selected and analyzed. Reasons for these effects are discussed

    Schema-agnostic progressive entity resolution

    Full text link
    Entity Resolution (ER) is the task of finding entity profiles that correspond to the same real-world entity. Progressive ER aims to efficiently resolve large datasets when limited time and/or computational resources are available. In practice, its goal is to provide the best possible partial solution by approximating the optimal comparison order of the entity profiles. So far, Progressive ER has only been examined in the context of structured (relational) data sources, as the existing methods rely on schema knowledge to save unnecessary comparisons: they restrict their search space to similar entities with the help of schema-based blocking keys (i.e., signatures that represent the entity profiles). As a result, these solutions are not applicable in Big Data integration applications, which involve large and heterogeneous datasets, such as relational and RDF databases, JSON files, Web corpus etc. To cover this gap, we propose a family of schema-agnostic Progressive ER methods, which do not require schema information, thus applying to heterogeneous data sources of any schema variety. First, we introduce two naïve schema-agnostic methods, showing that straightforward solutions exhibit a poor performance that does not scale well to large volumes of data. Then, we propose four different advanced methods. Through an extensive experimental evaluation over 7 real-world, established datasets, we show that all the advanced methods outperform to a significant extent both the naïve and the state-of-the-art schema-based ones. We also investigate the relative performance of the advanced methods, providing guidelines on the method selection
    corecore