1,721,033 research outputs found

    On Computing Entity Relatedness in Wikipedia, with Applications

    No full text
    Many text mining tasks, such as clustering, classification, retrieval, and named entity linking, benefit from a measure of relatedness between entities in a knowledge graph. We present a thorough study of all entity relatedness measures in recent literature based on Wikipedia as the knowledge graph. To facilitate this study, we introduce a new dataset with human judgments of entity relatedness. No clear dominance is seen between measures based on textual similarity and graph proximity. Some of the better measures involve expensive global graph computations. We propose a new, space-efficient, computationally lightweight, two-stage framework for relatedness computation. In the first stage, a small weighted subgraph is dynamically grown around the two query entities; in the second stage, relatedness is derived based on computations on this subgraph. Our system shows better agreement with human judgment than existing proposals both on the new dataset and on an established one. Our framework also shows improvements with respect to the state-of-the-art on three different extrinsic evaluations in the domains of ranking entity pairs, entity linking, and synonym extraction

    The Myriad Virtes of Wavelet Trees

    No full text
    A new data structure, the wavelet tree, is analysied and discussed with particular attention to data compressio

    Compressing and Querying Integer Dictionaries Under Linearities and Repetitions

    Full text link
    We revisit the fundamental problem of compressing an integer dictionary that supports efficient rank and select operations by exploiting simultaneously two kinds of regularities arising in real data: repetitiveness and approximate linearity. We attack this problem by proposing two novel compressed indexing approaches that extend the classic Lempel-Ziv compression scheme and the more recent block tree data structure with new algorithms and data structures that allow them to also capture regularities in terms of the approximate linearity in the data. Finally, we corroborate these theoretical results with a wide set of experiments on real and synthetic datasets, which allow us to show that our approaches achieve new interesting space-time trade-offs that characterise them as more robust in most practical scenarios compared to the known data structures that exploit only one of the two regularities

    Two-level massive string dictionaries

    Full text link
    We study the problem of engineering space-time efficient data structures that support membership and rank queries on very large static dictionaries of strings. Our solution is based on a very simple approach that decouples string storage and string indexing by means of a block-wise compression of the sorted dictionary strings (to be stored in external memory) and a succinct implementation of a Patricia trie (to be stored in internal memory) built on the first string of each block. On top of this, we design an in-memory cache that, given a sample of the query workload, augments the Patricia trie with additional information to reduce the number of I/Os of future queries. Our experimental evaluation on two new datasets, which are at least one order of magnitude larger than the ones used in the literature, shows that (i) the state-of-the-art compressed string dictionaries, compared to Patricia tries, do not provide significant benefits when used in a large-scale indexing setting, and (ii) our two-level approach enables the indexing and storage of 3.5 billion strings taking 273 GB in just less than 200 MB of internal memory and 83 GB of compressed disk space, while still guaranteeing comparable or faster query performance than those offered by array-based solutions used in modern storage systems, such as RocksDB, thus possibly influencing their future design

    On the performance of learned data structures

    Full text link
    A recent trend in algorithm design consists of augmenting classic data structures with machine learning models, which are better suited to reveal and exploit patterns and trends in the input data so to achieve outstanding practical improvements in space occupancy and time efficiency. This is especially known in the context of indexing data structures for big data where, despite few attempts in evaluating their asymptotic efficiency, theoretical results are yet missing in showing that learned indexes are provably better than classic indexes, such as B-tree s and their variants. In this paper, we present the first mathematically-grounded answer to this problem by exploiting a link with a mean exit time problem over a proper stochastic process which, we show, is related to the space and time complexity of these learned indexes. As a corollary of this general analysis, we show that plugging this result in the (learned) PGM-index, we get a learned data structure which is provably better than B-tree s

    Locality Filtering for Efficient Ride Sharing Platforms

    No full text
    Ride sharing has a tremendous potential to reduce the number of vehicles needed to serve a certain mobility demand. However, although ride sourcing services have flourished in recent years and are widely available worldwide (e.g. Uber, Didi, Lyft, Via), known ride sharing techniques still suffer severe scalability limitations, especially if the goal is combining multiple on-demand ride requests into a single trip within a large urban area. In the context of on-demand mobility systems, a complete enumeration of all candidate trip requests is unfortunately not a practical approach to find the optimal ride sharing solution. An efficient filtering approach is therefore needed in order to avoid both the storage of quadratic shortest-path lookup tables, as well as the exhaustive pairwise comparison of all mobility requests, with their GPS coordinates and time constraints. In this paper we present a ride sharing algorithm, which combined with the shareability networks method, is able to substantially speed up known approaches while only minimally impacting on the quality of the computed solution. The key building block is a novel locality filter, which allows to build a pruned version of the shareability network more efficiently in time and space than previous works. We corroborate this novel proposal with a large set of experiments executed over a dataset consisting of one month of trip requests (106) performed in two different urban areas, namely Manhattan (NYC) and Singapore. Our experiments show that our approach achieves a 5×5\times speed-up, or even more during so-called 'rush times', and it is robust under different traffic conditions

    An algorithm for the prediction of annotations on Pubmed

    No full text
    The inference of novel knowledge and the generation of new hypotheses from the analysis of the current literature is a fundamental process in making new scientific discoveries. Especially in biomedicine, given the enormous amount of literature and knowledge bases available, this process is often complex, and researchers may focus too much on aspects already widely investigated due to poor literature mining. The automatic extraction of information in the form of semantically related terms (or tags) is becoming an aspect of great importance and extensive investigation (Kilicoglu et al., 2012; Stewart et al., 2012). Here we propose a method that consists of the combination of the TAGME algorithm (Ferragina and Scaiella, 2012), with the DT-Hybrid (Alaimo et al., 2013) technique for recommending novel semantically related tags. This combination will be designed in order to extract novel knowledge from a corpus of documents obtained from PubMed
    corecore