1,720,985 research outputs found

    Parallel and External-Memory Construction of Minimal Perfect Hash Functions with PTHash

    Full text link
    A function f:U \to {0,...,n−1} is a minimal perfect hash function for a set S \subseteq U of size n , if f bijectively maps S into the first n natural numbers. These functions are important for many practical applications in computing, such as search engines, computer networks, and databases. Several algorithms have been proposed to build minimal perfect hash functions that: scale well to large sets, retain fast evaluation time, and take very little space, e.g., 2 – 3 bits/key. PTHash is one such algorithm, achieving very fast evaluation in compressed space, typically many times faster than other techniques. In this work, we propose a new construction algorithm for PTHash enabling: (1) multi-threading , to either build functions more quickly or more space-efficiently, and (2) external-memory processing , to scale to inputs much larger than the available internal memory. Only few other algorithms in the literature share these features, despite of their practical impact. We conduct an extensive experimental assessment on large real-world string collections and show that, with respect to other techniques, PTHash is competitive in construction time and space consumption, but retains 2 – 6× better lookup time

    Fast Filtering of Search Results Sorted by Attribute

    No full text
    Modern search services often provide multiple options to rank the search results, e.g., sort "by relevance", "by price"or "by discount"in e-commerce. While the traditional rank by relevance effectively places the relevant results in the top positions of the results list, the rank by attribute could place many marginally relevant results in the head of the results list leading to poor user experience. In the past, this issue has been addressed by investigating the relevance-aware filtering problem, which asks to select the subset of results maximizing the relevance of the attribute-sorted list. Recently, an exact algorithm has been proposed to solve this problem optimally. However, the high computational cost of the algorithm makes it impractical for the Web search scenario, which is characterized by huge lists of results and strict time constraints. For this reason, the problem is often solved using efficient yet inaccurate heuristic algorithms. In this article, we first prove the performance bounds of the existing heuristics. We then propose two efficient and effective algorithms to solve the relevance-aware filtering problem. First, we propose OPT-Filtering, a novel exact algorithm that is faster than the existing state-of-the-art optimal algorithm. Second, we propose an approximate and even more efficient algorithm, -Filtering, which, given an allowed approximation error , finds a (1-)-optimal filtering, i.e., the relevance of its solution is at least (1-) times the optimum. We conduct a comprehensive evaluation of the two proposed algorithms against state-of-the-art competitors on two real-world public datasets. Experimental results show that OPT-Filtering achieves a significant speedup of up to two orders of magnitude with respect to the existing optimal solution, while -Filtering further improves this result by trading effectiveness for efficiency. In particular, experiments show that -Filtering can achieve quasi-optimal solutions while being faster than all state-of-the-art competitors in most of the tested configurations

    Learning bivariate scoring functions for ranking

    No full text
    State-of-the-art Learning-to-Rank algorithms, e.g., λMART, rely on univariate scoring functions to score a list of items. Univariate scoring functions score each item independently, i.e., without considering the other available items in the list. Nevertheless, ranking deals with producing an effective ordering of the items and comparisons between items are helpful to achieve this task. Bivariate scoring functions allow the model to exploit dependencies between the items in the list as they work by scoring pairs of items. In this paper, we exploit item dependencies in a novel framework—we call it the Lambda Bivariate (LB) framework—that allows to learn effective bivariate scoring functions for ranking using gradient boosting trees. We discuss the three main ingredients of LB: (i) the invariance to permutations property, (ii) the function aggregating the scores of all pairs into the per-item scores, and (iii) the optimization process to learn bivariate scoring functions for ranking using any differentiable loss functions. We apply LB to the λRank loss and we show that it results in learning a bivariate version of λMART—we call it Bi-λMART—that significantly outperforms all neural-network-based and tree-based state-of-the-art algorithms for Learning-to-Rank. To show the generality of LB with respect to other loss functions, we also discuss its application to the Softmax loss

    PTHash: Revisiting FCH Minimal Perfect Hashing

    Full text link
    Given a set S of n distinct keys, a function f that bijectively maps the keys of S into the range (0,...,n-1) is called a minimal perfect hash function for S. Algorithms that find such functions when n is large and retain constant evaluation time are of practical interest; for instance, search engines and databases typically use minimal perfect hash functions to quickly assign identifiers to static sets of variable-length keys such as strings. The challenge is to design an algorithm which is efficient in three different aspects: time to find f (construction time), time to evaluate f on a key of S (lookup time), and space of representation for f. Several algorithms have been proposed to trade-off between these aspects. In 1992, Fox, Chen, and Heath (FCH) presented an algorithm at SIGIR providing very fast lookup evaluation. However, the approach received little attention because of its large construction time and higher space consumption compared to other subsequent techniques. Almost thirty years later we revisit their framework and present an improved algorithm that scales well to large sets and reduces space consumption altogether, without compromising the lookup time. We conduct an extensive experimental assessment and show that the algorithm finds functions that are competitive in space with state-of-the art techniques and provide 2-4x better lookup time

    Going Beyond Counting First Authors in Author Co-citation Analysis

    Full text link
    The present study examines one of the fundamental aspects of author co-citation analysis (ACA) - the way co-citation counts are defined. Co-citation counting provides the data on which all subsequent statistical analyses and mappings are based, and we compare ACA results based on two different types of co-citation counting - the traditional type that only counts the first one among a cited work's authors on the one hand and a non-traditional type that takes into account the first 5 authors of a cited work on the other hand. Results indicate that the picture produced through this non-traditional author co-citation counting contains more coherent author groups and is therefore considerably clearer. However, this picture represents fewer specialties in the research field being studied than that produced through the traditional first-author co-citation counting when the same number of top-ranked authors is selected and analyzed. Reasons for these effects are discussed

    Variations on the Author

    Full text link
    “Variations on the Author” discusses two of Eduardo Coutinho’s recent films (Um Dia na Vida, from 2010, and Últimas Conversas, posthumously released in 2015) and their contribution to the general question of documentary authorship. The director’s filmography is characterized by a consistent yet self-effacing form of authorial self-inscription: Coutinho often features as an interviewer that rather than express opinions propels discourses; an interviewer that is good at listening. This mode of self-inscription characterizes him as an author who is not expressive but who is nonetheless markedly present on the screen. In Um Dia na Vida, however, Coutinho is completely absent form the image, while Últimas Conversas, on the contrary, includes a confessional prologue that moves the director from the margins to the center of his films. This article examines the ways in which these works stand out in the filmography of a director who offers new insights into the notion of cinematic authorship

    Appropriate Similarity Measures for Author Cocitation Analysis

    Full text link
    We provide a number of new insights into the methodological discussion about author cocitation analysis. We first argue that the use of the Pearson correlation for measuring the similarity between authors’ cocitation profiles is not very satisfactory. We then discuss what kind of similarity measures may be used as an alternative to the Pearson correlation. We consider three similarity measures in particular. One is the well-known cosine. The other two similarity measures have not been used before in the bibliometric literature. Finally, we show by means of an example that our findings have a high practical relevance.information science;Pearson correlation;cosine;similarity measure;author cocitation analysis

    Data on the distribution of the uncommon Mediterranean sponge Pachymatisma johnstonia (Porifera: Demospongiae)

    No full text
    The sponge Pachymatisma johnstonia (Bowerbank in Johnston, 1842), surveyed mainly along the north-east Atlantic coast, is recorded for the first time in the Southern Adriatic Sea. The specimen is collected at a depth of 228 m, off the Gargano coast (Apulia, Italy). The present study analyzes morphological characters, skeletal elements (spicules), and habitat of P. johnstonia and discusses a comparison between the Atlantic specimens. Moreover, this record extends the distribution of this uncommon species in the Mediterranean Sea

    An Optimal Algorithm for Finding Champions in Tournament Graphs

    Full text link
    A tournament graph is a complete directed graph, which can be used to model a round-robin tournament between n players. In this paper, we address the problem of finding a champion of the tournament, also known as Copeland winner, which is a player that wins the highest number of matches. In detail, we aim to investigate algorithms that find the champion by playing a low number of matches. Solving this problem allows us to speed up several Information Retrieval and Recommender System applications, including question answering, conversational search, etc. Indeed, these applications often search for the champion inducing a round-robin tournament among the players by employing a machine learning model to estimate who wins each pairwise comparison. Our contribution, thus, allows finding the champion by performing a low number of model inferences. We prove that any deterministic or randomized algorithm finding a champion with constant success probability requires Ω(ln) comparisons, where l is the number of matches lost by the champion. We then present an asymptotically-optimal deterministic algorithm matching this lower bound without knowing l, and we extend our analysis to three variants of the problem. Lastly, we conduct a comprehensive experimental assessment of the proposed algorithms on a question answering task on public data. Results show that our proposed algorithms speed up the retrieval of the champion up to 13× with respect to the state-of-the-art algorithm that perform the full tournament
    corecore