1,720,990 research outputs found

    Coresets-Methods and History: A Theoreticians Design Pattern for Approximation and Streaming Algorithms

    Full text link
    We present a technical survey on the state of the art approaches in data reduction and the coreset framework. These include geometric decompositions, gradient methods, random sampling, sketching and random projections. We further outline their importance for the design of streaming algorithms and give a brief overview on lower bounding techniques

    On Finding the Jaccard Center

    Full text link
    We initiate the study of finding the Jaccard center of a given collection N of sets. For two sets X,Y, the Jaccard index is defined as |X\cap Y|/|X\cup Y| and the corresponding distance is 1-|X\cap Y|/|X\cup Y|. The Jaccard center is a set C minimizing the maximum distance to any set of N. We show that the problem is NP-hard to solve exactly, and that it admits a PTAS while no FPTAS can exist unless P = NP. Furthermore, we show that the problem is fixed parameter tractable in the maximum Hamming norm between Jaccard center and any input set. Our algorithms are based on a compression technique similar in spirit to coresets for the Euclidean 1-center problem. In addition, we also show that, contrary to the previously studied median problem by Chierichetti et al. (SODA 2010), the continuous version of the Jaccard center problem admits a simple polynomial time algorithm

    Sketch 'Em All: Fast Approximate Similarity Search for Dynamic Data Streams

    No full text
    Recommender systems are an integral part of many web applica- tions. With increasingly larger user bases, scalability has become an important issue. Many of the most scalable algorithms with respect to both space and running times are based on locality-sensitive hashing (LSH). However, a significant drawback is that these meth- ods are only able to handle insertions to user profiles and tend to perform poorly when items may be removed. We initiate the study of scalable locality-sensitive hashing for dynamic input. Specifi- cally, using the Jaccard index as similarity measure, we design (1) a sketching algorithm for similarity estimation via a black box re- duction to l0 norm estimation and (2) a locality sensitive hashing scheme maintainable in fully dynamic data streams that quickly filters out low-similarity pairs. Our algorithms have little to no overhead in terms of running time compared to previous LSH ap- proaches for the insertion only case, and drastically outperform previous algorithms in case of deletion

    PEPPA: a project for evolutionary predator prey algorithms

    No full text
    The predator-prey model--based on aspects of the natural interplay of predators and prey--has become an alternative method for tackling multi-objective optimization problems. In this process, each predator targets a single objective, and it is expected that the joint influence of all predators affects the prey population in such a way that good solutions survive. This paper describes PEPPA, a modular software framework for designing and analyzing predator-prey models. It allows to model arbitrary world environments, complex predator behavior and dynamic prey adaptation. Further, PEPPA provides various tools for modeling, visualization and parallelization. We explain the architecture and handling of the framework and provide exemplary results on a simple multi-objective benchmark problem

    Fair Projections as a Means Towards Balanced Recommendations

    Full text link
    The goal of recommender systems is to provide to users suggestions that match their interests, with the eventual goal of increasing their satisfaction, as measured by the number of transactions (clicks, purchases, etc.). Often, this leads to providing recommendations that are of a particular type. For some contexts (e.g., browsing videos for information) this may be undesirable, as it may enforce the creation of filter bubbles. This is because of the existence of underlying bias in the input data of prior user actions. Reducing hidden bias in the data and ensuring fairness in algorithmic data analysis has recently received significant attention. In this paper, we consider both the densest subgraph and the k-clustering problem, two primitives that are being used by some recommender systems. We are given a coloring on the nodes, respectively the points, and aim to compute a fair solution S, consisting of a subgraph or a clustering, such that none of the colors is disparately impacted by the solution. Unfortunately, introducing fair solutions typically makes these problems substantially more difficult. Unlike the unconstrained densest subgraph problem, which is solvable in polynomial time, the fair densest subgraph problem is NP-hard even to approximate. For k-clustering, the fairness constraints make the problem very similar to capacitated clustering, which is a notoriously hard problem to even approximate. Despite such negative premises, we are able to provide positive results in important use cases. In particular, we are able to prove that a suitable spectral embedding allows recovery of an almost optimal, fair, dense subgraph hidden in the input data, whenever one is present, a result that is further supported by experimental evidence. We also show a polynomial-time, 2-approximation algorithm to the problem of fair densest subgraph, assuming that there exist only two colors and both colors occur equally often in the graph. This result turns out to be optimal assuming the small set expansion hypothesis. For fair k-clustering, we show that we can recover high quality fair clusterings effectively and efficiently. For the special case of k-median and k-center, we offer additional, fast and simple approximation algorithms as well as new hardness results. The above theoretical findings drive the design of heuristics, which we experimentally evaluate on a scenario based on real data, in which our aim is to strike a good balance between diversity and highly correlated items from Amazon co-purchasing graphs and facebook contacts

    Solving the Minimum String Cover Problem

    No full text
    A string cover CC of a set of strings SS is a set of substrings from SS such that every string in SS can be written as a concatenation of the strings in CC. Given costs assigned to each substring from SS, the \textsc{Minimum String Cover} (MSC) problem asks for a cover of minimum total cost. This NP-hard problem has so far only been approached from a purely theoretical perspective. A~previous integer linear programming (ILP) formulation was designed for a special case, in which each string in SS must be generated by a (small) constant number of substrings. If this restriction is removed, the ILP has an exponential number of variables, for which we show the pricing problem to be NP-hard. We propose an alternative flow-based ILP formulation of polynomial size, whose structure is particularly favorable for a Lagrangian relaxation approach. By making use of the strong bounds obtained through a repeated shortest path computation in a branch-and-bound manner, we show for the first time that non-trivial MSC instances can be solved to provable optimality in reasonable time. We also provide and solve real-world instances derived from the classic text ``Alice in Wonderland''. On almost all instances, our Lagrangian relaxation approach outperforms a CPLEX-based implementation by an order of magnitude. Our software is available under the terms of the GNU general public license

    Going Beyond Counting First Authors in Author Co-citation Analysis

    Full text link
    The present study examines one of the fundamental aspects of author co-citation analysis (ACA) - the way co-citation counts are defined. Co-citation counting provides the data on which all subsequent statistical analyses and mappings are based, and we compare ACA results based on two different types of co-citation counting - the traditional type that only counts the first one among a cited work's authors on the one hand and a non-traditional type that takes into account the first 5 authors of a cited work on the other hand. Results indicate that the picture produced through this non-traditional author co-citation counting contains more coherent author groups and is therefore considerably clearer. However, this picture represents fewer specialties in the research field being studied than that produced through the traditional first-author co-citation counting when the same number of top-ranked authors is selected and analyzed. Reasons for these effects are discussed

    Variations on the Author

    Full text link
    “Variations on the Author” discusses two of Eduardo Coutinho’s recent films (Um Dia na Vida, from 2010, and Últimas Conversas, posthumously released in 2015) and their contribution to the general question of documentary authorship. The director’s filmography is characterized by a consistent yet self-effacing form of authorial self-inscription: Coutinho often features as an interviewer that rather than express opinions propels discourses; an interviewer that is good at listening. This mode of self-inscription characterizes him as an author who is not expressive but who is nonetheless markedly present on the screen. In Um Dia na Vida, however, Coutinho is completely absent form the image, while Últimas Conversas, on the contrary, includes a confessional prologue that moves the director from the margins to the center of his films. This article examines the ways in which these works stand out in the filmography of a director who offers new insights into the notion of cinematic authorship

    Appropriate Similarity Measures for Author Cocitation Analysis

    Full text link
    We provide a number of new insights into the methodological discussion about author cocitation analysis. We first argue that the use of the Pearson correlation for measuring the similarity between authors’ cocitation profiles is not very satisfactory. We then discuss what kind of similarity measures may be used as an alternative to the Pearson correlation. We consider three similarity measures in particular. One is the well-known cosine. The other two similarity measures have not been used before in the bibliometric literature. Finally, we show by means of an example that our findings have a high practical relevance.information science;Pearson correlation;cosine;similarity measure;author cocitation analysis
    corecore