1,720,978 research outputs found

    Center-Based Approximation of a Drifting Distribution

    No full text
    We present a novel technique for computing a center-based approximation of a drifting distribution. Given k ≥ 1 and a stream of data, whose distribution is changing over time, the goal is to compute, at each step, the best k centers representation of the current distribution, despite possibly having only a single sample from the most recent distribution. In data mining, this is traditionally attempted through the sliding-window mechanism, where the analysis is performed on the most recent fixed-size segment of the data. The problems with this approach are twofold: (1) setting the correct window size is challenging; and (2) a fixed window size cannot effectively track changes in the distribution happening at variable speed. In this paper, we propose a new methodology that dynamically adjusts the window size based on the recent drift of the data. The challenge is that it is not possible to explicitly estimate the drift, as we may have only a single data point from each distribution. Our ..

    Distributed graph diameter approximation

    Full text link
    We present an algorithm for approximating the diameter of massive weighted undirected graphs on distributed platforms supporting a MapReduce-like abstraction. In order to be efficient in terms of both time and space, our algorithm is based on a decomposition strategy which partitions the graph into disjoint clusters of bounded radius. Theoretically, our algorithm uses linear space and yields a polylogarithmic approximation guarantee; most importantly, for a large family of graphs, it features a round complexity asymptotically smaller than the one exhibited by a natural approximation algorithm based on the state-of-the-art ∆-stepping SSSP algorithm, which is its only practical, linear-space competitor in the distributed setting. We complement our theoretical findings with a proof-of-concept experimental analysis on large benchmark graphs, which suggests that our algorithm may attain substantial improvements in terms of running time compared to the aforementioned competitor, while featuring, in practice, a similar approximation ratio

    Mining top-K frequent itemsets through progressive sampling

    No full text
    We study the use of sampling for efficiently mining the top-K frequent itemsets of cardinality at most w. To this purpose, we define an approximation to the top-K frequent itemsets to be a family of itemsets which includes (resp., excludes) all very frequent (resp., very infrequent) itemsets, together with an estimate of these itemsets’ frequencies with a bounded error. Our first result is an upper bound on the sample size which guarantees that the top-K frequent itemsets mined from a random sample of that size approximate the actual top-K frequent itemsets, with probability larger than a specified value. We show that the upper bound is asymptotically tight when w is constant. Our main algorithmic contribution is a progressive sampling approach, combined with suitable stopping conditions, which on appropriate inputs is able to extract approximate top-K frequent itemsets from samples whose sizes are smaller than the general upper bound. In order to test the stopping conditions, this approach maintains the frequency of all itemsets encountered, which is practical only for small w. However, we show how this problem can be mitigated by using a variation of Bloom filters. A number of experiments conducted on both synthetic and real benchmark datasets show that using samples substantially smaller than the original dataset (i.e., of size defined by the upper bound or reached through the progressive sampling approach) enable to approximate the actual top-K frequent itemsets with accuracy much higher than what analytically proved

    Tight Bounds on Information Dissemination in Sparse Mobile Networks

    No full text
    Motivated by the growing interest in mobile systems, we study the dynamics of information dissemination between agents moving independently on a plane. Formally, we consider k mobile agents performing independent random walks on an n-node grid. At time 0, each agent is located at a random node of the grid and one agent has a rumor. The spread of the rumor is governed by a dynamic communication graph process {G_t(r) | t >=0}, where two agents are connected by an edge in G_t(r) if their distance at time t is within their transmission radius r. Modeling the physical reality that the speed of radio transmission is much faster than the motion of the agents, we assume that the rumor can travel throughout a connected component of G_t before the graph is altered by the motion. We study the broad- cast time T_B of the system, which is the time it takes for all agents to know the rumor. We focus on the sparse case (below the percolation point r_c ~ sqrt(n/k) ) where, with high probability, no connected component in Gt has more than a logarithmic number of agents and the broadcast time is dominated by the time it takes for many independent random walks to meet one other. Quite surprisingly, we show that for a system below the percolation point, the broadcast time does not depend on the transmission radius. In fact, we prove that TB = Theta~(n/sqrt(k)) for any 0 <= r < r_c, even when the transmission range is signicantly larger than the mobility range in one step, giving a tight characterization up to logarithmic factors. Our result complements a recent result of Peres et al. (SODA 2011) who showed that above the percolation point the broadcast time is polylogarithmic in k

    Going Beyond Counting First Authors in Author Co-citation Analysis

    Full text link
    The present study examines one of the fundamental aspects of author co-citation analysis (ACA) - the way co-citation counts are defined. Co-citation counting provides the data on which all subsequent statistical analyses and mappings are based, and we compare ACA results based on two different types of co-citation counting - the traditional type that only counts the first one among a cited work's authors on the one hand and a non-traditional type that takes into account the first 5 authors of a cited work on the other hand. Results indicate that the picture produced through this non-traditional author co-citation counting contains more coherent author groups and is therefore considerably clearer. However, this picture represents fewer specialties in the research field being studied than that produced through the traditional first-author co-citation counting when the same number of top-ranked authors is selected and analyzed. Reasons for these effects are discussed

    MADMX: A Novel Strategy for Maximal Dense Motif Extraction

    No full text
    We develop, analyze and experiment with a new tool, called madmx, which extracts frequent motifs, possibly including don't care characters, from biological sequences. We introduce density, a simple and flexible measure for bounding the number of don't cares in a motif, defined as the ratio of solid (i.e., different from don't care) characters to the total length of the motif. By extracting only maximal dense motifs, madmx reduces the output size and improves performance, while enhancing the quality of the discoveries. The efficiency of our approach relies on a newly defined combining operation, dubbed fusion, which allows for the construction of maximal dense motifs in a bottom-up fashion, while avoiding the generation of nonmaximal ones. We provide experimental evidence of the efficiency and the quality of the motifs returned by madmx

    MADMX: A Strategy for Maximal Dense Motif Extraction

    No full text
    We develop, analyze, and experiment with a new tool, called madmx, which extracts frequent motifs from biological sequences. We introduce the notion of density to single out the ‘‘significant’’ motifs. The density is a simple and flexible measure for bounding the number of don’t cares in a motif, defined as the fraction of solid (i.e., different from don’t care) characters in the motif. A maximal dense motif has density above a certain threshold, and any further specialization of a don’t care symbol in it or any extension of its boundaries decreases its number of occurrences in the input sequence. By extracting only maximal dense motifs, madmx reduces the output size and improves performance, while enhancing the quality of the discoveries. The efficiency of our approach relies on a newly defined combining operation, dubbed fusion, which allows for the construction of maximal dense motifs in a bottom-up fashion, while avoiding the generation of nonmaximal ones. We provide experimental evidence of the efficiency and the quality of the motifs returned by madmx

    Variations on the Author

    Full text link
    “Variations on the Author” discusses two of Eduardo Coutinho’s recent films (Um Dia na Vida, from 2010, and Últimas Conversas, posthumously released in 2015) and their contribution to the general question of documentary authorship. The director’s filmography is characterized by a consistent yet self-effacing form of authorial self-inscription: Coutinho often features as an interviewer that rather than express opinions propels discourses; an interviewer that is good at listening. This mode of self-inscription characterizes him as an author who is not expressive but who is nonetheless markedly present on the screen. In Um Dia na Vida, however, Coutinho is completely absent form the image, while Últimas Conversas, on the contrary, includes a confessional prologue that moves the director from the margins to the center of his films. This article examines the ways in which these works stand out in the filmography of a director who offers new insights into the notion of cinematic authorship

    Appropriate Similarity Measures for Author Cocitation Analysis

    Full text link
    We provide a number of new insights into the methodological discussion about author cocitation analysis. We first argue that the use of the Pearson correlation for measuring the similarity between authors’ cocitation profiles is not very satisfactory. We then discuss what kind of similarity measures may be used as an alternative to the Pearson correlation. We consider three similarity measures in particular. One is the well-known cosine. The other two similarity measures have not been used before in the bibliometric literature. Finally, we show by means of an example that our findings have a high practical relevance.information science;Pearson correlation;cosine;similarity measure;author cocitation analysis
    corecore