1,720,978 research outputs found
Center-Based Approximation of a Drifting Distribution
We present a novel technique for computing a center-based approximation of a drifting distribution. Given k ≥ 1 and a stream of data, whose distribution is changing over time, the goal is to compute, at each step, the best k centers representation of the current distribution, despite possibly having only a single sample from the most recent distribution. In data mining, this is traditionally attempted through the sliding-window mechanism, where the analysis is performed on the most recent fixed-size segment of the data. The problems with this approach are twofold: (1) setting the correct window size is challenging; and (2) a fixed window size cannot effectively track changes in the distribution happening at variable speed. In this paper, we propose a new methodology that dynamically adjusts the window size based on the recent drift of the data. The challenge is that it is not possible to explicitly estimate the drift, as we may have only a single data point from each distribution. Our ..
Distributed graph diameter approximation
We present an algorithm for approximating the diameter of massive weighted undirected graphs on distributed platforms supporting a MapReduce-like abstraction. In order to be efficient in terms of both time and space, our algorithm is based on a decomposition strategy which partitions the graph into disjoint clusters of bounded radius. Theoretically, our algorithm uses linear space and yields a polylogarithmic approximation guarantee; most importantly, for a large family of graphs, it features a round complexity asymptotically smaller than the one exhibited by a natural approximation algorithm based on the state-of-the-art ∆-stepping SSSP algorithm, which is its only practical, linear-space competitor in the distributed setting. We complement our theoretical findings with a proof-of-concept experimental analysis on large benchmark graphs, which suggests that our algorithm may attain substantial improvements in terms of running time compared to the aforementioned competitor, while featuring, in practice, a similar approximation ratio
Mining top-K frequent itemsets through progressive sampling
We study the use of sampling for efficiently mining the top-K
frequent itemsets of cardinality at most w. To this purpose, we define an
approximation to the top-K frequent itemsets to be a family of itemsets
which includes (resp., excludes) all very frequent (resp., very infrequent)
itemsets, together with an estimate of these itemsets’ frequencies with
a bounded error. Our first result is an upper bound on the sample size
which guarantees that the top-K frequent itemsets mined from a random
sample of that size approximate the actual top-K frequent itemsets, with
probability larger than a specified value. We show that the upper bound
is asymptotically tight when w is constant.
Our main algorithmic contribution is a progressive sampling approach,
combined with suitable stopping conditions, which on appropriate inputs
is able to extract approximate top-K frequent itemsets from samples
whose sizes are smaller than the general upper bound. In order to test
the stopping conditions, this approach maintains the frequency of all
itemsets encountered, which is practical only for small w. However, we
show how this problem can be mitigated by using a variation of Bloom
filters.
A number of experiments conducted on both synthetic and real benchmark
datasets show that using samples substantially smaller than the
original dataset (i.e., of size defined by the upper bound or reached
through the progressive sampling approach) enable to approximate the
actual top-K frequent itemsets with accuracy much higher than what
analytically proved
Tight Bounds on Information Dissemination in Sparse Mobile Networks
Motivated by the growing interest in mobile systems, we
study the dynamics of information dissemination between
agents moving independently on a plane. Formally, we consider
k mobile agents performing independent random walks
on an n-node grid. At time 0, each agent is located at a
random node of the grid and one agent has a rumor. The
spread of the rumor is governed by a dynamic communication
graph process {G_t(r) | t >=0}, where two agents are
connected by an edge in G_t(r) if their distance at time t
is within their transmission radius r. Modeling the physical
reality that the speed of radio transmission is much faster
than the motion of the agents, we assume that the rumor
can travel throughout a connected component of G_t before
the graph is altered by the motion. We study the broad-
cast time T_B of the system, which is the time it takes for
all agents to know the rumor. We focus on the sparse case
(below the percolation point r_c ~ sqrt(n/k) ) where, with high
probability, no connected component in Gt has more than
a logarithmic number of agents and the broadcast time is
dominated by the time it takes for many independent random
walks to meet one other. Quite surprisingly, we show
that for a system below the percolation point, the broadcast
time does not depend on the transmission radius. In fact, we prove that TB = Theta~(n/sqrt(k)) for any 0 <= r < r_c, even
when the transmission range is signicantly larger than the
mobility range in one step, giving a tight characterization
up to logarithmic factors. Our result complements a recent
result of Peres et al. (SODA 2011) who showed that above
the percolation point the broadcast time is polylogarithmic
in k
Going Beyond Counting First Authors in Author Co-citation Analysis
The present study examines one of the fundamental aspects of author co-citation analysis (ACA) - the way co-citation
counts are defined. Co-citation counting provides the data on which all subsequent statistical analyses and mappings
are based, and we compare ACA results based on two different types of co-citation counting - the traditional type that
only counts the first one among a cited work's authors on the one hand and a non-traditional type that takes into
account the first 5 authors of a cited work on the other hand. Results indicate that the picture produced through this non-traditional author co-citation counting contains more coherent author groups and is therefore considerably clearer. However, this picture represents fewer specialties in the research field being studied than that produced through the traditional first-author co-citation counting when the same number of top-ranked authors is selected and analyzed. Reasons for these effects are discussed
MADMX: A Novel Strategy for Maximal Dense Motif Extraction
We develop, analyze and experiment with a new tool, called madmx, which extracts frequent motifs, possibly including don't care characters, from biological sequences. We introduce density, a simple and flexible measure for bounding the number of don't cares in a motif, defined as the ratio of solid (i.e., different from don't care) characters to the total length of the motif. By extracting only maximal dense motifs, madmx reduces the output size and improves performance, while enhancing the quality of the discoveries. The efficiency of our approach relies on a newly defined combining operation, dubbed fusion, which allows for the construction of maximal dense motifs in a bottom-up fashion, while avoiding the generation of nonmaximal ones. We provide experimental evidence of the efficiency and the quality of the motifs returned by madmx
MADMX: A Strategy for Maximal Dense Motif Extraction
We develop, analyze, and experiment with a new tool, called madmx, which extracts frequent motifs from biological sequences. We introduce the notion of density to single out the ‘‘significant’’ motifs. The density is a simple and flexible measure for bounding the number of don’t cares in a motif, defined as the fraction of solid (i.e., different from don’t care) characters in the motif. A maximal dense motif has density above a certain threshold, and any further specialization of a don’t care symbol in it or any extension of its boundaries decreases its number of occurrences in the input sequence. By extracting only maximal dense motifs, madmx reduces the output size and improves performance, while enhancing the quality of the discoveries. The efficiency of our approach relies on a newly defined combining operation, dubbed fusion, which allows for the construction of maximal dense motifs in a bottom-up fashion, while avoiding the generation of nonmaximal ones. We provide experimental evidence of the efficiency and the quality of the motifs returned by madmx
Variations on the Author
“Variations on the Author” discusses two of Eduardo Coutinho’s recent films (Um Dia na Vida, from 2010, and Últimas Conversas, posthumously released in 2015) and their contribution to the general question of documentary authorship. The director’s filmography is characterized by a consistent yet self-effacing form of authorial self-inscription: Coutinho often features as an interviewer that rather than express opinions propels discourses; an interviewer that is good at listening. This mode of self-inscription characterizes him as an author who is not expressive but who is nonetheless markedly present on the screen. In Um Dia na Vida, however, Coutinho is completely absent form the image, while Últimas Conversas, on the contrary, includes a confessional prologue that moves the director from the margins to the center of his films. This article examines the ways in which these works stand out in the filmography of a director who offers new insights into the notion of cinematic authorship
Appropriate Similarity Measures for Author Cocitation Analysis
We provide a number of new insights into the methodological discussion about author cocitation analysis. We first argue that the use of the Pearson correlation for measuring the similarity between authors’ cocitation profiles is not very satisfactory. We then discuss what kind of similarity measures may be used as an alternative to the Pearson correlation. We consider three similarity measures in particular. One is the well-known cosine. The other two similarity measures have not been used before in the bibliometric literature. Finally, we show by means of an example that our findings have a high practical relevance.information science;Pearson correlation;cosine;similarity measure;author cocitation analysis
- …
