1,721,198 research outputs found

    Mining frequent items in unstructured P2P networks

    Full text link
    Large scale decentralized systems, such as P2P, sensor or loT device networks are becoming increasingly common, and require robust protocols to address the challenges posed by the distribution of data and the large number of peers belonging to the network. In this paper, we deal with the problem of mining frequent items in unstructured P2P networks. This problem, of practical importance, has many useful applications. We design P2PSS, a fully decentralized, gossip-based protocol for frequent items discovery, leveraging the Space-Saving algorithm. We formally prove the correctness and theoretical error bound. Extensive experimental results clearly show that P2PSS provides very good accuracy and scalability, also in the presence of highly dynamic P2P networks with churning. To the best of our knowledge, this is the first gossip-based distributed algorithm providing strong theoretical guarantees for both the Approximate Frequent Items Problem in Unstructured P2P Networks and for the frequency estimation of discovered frequent items

    AFQN: approximate Qn estimation in data streams

    Full text link
    We present afqn (Approximate Fast Qn), a novel algorithm for approximate computation of the Qn scale estimator in a streaming setting, in the sliding window model. It is well-known that computing the Qn estimator exactly may be too costly for some applications, and the problem is a fortiori exacerbated in the streaming setting, in which the time available to process incoming data stream items is short. In this paper we show how to efficiently and accurately approximate the Qn estimator. As an application, we show the use of afqn for fast detection of outliers in data streams. In particular, the outliers are detected in the sliding window model, with a simple check based on the Qn scale estimator. Extensive experimental results on synthetic and real datasets confirm the validity of our approach by showing up to three times faster updates per second. Our contributions are the following ones: (i) to the best of our knowledge, we present the first approximation algorithm for online computation of the Qn scale estimator in a streaming setting and in the sliding window model; (ii) we show how to take advantage of our UDDSketch algorithm for quantile estimation in order to quickly compute the Qn scale estimator; (iii) as an example of a possible application of the Qn scale estimator, we discuss how to detect outliers in an input data stream

    Fast online computation of the Qn estimator with applications to the detection of outliers in data streams

    Full text link
    We present FQN (Fast Qn), a novel algorithm for online computation of the Qn scale estimator. The algorithm works in the sliding window model, cleverly computing the Qn scale estimator in the current window. We thoroughly compare our algorithm for online Qn with the state of the art competing algorithm by Nunkesser et al., and show that FQN (i) is faster, requiring only O(s) time in the worst case where s is the length of the window (ii) its computational complexity does not depend on the input distribution and (iii) it requires less space. To the best of our knowledge, our algorithm is the first that allows online computation of the Qn scale estimator in worst case time linear in the size of the window. As an example of a possible application, besides its use as a robust measure of statistical dispersion, we show how to use the Qn estimator for fast detection of outliers in data streams. Extensive experimental results on both synthetic and real datasets confirm the validity of our approach

    Data stream fusion for accurate quantile tracking and analysis

    No full text
    UDDSKETCH is a recent algorithm for accurate tracking of quantiles in data streams, derived from the DDSKETCH algorithm. UDDSKETCH provides accuracy guarantees covering the full range of quantiles independently of the input distribution and greatly improves the accuracy with regard to DDSKETCH. In this paper we show how to compress and fuse two or more data streams (or datasets) by leveraging the mergeability of the UDDSKETCH data summaries. In general, two summaries on two data streams are said to be mergeable if there exists an algorithm that allows combining the two summaries into a single one related to the union of the two datasets, simultaneously preserving the error and size guarantees. The property of mergeability of a sketch enables the parallel and distributed processing of big volume data streams that can be compressed and fused by means of such mergeable data structures. Among the applications strictly related to accurate tracking of quantiles, requiring parallel and/or distributed processing we recall here estimating the latency of a web site, database query optimizers and the need of succinctly summarizing the distribution of values occurring over a sensor network. We prove that UDDSKETCH is fully mergeable and introduce PUDDSKETCH, a parallel version of UDDSKETCH suitable for message-passing based architectures. We formally prove its correctness and compare it to a parallel version of DDSKETCH, showing through extensive experimental results that our parallel algorithm almost always outperforms the parallel DDSKETCH algorithm with regard to the overall accuracy in determining the quantiles. Moreover, we also design and implement parallel versions of both the state of the art KLL and REQ sequential algorithms in order to compare and contrast PUDDSKETCH versus the corresponding parallel algorithms. Our experiments clearly show that PUDDSKETCH is faster or on par with regard to parallel running time, whilst providing simultaneously greater accuracy

    Toward Enhanced Support for Ship Sailing

    Full text link
    Ship sailing is a complex endeavour, requiring carefully considered proactive and reactive strategies in choosing the course of action that best suits the various events to be managed. Humans are already supported by different technologies for sailing, however these technologies are usually available in isolation. In this paper we show how to use simultaneously three different technologies by fusing their information in order to provide enhanced support for ship sailing. To the best of our knowledge no similar approach is reported in the literature from an operational point of view. In particular, we show how to fuse the video acquired from a camera with the information available from a radar/Lidar and an AIS receiver. The video frames are analyzed in order to automatically detect surrounding ships and seamarks, the Lidar is used to determine the average or minimum distance from the ship to the acquired targets and finally the AIS receiver logs are queried to determine, if available, useful information related to the surrounding ships, such as their geographic location, type of ship etc. Our experimental results are promising and encouraging. We believe that the simultaneous use of these technologies is a step towards fully autonomous ship sailing

    An Adaptive Clustering Approach for Distributed Outlier Detection in Data Streams

    No full text
    Many real-world problems deal with collections of high-dimensional data, i.e., data with many different features. A dataset exhibiting a high number of features incurs the so-called curse of dimensionality: when the dimensionality increases, the volume of the space increases at a fast rate, causing the sparseness of the data. This makes challenging clustering high-dimensional data for outlier detection purposes. In this paper, we design and implement a distributed peer to peer version of an algorithm that addresses the curse of dimensionality by generating candidate subspaces from the high-dimensional space through Principal Component Analysis. The experimental results show that if the parameters of the distributed algorithm are properly set, then the distributed algorithm converges to the results provided by the sequential algorithm, which is a fundamental and highly desirable property

    Deterministic, Fast and Accurate Solution of the Heavy Hitters q-Tail Latencies Problem

    No full text
    The heavy hitters q -tail latencies problem has been introduced recently. This problem, framed in the context of data stream monitoring, requires approximating the quantiles of the heavy hitters items of an input stream whose elements are pairs (item, latency). The underlying rationale is that heavy hitters are obviously among the most important items to be monitored, and their associated latency quantiles are of extreme interest in several network monitoring applications. Currently, two randomized (SQUARE and SQUAD) and one deterministic (QUASI) algorithms are available to solve the problem. In this paper, we present a novel deterministic algorithm and empirically show that it outperforms QUASI, the current state of the art deterministic algorithm for the problem, with regard to accuracy and speed

    Parallel Mining of Correlated Heavy Hitters on Distributed and Shared-Memory Architectures

    No full text
    We present parallel algorithms for mining Correlated Heavy Hitters from a two-dimensional data stream. In particular, we design and implement a message-passing, a shared-memory and a hybrid algorithm. To the best of our knowledge, these are the first parallel algorithms solving the problem. We show, through experimental results, that our algorithms provide very good scalability, whilst retaining the accuracy of their sequential counterpart

    High Throughput Protein Similarity Searches in the LIBI Grid Problem Solving Environment

    No full text
    Bioinformatics applications are naturally distributed, due to distribution of involved data sets, experimental data and biological databases. They require high computing power, owing to the large size of data sets and the complexity of basic computations, may access heterogeneous data, where heterogeneity is in data format, access policy, distribution, etc., and require a secure infrastructure, because they could access private data owned by different organizations. The Problem Solving Environment (PSE) is an approach and a technology that can fulfil such bioinformatics requirements. The PSE can be used for the definition and composition of complex applications, hiding programming and configuration details to the user that can concentrate only on the specific problem. Moreover, Grids can be used for building geographically distributed collaborative problem solving environments and Grid aware PSEs can search and use dispersed high performance computing, networking, and data resources. In this work, the PSE solution has been chosen as the integration platform of bioinformatics tools and data sources. In particular an experiment of multiple sequence alignment on large scale, supported by the LIBI PSE, is presented
    corecore