1,720,971 research outputs found

    Fast and Scalable Mining of Time Series Motifs with Probabilistic Guarantees

    Full text link
    Mining time series motifs is a fundamental, yet expensive task in exploratory data analytics. In this paper, we therefore propose a fast method to find the top-k motifs with probabilistic guarantees. Our probabilistic approach is based on Locality Sensitive Hashing and allows to prune most of the distance computations, leading to huge speedups. We improve on a straightforward application of LSH to time series data by developing a self-tuning algorithm that adapts to the data distribution. Furthermore, we include several optimizations to the algorithm, reducing redundant computations and leveraging the structure of time series data to speed up LSH computations. We prove the correctness of the algorithm and provide bounds to the cost of the basic operations it performs. An experimental evaluation shows that our algorithm is able to tackle time series of one billion points on a single CPU-based machine, performing orders of magnitude faster than the GPU-based state of the art

    What’s New in Temporal Databases?

    No full text
    Temporal databases has been an active research area since many decades, ranging from research work on query processing, most dominantly on selection and join queries, to new directions in models and semantics, such as for instance temporal probabilistic or streaming data. At the same time more database vendors have been integrating temporal features into their systems, most notably, the temporal features of the SQL standard. In this paper, we summarize the latest research developments as presented in 30 research papers over the last five years in the context of temporal relational databases. Additionally, we also describe the developments of industrial database systems and vendors

    The role of local dimensionality measures in benchmarking nearest neighbor search

    Full text link
    This paper reconsiders common benchmarking approaches to nearest neighbor search. It is shown that the concepts of local intrinsic dimensionality (LID), local relative contrast (RC), and query expansion allow to choose query sets of a wide range of difficulty for real-world datasets. Moreover, the effect of the distribution of these dimensionality measures on the running time performance of implementations is empirically studied. To this end, different visualization concepts are introduced that allow to get a more fine-grained overview of the inner workings of nearest neighbor search principles. Interactive visualizations are available on the companion website.1 The paper closes with remarks about the diversity of datasets commonly used for nearest neighbor search benchmarking. It is shown that such real-world datasets are not diverse: results on a single dataset predict results on all other datasets well

    A General Coreset-Based Approach to Diversity Maximization under Matroid Constraints

    Full text link
    Diversity maximization is a fundamental problem in web search and data mining. For a given dataset S of n elements, the problem requires to determine a subset of S containing kg n "representatives"which maximize some diversity function expressed in terms of pairwise distances, where distance models dissimilarity. An important variant of the problem prescribes that the solution satisfy an additional orthogonal requirement, which can be specified as a matroid constraint (i.e., a feasible solution must be an independent set of size k of a given matroid). While unconstrained diversity maximization admits efficient coreset-based strategies for several diversity functions, known approaches dealing with the additional matroid constraint apply only to one diversity function (sum of distances), and are based on an expensive, inherently sequential, local search over the entire input dataset. We devise the first coreset-based algorithms for diversity maximization under matroid constraints for various diversity functions, together with efficient sequential, MapReduce, and Streaming implementations. Technically, our algorithms rely on the construction of a small coreset, that is, a subset of S containing a feasible solution which is no more than a factor 1-I away from the optimal solution for S. While our algorithms are fully general, for the partition and transversal matroids, if I is a constant in (0,1) and S has bounded doubling dimension, the coreset size is independent of n and it is small enough to afford the execution of a slow sequential algorithm to extract a final, accurate, solution in reasonable time. Extensive experiments show that our algorithms are accurate, fast, and scalable, and therefore they are capable of dealing with the large input instances typical of the big data scenario

    Implementing Distributed Approximate Similarity Joins using Locality Sensitive Hashing

    No full text
    ABSTRACTSimilarity joins are a basic primitive in data mining. Given twosets of points, we are interested in reporting all pairs of pointswhose similarity is above a user-defined threshold. Solving theproblem naively entails verifying all possible pairs, which canbe infeasible for large inputs. In such contexts, Locality Sensitive Hashing (LSH) is often considered to reduce the number ofpairs to verify. However, while it provides subquadratic runningtime, large input sets make it nevertheless necessary to resortto distributed computing. Hu, Yi, and Tao (PODS’17, TODS’19)proposed a nearly load-optimal LSH-based join algorithm andprovided a small-scale experimental study in a distributed setting.This paper provides further analysis of their approach. It showsthat the load-minimizing parameter settings by Hu et al. incurtoo much local work, rendering it impractical.To remedy this drawback, we propose two approaches: Thefirst distributes work in a data-independent way, while the secondadapts to the data distribution using LSH. Both schemes then useLSH to solve subproblems locally. This allows to balance loadand the amount of local work.Through an experimental evaluation, we show that the transition from theory to practice for Hu et al.’s approach is challenging: it is hard to strike a good tradeoff between the load and theamount of local work of each processor, load balancing is itselfan issue, and LSH may introduce duplicates in the output. Ourextensive experimental evaluation is supported by an efficientopen source implementation of all the methods we test.Our results highlight the need for an holistic approach: onlyfocusing on the load, as tradition in the MPC model, might notmake efficient use the available resources and better trade-offsbetween local work and load are possible

    Benchmarking Nearest Neighbor Search: Influence of Local Intrinsic Dimensionality and Result Diversity in Real-World Datasets

    Full text link
    This paper reconsiders common benchmarking approaches to nearest neighbor search. It is shown that the concept of local intrinsic dimensionality (LID) allows to choose query sets of a wide range of diculty for real-world datasets. Moreover, the eect of dierent LID distributions on the running time performance of implementations is empirically studied. To this end, dierent visualization concepts are introduced that allow to get a more ne-grained overview of the inner workings of nearest neighbor earch principles. The paper closes with remarksabout the diversity of datasets commonly used for nearest neighbor search benchmarking. It is shown that such real-world datasets are not diverse: results on a single dataset predict results on all other datasets well

    The Role of Local Intrinsic Dimensionality in Benchmarking Nearest Neighbor Search

    Full text link
    This paper reconsiders common benchmarking approaches to nearest neighbor search. It is shown that the concept of local intrinsic dimensionality (LID) allows to choose query sets of a wide range of difficulty for real-world datasets. Moreover, the effect of different LID distributions on the running time performance of implementations is empirically studied. To this end, different visualization concepts are introduced that allow to get a more fine-grained overview of the inner workings of nearest neighbor search principles. The paper closes with remarks about the diversity of datasets commonly used for nearest neighbor search benchmarking. It is shown that such real-world datasets are not diverse: results on a single dataset predict results on all other datasets well

    Running Experiments with Confidence and Sanity

    Full text link
    Analyzing data from large experimental suites is a daily task for anyone doing experimental algorithmics. In this paper we report on several approaches we tried for this seemingly mundane task in a similarity search setting, reflecting on the challenges it poses.We conclude by proposing a workflow, which can be implemented using several tools, that allows to analyze experimental data with confidence.The extended version of this paper and the support code are provided at https://github.com/Cecca/running-experiments
    corecore