1,720,994 research outputs found

    Anomaly Detection and Repair for Accurate Predictions in Geo-distributed Big Data

    Full text link
    The increasing presence of geo-distributed sensor networks implies the generation of huge volumes of data from multiple geographical locations at an increasing rate. This raises important issues which become more challenging when the final goal is that of the analysis of the data for forecasting purposes or, more generally, for predictive tasks. This paper proposes a framework which supports predictive modeling tasks from streaming data coming from multiple geo-referenced sensors. In particular, we propose a distance-based anomaly detection strategy which considers objects described by embedding features learned via a stacked auto-encoder. We then devise a repair strategy which repairs the data detected as anomalous exploiting non-anomalous data measured by sensors in nearby spatial locations. Subsequently, we adopt Gradient Boosted Trees (GBTs)to predict/forecast values assumed by a target variable of interest for the repaired newly arriving (unlabeled)data, using the original feature representation or the embedding feature representation learned via the stacked auto-encoder. The workflow is implemented with distributed Apache Spark programming primitives and tested on a cluster environment. We perform experiments to assess the performance of each module, separately and in a combined manner, considering the predictive modeling of one-day-ahead energy production, for multiple renewable energy sites. Accuracy results show that the proposed framework allows reducing the error up to 13.56%. Moreover, scalability results demonstrate the efficiency of the proposed framework in terms of speedup, scaleup and execution time under a stress test

    Deep learning versus conventional learning in data streams with concept drifts

    No full text
    In many real-world applications, the characteristics of data collected by activity logs, sensors and mobile devices change over time. This behavior is known as concept drift. In complex environments, which produce high dimensional data streams, machine learning tasks become cumbersome, as models become outdated very quickly. In our study, we assess hundreds of combinations of data characteristics and methods on network traffic data. Specifically, we focus on seven conventional machine learning and deep learning methods and compare their generalization power in the presence of concept drift. Our results show that Convolutional Neural Networks (CNNs) outperform conventional methods, even when compared to an idealized upper bound on their performance created in a piecewise manner by selecting the best method and its best configuration at each point in time, thus mimicking the output of a perfect meta-learning architecture. In the context of sequential data subject to concept drift, our results appear to defy the usually accepted 'No Free Lunch Theorem (NFL)', which stipulates that no method dominates all the others in every situation. While this is by no means a rejection of the NFL Theorem, which captures a much more complex phenomenon, it is nonetheless a surprising result worth further investigations. As a matter of fact, our results show that, when data availability is limited, a meta-learning approach is preferable to CNNs, as it requires less data for training

    Spark-GHSOM: Growing Hierarchical Self-Organizing Map for large scale mixed attribute datasets

    Full text link
    The Growing Hierarchical Self-Organizing Map (GHSOM) algorithm has shown its potential for performing several tasks such as exploratory analysis, anomaly detection and forecasting on a variety of domains including the financial and cyber-security domains. GHSOM is a dynamic variant of the SOM algorithm which generates a multi-level hierarchy of SOM maps based solely on input data. However, in order to generate this multi-level structure, GHSOM requires multiple iterations over the input dataset, thus making it intractable on large datasets. Moreover, the conventional GHSOM algorithm is designed to handle datasets with numeric attributes only. This represents an important limitation as most modern real-world datasets are characterized by mixed attributes - numerical and categorical. In this work, we propose an extension of the conventional GHSOM algorithm called Spark-GHSOM, which exploits the Spark platform to process massive datasets in a distributed manner. Moreover, we leverage a method known as the distance hierarchy approach to modify the optimization function of GHSOM so that it can (also) coherently handle mixed-attribute datasets. We test our new method with respect to accuracy, scalability and descriptive power. The results obtained using different datasets demonstrate the superior predictive and descriptive capabilities of Spark-GHSOM, as well as its applicability to large-scale datasets which could not be analyzed before

    Scalable auto-encoders for gravitational waves detection from time series data

    Full text link
    Gravitational waves represent a new opportunity to study and interpret phenomena from the universe. In order to efficiently detect and analyze them, advanced and automatic signal processing and machine learning techniques could help to support standard tools and techniques. Another challenge relates to the large volume of data collected by the detectors on a daily basis, which creates a gap between the amount of data generated and effectively analyzed. In this paper, we propose two approaches involving deep auto-encoder models to analyze time series collected from Gravitational Waves detectors and provide a classification label (noise or real signal). The purpose is to discard noisy time series accurately and identify time series that potentially contain a real phenomenon. Experiments carried out on three datasets show that the proposed approaches implemented using the Apache Spark framework, represent a valuable machine learning tool for astrophysical analysis, offering competitive accuracy and scalability performances with respect to state-of-the-art methods

    ECHAD: Embedding-Based Change Detection from Multivariate Time Series in Smart Grids

    Full text link
    Smart grids are power grids where clients may actively participate in energy production, storage and distribution. Smart grid management raises several challenges, including the possible changes and evolutions in terms of energy consumption and production, that must be taken into account in order to properly regulate the energy distribution. In this context, machine learning methods can be fruitfully adopted to support the analysis and to predict the behavior of smart grids, by exploiting the large amount of streaming data generated by sensor networks. In this article, we propose a novel change detection method, called ECHAD (Embedding-based CHAnge Detection), that leverages embedding techniques, one-class learning, and a dynamic detection approach that incrementally updates the learned model to reflect the new data distribution. Our experiments show that ECHAD achieves optimal performances on synthetic data representing challenging scenarios. Moreover, a qualitative analysis of the results obtained on real data of a real power grid reveals the quality of the change detection of ECHAD. Specifically, a comparison with state-of-the-art approaches shows the ability of ECHAD in identifying additional relevant changes, not detected by competitors, avoiding false positive detections

    Spatially-Aware Autoencoders for Detecting Contextual Anomalies in Geo-Distributed Data

    Full text link
    The huge amount of data generated by sensor networks enables many potential analyses. However, one important limiting factor for the analyses of sensor data is the possible presence of anomalies, which may affect the validity of any conclusion we could draw. This aspect motivates the adoption of a preliminary anomaly detection method. Existing methods usually do not consider the spatial nature of data generated by sensor networks. Properly modeling the spatial nature of the data, by explicitly considering spatial autocorrelation phenomena, has the potential to highlight the degree of agreement or disagreement of multiple sensor measurements located in different geographical positions. The intuition is that one could improve anomaly detection performance by considering the spatial context. In this paper, we propose a spatially-aware anomaly detection method based on a stacked auto-encoder architecture. Specifically, the proposed architecture includes a specific encoding stage that models the spatial autocorrelation in data observed at different locations. Finally, a distance-based approach leverages the embedding features returned by the auto-encoder to identify possible anomalies. Our experimental evaluation on real-world geo-distributed data collected from renewable energy plants shows the effectiveness of the proposed method, also when compared to state-of-the-art anomaly detection methods

    A review of performance evaluation measures for hierarchical classifiers

    Full text link
    Criteria for evaluating the performance of a classifier are an important part in its design. They allow to estimate the behavior of the generated classifier on unseen data and can be also used to compare its performance against the performance of classifiers generated by other classification algorithms. There are currently several performance measures for binary and flat classification problems. For hierarchical classification problems, where there are multiple classes which are hierarchically related, the evaluation step is more complex. This paper reviews the main evaluation metrics proposed in the literature to evaluate hierarchical classification models

    One-Class Ensembles for Rare Genomic Sequences Identification

    No full text
    The next-generation sequencing revolution has impacted biological research by allowing the collection and analysis of very large datasets. However, despite the large availability of data, current computational methods used by biologists present some limitations in challenging domains, such as extremely imbalanced datasets characterized by almost only negative examples. In this paper, we address the problem of identifying sequences from the zebra finch (songbird) germline-restricted chromosome (GRC), which is present only in reproductive tissues and missing from all other cells. Since the germline contains the GRC in addition to other chromosomes, sequencing germline DNA must be followed by separation into GRC or non-GRC sequences. The complexity of this task depends on the limited availability of known GRC sequences. In this paper, we propose a one-class ensemble learning method to solve this problem, and we compare its performance with state-of-the-art methods for one-class classification. Our results show that the proposed method is able to identify positive sequences with high accuracy, having been trained only with negative sequences, and tuned with a limited number of positive sequences. Moreover, a biological analysis revealed that positive sequences from a verified GRC gene were ranked in the top third of all the sequences, showing that our method is successful in demarcating GRC from non-GRC sequences. Our method thus represents a valuable tool for biologists, since model predictions can allow them to focus their limited resources towards the experimental validation of a subset of higher confidence sequences

    HURI: Hybrid user risk identification in social networks

    Full text link
    The massive adoption of social networks increased the need to analyze users’ data and interactions to detect and block the spread of propaganda and harassment behaviors, as well as to prevent actions influencing people towards illegal or immoral activities. In this paper, we propose HURI, a method for social network analysis that accurately classifies users as safe or risky, according to their behavior in the social network. Specifically, the proposed hybrid approach leverages both the topology of the network of interactions and the semantics of the content shared by users, leading to an accurate classification also in the presence of noisy data, such as users who may appear to be risky due to the topic of their posts, but are actually safe according to their relationships. The strength of the proposed approach relies on the full and simultaneous exploitation of both aspects, giving each of them equal consideration during the combination phase. This characteristic makes HURI different from other approaches that fully consider only a single aspect and graft partial or superficial elements of the other into the first. The achieved performance in the analysis of a real-world Twitter dataset shows that the proposed method offers competitive performance with respect to eight state-of-the-art approaches
    corecore