Search CORE

1,720,978 research outputs found

Distributed and explainable GHSOM for anomaly detection in sensor networks

Author: Corizzo R.
Ceci M.
Mignone P.
Publication venue
Publication date: 01/01/2024
Field of study

The identification of anomalous activities is a challenging and crucially important task in sensor networks. This task is becoming increasingly complex with the increasing volume of data generated in real-world domains, and greatly benefits from the use of predictive models to identify anomalies in real time. A key use case for this task is the identification of misbehavior that may be caused by involuntary faults or deliberate actions. However, currently adopted anomaly detection methods are often affected by limitations such as the inability to analyze large-scale data, a reduced effectiveness when data presents multiple densities, a strong dependence on user-defined threshold configurations, and a lack of explainability in the extracted predictions. In this paper, we propose a distributed deep learning method that extends growing hierarchical self-organizing maps, originally designed for clustering tasks, to address anomaly detection tasks. The SOM-based modeling capabilities of the method enable the analysis of data with multiple densities, by exploiting multiple SOMs organized as a hierarchy. Our map-reduce implementation under Apache Spark allows the method to process and analyze large-scale sensor network data. An automatic threshold-tuning strategy reduces user efforts and increases the robustness of the method with respect to noisy instances. Moreover, an explainability component resorting to instance-based feature ranking emphasizes the most salient features influencing the decisions of the anomaly detection model, supporting users in their understanding of raised alerts. Experiments are conducted on five real-world sensor network datasets, including wind and photovoltaic energy production, vehicular traffic, and pedestrian flows. Our results show that the proposed method outperforms state-of-the-art anomaly detection competitors. Furthermore, a scalability analysis reveals that the method is able to scale linearly as the data volume presented increases, leveraging multiple worker nodes in a distributed computing setting. Qualitative analyses on the level of anomalous pollen in the air further emphasize the effectiveness of our proposed method, and its potential in determining the level of danger in raised alerts

Archivio istituzionale della ricerca - Università di Bari

Anomaly Detection and Repair for Accurate Predictions in Geo-distributed Big Data

Author: Corizzo R.
Ceci M.
Japkowicz N.
Publication venue
Publication date: 01/01/2019
Field of study

The increasing presence of geo-distributed sensor networks implies the generation of huge volumes of data from multiple geographical locations at an increasing rate. This raises important issues which become more challenging when the final goal is that of the analysis of the data for forecasting purposes or, more generally, for predictive tasks. This paper proposes a framework which supports predictive modeling tasks from streaming data coming from multiple geo-referenced sensors. In particular, we propose a distance-based anomaly detection strategy which considers objects described by embedding features learned via a stacked auto-encoder. We then devise a repair strategy which repairs the data detected as anomalous exploiting non-anomalous data measured by sensors in nearby spatial locations. Subsequently, we adopt Gradient Boosted Trees (GBTs)to predict/forecast values assumed by a target variable of interest for the repaired newly arriving (unlabeled)data, using the original feature representation or the embedding feature representation learned via the stacked auto-encoder. The workflow is implemented with distributed Apache Spark programming primitives and tested on a cluster environment. We perform experiments to assess the performance of each module, separately and in a combined manner, considering the predictive modeling of one-day-ahead energy production, for multiple renewable energy sites. Accuracy results show that the proposed framework allows reducing the error up to 13.56%. Moreover, scalability results demonstrate the efficiency of the proposed framework in terms of speedup, scaleup and execution time under a stress test

Archivio istituzionale della ricerca - Università di Bari

GAP-LSTM: Graph-Based Autocorrelation Preserving Networks for Geo-Distributed Forecasting

Author: Corizzo R.
Ceci M.
Altieri M.
Publication venue
Publication date: 01/01/2024
Field of study

Forecasting methods are important decision support tools in geo-distributed sensor networks. However, challenges such as the multivariate nature of data, the existence of multiple nodes, and the presence of spatio-temporal autocorrelation increase the complexity of the task. Existing forecasting methods are unable to address these challenges in a combined manner, resulting in a suboptimal model accuracy. In this article, we propose, a novel geo-distributed forecasting method that leverages the synergic interaction of graph convolution, attention-based long short-term memory (LSTM), 2-D-convolution, and latent memory states to effectively exploit spatio-temporal autocorrelation in multivariate data generated by multiple nodes, resulting in improved modeling capabilities. Our extensive evaluation, involving real-world datasets on traffic, energy, and pollution domains, showcases the ability of our method to outperform state-of-the-art forecasting methods. An ablation study confirms that all method components provide a positive contribution to the accuracy of the extracted forecasts. The method also provides an interpretable visualization that complements forecasts with additional insights for domain experts

Archivio istituzionale della ricerca - Università di Bari

Deep learning versus conventional learning in data streams with concept drifts

Author: Corizzo R.
Ryan S.
Kiringa I.
Japkowicz N.
Publication venue
Publication date: 01/01/2019
Field of study

In many real-world applications, the characteristics of data collected by activity logs, sensors and mobile devices change over time. This behavior is known as concept drift. In complex environments, which produce high dimensional data streams, machine learning tasks become cumbersome, as models become outdated very quickly. In our study, we assess hundreds of combinations of data characteristics and methods on network traffic data. Specifically, we focus on seven conventional machine learning and deep learning methods and compare their generalization power in the presence of concept drift. Our results show that Convolutional Neural Networks (CNNs) outperform conventional methods, even when compared to an idealized upper bound on their performance created in a piecewise manner by selecting the best method and its best configuration at each point in time, thus mimicking the output of a perfect meta-learning architecture. In the context of sequential data subject to concept drift, our results appear to defy the usually accepted 'No Free Lunch Theorem (NFL)', which stipulates that no method dominates all the others in every situation. While this is by no means a rejection of the NFL Theorem, which captures a much more complex phenomenon, it is nonetheless a surprising result worth further investigations. As a matter of fact, our results show that, when data availability is limited, a meta-learning approach is preferable to CNNs, as it requires less data for training

Archivio istituzionale della ricerca - Università di Bari

Explainable Spatio-Temporal Graph Modeling

Author: Corizzo R.
Ceci M.
Altieri M.
Publication venue
Publication date: 01/01/2023
Field of study

Explainable AI (XAI) focuses on designing inference explanation methods and tools to complement machine learning and black-box deep learning models. Such capabilities are crucially important with the rising adoption of AI models in real-world applications, which require domain experts to understand how model predictions are extracted in order to make informed decisions. Despite the increasing number of XAI approaches for tabular, image, and graph data, their effectiveness in contexts with a spatial and temporal dimension is rather limited. As a result, available methods do not properly explain predictive models’ inferences when dealing with spatio-temporal data. In this paper, we fill this gap proposing a XAI method that focuses on spatio-temporal geo-distributed sensor network data, where observations are collected at regular time intervals and at different locations. Our model-agnostic method performs perturbations on the feature space of the data to uncover relevant factors that influence model predictions, and generates explanations for multiple analytical views, such as features, timesteps, and node location. Our qualitative and quantitative experiments with real-world forecasting datasets show the effectiveness of the proposed method in providing valuable explanations of model predictions

Archivio istituzionale della ricerca - Università di Bari

Spark-GHSOM: Growing Hierarchical Self-Organizing Map for large scale mixed attribute datasets

Author: Corizzo R.
Kiringa I.
Ceci M.
Japkowicz N.
Malondkar A.
Publication venue
Publication date: 01/01/2019
Field of study

The Growing Hierarchical Self-Organizing Map (GHSOM) algorithm has shown its potential for performing several tasks such as exploratory analysis, anomaly detection and forecasting on a variety of domains including the financial and cyber-security domains. GHSOM is a dynamic variant of the SOM algorithm which generates a multi-level hierarchy of SOM maps based solely on input data. However, in order to generate this multi-level structure, GHSOM requires multiple iterations over the input dataset, thus making it intractable on large datasets. Moreover, the conventional GHSOM algorithm is designed to handle datasets with numeric attributes only. This represents an important limitation as most modern real-world datasets are characterized by mixed attributes - numerical and categorical. In this work, we propose an extension of the conventional GHSOM algorithm called Spark-GHSOM, which exploits the Spark platform to process massive datasets in a distributed manner. Moreover, we leverage a method known as the distance hierarchy approach to modify the optimization function of GHSOM so that it can (also) coherently handle mixed-attribute datasets. We test our new method with respect to accuracy, scalability and descriptive power. The results obtained using different datasets demonstrate the superior predictive and descriptive capabilities of Spark-GHSOM, as well as its applicability to large-scale datasets which could not be analyzed before

Archivio istituzionale della ricerca - Università di Bari

Multi-aspect renewable energy forecasting

Author: Corizzo R.
Ceci M.
Fanaee-T H.
Gama J.
Publication venue
Publication date: 01/01/2021
Field of study

The increasing presence of renewable energy plants has created new challenges such as grid integration, load balancing and energy trading, making it fundamental to provide effective prediction models. Recent approaches in the literature have shown that exploiting spatio-temporal autocorrelation in data coming from multiple plants can lead to better predictions. Although tensor models and techniques are suitable to deal with spatio-temporal data, they have received little attention in the energy domain. In this paper, we propose a new method based on the Tucker tensor decomposition, capable of extracting a new feature space for the learning task. For evaluation purposes, we have investigated the performance of predictive clustering trees with the new feature space, compared to the original feature space, in three renewable energy datasets. The results are favorable for the proposed method, also when compared with state-of-the-art algorithms

Archivio istituzionale della ricerca - Università di Bari

Scalable auto-encoders for gravitational waves detection from time series data

Author: Corizzo R.
Zdravevski E.
Ceci M.
Japkowicz N.
Publication venue
Publication date: 01/01/2020
Field of study

Gravitational waves represent a new opportunity to study and interpret phenomena from the universe. In order to efficiently detect and analyze them, advanced and automatic signal processing and machine learning techniques could help to support standard tools and techniques. Another challenge relates to the large volume of data collected by the detectors on a daily basis, which creates a gap between the amount of data generated and effectively analyzed. In this paper, we propose two approaches involving deep auto-encoder models to analyze time series collected from Gravitational Waves detectors and provide a classification label (noise or real signal). The purpose is to discard noisy time series accurately and identify time series that potentially contain a real phenomenon. Experiments carried out on three datasets show that the proposed approaches implemented using the Apache Spark framework, represent a valuable machine learning tool for astrophysical analysis, offering competitive accuracy and scalability performances with respect to state-of-the-art methods

Archivio istituzionale della ricerca - Università di Bari

Spatial autocorrelation and entropy for renewable energy forecasting

Author: Rashkovska A.
Corizzo R.
Malerba D.
Ceci M.
Publication venue
Publication date: 01/01/2019
Field of study

In renewable energy forecasting, data are typically collected by geographically distributed sensor networks, which poses several issues. (i) Data represent physical properties that are subject to concept drift, i.e., their characteristics could change over time. To address the concept drift phenomenon, adaptive online learning methods should be considered. (ii) The error distribution is typically non-Gaussian. Therefore, traditional quality performance criteria during training, like the mean-squared error, are less suitable. In the literature, entropy-based criteria have been proposed to deal with this problem. (iii) Spatially-located sensors introduce some form of autocorrelation, that is, values collected by sensors show a correlation strictly due to their relative spatial proximity. Although all these issues have already been investigated in the literature, they have not been investigated in combination. In this paper, we propose a new method which learns artificial neural networks by addressing all these issues. The method performs online adaptive training and enriches the entropy measures with spatial information of the data, in order to take into account spatial autocorrelation. Experimental results on two photovoltaic power production datasets are clearly favorable for entropy-based measures that take into account spatial autocorrelation, also when compared with state-of-the art methods

Archivio istituzionale della ricerca - Università di Bari

DENCAST: distributed density-based clustering for multi-target regression

Author: Corizzo R.
Malerba D.
Ceci M.
Pio G.
Publication venue
Publication date: 01/01/2019
Field of study

Recent developments in sensor networks and mobile computing led to a huge increase in data generated that need to be processed and analyzed efficiently. In this context, many distributed data mining algorithms have recently been proposed. Following this line of research, we propose the DENCAST system, a novel distributed algorithm implemented in Apache Spark, which performs density-based clustering and exploits the identified clusters to solve both single- and multi-target regression tasks (and thus, solves complex tasks such as time series prediction). Contrary to existing distributed methods, DENCAST does not require a final merging step (usually performed on a single machine) and is able to handle large-scale, high-dimensional data by taking advantage of locality sensitive hashing. Experiments show that DENCAST performs clustering more efficiently than a state-of-the-art distributed clustering algorithm, especially when the number of objects increases significantly. The quality of the extracted clusters is confirmed by the predictive capabilities of DENCAST on several datasets: It is able to significantly outperform (p-value < 0.05) state-of-the-art distributed regression methods, in both single and multi-target settings

Archivio istituzionale della ricerca - Università di Bari