1,720,991 research outputs found
An Empirical Approach for Clustering-Based Time Series Summarisation Assessment
In the last decades, the rise of Big Data solutions has significantly advanced the analysis of time series data as representation of dynamic phenomena through sequences of observations. Recent research efforts have advocated for the adoption of data summarisation techniques, such as incremental clustering, to promptly capture data evolution, thus facilitating domain experts in making informed and proactive decisions, capitalising on a compact representation of time series. Neverthe-less, while incremental clustering effectively reduces data volume, thus preserving relevant statistical information, it is crucial to estimate the degree of approximation between the original time series data and its summarised version. This evaluation is pivotal whenever the summarisation output is the starting point to set up complex analytical pipelines (e.g., for pattern recognition and anomaly detection purposes). Stemming from practical and empirical considerations made upon both a synthetic and a real-world dataset, we propose in this paper a variant of a renowned quality metric for incremental clustering, to assess the extent to which the time series summary accurately captures the dynamics of the original data
A Methodological Approach for Data-Intensive Web Application Design on Top of Data Lakes
Data exploration and decision making may benefit from the availability of data-intensive web applications, that enable domain experts to navigate across massive, dynamic and heterogeneous data sources, stored in the so-called Data Lakes. However, traditional design strategies for this kind of applications require in the background well-defined and cleaned data structures. Conceptual modelling may be fruitfully employed to provide web developers with a comprehensive vision over Data Lake sources, on which web applications are designed. Nevertheless, the cumbersome nature of Data Lakes turns the conceptual model into a dynamic entity, which must be properly managed. In this paper, we propose a methodological approach to design data-intensive web applications on top of a Data Lake. A conceptual data model, weaved over Data Lake sources, is leveraged to identify the relevant information to be included in the web application. The methodology makes the model evolve both with new data sources content emerging from the Data Lake, through a zone-based operations pipeline that prepares a curated version of the raw data (bottom-up), and with additional domain knowledge provided by web developers derived from the data-intensive web application design (top-down). The approach, independent from any specific implementation technology, is declined in the context of a real case study regarding an ongoing research project in the cultural heritage domain
PICTURE - A Framework to Assess the Degree of Approximation of Summarized Time Series
The analysis of time series data, which represents dynamic phenomena through sequences of observations, is greatly influenced by Big Data. Both the sheer volume and the advanced capabilities of Big Data significantly impact on how these analyses are conducted, enabling more comprehensive and detailed insights. Recent studies have promoted the use of data summarization techniques, for instance through incremental clustering, to address the challenges of Big Data volume. These techniques quickly capture data evolution, thereby helping domain experts make informed and proactive decisions by leveraging a concise representation of time series. However, although incremental clustering efficiently reduces data volume and retains key statistical information, it is important to evaluate the accuracy of the summarized version compared to the original time series data. This assessment is critical when the summarized data is used as the basis for complex analytical pipelines, such as those for pattern recognition and anomaly detection. Moved by these premises and starting from an empirical experience on the definition of a metric to assess the adherence of summarised time series to the original data stream, in this paper: (i) we propose a variant of a renowned quality metric for incremental clustering based on an abstract model of clustering data structures, to assess the extent to which the time series summary accurately captures the dynamics of the original data; (ii) we present PICTURE (Python-based Incremental Clustering for Time series Representation and Evaluation) a framework featuring four widely used incremental clustering algorithms from the literature, equipped with modules for execution, representation, and evaluation of clustering results applied to time series according to the abstract model; (iii) we conduct an extensive qualitative and quantitative analysis of incremental clustering results on a synthetic and two real-world datasets using the PICTURE framework, to demonstrate the effectiveness of the proposed metric in assessing the degree of approximation of summarised time series
A semantics-enabled approach for personalised Data Lake exploration
The increasing availability of Big Data is changing the way data exploration for Business Intelligence is performed, due to the volume, velocity and uncontrolled variety of data on which exploration relies. In particular, data exploration is required in Data Lakes that have been proposed to host heterogeneous data sources, given their flexibility to cope with cumbersome properties of Big Data. However, as data grows, new methods and techniques are required for extracting value and knowledge from data stored within Data Lakes, aggregating data into indicators according to multiple analysis dimensions, to enable a large number of users with different roles and competencies to capitalise on available information. In this paper, we propose PERSEUS (PERSonalised Exploration by User Support), a computer-aided approach for data exploration on top of a Data Lake, structured over three phases: (1) the construction of a semantic metadata catalog on top of the Data Lake, leveraging tools and metrics to ease the annotation of the Data Lake metadata; (2) modelling of indicators and analysis dimensions, guided by an openly available Multi-Dimensional Ontology to enable conformance checking of indicators and let users explore Data Lake contents; (3) enrichment of the definition of indicators with personalisation aspects, based on users’ profiles and preferences, to make easier and more usable the exploration of data for a large number of users. Results of an experimental evaluation in the Smart City domain are presented with the aim of demonstrating the feasibility of the approach
Relevance-Based Big Data Exploration for Smart Road Maintenance
In the latest years, the progressive digitalisation of Smart City ecosystems has fuelled an increasing availability of data from sensor networks, which is considered as a valuable asset for improving mobility resilience. In particular, data coming from sensors in vehicles can be leveraged to obtain useful information about the quality of the area-wide road surface in near real-time, and may be used by road maintainers to focus monitoring and maintenance activities on urban and public infrastructure. To bring such application scenario into the field, road maintainers should be equipped with valuable tools to gain insights from the data and ensure a safer and more efficient infrastructure. In this paper, we present a methodological approach, based on big data exploration techniques, applied to support road maintainers in analysing and assessing surface conditions of roads. Specifically, the proposed approach is grounded on three components: (i) a multi-dimensional model, apt to represent the road network and to enable data exploration; (ii) data summarisation techniques, in order to simplify overall view over high volumes of data collected by vehicles; (iii) a measure of relevance, aimed at focusing the attention of the maintainers on relevant data only. The paper illustrates the design and implementation of multiple exploration scenarios on top of the three components and their implementation and preliminary evaluation in an ongoing research project on sustainable and resilient mobility
A big data exploration approach to exploit in-vehicle data for smart road maintenance
In modern Smart Cities, pervasive collection of sensor-based and IoT data streams is a challenging opportunity for improving mobility resilience. Among the potential applications, sensor-based data streams provide valuable information about the quality of the area-wide road surface. Modern vehicle black boxes are also able to estimate the type of anomaly (e.g., bump, hole, rough ground, depression), based on real-time analysis of acceleration data streams. Road maintainers may use all this information to improve monitoring and maintenance activities. However, the volume of data streams, the variety of road network and different degrees of seriousness of detected anomalies call for methods to support maintainers in the exploration of available data. To this aim, in this paper, we propose a methodological approach, based on big data exploration techniques. The approach is grounded on: (i) a multi-dimensional model, apt to organise data streams according to different dimensions and enable data exploration; (ii) data summarisation techniques, based on an incremental clustering algorithm, to simplify the overall view over massive data streams and to cope with their dynamic nature; (iii) a measure of relevance, to focus the attention on road portions that present critical conditions. The innovative contributions regard the formalisation of the exploration methodology, the definition of exploration scenarios, based on road maintainers’ goals and the measure of relevance, and an extensive experimentation on a real world case study, addressed in a research project on smart and resilient mobility. Experimental results show how relevance evaluation is able to efficiently attract the road maintainers’ attention on road portions that present the most critical conditions and the proposed incremental clustering algorithm outperforms existing algorithms in the literature
Personalised Exploration Graphs on top of Data Lakes
The volume, velocity and uncontrolled variety of Big Data are changing the way data exploration for data-driven decision making is performed on top of Data Lakes. As data grows, novel methods are needed for data aggregation by means of indicators and multi-dimensional analysis of Data Lakes content, enabling exploration of data according to various dimensions, thus empowering users with diverse roles and competencies to capitalise on the available information. In this paper, we present a computer-aided approach (named PERSEUS, PERSonalised Exploration by User Support) for data exploration on top of a Data Lake. The approach is structured over three phases: (i) the construction of a semantic metadata catalog on top of the Data Lake; (ii) the creation of an Exploration Graph, based on metadata contained in the catalog, containing the semantic representation of indicators and analysis dimensions; (iii) the enrichment of the definition of indicators with personalisation aspects (based on users' profiles and preferences) to identify Exploration Contexts, in turn delimiting portions of the Exploration Graph for a personalised and interactive exploration of indicators. Results of an experimental evaluation in the Smart City domain are presented with the aim of demonstrating the feasibility of the approach
In-Vehicle Big Data Exploration for Road Maintenance (Discussion Paper)
Big Data Exploration techniques may benefit from the availability of huge amount of data (e.g., collected from IoT infrastructures) for improving resilience of monitored systems. In this paper, we discuss the application of such techniques in a research project to pursue mobility resilience in Smart Cities applications. Among the aspects to be considered for enabling resilience in mobility, we specifically focus on road maintenance, gathering data streams from vehicles equipped with sensors and designing proper exploration scenarios. Scenarios rely on three precise components as main pillars of the proposed approach: (i) a multi-dimensional model apt to represent the road network and to enable data exploration; (ii) data summarisation techniques, in order to simplify exploration of high data volumes; (iii) a measure of relevance, aimed at attracting the attention of the road maintainers on relevant data only
- …
