1,720,984 research outputs found
Predictive perception for detecting human motion anomalies and procedural mistakes
Computer Vision emerges as a cornerstone field within Artificial intelligence, enabling digital systems to sense the world through images, mirroring the human ability to see and interpret their surroundings.
This ability is paramount, as it allows autonomous systems to interact with humans, promising to reliably extend the applications of AI to productive systems.
For example, in Human-Robot collaboration (HRC), accurate vision-based techniques can prevent accidents by providing the cobot with the ability to interpret and swiftly respond to human worker actions.
Similarly, in smart manufacturing, Computer Vision methods allow for the timely detection of errors and anomalies in production lines, enhancing quality control and safety, or in video surveillance, where they monitor environments for security threats, promptly identifying unusual behaviors or hazardous situations before they exacerbate. However, the deployment of Computer Vision technologies in real-world scenarios is hampered by significant challenges.
% realizing the full potential of Computer Vision in practical settings is constrained by critical issues, including
These include the requirement for real-time responsiveness, the ability to function reliably in diverse and unpredictable environments, and the development of comprehensive metrics for assessing detection accuracy and system reliability.
This thesis explores machine perception's role in enhancing safety and productive integrity across several domains. By leveraging cutting-edge methodologies such as Denoising Diffusion Probabilistic Models and Large Language Models in novel domains, we propose innovative solutions for applications that require a fine understanding of human behaviors and environments to promote effectiveness, safety, and efficiency.
First, we delve into the HRC domain.
% We exploit human pose data to develop a method for preventing dangerous collisions in HRC.
Aiming to improve the current methods' efficiency, we devise a lightweight Separable-Sparse Graph Convolutional model that we dub \emph{SeS-GCN}. SeS-GCN bottlenecks the interaction of the GCN's spatial, temporal, and channel-wise dimensions and further learns sparse adjacency matrices by a teacher-student framework. These modeling choices lower the model's memory footprint, providing a practical solution that proves effective both in Human-Pose Forecasting and Collision Avoidance. Moreover, the Cobots and Humans in Industrial COllaboration (CHICO) dataset is proposed to foster research in this field. For the first time, CHICO encompasses 3D-synchronized views and recorded poses of humans and cobots while collaborating in a real industrial scenario, representing a precious resource for advancing safe human-robot collaboration.
Safety often coincides with promptly detecting and responding to mistakes or anomalies, which risk otherwise aggravating, potentially producing dangerous collisions or productive inefficiencies.
Thus, following a review of the latest advancements in Video Anomaly Detection methodologies, this thesis builds on the established one-class classification framework, proposing two techniques for human-related Anomaly Detection. The first study investigates adopting non-Euclidean latent spaces to set the one-class-classification's metric objective.
We leverage the unique properties of the hyperbolic and spherical metric manifolds for improving human-related anomaly detection. The second proposal introduces a Motion Conditioned Diffusion-based approach for Anomaly Detection (\emph{MoCoDAD}). Indeed, for the first time, MoCoDAD introduces a method for video anomaly detection that exploits cutting-edge diffusive models for spotting anomalies in motion sequences. We review the common reconstruction-based technique, coupling it with the generative ability of diffusion probabilistic models, extending the state-of-the-art in human-related Video Anomaly Detection, and providing relevant insights that serve as the foundation for online mistake detection.
Next, this thesis deals with error anticipation in procedural activities. Acknowledging the absence of a proper benchmark for this task, we apply the insights from the one-class-classification paradigm and Video Anomaly Detection and propose two novel datasets, metrics, and baseline methods for detecting errors in industrial procedural videos. Moreover, we present an innovative technique that exploits the emerging reasoning capabilities of Large Language Models to detect mistakes in procedural video sequences.
This results in a novel multimodal approach that leverages an action recognition module to classify the steps of Egocentric procedural videos and couple it with a Language model to analyze the obtained procedural transcripts and detect mistakes.
This work offers empirical validation through extensive testing on established and newly introduced datasets; bridging the gap between Video Anomaly Detection and Procedural Mistake Detection, it presents a robust foundation for future research and practical applications.
We advance the understanding of procedural mistakes as open-set phenomena and emphasize the crucial need for online detection mechanisms, thus enhancing safety and operational efficiency in these environments.
These findings lay the foundation for future research, shaping the development of safer, more adaptive industrial automatic systems
Mixtures of von Mises Distributions for People Trajectory Shape Analysis
People trajectory analysis is a recurrent task in many pattern recognition applications, such as surveillance, behavior analysis, video annotation, and many others. In this paper, we propose a new framework for analyzing trajectory shape, invariant to spatial shifts of the people motion in the
scene. In order to cope with the noise and the uncertainty of the trajectory samples, we propose to describe the trajectories as a sequence of angles modeled by distributions of circular statistics, i.e., a mixture of von Mises (MovM) distributions. To deal with MovM, we define a new specific expectation-maximization (EM) algorithm for estimating the parameters and derive a closed form of the Bhattacharyya distance between single von Mises pdfs. Trajectories are then modeled with a sequence of symbols, corresponding to the most suitable distribution in the mixture,
and compared each other after a global alignment procedure to cope with trajectories of different lengths. The trajectories in the training set are clustered according to their shape similarity in an off-line phase, and testing trajectories are then classified with a specific on-line EM, based on sufficient statistics. The approach is particularly suitable for classifying people trajectories in video surveillance, searching for abnormal (i.e., infrequent) paths. Tests on synthetic and real data are provided with also a complete comparison with other circular statistical and alignment methods
Bayesian-competitive Consistent Labeling for People Surveillance
This paper presents a novel and robust approach to consistent labeling for people surveillance in multicamera systems. A general framework scalable to any number of cameras with overlapped views is devised. An offline training process automatically computes ground-plane homography and recovers epipolar geometry. When a new object is detected in any one camera, hypotheses for
potential matching objects in the other cameras are established. Each of the hypotheses is evaluated using a prior and likelihood value. The prior accounts for the positions of the potential matching objects, while the likelihood is computed by warping the vertical axis of the new object on the field of view of the other cameras and measuring the amount of match. In the likelihood, two contributions (forward
and backward) are considered so as to correctly handle the case of groups of people merged into single objects. Eventually, a maximum-a-posteriori approach estimates the best label assignment for the new object. Comparisons with other methods based on homography and extensive outdoor experiments demonstrate that the proposed approach is accurate and robust in coping with segmentation errors and in disambiguating groups
Integrate tool for online analysis and offline mining of people trajectories
In the past literature, online alarm-based video-surveillance and offline forensic-based data mining systems are often treated separately, even from different scientific communities. However, the founding techniques are almost the same and, despite some examples in commercial systems, the cases on which an integrated approach is followed are limited. For this reason, this study describes an integrated tool capable of putting together these two subsystems in an effective way. Despite its generality, the proposal is here reported in the case of people trajectory analysis, both in real time and offline. Trajectories are modelled based on either their spatial location or their shape, and proper similarity measures are proposed. Special solutions to meet real-time requirements in both cases are also presented and the trade-off between efficiency and efficacy is analysed by comparing when using a statistical model and when not. Examples of results in large datasets acquired in the University campus are reported as preliminary evaluation of the system
The LAICA project: Experiments on Multicamera People Tracking and Logging
Logging information on moving objects is crucial in video surveillance systems. Distributed multi-camera systems can provide the appearance of objects/people from differentviewpoints and at different resolutions, allowing a more complete and precise logging of the information. This is achieved through consistent labeling to correlate collected information of the same person. This paper proposes a novel approach to consistent labeling also capable tofully characterize groups of people and to manage miss segmentations. The ground-plane homography and the epipolar geometry are automatically learned and exploited to warp objects’ principal axes between overlapped cameras. A MAP estimator that exploits two contributions (forward and backward) is used to choose the most probable label con£guration to be assigned at the handoff of a new object. Extensive experiments demonstrate the accuracy of the proposed method in detecting single and simultaneous handoffs, miss segmentations, and groups
Behavioral lEarning in Surveilled Areas with Feature Extraction
The project aims at exploring how visual features can be automatically extracted from video using computer vision techniques and exploited by a classifier (generated by machine learning) to detect and identify suspicious people behavior in public places in real time. In this sense, CV and ML are jointly developed and studied to provide a better mix of innovative techniques
HECOL: Homography and Epipolar-based Consistent Labeling for Outdoor Park Surveillance
Outdoor surveillance is one of the most attractive application of video processing and analysis. Robust algorithms must be defined and tuned to cope with the non-idealities of outdoor scenes. For instance, in a public park, an automatic video surveillance system must discriminate between shadows, reflections, waving trees, people standing still or moving, and other objects. Visual knowledge coming from multiple cameras can disambiguate cluttered and occluded targets by providing a continuous consistent labeling of tracked objects among the different views. This work proposes a new approach for coping with this problem in multi-camera systems with overlapped Fields of View (FoVs). The presence of overlapped zones allows the definition of a geometry-based approach to reconstruct correspondences between FoVs, using only homography and epipolar lines (hereinafter HECOL: Homography and Epipolar-based COnsistent Labeling) computed automatically with a training phase. We also propose a complete system that provides segmentation and tracking of people in each camera module. Segmentation is performed by means of the SAKBOT (Statistical and Knowledge Based Object Tracker) approach, suitably modified to cope with multimodal backgrounds, reflections and other artefacts, typical of outdoor scenes. The extracted objects are tracked using a statistical appearance model robust against occlusions and segmentation errors. The main novelty of this paper is the approach to consistent labeling. A specific Camera Transition Graph is adopted to efficiently select the possible correspondence hypotheses between labels. A Bayesian MAP optimization assigns consistent labels to objects detected by several points of views: the object axis is computed from the shape tracked in each camera module and homography and epipolar lines allow a correct axis warping in other image planes. Both forward and backward probability contributions from the two different warping directions make the approach robust against segmentation errors, and capable of disambiguating groups of people. The system has been tested in a real setup of a urban public park, within the Italian LAICA (Laboratory of Ambient Intelligence for a friendly city) project. The experiments show how the system can correctly track and label objects in a distributed system with real-time performance. Comparisons with simpler consistent labeling methods and extensive outdoor experiments with ground truth demonstrate the accuracy and robustness of the proposed approac
A Markerless Approach for Consistent Action Recognition in a Multi-camera System
This paper presents a method for recognizing human actions in a multi-camera setup. The proposed method automatically extracts significant points on the human body, without the need of artificial markers. A sophisticated appearance-based tracking able to cope with occlusions is exploited to extract a probability map for each moving object. A segmentation technique based on mixture of Gaussians is then employed to extract and track significant points on this map, corresponding to significant regions on the human silhouette. The point tracking produces a set of 3D trajectories that are compared with other trajectories by means of global alignment and dynamic programming techniques. Preliminary experiments showed the potentiality of the proposed approach
- …
