1,721,588 research outputs found
Single View 3D Face Reconstruction
The chapter describes the recent literature in the field of 3d face reconstruction
Continuous localization and mapping of a pan-tilt-zoom camera for wide area tracking
Pan–tilt–zoom (PTZ) cameras are well suited for object identification and recognition in far-field scenes. However, the effective use of PTZ cameras is complicated by the fact that a continuous online camera calibration is needed and the absolute pan, tilt and zoom values provided by the camera actuators cannot be used because they are not synchronized with the video stream. So, accurate calibration must be directly extracted from the visual content of the frames. Moreover, the large and abrupt scale changes, the scene background changes due to the camera operation and the need of camera motion compensation make target tracking with these cameras extremely challenging. In this paper, we present a solution that provides continuous online calibration of PTZ cameras which is robust to rapid camera motion, changes of the environment due to varying illumination or moving objects. The approach also scales beyond thousands of scene landmarks extracted with the SURF keypoint detector. The method directly derives the relationship between the position of a target in the ground plane and the corresponding scale and position in the image and allows real-time tracking of multiple targets with high and stable degree of accuracy even at far distances and any zoom level
Online Deep Clustering with Video Track Consistency
Several unsupervised and self-supervised approaches have been developed in
recent years to learn visual features from large-scale unlabeled datasets.
Their main drawback however is that these methods are hardly able to recognize
visual features of the same object if it is simply rotated or the perspective
of the camera changes. To overcome this limitation and at the same time exploit
a useful source of supervision, we take into account video object tracks.
Following the intuition that two patches in a track should have similar visual
representations in a learned feature space, we adopt an unsupervised
clustering-based approach and constrain such representations to be labeled as
the same category since they likely belong to the same object or object part.
Experimental results on two downstream tasks on different datasets demonstrate
the effectiveness of our Online Deep Clustering with Video Track Consistency
(ODCT) approach compared to prior work, which did not leverage temporal
information. In addition we show that exploiting an unsupervised
class-agnostic, yet noisy, track generator yields to better accuracy compared
to relying on costly and precise track annotations.Comment: Accepted at ICPR2022 as ora
Multimedia at work: Natural interfaces to enhance visitors' experiences
The authors present a multimedia system that really works in a cultural public space. Indeed, if you go to Florence and visit the museum of Palazzo Medici Riccardi, you might see a queue of worldwide tourists waiting for their turn to play with a digital version of the famous fresco The Journey of the Magi, appearing on two large screens. Visitors stand in front of the screens and point with their hands to the part of the painting they're interested in. Two cameras grab this point and an algorithm calculates the exact part of the painting the person selected. In response to the pointing, an audio response gives information on the subjects or objects. Visitors seem to deeply enjoy their interaction wi th the system, which does feel natural. Visitors wear no special equipment and use no complex hardware; the fresco is extremely well displayed, and typically the information is precise and interesting, with different levels of information available
Scene-dependent proposals for efficient person detection
In this paper, we present a new method that provides a substantial speed-up of person detection while showing high classification accuracy. Our method learns a Gaussian Mixture Model of locations and scales of the persons in the scene under observation. The model is learnt in an unsupervised way from a set of detections extracted from a small number of frames, so that each component of the mixture represents the expectation of finding a target in a region of the image at a specific scale. At runtime, the windows that most likely contain a person are sampled from the components and evaluated by the classifier. Experimental results show that replacing the classic sliding window approach with our scene-dependent proposals in state of the art person detectors allows us to drastically reduce the computational complexity while granting equal or higher performance in terms of accuracy
FLODCAST: Flow and depth forecasting via multimodal recurrent architectures
Forecasting motion and spatial positions of objects is of fundamental importance, especially in safety-critical settings such as autonomous driving. In this work, we address the issue by forecasting two different modalities that carry complementary information, namely optical flow and depth. To this end we propose FLODCAST a flow and depth forecasting model that leverages a multitask recurrent architecture, trained to jointly forecast both modalities at once. We stress the importance of training using flows and depth maps together, demonstrating that both tasks improve when the model is informed of the other modality. We train the proposed model to also perform predictions for several timesteps in the future. This provides better supervision and leads to more precise predictions, retaining the capability of the model to yield outputs autoregressively for any future time horizon. We test our model on the challenging Cityscapes dataset, obtaining state of the art results for both flow and depth forecasting. Thanks to the high quality of the generated flows, we also report benefits on the downstream task of segmentation forecasting, injecting our predictions in a flow-based mask-warping framework
Multitarget tracking in 3D con reti di telecamere PTZ
La sempre maggiore diffusione di sensori per l’acquisizione video ha fortemente aumentato la possibilità di allestire reti di telecamere per la videosorveglianza. Allo stesso tempo, la disponibilità di sensori in grado di essere diretti nella direzione voluta e a lunghezza focale variabile pone un nuovo problema che è quello di gestire nel modo migliore il sensore per massimizzare la quantità d’informazione raccolta durante l’attività di monitoraggio. Si osserva, però, che a questa maggiore disponibilità e qualità di tecnologia a basso costo non corrisponde un adeguato avanzamento dei sistemi automatici per la videosorveglianza. Di fatto, molte delle attività di monitoraggio sono ancora demandate all’operatore umano, la cui capacità di attenzione decade molto velocemente quando la quantità di informazioni da gestire aumenta oltre una certa soglia.
In questo articolo presentiamo una soluzione per la videosorveglianza attiva di vaste aree outdoor che è in grado di sfruttare convenientemente una rete di telecamere Pan Tilt Zoom (PTZ) per l’inseguimento di bersagli in movimento e l’acquisizione di immagini in alta risoluzione in grado di agevolare l’identificazione, da parte di umani o di sistemi automatizzati, degli eventi e dei soggetti osservati
Zero-Shot Image Retrieval with Human Feedback
Composed image retrieval extends traditional content-based image retrieval (CBIR) combining a query image with additional descriptive text to express user intent and specify supplementary requests related to the visual attributes of the query image. This approach holds significant potential for e-commerce applications, such as interactive multimodal searches and chatbots. In our demo, we present an interactive composed image retrieval system based on the SEARLE approach, which tackles this task in a zero-shot manner efficiently and effectively. The demo allows users to perform image retrieval iteratively refining the results using textual feedback
- …
