1,721,499 research outputs found

    ALICE physics: Theoretical overview

    No full text
    Alessandro B, Aurenche P, Baier R, Becattini F, Botje M. ALICE physics: Theoretical overview. CERN-Alice-Internal-Note. 2002;2002-025

    Joint-Based Action Progress Prediction

    Full text link
    Action understanding is a fundamental computer vision branch for several applications, ranging from surveillance to robotics. Most works deal with localizing and recognizing the action in both time and space, without providing a characterization of its evolution. Recent works have addressed the prediction of action progress, which is an estimate of how far the action has advanced as it is performed. In this paper, we propose to predict action progress using a different modality compared to previous methods: body joints. Human body joints carry very precise information about human poses, which we believe are a much more lightweight and effective way of characterizing actions and therefore their execution. Estimating action progress can in fact be determined based on the understanding of how key poses follow each other during the development of an activity. We show how an action progress prediction model can exploit body joints and integrate it with modules providing keypoint and action information in order to be run directly from raw pixels. The proposed method is experimentally validated on the Penn Action Dataset

    Understanding Human Reactions Looking at Facial Microexpressions With an Event Camera

    No full text
    With the establishment of Industry 4.0 , machines are now required to interact with workers. By observing biometrics they can assess if humans are authorized, or mentally and physically fit to work. Understanding body language, makes human–machine interaction more natural, secure, and effective. Nonetheless, traditional cameras have limitations; low frame rate and dynamic range hinder a comprehensive human understanding. This poses a challenge, since faces undergo frequent instantaneous microexpressions. In addition, this is privacy-sensitive information that must be protected. We propose to model expressions with event cameras, bio-inspired vision sensors that have found application within the Industry 4.0 scope. They capture motion at millisecond rates and work under challenging conditions like low illumination and highly dynamic scenes. Such cameras are also privacy-preserving, making them extremely interesting for industry. We show that using event cameras, we can understand human reactions by only observing facial expressions. Comparison with red-green-blue (RGB)-based modeling demonstrates improved effectiveness and robustness

    Events in crowded places: A smart service management

    No full text
    In this work we present eSERVANT (eventS in crowdEd places: a smaRt serVice mANagemenT), an ICT hardware and software infrastructure that allows the management of services tied to big events such as concerts, sport matches or fairs that take place in dedicated facilities. Among the main features of eSERVANT there is a vision-based dynamic routing that exploits computer vision techniques to detect people and track their movements within a facility. Thanks to an estimate of room occupancies and inbound/outbound flows, we propose an algorithm for dynamically selecting the best indoor route and avoid hazardous situations. At the same time, we perform user profiling finalized to promoting social networking through recommendation. To assess the usefulness of our infrastructure we established a test case over a six month experimentation and quantitatively evaluated user satisfaction

    Is GPT-3 All You Need for Visual Question Answering in Cultural Heritage?

    No full text
    The use of Deep Learning and Computer Vision in the Cultural Heritage domain is becoming highly relevant in the last few years with lots of applications about audio smart guides, interactive museums and augmented reality. All these technologies require lots of data to work effectively and be useful for the user. In the context of artworks, such data is annotated by experts in an expensive and time consuming process. In particular, for each artwork, an image of the artwork and a description sheet have to be collected in order to perform common tasks like Visual Question Answering. In this paper we propose a method for Visual Question Answering that allows to generate at runtime a description sheet that can be used for answering both visual and contextual questions about the artwork, avoiding completely the image and the annotation process. For this purpose, we investigate on the use of GPT-3 for generating descriptions for artworks analyzing the quality of generated descriptions through captioning metrics. Finally we evaluate the performance for Visual Question Answering and captioning tasks

    Diffusion Based Augmentation for Captioning and Retrieval in Cultural Heritage

    Full text link
    Cultural heritage applications and advanced machine learning models are creating a fruitful synergy to provide effective and accessible ways of interacting with artworks. Smart audio-guides, personalized art-related content and gamification approaches are just a few examples of how technology can be exploited to provide additional value to artists or exhibitions. Nonetheless, from a machine learning point of view, the amount of available artistic data is often not enough to train effective models. Off-the-shelf computer vision modules can still be exploited to some extent, yet a severe domain shift is present between art images and standard natural image datasets used to train such models. As a result, this can lead to degraded performance. This paper introduces a novel approach to address the challenges of limited annotated data and domain shifts in the cultural heritage domain. By leveraging generative vision-language models, we augment art datasets by generating diverse variations of artworks conditioned on their captions. This augmentation strategy enhances dataset diversity, bridging the gap between natural images and artworks, and improving the alignment of visual cues with knowledge from general-purpose datasets. The generated variations assist in training vision and language models with a deeper understanding of artistic characteristics and that are able to generate better captions with appropriate jargon

    Multiple future prediction leveraging synthetic trajectories

    No full text
    Trajectory prediction is an important task, especially in autonomous driving. The ability to forecast the position of other moving agents can yield to an effective planning, ensuring safety for the autonomous vehicle as well for the observed entities. In this work we propose a data driven approach based on Markov Chains to generate synthetic trajectories, which are useful for training a multiple future trajectory predictor. The advantages are twofold: on the one hand synthetic samples can be used to augment existing datasets and train more effective predictors; on the other hand, it allows to generate samples with multiple ground truths, corresponding to diverse equally likely outcomes of the observed trajectory. We define a trajectory prediction model and a loss that explicitly address the multimodality of the problem and we show that combining synthetic and real data leads to prediction improvements, obtaining state of the art results

    Fashion recommendation based on style and social events

    No full text
    Fashion recommendation is often declined as the task of finding complementary items given a query garment or retrieving outfits that are suitable for a given user. In this work we address the problem by adding an additional semantic layer based on the style of the proposed dressing. We model style according to two important aspects: the mood and the emotion concealed behind color combination patterns and the appropriateness of the retrieved garments for a given type of social event. To address the former we rely on Shigenobu Kobayashi’s color image scale, which associated emotional patterns and moods to color triples. The latter instead is analyzed by extracting garments from images of social events. Overall, we integrate in a state of the art garment recommendation framework a style classifier and an event classifier in order to condition recommendation on a given query

    Transformer-based Graph Neural Networks for Outfit Generation

    No full text
    Suggesting complementary clothing items to compose an outfit is a process of emerging interest, yet it involves a fine understanding of fashion trends and visual aesthetics. Previous works have mainly focused on recommendation by scoring visual appeal and representing garments as ordered sequences or as collections of pairwise-compatible items. This limits the full usage of relations among clothes. We attempt to bridge the gap between outfit recommendation and generation by leveraging a graph-based representation of items in a collection. The work carried out in this paper, tries to build a bridge between outfit recommendation and generation, by discovering new appealing outfits starting from a collection of pre-existing ones. We propose a transformer-based architecture, named TGNN, which exploits multi-headed self attention to capture relations between clothing items in a graph as a message passing step in Convolutional Graph Neural Networks. Specifically, starting from a seed, i.e. one or more garments, outfit generation is performed by iteratively choosing the garment that is most compatible with the previously chosen ones. Extensive experimentations are conducted with two different datasets, demonstrating the capability of the model to perform seeded outfit generation as well as obtaining state of the art results on compatibility estimation tasks
    corecore