1,720,979 research outputs found

    What does CLIP know about peeling a banana?

    Full text link
    Humans show an innate capability to identify tools to support specific actions. The association between objects parts and the actions they facilitate is usually named affordance. Being able to segment objects parts depending on the tasks they afford is crucial to enable intelligent robots to use objects of daily living. Traditional supervised learning methods for affordance segmentation require costly pixel-level annotations, while weakly supervised approaches, though less demanding, still rely on object-interaction examples and support a closed set of actions. These limitations hinder scalability, may introduce biases, and usually restrict models to a limited set of predefined actions. This paper proposes Affordance-CLIP, to overcome these limitations by leveraging the implicit affordance knowledge embedded within large pre-trained Vision-Language models like CLIP. We experimentally demonstrate that CLIP, although not explicitly trained for affordances detection, retains valuable information for the task. Our AffordanceCLIP achieves competitive zero-shot performance compared to methods with specialized training, while offering several advantages: i) it works with any action prompt, not just a predefined set; ii) it requires training only a small number of additional parameters compared to existing solutions and iii) eliminates the need for direct supervision on action-object pairs, opening new perspectives for functionality-based reasoning of models

    Toward human-robot cooperation: unsupervised domain adaptation for egocentric action recognition

    Full text link
    With the advent of collaborative manipulators, the community is pushing the limits of human-robot interaction with novel control, planning, and task allocation strategies. For a purposeful interaction, however, the robot is also required to understand and predict the action of the human not only at a kinematic level (i.e. motion estimation), but also at an higher level of abstraction (i.e. action recognition), ideally from the human own perspective. Dealing with egocentric videos comes with the benefit that the data source already embeds an intrinsic attention mechanism, driven by the focus of the user. However, the deployment of such technology in realistic use-cases cannot ignore the large variability of background characteristics when changing environment, resulting in a domain shift in features space not learnable from labels at training time. In this paper, we discuss a method to perform Domain Adaptation with no external supervision, which we test on the EPIC-Kitchens-100 UDA Challenge in Action Recognition. More specifically, we move from our previous work on Relative Norm Alignment and extend the approach to unlabelled target data, enabling a simpler adaptation of the model to the target distribution in an unsupervised fashion. To this purpose, we enhanced our framework with multi-level adversarial alignment and with a set of losses aimed at reducing the classifier’s uncertainty on the target data. Extensive experiments demonstrate how our approach is capable to perform Multi-Source Multi-Target Domain Adaptation, thus minimising both temporal (i.e. different recording times) and environmental (i.e. different kitchens) biases

    JIST: Joint Image and Sequence Training for Sequential Visual Place Recognition

    Full text link
    Visual Place Recognition aims at recognizing previously visited places by relying on visual clues, and it is used in robotics applications for SLAM and localization. Since typically a mobile robot has access to a continuous stream of frames, this task is naturally cast as a sequence-to-sequence localization problem. Nevertheless, obtaining sequences of labelled data is much more expensive than collecting isolated images, which can be done in an automated way with little supervision. As a mitigation to this problem, we propose a novel Joint Image and Sequence Training (JIST) protocol that leverages large uncurated sets of images through a multi-task learning framework. With JIST we also introduce SeqGeM, an aggregation layer that revisits the popular GeM pooling to produce a single robust and compact embedding from a sequence of single-frame embeddings. We show that our model is able to outperform previous state of the art while being faster, using eight times smaller descriptors, having a lighter architecture and allowing to process sequences of various lengths

    Are Local Features All You Need for Cross-Domain Visual Place Recognition?

    Full text link
    Visual Place Recognition is a task that aims to predict the coordinates of an image (called query) based solely on visual clues. Most commonly, a retrieval approach is adopted, where the query is matched to the most similar images from a large database of geotagged photos, using learned global descriptors. Despite recent advances, recognizing the same place when the query comes from a significantly different distribution is still a major hurdle for state of the art retrieval methods. Examples are heavy illumination changes (e.g. night-time images) or substantial occlusions (e.g. transient objects) . In this work we explore whether re-ranking methods based on spatial verification can tackle these challenges, following the intuition that local descriptors are inherently more robust than global features to domain shifts. To this end, we provide a new, comprehensive benchmark on current state of the art models. We also introduce two new demanding datasets with night and occluded queries, to be matched against a citywide database. Code and datasets are available at https://github.com/gbarbarani/re-ranking-for-VPR

    Collaborative Visual Place Recognition through Federated Learning

    Full text link
    Visual Place Recognition (VPR) aims to estimate the location of an image by treating it as a retrieval problem. VPR uses a database of geo-tagged images and leverages deep neural networks to extract a global representation, called descriptor, from each image. While the training data for VPR models often originates from diverse, geographically scattered sources (geo-tagged images), the training process itself is typically assumed to be centralized. This research revisits the task of VPR through the lens of Federated Learning (FL), addressing several key challenges associated with this adaptation. VPR data inherently lacks well-defined classes, and models are typically trained using contrastive learning, which necessitates a data mining step on a centralized database. Additionally, client devices in federated systems can be highly heterogeneous in terms of their processing capabilities. The proposed FedVPR framework not only presents a novel approach for VPR but also introduces a new, challenging, and realistic task for FL research. This has the potential to spur the application of FL to other image retrieval tasks

    Software-based solutions for the optimization of a building electric bill using integrated PV and storage systems: a case study

    No full text
    Green energies are establishing themselves as a training sector in the last decade, enabling economic and technological opportunities still to be investigated. This article proposes a solution for energy management, merging photovoltaics and storage systems, focusing on the city urban environment and taking as the main case study the typical multi-storey building characterized by high density of households. The proposed solution optimizes the cost of the electrical bill using a predictive algorithm, stem from an economical analysis based on the production and consumption of the system

    Deep Visual Geo-localization Benchmark

    Full text link
    In this paper, we propose a new open-source benchmarking framework for Visual Geo-localization (VG) that allows to build, train, and test a wide range of commonly used architectures, with the flexibility to change individual components of a geo-localization pipeline. The purpose of this framework is twofold: i) gaining insights into how different components and design choices in a VG pipeline impact the final results, both in terms of performance (recall@N metric) and system requirements (such as execution time and memory consumption); ii) establish a systematic evaluation protocol for comparing different methods. Using the proposed framework, we perform a large suite of experiments which provide criteria for choosing backbone, aggregation and negative mining depending on the use-case and requirements. We also assess the impact of engineering techniques like pre/post-processing, data augmentation and image resizing, showing that better performance can be obtained through somewhat simple procedures: for example, downscaling the images' resolution to 80% can lead to similar results with a 36% savings in extraction time and dataset storage requirement. Code and trained models are available at https://deep-vg-bench.herokuapp.com/.Comment: CVPR 2022 (Oral
    corecore