1,720,972 research outputs found

    Multi-Modal Learning for Cross-Domain Analysis of Egocentric Action and Object Recognition

    Full text link
    L'abstract è presente nell'allegato / the abstract is in the attachmen

    Recurrent convolutional fusion for RGB-D object recognition

    No full text
    Providing robots with the ability to recognize objects like humans has always been one of the primary goals of robot vision. The introduction of RGB-D cameras has paved the way for a significant leap forward in this direction thanks to the rich information provided by these sensors. However, the robot vision community still lacks an effective method to synergically use the RGB and depth data to improve object recognition. In order to take a step in this direction, we introduce a novel end-to-end architecture for RGB-D object recognition called recurrent convolutional fusion (RCFusion). Our method generates compact and highly discriminative multi-modal features by combining RGB and depth information representing different levels of abstraction. Extensive experiments on two popular datasets, RGB-D Object Dataset and JHUIT-50, show that RCFusion significantly outperforms state-of-the-art approaches in both the object categorization and instance recognition tasks. In addition, experiments on the more challenging Object Clutter Indoor Dataset confirm the validity of our method in the presence of clutter and occlusion. The code is publicly available at: “ https://github.com/MRLoghmani/rcfusion .

    Toward human-robot cooperation: unsupervised domain adaptation for egocentric action recognition

    Full text link
    With the advent of collaborative manipulators, the community is pushing the limits of human-robot interaction with novel control, planning, and task allocation strategies. For a purposeful interaction, however, the robot is also required to understand and predict the action of the human not only at a kinematic level (i.e. motion estimation), but also at an higher level of abstraction (i.e. action recognition), ideally from the human own perspective. Dealing with egocentric videos comes with the benefit that the data source already embeds an intrinsic attention mechanism, driven by the focus of the user. However, the deployment of such technology in realistic use-cases cannot ignore the large variability of background characteristics when changing environment, resulting in a domain shift in features space not learnable from labels at training time. In this paper, we discuss a method to perform Domain Adaptation with no external supervision, which we test on the EPIC-Kitchens-100 UDA Challenge in Action Recognition. More specifically, we move from our previous work on Relative Norm Alignment and extend the approach to unlabelled target data, enabling a simpler adaptation of the model to the target distribution in an unsupervised fashion. To this purpose, we enhanced our framework with multi-level adversarial alignment and with a set of losses aimed at reducing the classifier’s uncertainty on the target data. Extensive experiments demonstrate how our approach is capable to perform Multi-Source Multi-Target Domain Adaptation, thus minimising both temporal (i.e. different recording times) and environmental (i.e. different kitchens) biases

    Domain generalization through audio-visual relative norm alignment in first person action recognition

    Full text link
    First person action recognition is becoming an increasingly researched area thanks to the rising popularity of wearable cameras. This is bringing to light cross-domain issues that are yet to be addressed in this context. Indeed, the information extracted from learned representations suffers from an intrinsic "environmental bias". This strongly affects the ability to generalize to unseen scenarios, limiting the application of current methods to real settings where labeled data are not available during training. In this work, we introduce the first domain generalization approach for egocentric activity recognition, by proposing a new audiovisual loss, called Relative Norm Alignment loss. It rebalances the contributions from the two modalities during training, over different domains, by aligning their feature norm representations. Our approach leads to strong results in domain generalization on both EPIC-Kitchens-55 and EPIC-Kitchens-100, as demonstrated by extensive experiments, and can be extended to work also on domain adaptation settings with competitive results

    DA4Event: towards bridging the Sim-to-Real Gap for Event Cameras using Domain Adaptation

    Full text link
    Event cameras are novel bio-inspired sensors, which asynchronously capture pixel-level intensity changes in the form of ``events". The innovative way they acquire data presents several advantages over standard devices, especially in poor lighting and high-speed motion conditions. However, the novelty of these sensors results in the lack of a large amount of training data capable of fully unlocking their potential. The most common approach implemented by researchers to address this issue is to leverage extit{simulated event data}. Yet, this approach comes with an open research question: extit{how well simulated data generalize to real data?} To answer this, we propose to exploit, in the event-based context, recent Domain Adaptation (DA) advances in traditional computer vision, showing that DA techniques applied to event data help reduce the extit{sim-to-real} gap. To this purpose, we propose a novel architecture, which we call {Multi-View DA4E} ({MV-DA4E}), that better exploits the peculiarities of frame-based event representations while also promoting domain invariant characteristics in features. Through extensive experiments, we prove the effectiveness of DA methods and {MV-DA4E} on N-Caltech101. Moreover, we validate their soundness in a real-world scenario through a cross-domain analysis on the popular RGB-D Object Dataset (ROD), which we extended to the event modality (RGB-E)

    Leveraging over depth in egocentric activity recognition

    Full text link
    Activity recognition from first person videos is a growing research area. The increasing diffusion of egocentric sensors in various devices makes it timely to develop approaches able to recognize fine grained first person actions like picking up, putting down, pouring and so forth. While most of previous work focused on RGB data, some authors pointed out the importance of leveraging over depth information in this domain. In this paper we follow this trend and we propose the first deep architecture that uses depth maps as an attention mechanism for first person activity recognition. Specifically, we blend together the RGB and depth data, so to obtain an enriched input for the network. This blending puts more or less emphasis on different parts of the image based on their distance from the observer, hence acting as an attention mechanism. To further strengthen the proposed activity recognition protocol, we opt for a self labeling approach. This, combined with a Conv-LSTM block for extracting temporal information from the various frames, leads to the new state of the art on two publicly available benchmark databases. An ablation study completes our experimental findings, confirming the effectiveness of our approac

    Bringing Online Egocentric Action Recognition into the wild

    Full text link
    To enable a safe and effective human-robot cooperation, it is crucial to develop models for the identification of human activities. Egocentric vision seems to be a viable solution to solve this problem, and therefore many works provide deep learning solutions to infer human actions from first person videos. However, although very promising, most of these do not consider the major challenges that comes with a realistic deployment, such as the portability of the model, the need for real-time inference, and the robustness with respect to the novel domains (i.e., new spaces, users, tasks). With this paper, we set the boundaries that egocentric vision models should consider for realistic applications, defining a novel setting of egocentric action recognition in the wild, which encourages researchers to develop novel, applications-aware solutions. We also present a new model-agnostic technique that enables the rapid repurposing of existing architectures in this new context, demonstrating the feasibility to deploy a model on a tiny device (Jetson Nano) and to perform the task directly on the edge with very low energy consumption (2.4W on average at 50 fps). The code is publicly available at: https://github.com/EgocentricVision/EgoWild.Comment: Accepted to RA-L, for associated video, see https://www.youtube.com/watch?v=7rtynmoYnuw&t=9

    Egocentric zone-aware action recognition across environments

    Full text link
    Human activities exhibit a strong correlation between actions and the places where these are performed, such as washing something at a sink. More specifically, in daily living environments we may identify particular locations, hereinafter named activity-centric zones, which may afford a set of homogeneous actions. Their knowledge can serve as a prior to favor vision models to recognize human activities. However, the appearance of these zones is scene-specific, limiting the transferability of this prior information to unfamiliar areas and domains. This problem is particularly relevant in egocentric vision, where the environment takes up most of the image, making it even more difficult to separate the action from the context. In this paper, we discuss the importance of decoupling the domain-specific appearance of activity-centric zones from their universal, domain-agnostic representations, and show how the latter can improve the cross-domain transferability of Egocentric Action Recognition (EAR) models. We validate our solution on the Epic-Kitchens-100 and Argo1M datasets
    corecore