1,721,044 research outputs found
A CNN-ViT hybrid architecture search benchmark on a large-scale dataset
In recent years, Neural Architecture Search (NAS) has emerged as a promising methodology to automate the design of deep neural networks, enabling the discovery of high-performing architectures across a wide range of tasks. Due to the high computational cost associated with NAS, several benchmarks have been introduced to support the development and evaluation of NAS methods. However, existing benchmarks are often limited in scope, typically relying on small-scale datasets or narrow search spaces, mostly based on Convolutional Neural Networks (CNNs) only. To address these limitations we introduce HyViTas-Bench, a novel NAS benchmark specifically tailored for hybrid CNN-Vision Transformer (ViT) architectures. HyViTas-Bench contains 6,561 unique models trained three times on a reduced, yet large scale, version of ImageNet-1k, offering an evaluation setting that better reflects realistic data. Each architecture is evaluated on 19 hardware platforms (CPU, GPU, and edge devices) for latency measurements, while robustness is validated through repeated training. We also provide an analysis of Out-of-Distribution (OoD) generalization using three external datasets. HyViTas-Bench enables a multifaceted assessment of NAS methods in terms of accuracy, latency, generalization capability, and model size. As such, it represents a valuable resource for advancing research on hybrid architectures and for facilitating the design and comparison of NAS strategies under more realistic and diverse evaluation criteria
Toward human-robot cooperation: unsupervised domain adaptation for egocentric action recognition
With the advent of collaborative manipulators, the community is pushing the limits of human-robot interaction with novel control, planning, and task allocation strategies. For a purposeful interaction, however, the robot is also required to understand and predict the action of the human not only at a kinematic level (i.e. motion estimation), but also at an higher level of abstraction (i.e. action recognition), ideally from the human own perspective. Dealing with egocentric videos comes with the benefit that the data source already embeds an intrinsic attention mechanism, driven by the focus of the user. However, the deployment of such technology in realistic use-cases cannot ignore the large variability of background characteristics when changing environment, resulting in a domain shift in features space not learnable from labels at training time. In this paper, we discuss a method to perform Domain Adaptation with no external supervision, which we test on the EPIC-Kitchens-100 UDA Challenge in Action Recognition. More specifically, we move from our previous work on Relative Norm Alignment and extend the approach to unlabelled target data, enabling a simpler adaptation of the model to the target distribution in an unsupervised fashion. To this purpose, we enhanced our framework with multi-level adversarial alignment and with a set of losses aimed at reducing the classifier’s uncertainty on the target data. Extensive experiments demonstrate how our approach is capable to perform Multi-Source Multi-Target Domain Adaptation, thus minimising both temporal (i.e. different recording times) and environmental (i.e. different kitchens) biases
The revenge of BiSeNet: Efficient Multi-Task Image Segmentation
Recent advancements in image segmentation have focused on enhancing the efficiency of the models to meet the demands of real-time applications, especially on edge devices. However, existing research has primarily concentrated on single-task settings, especially on semantic segmentation, leading to redundant efforts and specialized architectures for different tasks. To address this limitation, we propose a novel architecture for efficient multi-task image segmentation, capable of handling various segmentation tasks without sacrificing efficiency or accuracy. We introduce BiSeNetFormer, that leverages the efficiency of two-stream semantic segmentation architectures and it extends them into a mask classification framework. Our approach maintains the efficient spatial and context paths to capture detailed and semantic information, respectively, while leveraging an efficient transformed-based segmentation head that computes the binary masks and class probabilities. By seamlessly supporting multiple tasks, namely semantic and panoptic segmentation, BiSeNetFormer offers a versatile solution for multi-task segmentation. We evaluate our approach on popular datasets, Cityscapes and ADE20K, demonstrating impressive inference speeds while maintaining competitive accuracy compared to state-of-the-art architectures. Our results indicate that BiSeNetFormer represents a significant advancement towards fast, efficient, and multi-task segmentation networks, bridging the gap between model efficiency and task adaptability
What does CLIP know about peeling a banana?
Humans show an innate capability to identify tools to support specific actions. The association between objects parts and the actions they facilitate is usually named affordance. Being able to segment objects parts depending on the tasks they afford is crucial to enable intelligent robots to use objects of daily living. Traditional supervised learning methods for affordance segmentation require costly pixel-level annotations, while weakly supervised approaches, though less demanding, still rely on object-interaction examples and support a closed set of actions. These limitations hinder scalability, may introduce biases, and usually restrict models to a limited set of predefined actions. This paper proposes Affordance-CLIP, to overcome these limitations by leveraging the implicit affordance knowledge embedded within large pre-trained Vision-Language models like CLIP. We experimentally demonstrate that CLIP, although not explicitly trained for affordances detection, retains valuable information for the task. Our AffordanceCLIP achieves competitive zero-shot performance compared to methods with specialized training, while offering several advantages: i) it works with any action prompt, not just a predefined set; ii) it requires training only a small number of additional parameters compared to existing solutions and iii) eliminates the need for direct supervision on action-object pairs, opening new perspectives for functionality-based reasoning of models
FreeREA: Training-Free Evolution-based Architecture Search
In the last decade, most research in Machine Learning contributed to the improvement of existing models, with the aim of increasing the performance of neural networks for the solution of a variety of different tasks. However, such advancements often come at the cost of an increase of model memory and computational requirements. This represents a significant limitation for the deployability of research output in realistic settings, where the cost, the energy consumption, and the complexity of the framework play a crucial role. To solve this issue, the designer should search for models that maximise the performance while limiting its footprint. Typical approaches to reach this goal rely either on manual procedures, which cannot guarantee the optimality of the final design, or upon Neural Architecture Search algorithms to automatise the process, at the expenses of extremely high computational time. This paper provides a solution for the fast identification of a neural network that maximises the model accuracy while preserving size and computational constraints typical of tiny devices. Our approach, named FreeREA, is a custom cell-based evolution NAS algorithm that exploits an optimised combination of training-free metrics to rank architectures during the search, thus without need of model training. Our experiments, carried out on the common benchmarks NAS-Bench-101 and NATS-Bench, demonstrate that i) FreeREA is a fast, efficient, and effective search method for models automatic design; ii) it outperforms State of the Art training-based and training-free techniques in all the datasets and benchmarks considered, and iii) it can easily generalise to constrained scenarios, representing a competitive solution for fast Neural Architecture Search in generic constrained applications. The code is available at https://github.com/NiccoloCavagnero/FreeREA
AMEGO:Active Memory from long EGOcentric videos
Egocentric videos provide a unique perspective into individuals’ daily experiences, yet their unstructured nature presents challenges for perception. In this paper, we introduce AMEGO, a novel approach aimed at enhancing the comprehension of very-long egocentric videos. Inspired by the human’s ability to maintain information from a single watching, AMEGO focuses on constructing a self-contained representations from one egocentric video, capturing key locations and object interactions. This representation is semantic-free and facilitates multiple queries without the need to reprocess the entire visual content. Additionally, to evaluate our understanding of very-long egocentric videos, we introduce the new Active Memories Benchmark (AMB), composed of more than 20K of highly challenging visual queries from EPIC-KITCHENS. These queries cover different levels of video reasoning (sequencing, concurrency and temporal grounding) to assess detailed video understanding capabilities. We showcase improved performance of AMEGO on AMB, surpassing other video QA baselines by a substantial margin
Rethinking Cross-Modal Interaction for Efficient Referring Image Segmentation
Referring Image Segmentation, the task of finding and segmenting objects in an image conditioned on a natural language description, is crucial for human-robot collaboration. However, current RIS methods often implement visual-text alignment relying on computationally intensive Transformer-based self-attention mechanisms, which impairs deployment on robots, especially those with limited computational resources. Indeed, beyond accuracy, practical robotic applications demand efficient models with small footprints. This letter introduces ERIS, an Efficient RIS approach designed for real-world deployment. ERIS achieves effective multi-modal interaction through a novel dual-branch architecture: a Visual Text Alignment branch and a Text Visual Refinement branch. This design implements bilateral alignment between textual and visual modalities without the computational burden of self-attention. Of note, the progressive alignment in ERIS enhances interpretability, revealing how textual cues guide segmentation. For the sake of efficiency, our alignment strategy produces structured embeddings which can be directly mapped into the final segmentation mask, without the need for additional segmentation heads. Thus, ERIS footprint scales linearly with respect to the number of visual and text tokens, making it suitable for both cloud-based and edge deployment. Experimental results demonstrate that ERIS achieves competitive or superior performance compared to state-of-the-art methods while significantly reducing computational cost, proving that efficiency and accuracy are not mutually exclusive
PoliTO-IIT-CINI Submission to the EPIC-KITCHENS-100 Unsupervised Domain Adaptation Challenge for Action Recognition
Entropic Score Metric: Decoupling Topology and Size in Training-Free NAS
Neural Networks design is a complex and often daunting task, particularly for resource-constrained scenarios typical of mobile-sized models. Neural Architecture Search is a promising approach to automate this process, but existing competitive methods require large training time and computational resources to generate accurate models. To overcome these limits, this paper contributes with: i) a novel training-free metric, named Entropic Score, to estimate model expressivity through the aggregated element-wise entropy of its activations; ii) a cyclic search algorithm to separately yet synergistically search model size and topology. Entropic Score shows remarkable ability in searching for the topology of the network, and a proper combination with LogSynflow, to search for model size, yields superior capability to completely design high-performance Hybrid Transformers for edge applications in less than 1 GPU hour, resulting in the fastest and most accurate NAS method for ImageNet classification
Bringing Online Egocentric Action Recognition into the wild
To enable a safe and effective human-robot cooperation, it is crucial to
develop models for the identification of human activities. Egocentric vision
seems to be a viable solution to solve this problem, and therefore many works
provide deep learning solutions to infer human actions from first person
videos. However, although very promising, most of these do not consider the
major challenges that comes with a realistic deployment, such as the
portability of the model, the need for real-time inference, and the robustness
with respect to the novel domains (i.e., new spaces, users, tasks). With this
paper, we set the boundaries that egocentric vision models should consider for
realistic applications, defining a novel setting of egocentric action
recognition in the wild, which encourages researchers to develop novel,
applications-aware solutions. We also present a new model-agnostic technique
that enables the rapid repurposing of existing architectures in this new
context, demonstrating the feasibility to deploy a model on a tiny device
(Jetson Nano) and to perform the task directly on the edge with very low energy
consumption (2.4W on average at 50 fps). The code is publicly available at:
https://github.com/EgocentricVision/EgoWild.Comment: Accepted to RA-L, for associated video, see
https://www.youtube.com/watch?v=7rtynmoYnuw&t=9
- …
