1,720,967 research outputs found
Learning robust and efficient point cloud representations
L'abstract è presente nell'allegato / the abstract is in the attachmen
A CNN-ViT hybrid architecture search benchmark on a large-scale dataset
In recent years, Neural Architecture Search (NAS) has emerged as a promising methodology to automate the design of deep neural networks, enabling the discovery of high-performing architectures across a wide range of tasks. Due to the high computational cost associated with NAS, several benchmarks have been introduced to support the development and evaluation of NAS methods. However, existing benchmarks are often limited in scope, typically relying on small-scale datasets or narrow search spaces, mostly based on Convolutional Neural Networks (CNNs) only. To address these limitations we introduce HyViTas-Bench, a novel NAS benchmark specifically tailored for hybrid CNN-Vision Transformer (ViT) architectures. HyViTas-Bench contains 6,561 unique models trained three times on a reduced, yet large scale, version of ImageNet-1k, offering an evaluation setting that better reflects realistic data. Each architecture is evaluated on 19 hardware platforms (CPU, GPU, and edge devices) for latency measurements, while robustness is validated through repeated training. We also provide an analysis of Out-of-Distribution (OoD) generalization using three external datasets. HyViTas-Bench enables a multifaceted assessment of NAS methods in terms of accuracy, latency, generalization capability, and model size. As such, it represents a valuable resource for advancing research on hybrid architectures and for facilitating the design and comparison of NAS strategies under more realistic and diverse evaluation criteria
Rethinking Cross-Modal Interaction for Efficient Referring Image Segmentation
Referring Image Segmentation, the task of finding and segmenting objects in an image conditioned on a natural language description, is crucial for human-robot collaboration. However, current RIS methods often implement visual-text alignment relying on computationally intensive Transformer-based self-attention mechanisms, which impairs deployment on robots, especially those with limited computational resources. Indeed, beyond accuracy, practical robotic applications demand efficient models with small footprints. This letter introduces ERIS, an Efficient RIS approach designed for real-world deployment. ERIS achieves effective multi-modal interaction through a novel dual-branch architecture: a Visual Text Alignment branch and a Text Visual Refinement branch. This design implements bilateral alignment between textual and visual modalities without the computational burden of self-attention. Of note, the progressive alignment in ERIS enhances interpretability, revealing how textual cues guide segmentation. For the sake of efficiency, our alignment strategy produces structured embeddings which can be directly mapped into the final segmentation mask, without the need for additional segmentation heads. Thus, ERIS footprint scales linearly with respect to the number of visual and text tokens, making it suitable for both cloud-based and edge deployment. Experimental results demonstrate that ERIS achieves competitive or superior performance compared to state-of-the-art methods while significantly reducing computational cost, proving that efficiency and accuracy are not mutually exclusive
Point Cloud Normal Estimation with Graph-Convolutional Neural Networks
Surface normal estimation is a basic task for many point cloud processing algorithms. However, it can be challenging to capture the local geometry of the data, especially in presence of noise. Recently, deep learning approaches have shown promising results. Nevertheless, applying convolutional neural networks to point clouds is not straightforward, due to the irregular positioning of the points. In this paper, we propose a normal estimation method based on graph-convolutional neural networks to deal with such irregular point cloud domain. The graph-convolutional layers build hierarchies of localized features to solve the estimation problem. We show state-ofthe-art performance and robust results even in presence of noise
Entropic Score Metric: Decoupling Topology and Size in Training-Free NAS
Neural Networks design is a complex and often daunting task, particularly for resource-constrained scenarios typical of mobile-sized models. Neural Architecture Search is a promising approach to automate this process, but existing competitive methods require large training time and computational resources to generate accurate models. To overcome these limits, this paper contributes with: i) a novel training-free metric, named Entropic Score, to estimate model expressivity through the aggregated element-wise entropy of its activations; ii) a cyclic search algorithm to separately yet synergistically search model size and topology. Entropic Score shows remarkable ability in searching for the topology of the network, and a proper combination with LogSynflow, to search for model size, yields superior capability to completely design high-performance Hybrid Transformers for edge applications in less than 1 GPU hour, resulting in the fastest and most accurate NAS method for ImageNet classification
Learning Graph-Convolutional Representations for Point Cloud Denoising
Point clouds are an increasingly relevant data type but they are often corrupted by noise. We propose a deep neural network based on graph-convolutional layers that can elegantly deal with the permutation-invariance problem encountered by learning-based point cloud processing methods. The network is fully-convolutional and can build complex hierarchies of features by dynamically constructing neighborhood graphs from similarity among the high-dimensional feature representations of the points. When coupled with a loss promoting proximity to the ideal surface, the proposed approach significantly outperforms state-of-the-art methods on a variety of metrics. In particular, it is able to improve in terms of Chamfer measure and of quality of the surface normals that can be estimated from the denoised data. We also show that it is especially robust both at high noise levels and in presence of structured noise such as the one encountered in real LiDAR scans
Going Beyond Counting First Authors in Author Co-citation Analysis
The present study examines one of the fundamental aspects of author co-citation analysis (ACA) - the way co-citation
counts are defined. Co-citation counting provides the data on which all subsequent statistical analyses and mappings
are based, and we compare ACA results based on two different types of co-citation counting - the traditional type that
only counts the first one among a cited work's authors on the one hand and a non-traditional type that takes into
account the first 5 authors of a cited work on the other hand. Results indicate that the picture produced through this non-traditional author co-citation counting contains more coherent author groups and is therefore considerably clearer. However, this picture represents fewer specialties in the research field being studied than that produced through the traditional first-author co-citation counting when the same number of top-ranked authors is selected and analyzed. Reasons for these effects are discussed
PEM: Prototype-Based Efficient MaskFormer for Image Segmentation
Recent transformer-based architectures have shown impressive results in the field of image segmentation. Thanks to their flexibility, they obtain outstanding performance in multiple segmentation tasks, such as semantic and panoptic, under a single unified framework. To achieve such impressive performance, these architectures employ intensive operations and require substantial computational resources, which are often not available, especially on edge devices. To fill this gap, we propose Prototype-based Efficient MaskFormer (PEM), an efficient transformer-based architecture that can operate in multiple segmentation tasks. PEM proposes a novel prototype-based cross-attention which leverages the redundancy of visual features to restrict the computation and improve the efficiency without harming the performance. In addition, PEM introduces an efficient multi-scale feature pyramid network, capable of extracting features that have high semantic content in an efficient way, thanks to the combination of deformable convolutions and context-based self-modulation. We benchmark the proposed PEM architecture on two tasks, semantic and panoptic segmentation, evaluated on two different datasets, Cityscapes and ADE20K. PEM demonstrates outstanding performance on every task and dataset, outperforming task-specific architectures while being comparable and even better than computationally-expensive baselines
Hier-EgoPack: Hierarchical Egocentric Video Understanding with Diverse Task Perspectives
Our comprehension of video streams depicting human activities is naturally multifaceted: in just a few moments, we can grasp what is happening, identify the relevance and interactions of objects in the scene, and forecast what will happen soon, everything all at once. To endow autonomous systems with such a holistic perception, learning how to correlate concepts, abstract knowledge across diverse tasks, and leverage tasks synergies when learning novel skills is essential. A significant step in this direction is EgoPack, a unified framework for understanding human activities across diverse tasks with minimal overhead. EgoPack promotes information sharing and collaboration among downstream tasks, essential for efficiently learning new skills. In this paper, we introduce Hier-EgoPack, which advances EgoPack by enabling reasoning also across diverse temporal granularities, which expands its applicability to a broader range of downstream tasks. To achieve this, we propose a novel hierarchical architecture for temporal reasoning equipped with a GNN layer specifically designed to tackle the challenges of multi-granularity reasoning effectively. We evaluate our approach on multiple Ego4D benchmarks involving both clip-level and frame-level reasoning, demonstrating how our hierarchical unified architecture effectively solves these diverse tasks simultaneously
- …
