1,720,971 research outputs found
Contextual Convolutions for Scalable Forward-Only Learning on Tiny Devices
On-device training on resource-constrained hardware, such as microcontrollers with limited memory and fixed-function convolutional accelerators, remains an open challenge in embedded computer vision. Standard backpropagation is often impractical due to its high memory requirements and reliance on operations unsupported by typical inference-optimized accelerators. Recent forward-only learning methods, such as Forward-Forward and PEPITA, offer lightweight alternatives by eliminating the backward pass, enabling training on ultra-low-power devices. However, these methods tend to degrade in performance on more complex tasks involving deeper networks and larger output spaces. In this work, we introduce the Contextual Convolution Block, a novel architectural module that enhances the representational capacity of forward-only networks by injecting ground truth class information during training. This allows the network to specialize convolutional kernels for specific classes without relying on gradients or weight transport. We further present an optimized implementation of this block using an im2col-based formulation, enabling efficient training on severely constrained devices. Our method significantly improves the scalability of forward-only training approaches, achieving stronger performance on complex classification tasks while preserving compatibility with embedded hardware limitations
XiNet: Efficient Neural Networks for tinyML
The recent interest in the edge-to-cloud continuum paradigm has emphasized the need for simple and scalable architectures to deliver optimal performance on computationally constrained devices. However, resource-efficient neural networks usually optimize for parameter count and thus use operators such as depthwise convolutions, which do not maximally exploit the efficiency of resource-constrained devices. In this article, we propose XiNet, a novel convolutional neural architecture that targets edge devices. We derived the XiNet architecture from an extensive real-world efficiency analysis of various neural network operators (e.g., standard, depthwise, and pointwise convolutions). Compared to other mobile architectures, our approach substantially improves the performance-complexity trade-off by optimizing the number of operations, parameters, and working memory (RAM). Moreover, we show how XiNet can be easily adapted to different devices thanks to Hardware Aware Scaling (HAS), which enables disjoint optimization of RAM, FLASH, and operations count. We analyze the scaling properties of our architecture under different hardware constraints and validate it on the image classification task. Finally, we evaluate the performance of XiNet for object detection on the MS-COCO and VOC-2012 benchmarks and compare it with state-of-the-art mobile neural networks, achieving a 70% reduction in energy requirements with similar performance
Linear Transformers beat YOLO for Embedded Object Detection
Vision transformers (ViTs) have recently become the go-to standard for solving various computer vision tasks due to their superior performance and generalization capabilities. However, these architectures are complex to use in embedded and heavily resource-constrained devices for two main reasons: their high memory requirements and the use of complex operators seldom supported by embedded inference pipelines. Meanwhile, in embedded environments, it is still common to use older architectures with lower performance, but offering reduced memory consumption and higher compatibility with the limited embedded runtimes, usually supporting only a limited number of operators. In this paper, we present a neural architecture based on a novel linear transformer block capable of bridging the gap between the performance achieved by modern computer vision models and the broader support offered by architectures currently used in embedded environments. We also propose a solution for one-shot scaling of our architecture, called Hardware-Aware Scaling. This approach allows us to develop architectures tailored to embedded devices with different computational resources without requiring a lengthy network architecture search or manual architecture tuning. We tested our architecture on an object detection task and achieved performance comparable to recent versions of YOLO, with lower latency and parameter count while maximizing compatibility
TinyVocos: Neural Vocoders on MCUs
Neural Vocoders convert time-frequency representations, such as mel-spectrograms, into corresponding time representations. Vocoders are essential for generative applications in audio (e.g. text-to-speech and text-to-audio). This paper presents a scalable vocoder architecture for small-footprint edge devices, inspired by Vocos and adapted with XiNets and PhiNets. We test the developed model capabilities qualitatively and quantitatively on single-speaker and multi-speaker datasets and benchmark inference speed and memory consumption on four microcontrollers. Additionally, we study the power consumption on an ARM Cortex-M7-powered board. Our results demonstrate the feasibility of deploying neural vocoders on resource-constrained edge devices, potentially enabling new applications in Internet of Sounds (IoS) and Embedded Audio scenarios. Our best-performing model achieves a MOS score of 3.95/5 while utilizing 1.5MiB of FLASH and 517KiB of RAM and consuming 252 mW for a 1s audio clip inference
Improving latency performance trade-off in keyword spotting applications at the edge
Keyword Spotting (KWS) is handy in many innovative ambient intelligence applications, such as smart cities and home automation. While solving KWS on GP/GPUs has become a trivial task in recent years, many benefits arise when KWS applications run at the edge (e.g., privacy by design and infrastructure sustainability), where resources are limited. Hardware-aware scaling (HAS) is a novel paradigm that brings neural architectures to low-resource platforms. With HAS, it is possible to optimize neural architectures to fit on embedded platforms (e.g., microcontrollers) while maximizing the performance-complexity tradeoff and the performance-latency tradeoff. This paper shows how HAS, coupled with a neural network with appropriate scaling capabilities, can outperform architectures designed with neural architecture search techniques, such as MCUNet. Our method achieves 94.5% accuracy when classifying the 35 keywords in Google Speech Commands v2, with only 70 ms of latency and overall power consumption of less than 10 mJ
PhiNets: a scalable backbone for low-power AI at the edge
In the Internet of Things era, where we see many interconnected and heterogeneous mobile and fixed smart devices, distributing the intelligence from the cloud to the edge has become a necessity. Due to limited computational and communication capabilities, low memory and limited energy budget, bringing artificial intelligence algorithms to peripheral devices, such as end-nodes of a sensor network, is a challenging task and requires the design of innovative solutions. In this work, we present PhiNets, a new scalable backbone optimized for deep-learning-based image processing on resource-constrained platforms. PhiNets are based on inverted residual blocks specifically designed to decouple the computational cost, working memory, and parameter memory, thus exploiting all available resources for a given platform. With a YoloV2 detection head and Simple Online and Realtime Tracking, the proposed architecture achieves state-of-the-art results in (i) detection on the COCO and VOC2012 benchmarks, and (ii) tracking on the MOT15 benchmark. PhiNets obtain a reduction in parameter count of around 90% with respect to previous state-of-the-art models (EfficientNetv1, MobileNetv2) and achieve better performance with lower computational cost. Moreover, we demonstrate our approach on a prototype node based on an STM32H743 microcontroller (MCU) with 2MB of internal Flash and 1MB of RAM and achieve power requirements in the order of 10 mW. The code for the PhiNets is publicly available on GitHub
XimSwap: many-to-many face swapping for TinyML
The unprecedented development of deep learning approaches for video processing has caused growing privacy concerns. To ensure data analysis while maintaining privacy, it is essential to address how to protect individuals’ identities. One solution is to anonymize data at the source, avoiding the transmission or storage of information that could lead to identification. This study introduces XimSwap, a novel deep learning technique for real-time video anonymization, which can remove facial identification features directly on edge devices with minimal computational resources. Our approach offers a comprehensive solution that guarantees privacy by design. This novel method for implementing face-swapping ensures that the pose and expression of a target face remain unchanged and can be used on embedded devices with very limited computational resources. By incorporating style transfer layers into convolutional ones and optimizing the network’s operation, we achieved a reduction of over 98% in the required operations and parameters compared to state-of-the-art architectures. Our approach also significantly reduces RAM usage, making it possible to implement the anonymization process on tiny edge devices, including microcontrollers, such as the STM32H743
Going Beyond Counting First Authors in Author Co-citation Analysis
The present study examines one of the fundamental aspects of author co-citation analysis (ACA) - the way co-citation
counts are defined. Co-citation counting provides the data on which all subsequent statistical analyses and mappings
are based, and we compare ACA results based on two different types of co-citation counting - the traditional type that
only counts the first one among a cited work's authors on the one hand and a non-traditional type that takes into
account the first 5 authors of a cited work on the other hand. Results indicate that the picture produced through this non-traditional author co-citation counting contains more coherent author groups and is therefore considerably clearer. However, this picture represents fewer specialties in the research field being studied than that produced through the traditional first-author co-citation counting when the same number of top-ranked authors is selected and analyzed. Reasons for these effects are discussed
- …
