Search CORE

1,722,653 research outputs found

Origami: A 803-GOp/s/W Convolutional Network Accelerator

Author: Benini Luca; id_orcid
Cavigelli Lukas
Benini Luca
Publication venue
Publication date: 01/01/2017
Field of study

An ever-increasing number of computer vision and image/video processing challenges are being approached using deep convolutional neural networks, obtaining state-of-the-art results in object recognition and detection, semantic segmentation, action recognition, optical flow, and super resolution. Hardware acceleration of these algorithms is essential to adopt these improvements in embedded and mobile computer vision systems. We present a new architecture, design, and implementation, as well as the first reported silicon measurements of such an accelerator, outperforming previous work in terms of power, area, and I/O efficiency. The manufactured device provides up to 196 GOp/s on 3.09 mm2 of silicon in UMC 65-nm technology and can achieve a power efficiency of 803 GOp/s/W. The massively reduced bandwidth requirements make it the first architecture scalable to TOp/s performance

ETHzürich Repository for Publications and Research Data

Crossref

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

PULP: Extreme Energy Efficiency for Extreme Edge AI Acceleration

Author: Benini Luca
Publication venue
Publication date: 01/01/2022
Field of study

The next wave of pervasive AI pushes machine learning (ML) acceleration toward the extreme edge, with mW powerbudgets, while atthe same time it raisesthebar in terms of accuracy and capabilities, with new ML models being propose on a daily basis. To succeed in this balancing act, we need principled ways to walk the line between flexible and highly specialized ML acceleration architectures. In this talk I will detail on how to walk the line, drawing from the experience of the open PULP (Parallel Ultra-Low Power) platform, based on ML-enhanced RISC-V processors coupled with domain-specific acceleration engines

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

From Nano-Drones to Cars - A RISC-V Open Platform for next-generation Vehicles

Author: Benini Luca
Publication venue
Publication date: 01/01/2023
Field of study

The next generation of highly autonomous vehicles, with form factors ranging from tiny palmsized drones to cars pushes signal processing and machine learning aggressively towards the edge, near sensors and actuators, with strong energy-efficiency, safety and security requirements, while at the same time raising the bar in terms of flexibility and performance. To succeed in this balancing act, we need principled ways to walk the line between conflicting non-functional requirements. In the talk, I will describe our experience in leveraging the Open RISC-V ISA and open hardware approaches to innovate across the board and pave the way for an open embedded computing platform for autonomous vehicles

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

Sub-PicoJoule per operation scalable computing

Author: BENINI LUCA
Luca Benini
Publication venue
Publication date: 01/01/2016
Field of study

The "internet of everything" envisions trillions of connected objects loaded with high-bandwidth sensors requiring massive amounts of local signal processing, fusion, pattern extraction and classification. From the computational viewpoint, the challenge is formidable and can be addressed only by pushing computing fabrics toward massive parallelism and brain-like energy efficiency levels. CMOS technology can still take us a long way toward this vision. Our recent results with the open-source PULP (parallel ultra-low power) chips demonstrate that pj/OP (GOPS/mW) computational efficiency is within reach in today's 28nm CMOS FDSOI technology. In this talk, I will look at the next 1000x of energy efficiency improvement, which will require heterogeneous 3D integration, mixed-signal, approximate processing and non-Von-Neumann architectures for scalable acceleration

Crossref

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

Plenty of room at the bottom? Micropower deep learning for cognitive cyber physical systems

Author: Benini Luca
Luca Benini
Publication venue
Publication date: 01/01/2017
Field of study

Summary form only given. Deep convolutional neural networks are being regarded today as an extremely effective and flexible approach for extracting actionable, high-level information from the wealth of raw data produced by a wide variety of sensory data sources. CNNs are however computationally demanding: today they typically run on GPU-accelerated compute servers or high-end embedded platforms. Industry and academia are racing to bring CNN inference (first) and training (next) within ever tighter power envelopes, while at the same time meeting real-time requirements. Recent results, including our PULP and ORIGAMI chips, demonstrate there is plenty of room at the bottom: pj/OP (GOPS/mW) computational efficiency, needed for deploying CNNs in the mobile/wearable scenario, is within reach. However, this is not enough: 1000x energy efficiency improvement, within a mW power envelope and with low-cost CMOS processes, is required for deploying CNNs in the most demanding CPS scenarios. The fj/OP milestone will require heterogeneous (3D) integration with ultra-efficient die-to-die communication, mixed-signal pre-processing, event-based approximate computing, while still meeting real-time requirements

Crossref

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

Trikarenos: A Fault-Tolerant RISC-V-based Microcontroller for CubeSats in 28nm

Author: Benini Luca; id_orcid
Benini Luca
Rogenmoser Michael; id_orcid
Rogenmoser Michael
Publication venue
Publication date: 01/01/2023
Field of study

One of the key challenges when operating microcontrollers in harsh environments such as space is radiation-induced Single Event Upsets (SEUs), which can lead to errors in computation. Common countermeasures rely on proprietary radiation-hardened technologies, low density technologies, or extensive replication, leading to high costs and low performance and efficiency. To combat this, we present Trikarenos, a fault-tolerant 32-bit RISC-V microcontroller SoC in an advanced TSMC 28nm technology. Trikarenos alleviates the replication cost by employing a configurable triple-core lockstep configuration, allowing three Ibex cores to execute applications reliably, operating on ECC-protected memory. If reliability is not needed for a given application, the cores can operate independently in parallel for higher performance and efficiency. Trikarenos consumes 15.7mW at 250MHz executing a fault-tolerant matrix-matrix multiplication, a 21.5x efficiency gain over state-of-the-art, and performance is increased by 2.96x when reliability is not needed for processing, with a 2.36x increase in energy efficiency.Comment: 4 pages, 4 figures, accepted by IEEE International Conference on Electronics Circuits and Systems (ICECS) 202

arXiv.org e-Print Archive

ETHzürich Repository for Publications and Research Data

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

Lightweight virtual memory support for zero-copy sharing of pointer-rich data structures in heterogeneous embedded SoCs

Author: Vogel Pirmin
Benini Luca; id_orcid
Benini Luca
Marongiu Andrea
Publication venue
Publication date: 01/01/2017
Field of study

While high-end heterogeneous systems are increasingly supporting heterogeneous uniform memory access (hUMA), their low-power counterparts still lack basic features like virtual memory support for accelerators. Instead of simply passing pointers, explicit data management involving copies is needed which hampers programmability and performance. In this work, we evaluate a mixed hardware/software solution for lightweight virtual memory support for many-core accelerators in heterogeneous embedded systemson- chip. Based on an input/output translation lookaside buffer managed by a host kernel-level driver, and compiler extensions protecting the accelerator's accesses to shared data, our solution is non-intrusive to the architecture of the accelerator cores, and enables zero-copy sharing of pointer-rich data structures

ETHzürich Repository for Publications and Research Data

Crossref

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

Archivio istituzionale della ricerca - Università di Modena e Reggio Emilia

Energy-efficiency analysis of analog and digital compressive sensing in wireless sensors

Author: Bellasi David E
BENINI LUCA
Publication venue
Publication date: 01/01/2015
Field of study

Compressive sensing (CS) is a signal acquisition strategy that, based on the assumption of sparsity, promises to relax the design constraints of signal acquisition systems with respect to conventional strategies. In this paper, we contrast signal acquisition systems for low-rate applications based on analog CS encoding with systems based on digital CS encoding. We consider the complete signal chain from acquisition to reconstruction, with particular attention to the effects of quantization, and show that the two schemes differ significantly in encoder precision, measurement resolution, compression ratio, and reconstruction quality. Further, we develop first-order power estimation models to asses the relative energy-efficiency of different CS and conventional signal acquisition systems. Our numerical evaluations suggest that when the power consumption of data storage/communication outweighs the power consumption of data acquisition and processing, analog CS systems can outperform their digital counterparts, despite their higher hardware complexity. Moreover, we provide evidence that the common special case of analog and digital encoding, known as non-uniform sampler, performs best under all conditions

Crossref

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

CAS-CNN: A deep convolutional neural network for image compression artifact suppression

Author: Hager Pascal
Benini Luca; id_orcid
Cavigelli Lukas
Pascal Hager
Benini Luca
Luca Benini
Lukas Cavigelli
Publication venue
Publication date: 01/01/2017
Field of study

Lossy image compression algorithms are pervasively used to reduce the size of images transmitted over the web and recorded on data storage media. However, we pay for their high compression rate with visual artifacts degrading the user experience. Deep convolutional neural networks have become a widespread tool to address high-level computer vision tasks very successfully. Recently, they have found their way into the areas of low-level computer vision and image processing to solve regression problems mostly with relatively shallow networks. We present a novel 12-layer deep convolutional network for image compression artifact suppression with hierarchical skip connections and a multi-scale loss function. We achieve a boost of up to 1.79 dB in PSNR over ordinary JPEG and an improvement of up to 0.36 dB over the best previous ConvNet result. We show that a network trained for a specific quality factor (QF) is resilient to the QF used to compress the input image - a single network trained for QF 60 provides a PSNR gain of more than 1.5 dB over the wide QF range from 40 to 76

ETHzürich Repository for Publications and Research Data

Crossref

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

Accelerating real-time embedded scene labeling with convolutional networks

Author: Benini Luca; id_orcid
Cavigelli Lukas
MAGNO MICHELE
BENINI LUCA
Michele Magno
Luca Benini
Lukas Cavigelli
Publication venue
Publication date: 01/01/2015
Field of study

Today there is a clear trend towards deploying advanced computer vision (CV) systems in a growing number of application scenarios with strong real-time and power constraints. Brain-inspired algorithms capable of achieving record-breaking results combined with embedded vision systems are the best candidate for the future of CV and video systems due to their flexibility and high accuracy in the area of image understanding. In this paper, we present an optimized convolutional network implementation suitable for real-time scene labeling on embedded platforms. We show that our algorithm can achieve up to 96GOp/s, running on the Nvidia Tegra K1 embedded SoC. We present experimental results, compare them to the state-of-the-art, and demonstrate that for scene labeling our approach achieves a 1.5x improvement in throughput when compared to a modern desktop CPU at a power budget of only 11 W

ETHzürich Repository for Publications and Research Data

Crossref

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna