1,722,653 research outputs found
Origami: A 803-GOp/s/W Convolutional Network Accelerator
An ever-increasing number of computer vision and image/video processing challenges are being approached using deep convolutional neural networks, obtaining state-of-the-art results in object recognition and detection, semantic segmentation, action recognition, optical flow, and super resolution. Hardware acceleration of these algorithms is essential to adopt these improvements in embedded and mobile computer vision systems. We present a new architecture, design, and implementation, as well as the first reported silicon measurements of such an accelerator, outperforming previous work in terms of power, area, and I/O efficiency. The manufactured device provides up to 196 GOp/s on 3.09 mm2 of silicon in UMC 65-nm technology and can achieve a power efficiency of 803 GOp/s/W. The massively reduced bandwidth requirements make it the first architecture scalable to TOp/s performance
PULP: Extreme Energy Efficiency for Extreme Edge AI Acceleration
The next wave of pervasive AI pushes machine learning (ML) acceleration toward the extreme edge, with
mW powerbudgets, while atthe same time it raisesthebar in terms of accuracy and capabilities, with new ML models being
propose on a daily basis. To succeed in this balancing act, we need principled ways to walk the line between flexible and
highly specialized ML acceleration architectures. In this talk I will detail on how to walk the line, drawing from the
experience of the open PULP (Parallel Ultra-Low Power) platform, based on ML-enhanced RISC-V processors coupled
with domain-specific acceleration engines
From Nano-Drones to Cars - A RISC-V Open Platform for next-generation Vehicles
The next generation of highly autonomous vehicles, with form factors ranging from tiny palmsized drones to cars pushes signal processing and machine learning aggressively towards the edge, near sensors and actuators, with strong energy-efficiency, safety and security requirements, while at the same time raising the bar in terms of flexibility and performance. To succeed in this balancing act, we need principled ways to walk the line between conflicting non-functional requirements. In the talk, I will describe our experience in leveraging the Open RISC-V ISA and open hardware approaches to innovate across the board and pave the way for an open embedded computing platform for autonomous vehicles
Sub-PicoJoule per operation scalable computing
The "internet of everything" envisions trillions of connected objects loaded with high-bandwidth sensors requiring massive amounts of local signal processing, fusion, pattern extraction and classification. From the computational viewpoint, the challenge is formidable and can be addressed only by pushing computing fabrics toward massive parallelism and brain-like energy efficiency levels. CMOS technology can still take us a long way toward this vision. Our recent results with the open-source PULP (parallel ultra-low power) chips demonstrate that pj/OP (GOPS/mW) computational efficiency is within reach in today's 28nm CMOS FDSOI technology. In this talk, I will look at the next 1000x of energy efficiency improvement, which will require heterogeneous 3D integration, mixed-signal, approximate processing and non-Von-Neumann architectures for scalable acceleration
Plenty of room at the bottom? Micropower deep learning for cognitive cyber physical systems
Summary form only given. Deep convolutional neural networks are being regarded today as an extremely effective and flexible approach for extracting actionable, high-level information from the wealth of raw data produced by a wide variety of sensory data sources. CNNs are however computationally demanding: today they typically run on GPU-accelerated compute servers or high-end embedded platforms. Industry and academia are racing to bring CNN inference (first) and training (next) within ever tighter power envelopes, while at the same time meeting real-time requirements. Recent results, including our PULP and ORIGAMI chips, demonstrate there is plenty of room at the bottom: pj/OP (GOPS/mW) computational efficiency, needed for deploying CNNs in the mobile/wearable scenario, is within reach. However, this is not enough: 1000x energy efficiency improvement, within a mW power envelope and with low-cost CMOS processes, is required for deploying CNNs in the most demanding CPS scenarios. The fj/OP milestone will require heterogeneous (3D) integration with ultra-efficient die-to-die communication, mixed-signal pre-processing, event-based approximate computing, while still meeting real-time requirements
Trikarenos: A Fault-Tolerant RISC-V-based Microcontroller for CubeSats in 28nm
One of the key challenges when operating microcontrollers in harsh
environments such as space is radiation-induced Single Event Upsets (SEUs),
which can lead to errors in computation. Common countermeasures rely on
proprietary radiation-hardened technologies, low density technologies, or
extensive replication, leading to high costs and low performance and
efficiency. To combat this, we present Trikarenos, a fault-tolerant 32-bit
RISC-V microcontroller SoC in an advanced TSMC 28nm technology. Trikarenos
alleviates the replication cost by employing a configurable triple-core
lockstep configuration, allowing three Ibex cores to execute applications
reliably, operating on ECC-protected memory. If reliability is not needed for a
given application, the cores can operate independently in parallel for higher
performance and efficiency. Trikarenos consumes 15.7mW at 250MHz executing a
fault-tolerant matrix-matrix multiplication, a 21.5x efficiency gain over
state-of-the-art, and performance is increased by 2.96x when reliability is not
needed for processing, with a 2.36x increase in energy efficiency.Comment: 4 pages, 4 figures, accepted by IEEE International Conference on
Electronics Circuits and Systems (ICECS) 202
Lightweight virtual memory support for zero-copy sharing of pointer-rich data structures in heterogeneous embedded SoCs
While high-end heterogeneous systems are increasingly supporting heterogeneous uniform memory access (hUMA), their low-power counterparts still lack basic features like virtual memory support for accelerators. Instead of simply passing pointers, explicit data management involving copies is needed which hampers programmability and performance. In this work, we evaluate a mixed hardware/software solution for lightweight virtual memory support for many-core accelerators in heterogeneous embedded systemson- chip. Based on an input/output translation lookaside buffer managed by a host kernel-level driver, and compiler extensions protecting the accelerator's accesses to shared data, our solution is non-intrusive to the architecture of the accelerator cores, and enables zero-copy sharing of pointer-rich data structures
Energy-efficiency analysis of analog and digital compressive sensing in wireless sensors
Compressive sensing (CS) is a signal acquisition strategy that, based on the assumption of sparsity, promises to relax the design constraints of signal acquisition systems with respect to conventional strategies. In this paper, we contrast signal acquisition systems for low-rate applications based on analog CS encoding with systems based on digital CS encoding. We consider the complete signal chain from acquisition to reconstruction, with particular attention to the effects of quantization, and show that the two schemes differ significantly in encoder precision, measurement resolution, compression ratio, and reconstruction quality. Further, we develop first-order power estimation models to asses the relative energy-efficiency of different CS and conventional signal acquisition systems. Our numerical evaluations suggest that when the power consumption of data storage/communication outweighs the power consumption of data acquisition and processing, analog CS systems can outperform their digital counterparts, despite their higher hardware complexity. Moreover, we provide evidence that the common special case of analog and digital encoding, known as non-uniform sampler, performs best under all conditions
CAS-CNN: A deep convolutional neural network for image compression artifact suppression
Lossy image compression algorithms are pervasively used to reduce the size of images transmitted over the web and recorded on data storage media. However, we pay for their high compression rate with visual artifacts degrading the user experience. Deep convolutional neural networks have become a widespread tool to address high-level computer vision tasks very successfully. Recently, they have found their way into the areas of low-level computer vision and image processing to solve regression problems mostly with relatively shallow networks. We present a novel 12-layer deep convolutional network for image compression artifact suppression with hierarchical skip connections and a multi-scale loss function. We achieve a boost of up to 1.79 dB in PSNR over ordinary JPEG and an improvement of up to 0.36 dB over the best previous ConvNet result. We show that a network trained for a specific quality factor (QF) is resilient to the QF used to compress the input image - a single network trained for QF 60 provides a PSNR gain of more than 1.5 dB over the wide QF range from 40 to 76
Accelerating real-time embedded scene labeling with convolutional networks
Today there is a clear trend towards deploying advanced computer vision (CV) systems in a growing number of application scenarios with strong real-time and power constraints. Brain-inspired algorithms capable of achieving record-breaking results combined with embedded vision systems are the best candidate for the future of CV and video systems due to their flexibility and high accuracy in the area of image understanding. In this paper, we present an optimized convolutional network implementation suitable for real-time scene labeling on embedded platforms. We show that our algorithm can achieve up to 96GOp/s, running on the Nvidia Tegra K1 embedded SoC. We present experimental results, compare them to the state-of-the-art, and demonstrate that for scene labeling our approach achieves a 1.5x improvement in throughput when compared to a modern desktop CPU at a power budget of only 11 W
- …
