1,720,982 research outputs found

    FANN-on-MCU: An Open-Source Toolkit for Energy-Efficient Neural Network Inference at the Edge of the Internet of Things

    Full text link
    The growing number of low-power smart devices in the Internet of Things is coupled with the concept of 'edge computing' that is moving some of the intelligence, especially machine learning, toward the edge of the network. Enabling machine learning algorithms to run on resource-constrained hardware, typically on low-power smart devices, is challenging in terms of hardware (optimized and energy-efficient integrated circuits), algorithmic, and firmware implementations. This article presents a FANN-on-MCU, an open-source toolkit built upon the fast artificial neural network (FANN) library to run lightweight and energy-efficient neural networks on microcontrollers based on both the ARM Cortex-M series and the novel RISC-V-based parallel ultralow-power (PULP) platform. The toolkit takes multilayer perceptrons trained with FANN and generates code targeted to low-power microcontrollers. This article also presents detailed analyses of energy efficiency across the different cores, and the optimizations to handle different network sizes. Moreover, it provides a detailed analysis of parallel speedups and degradations due to parallelization overhead and memory transfers. Further evaluations include experimental results for three different applications using a self-sustainable wearable multisensor bracelet. The experimental results show a measured latency in the order of only a few microseconds and power consumption of a few milliwatts while keeping the memory requirements below the limitations of the targeted microcontrollers. In particular, the parallel implementation on the octa-core RISC-V platform reaches a speedup of 22× and a 69% reduction in energy consumption with respect to a single-core implementation on Cortex-M4 for continuous real-time classification

    Extended Bit-Plane Compression for Convolutional Neural Network Accelerators

    Full text link
    After the tremendous success of convolutional neural networks in image classification, object detection, speech recognition, etc., there is now rising demand for deployment of these compute-intensive ML models on tightly power constrained embedded and mobile systems at low cost as well as for pushing the throughput in data centers. This has triggered a wave of research towards specialized hardware accelerators. Their performance is often constrained by I/O bandwidth and the energy consumption is dominated by I/O transfers to off-chip memory. We introduce and evaluate a novel, hardware-friendly compression scheme for the feature maps present within convolutional neural networks. We show that an average compression ratio of 4.4× relative to uncompressed data and a gain of 60% over existing method can be achieved for ResNet-34 with a compression block requiring <300 bit of sequential cells and minimal combinational logic

    Hyperdrive: A Multi-Chip Systolically Scalable Binary-Weight CNN Inference Engine

    Full text link
    Deep neural networks have achieved impressive results in computer vision and machine learning. Unfortunately, state-of-the-art networks are extremely compute and memory intensive, which makes them unsuitable for mW-devices such as loT end-nodes. Aggressive quantization of these networks dramatically reduces the computation and memory footprint. Binary-weight neural networks (BWNs) follow this trend, pushing weight quantization to the limit. Hardware accelerators for BWNs presented up to now have focused on core efficiency, disregarding I/O bandwidth, and system-level efficiency that are crucial for the deployment of accelerators in ultra-low power devices. We present Hyperdrive: a BWN accelerator dramatically reducing the I/O bandwidth exploiting a novel binary-weight streaming approach, which can he used for an arbitrarily sized convolutional neural network architecture and input resolution by exploiting the natural scalability of the compute units both at chip-level and system-level by arranging Hyperdrive chips systolically in a 2D mesh while processing the entire feature map together in parallel. Hyperdrive achieves 4.3 TOp/s/W system-level efficiency (i.e., including I/Os)-3.1 x higher than state-of-the-art BWN accelerators, even if its core uses resource-intensive FP16 arithmetic for increased robustness

    Ara2: Exploring Single- and Multi-Core Vector Processing With an Efficient RVV 1.0 Compliant Open-Source Processor

    No full text
    Vector processing is highly effective in boosting processor performance and efficiency for data-parallel workloads. In this paper, we present Ara2, the first fully open-source vector processor to support the RISC-V V 1.0 frozen ISA. We evaluate Ara2's performance on a diverse set of data-parallel kernels for various problem sizes and vector-unit configurations, achieving an average functional-unit utilization of 95% on the most computationally intensive kernels. We pinpoint performance boosters and bottlenecks, including the scalar core, memories, and vector architecture, providing insights into the main vector architecture's performance drivers. Leveraging the openness of the design, we implement Ara2 in a 22nm technology, characterize its PPA metrics on various configurations (2-16 lanes), and analyze its microarchitecture and implementation bottlenecks. Ara2 achieves a state-of-the-art energy efficiency of 37.8 DP-GFLOPS/W (0.8V) and 1.35GHz of clock frequency (critical path: ∼40 FO4 gates). Finally, we explore the performance and energy-efficiency trade-offs of multi-core vector processors: we find that multiple vector cores help overcome the scalar core issue-rate bound that limits short-vector performance. For example, a cluster of eight 2-lane Ara2 (16 FPUs) achieves more than 3x better performance than a 16-lane single-core Ara2 (16 FPUs) when executing a 32x32x32 matrix multiplication, with 1.5x improved energy efficiency

    ChewBaccaNN: A flexible 223 TOPS/W BNN accelerator

    No full text
    Binary Neural Networks enable smart IoT devices, as they significantly reduce the required memory footprint and computational complexity while retaining a high network performance and flexibility. This paper presents ChewBaccaNN, a 0.7 mm2 sized binary convolutional neural network (CNN) accelerator designed in GlobalFoundries 22 nm technology. By exploiting efficient data re-use, data buffering, latch-based memories, and voltage scaling, a throughput of 241 GOPS is achieved while consuming just 1.1 mW at 0.4V/154MHz during inference of binary CNNs with up to 7×7 kernels, leading to a peak core energy efficiency of 223 TOPS/W. ChewBaccaNN's flexibility allows to run a much wider range of binary CNNs than other accelerators, drastically improving the accuracy-energy tradeoff beyond what can be captured by the TOPS/W metric. In fact, it can perform CIFAR-10 inference at 86.8% accuracy with merely 1.3 μJ, thus exceeding the accuracy while at the same time lowering the energy cost by 2.8× compared to even the most efficient and much larger analog processing-in-memory devices, while keeping the flexibility of running larger CNNs for higher accuracy when needed. It also runs a binary ResNet-18 trained on the 1000-class ILSVRC dataset and improves the energy efficiency by 4.4× over accelerators of similar flexibility. Furthermore, it can perform inference on a binarized ResNet-18 trained with 8-bases Group-Net to achieve a 67.5% Top-1 accuracy with only 3.0 mJ/frame-at an accuracy drop of merely 1.8% from the full-precision ResNet-18

    Laelaps: An Energy-Efficient Seizure Detection Algorithm from Long-term Human iEEG Recordings without False Alarms

    Full text link
    We propose Laelaps, an energy-efficient and fast learning algorithm with no false alarms for epileptic seizure detection from long-term intracranial electroencephalography (iEEG) signals. Laelaps uses end-to-end binary operations by exploiting symbolic dynamics and brain-inspired hyperdimensional computing. Laelaps's results surpass those yielded by state-of-the-art (SoA) methods [1], [2], [3], including deep learning, on a new very large dataset containing 116 seizures of 18 drug-resistant epilepsy patients in 2656 hours of recordings - each patient implanted with 24 to 128 iEEG electrodes. Laelaps trains 18 patient-specific models by using only 24 seizures: 12 models are trained with one seizure per patient, the others with two seizures. The trained models detect 79 out of 92 unseen seizures without any false alarms across all the patients as a big step forward in practical seizure detection. Importantly, a simple implementation of Laelaps on the Nvidia Tegra X2 embedded device achieves 1.7×-3.9× faster execution and 1.4×-2.9× lower energy consumption compared to the best result from the SoA methods. Our source code and anonymized iEEG dataset are freely available at http://ieeg-swez.ethz.ch

    EBPC: Extended Bit-Plane Compression for Deep Neural Network Inference and Training Accelerators

    Full text link
    In the wake of the success of convolutional neural networks in image classification, object recognition, speech recognition, etc., the demand for deploying these compute-intensive ML models on embedded and mobile systems with tight power and energy constraints at low cost, as well as for boosting throughput in data centers, is growing rapidly. This has sparked a surge of research into specialized hardware accelerators. Their performance is typically limited by I/O bandwidth, power consumption is dominated by I/O transfers to off-chip memory, and on-chip memories occupy a large part of the silicon area. We introduce and evaluate a novel, hardware-friendly, and lossless compression scheme for the feature maps present within convolutional neural networks. We present hardware architectures and synthesis results for the compressor and decompressor in 65 nm. With a throughput of one 8-bit word/cycle at 600 MHz, they fit into 2.8 kGE and 3.0 kGE of silicon area, respectively - together the size of less than seven 8-bit multiply-add units at the same throughput. We show that an average compression ratio of 5.1 × for AlexNet, 4 for VGG-16, 2.4 × for ResNet-34 and 2.2 × for MobileNetV2 can be achieved - a gain of 45-70% over existing methods. Our approach also works effectively for various number formats, has a low frame-to-frame variance on the compression ratio, and achieves compression factors for gradient map compression during training that are even better than for inference

    ECG-TCN: Wearable Cardiac Arrhythmia Detection with a Temporal Convolutional Network

    No full text
    Personalized ubiquitous healthcare solutions require energy-efficient wearable platforms that provide an accurate classification of bio-signals while consuming low average power for long-term battery-operated use. Single lead electrocardiogram (ECG) signals provide the ability to detect, classify, and even predict cardiac arrhythmia. In this paper we propose a novel temporal convolutional network (TCN) that achieves high accuracy while still being feasible for wearable platform use. Experimental results on the ECG5000 dataset show that the TCN has a similar accuracy (94.2%) score as the state-of-the-art (SoA) network while achieving an improvement of 16.5% in the balanced accuracy score. This accurate classification is done with 27 fewer parameters and 37 less multiply-accumulate operations. We test our implementation on two publicly available platforms, the STM32L475, which is based on ARM Cortex M4F, and the GreenWaves Technologies GAP8 on the GAPuino board, based on 1 +8 RISC-V CV32E40P cores. Measurements show that the GAP8 implementation respects the real-time constraints while consuming 0.10mJ per inference. With 9.91GMAC/s/W, it is 23.0 more energy-efficient and 46.85 faster than an implementation on the ARM Cortex M4F (0.43GMAC/s/W). Overall, we obtain 8.1% higher accuracy while consuming 19.6 less energy and being 35.1 faster compared to a previous SoA embedded implementation

    HR-SAR-Net: A Deep Neural Network for Urban Scene Segmentation from High-Resolution SAR Data

    No full text
    Synthetic aperture radar (SAR) data is becoming increasingly available to a wide range of users through commercial service providers with resolutions reaching 0.5 m/px. Segmenting SAR data still requires skilled personnel, limiting the potential for large-scale use. We show that it is possible to automatically and reliably perform urban scene segmentation from next-gen resolution SAR data (0.15 m/px) using deep neural networks (DNNs), achieving a pixel accuracy of 95.19% and a mean intersection-over-union (mIoU) of 74.67% with data collected over a region of merely 2.2km2. The presented DNN is not only effective, but is very small with only 63k parameters and computationally simple enough to achieve a throughput of around 500 Mpx/s using a single GPU. We further identify that additional SAR receive antennas and data from multiple flights massively improve the segmentation accuracy. We describe a procedure for generating a high-quality segmentation ground truth from multiple inaccurate building and road annotations, which has been crucial to achieving these segmentation results
    corecore