1,721,045 research outputs found

    Energy-Efficient Radix-4 Belief Propagation Polar Code Decoding Using an Efficient Sign-Magnitude Adder and Clock Gating

    Full text link
    Polar encoding is the first information coding method that has been proven to achieve channel capacity for binary-input discrete memoryless channels. Since its introduction, much research has been done on improving decoding performance, execution time and energy efficiency. Classic belief propagation uses radix-2 decoding, but a recent study proposed radix-4 decoding which reduces memory usage by 50%. However a drawback is its higher computational complexity, negatively impacting energy usage and throughput. In this paper we present an energy-efficient radix-4 belief propagation polar decoder architecture that uses a new sign-magnitude adder that does not require conversion to two's complement and back. On top of that we also propose using clock gating of input values by checking if all R inputs of the decoder are zero. These two key contributions lead to a more energy -efficient design that is smaller and has higher maximum clock speed and throughput. Post-layout simulation results show that compared to the previously proposed 1024-bit radix-4 belief propagation polar code decoder, our decoder uses between 30.22 % and 32.80 % less power and is 5.2 % smaller at the same clock speed. Also, our design can achieve a 15.7% higher clock speed at which it is still up to 10.76% more power efficient and 4.8% smaller.</p

    Analysis of Graph Processing in Reconfigurable Devices for Edge Computing Applications

    No full text
    Graph processing is an area that has received significant attention in recent years due to the substantial expansion in industries relying on data analytics. Alongside the vital role of finding relations in social networks, graph processing is also widely used in transportation to find optimal routes and biological networks to analyse sequences. The main bottleneck in graph processing is irregular memory accesses rather than computation intensity. Since computational intensity is not a driving factor, we propose a method to perform graph processing at the edge more efficiently. We believe current cloud computing solutions are still very costly and have latency issues. The results demonstrate the benefits of a dedicated sparse graph processing algorithm compared with dense graph processing when analysing data with low density. As graph datasets grow exponentially, traversal algorithms such as breadth-first search (BFS), fundamental to many graph processing applications and metrics, become more costly to compute. Our work focuses on reviewing other implementations of breadth-first search algorithms designed for low power systems and proposing our solution that utilises advanced enhancements to achieve a significant performance boost up to 9.2x better performance in terms of MTEPS compared to other state-of-the-art solutions with a power usage of 2.32W.</p

    Partial Evaluation in Junction Trees

    No full text
    One prominent method to perform inference on probabilistic graphical models is the probability propagation in trees of clusters (PPTC) algorithm. In this paper, we demonstrate the use of partial evaluation, an established technique from the compiler domain, to improve the performance of online Bayesian inference using the PPTC algorithm in the context of observed evidence. We present a metaprogramming-based method to transform a base program into an optimized version by precomputing the static input at compile time while guaranteeing behavioral equivalence. We achieve an inference time reduction of 21% on average for the Promedas benchmark.</p

    CELR:Cloud Enhanced Local Reconstruction from low-dose sparse Scanning Electron Microscopy images

    Full text link
    Current Scanning Electron Microscopy (SEM) acquisition techniques are far too slow to capture large volumes in a feasible time. One solution is to use low-dose and sparse imaging. By computationally denoising and inpainting an image with acceptable quality can be approximated. This approach, however, requires significant compute resources. Therefore, this paper proposes CELR, a framework, that hides the computationally expensive workload of reconstructing low-dose sparse SEM images, such that (delayed) live reconstruction is possible. Live reconstruction is possible by using Convolutional Neural Networks (CNNs) that approximate a classical reconstruction algorithm like GOAL. The reconstruction by CNNs is done locally, while recurring training of CNNs is done in the cloud. Moreover, training labels are generated by GOAL in the cloud. Next to the framework, this paper evaluates and optimizes the CNN reconstruction throughput by employing Nvidia's TensorRT. This paper also touches upon open research questions about on-the-fly CNN training. The combination of CELR and TensorRT enables large volume acquisitions with a dwell-time of 1μs and 10% pixel coverage to be reconstructed on a single GPU

    Quantization:how far should we go?

    No full text
    Machine learning, and specifically Deep Neural Networks (DNNs) impact all parts of daily life. Although DNNs can be large and compute intensive, requiring processing on big servers (like in the cloud), we see a move of DNNs into loT-edge based systems, adding intelligence to these systems. These systems are often energy constrained and too small for satisfying the huge DNN computation and memory demands. DNN model quantization may come to the rescue. Instead of using 32-bit floating point numbers, much smaller formats can be used, down to 1-bit binary numbers. Although this largely may solve the compute and memory problems, it comes with a huge price, model accuracy reduction. This problem spawned a lot of research into model repair methods, especially for binary neural networks. Heavy quantization triggers a lot of debate; we even see some movements of going back to higher precision using brainfloats. This paper therefore evaluates the trade-off between energy reduction through extreme quantization versus accuracy loss. This evaluation is based on ResNet-I8 with the ImageNet dataset, mapped to a fully programmable architecture with special support for 8-bit and 1-bit deep learning, the BrainTTA. We show that, after applying repair methods, the use of extremely quantized DNNs makes sense. They have superior energy efficiency compared to DNNs based on 8-bit precision of weights and data, while only having a slightly lower accuracy. There is still an accuracy gap, requiring further research, but results are promising. A side effect of the much lower energy requirements of BNNs is that external DRAM becomes more dominant. This certainly requires further attention

    DNAsim:Evaluation Framework for Digital Neuromorphic Architectures

    Full text link
    Neuromorphic architectures implement low-power machine learning applications using spike-based biological neuron models trained with bio-inspired or machine learning algorithms. Prior work on simulating Spiking Neural Networks (SNNs) focused on simulating emerging compute in-memory (CIM) architectures, while prior work on mapping SNNs focused mainly on minimizing inter-core communication or resource utilization and targeted either emerging CIM architectures or specific target platforms. SNN mapping choices on a neuromoprhic multi-processor platform can impact performance and energy consumption. In this paper, we introduce a simulation framework that evaluates application mapping on a user-defined NoC-based multi-core digital neuromorphic architecture. Our simulator evaluates latency and energy based on mapping and abstract spike activity traces which indicate the firing of neurons at specific discrete timesteps defined by the application. We create two hardware models based on reported work in literature and show the evaluation of different mapping scenarios for a state-of-the-art SNN benchmark.</p

    Evaluation of Early-exit Strategies in Low-cost FPGA-based Binarized Neural Networks

    No full text
    In this paper, we investigate the application of early-exit strategies to quantized neural networks with binarized weights, mapped to low-cost FPGA SoC devices. The increasing complexity of network models means that hardware reuse and heterogeneous execution are needed and this opens the opportunity to evaluate the prediction confidence level early on. We apply the early-exit strategy to a network model suitable for ImageNet classification that combines weights with floating-point and binary arithmetic precision. The experiments show an improvement in inferred speed of around 20% using an early-exit network, compared with using a single primary neural network, with a negligible accuracy drop of 1.56%.</p

    ARTS:An adaptive regularization training schedule for activation sparsity exploration

    Full text link
    Brain-inspired event-based processors have attracted considerable attention for edge deployment because of their ability to efficiently process Convolutional Neural Networks (CNNs) by exploiting sparsity. On such processors, one critical feature is that the speed and energy consumption of CNN inference are approximately proportional to the number of non-zero values in the activation maps. Thus, to achieve top performance, an efficient training algorithm is required to largely suppress the activations in CNNs. We propose a novel training method, called Adaptive-Regularization Training Schedule (ARTS), which dramatically decreases the non-zero activations in a model by adaptively altering the regularization coefficient through training. We evaluate our method across an extensive range of computer vision applications, including image classification, object recognition, depth estimation, and semantic segmentation. The results show that our technique can achieve 1.41 × to 6.00 × more activation suppression on top of ReLU activation across various networks and applications, and outperforms the state-of-the-art methods in terms of training time, activation suppression gains, and accuracy. A case study for a commercially-available event-based processor, Neuronflow, shows that the activation suppression achieved by ARTS effectively reduces CNN inference latency by up to 8.4 × and energy consumption by up to 14.1 ×.</p

    AI-based segmentation of intraoperative glioblastoma hyperspectral images

    No full text
    Glioblastoma surgical resection is a problematic mission for neurosurgeons. Tumor complete resection improves patients healing chances and prognosis, whilst excessive resection could lead to neurological deficits. Nevertheless, surgeons' sight hardly traces the tumor's extent and boundaries. Indeed, most surgical processes result in subtotal resections. Histopathological testing might enable complete tumor elimination, though it is not feasible due to the time required for tissue investigation. Several studies reported tumor cells having unique molecular signatures and properties. Hyperspectral Imaging (HSI) is an emerging, non-contact, non-ionizing, label-free and minimally invasive optical imaging technique able to extract information concerning the observed tissue at the molecular level. Here, we exploited extensive data augmentation, transfer learning, the U-Net++ and the DeepLab-V3+ architectures to perform the automatic end-to-end segmentation of intraoperative glioblastoma hyperspectral images meeting competitive processing times and segmentation results concerning the gold-standard procedure. Based on ground truths provided by the HELICoiD framework, we dramatically improved HSIs processing times, enabling the end-to-end segmentation of glioblastomas targeting the real-time processing to be employed during open craniotomy in surgery, thus improving the gold-standard ML pipeline. We measured competitive inference times concerning the standard CUDA environment offered by MatLab 2020a. The HELICoiD fastest parallel version took 1.68 s to elaborate the most prominent image of the database, whilst our methodology performs segmentation inference in 0.29 ± 0.17 s, hence being real-time compliant concerning the 21 seconds constraint imposed on processing. Furthermore, we evaluated our segmentation results qualitatively and quantitatively regarding the ground truth produced by HELICoiD
    corecore