1,720,966 research outputs found

    A Reconfigurable 2D-Convolution Accelerator for DNNs Quantized with Mixed-Precision

    Full text link
    Mixed-precision uses in each layer of a Deep Neural Network the minimum bit-width that preserves accuracy. In this context, our new Reconfigurable 2D-Convolution Module (RCM) computes N =1, 2 or 4 Multiply-and-Accumulate operations in parallel with configurable precision from 1 to 16/N bits. Our design-space exploration via high-level synthesis obtains the best points in the latency vs area space, varying the size of the tensor tile handled by our RCM and its parallelism. A comparison with a non-configurable module on a 28-nm technology shows many reconfigurable Pareto points for low bit-width configurations, making our RCM a promising mixed-precision accelerator for inference

    A Reconfigurable Depth-Wise Convolution Module for Heterogeneously Quantized DNNs

    Full text link
    In Deep Neural Networks (DNN), the depth-wise separable convolution has often replaced the standard 2D convolution having much fewer parameters and operations. Another common technique to squeeze DNNs is heterogeneous quantization, which uses a different bitwidth for each layer. In this context we propose for the first time a novel Reconfigurable Depth-wise convolution Module (RDM), which uses multipliers that can be reconfigured to support 1, 2 or 4 operations at the same time at increasingly lower precision of the operands. We leveraged High Level Synthesis to produce five RDM variants with different channels parallelism to cover a wide range of DNNs. The comparisons with a non-configurable Standard Depth-wise convolution module (SDM) on a CMOS FDSOI 28-nm technology show a significant latency reduction for a given silicon area for the low-precision configurations

    Design-Space Exploration of Mixed-precision DNN Accelerators based on Sum-Together Multipliers

    Full text link
    Mixed-precision quantization (MPQ) is gaining momentum in academia and industry as a way to improve the trade-off between accuracy and latency of Deep Neural Networks (DNNs) in edge applications. MPQ requires dedicated hardware to support different bit-widths. One approach uses Precision-Scalable MAC units (PSMACs) based on multipliers operating in Sum-Together (ST) mode. These can be configured to compute N = 1, 2, 4 multiplications/dot-products in parallel with operands at 16/N bits. We contribute to the State of the Art (SoA) in three directions: we compare for the first time the SoA ST multipliers architectures in performance, power and area; compared to previous work, we contribute to the portfolio of ST-based accelerators proposing three designs for the most common DNN algorithms: 2D-Convolution, Depth-wise Convolution and Fully-Connected; we show how these accelerators can be obtained with a High-Level Synthesis (HLS) flow. In particular, we perform a design-space exploration (DSE) in area, latency, power, varying many knobs, including PSMAC units parallelism, clock frequency and ST multipliers type. From the DSE on a 28-nm technology we observe that both at multiplier level and at accelerator level there is no one-fits-all solution for each possible scenario. Our findings allow accelerators’ designers to choose, out of a rich variety, the best combination of ST multiplier and HLS knobs depending on the target, either high performance, low area, or low power

    Accelerating Quantized DNN Layers on RISC-V with a STAR MAC Unit

    Full text link
    To support quantized neural networks in low-end CPUs, we propose STAR MAC, a reconfigurable multiply-and-accumulate unit based on a modified Baugh-Wooley architecture that operates at a variable reduced precision. We integrated it in a small RISC-V processor called Ibex obtaining an acceleration up to 5.8 in Fully-Connected (FC) layers, 3.7 in 2D-Convolution (2DConv) layers, and 2.8 in Depth-Wise Convolution (DWConv) layers, with respect to the original Ibex core (Orig.), and up to 4.5 in FC layers, 3.0 in 2DConv layers, and 2.3 in DWConv layers, against a modified Ibex core supporting standard 32-bit MAC operations (Orig.+MAC). Area and power in a 28-nm technology with 200 and 600 MHz target clock frequency are 0.015 and 0.017 mm, and 1.5 and 4.3 mW, respectively, with a limited overhead within 10% and 3% with respect to Orig., and within 3% and 3% against Orig.+MAC

    High-Level Design of Precision-Scalable DNN Accelerators Based on Sum-Together Multipliers

    Full text link
    Precison-scalable (PS) multipliers are gaining traction in Deep Neural Network accelerators, particularly for enabling mixed-precision (MP) quantization in Deep Learning at the edge. This paper focuses on the Sum-Together (ST) class of PS multipliers, which are subword-parallel multipliers that can execute a standard multiplication at full precision or a dot-product with parallel low-precision operands. Our contributions in this area encompass multiple aspects: we enrich our previous comparison of SoA ST multipliers by including our recent radix-4 Booth ST multiplier and two novel designs; we extend the explanation of the architecture and the design flow of our previously proposed ST-based PS hardware accelerators designed for 2D-Convolution, Depth-wise Convolution, and Fully-Connected layers that we developed using High-Level Synthesis (HLS); we implement the uniform integer quantization equations in hardware; we conduct a broad HLS-driven design space exploration of our ST-based accelerators, varying numerous hardware parameters; finally, we showcase the advantages of ST-based accelerators when integrated into System-on-Chips (SoCs) in three different scenarios (low-area, low-power, and low-latency), running inference on MP-quantized MLPerf Tiny models as case study. Across the three scenarios, the results show an average latency speedup of 1.46x, 1.33x, and 1.29x, a reduced energy consumption in most of the cases, and a marginal area overhead of 0.9%, 2.5% and 8.0%, compared to SoCs with accelerators based on fixed-precision 16-bit multipliers. To sum up, our work provides a comprehensive understanding of ST-based accelerators’ performance in an SoC context, paving the way for future enhancements and the solution of identified inefficiencies

    Going Beyond Counting First Authors in Author Co-citation Analysis

    Full text link
    The present study examines one of the fundamental aspects of author co-citation analysis (ACA) - the way co-citation counts are defined. Co-citation counting provides the data on which all subsequent statistical analyses and mappings are based, and we compare ACA results based on two different types of co-citation counting - the traditional type that only counts the first one among a cited work's authors on the one hand and a non-traditional type that takes into account the first 5 authors of a cited work on the other hand. Results indicate that the picture produced through this non-traditional author co-citation counting contains more coherent author groups and is therefore considerably clearer. However, this picture represents fewer specialties in the research field being studied than that produced through the traditional first-author co-citation counting when the same number of top-ranked authors is selected and analyzed. Reasons for these effects are discussed

    A Machine-Learning Based Microwave Sensing Approach to Food Contaminant Detection

    Full text link
    To detect contaminants accidentally included in packaged foods, food industries use an array of systems ranging from metal detectors to X-ray imagers. Low density plastic or glass contaminants, however, are not easily detected with standard methods. If the dielectric contrast between the packaged food and these contaminants in the microwave spectrum is sensible, Microwave Sensing (MWS) can be used as a contactless detection method, which is particularly useful when the food is already packaged. In this paper we propose using MWS combined with Machine Learning (ML). In particular, we report on experiments we did with packaged cocoa-hazelnut spread and show the accuracy of our approach. We also present an FPGA acceleration that runs the ML processing in real-time so as to keep up with the throughput of a production line

    Enhanced Machine-Learning Flow for Microwave-Sensing Systems to Detect Contaminants in Food

    Full text link
    The presence of foreign bodies in packaged food is a serious concern for both final consumers (allergies, injuries, choking) and food manufacturers (reputation and economic losses). In particular, low-density plastics, glass and wood splinters are hard to detect even by the most advanced X-ray imagers. One solution is Machine-Learning-based Microwave Sensing (MLMWS): a non-invasive, contactless, and real-time method which uses a machine-learning (ML) classifier to analyze the scattered microwaves from the irradiated target object. In this paper, we want to extend our previous work about contaminant detection in cocoa-hazelnut spread jars by proposing an enhanced ML flow to increase the accuracy of the ML classifier. For the first time in this case study, we use a multi-class classifier, we train it with scattering parameters measured at multiple microwave frequencies, with a new pre-processing scaler, data augmentation, quantization-aware training and a pruning schedule. The results show a contaminant detection multi-class accuracy of 94.167% with a latency of 26μs when targeting an AMD/Xilinx Kria K26 FPGA. Finally, we released our datasets publicly to OpenML
    corecore