1,720,999 research outputs found
A Reconfigurable Depth-Wise Convolution Module for Heterogeneously Quantized DNNs
In Deep Neural Networks (DNN), the depth-wise separable convolution has often replaced the standard 2D convolution having much fewer parameters and operations. Another common technique to squeeze DNNs is heterogeneous quantization, which uses a different bitwidth for each layer. In this context we propose for the first time a novel Reconfigurable Depth-wise convolution Module (RDM), which uses multipliers that can be reconfigured to support 1, 2 or 4 operations at the same time at increasingly lower precision of the operands. We leveraged High Level Synthesis to produce five RDM variants with different channels parallelism to cover a wide range of DNNs. The comparisons with a non-configurable Standard Depth-wise convolution module (SDM) on a CMOS FDSOI 28-nm technology show a significant latency reduction for a given silicon area for the low-precision configurations
HLS-Based Flexible Hardware Accelerator for PCA Algorithm on a Low-Cost ZYNQ SoC
Principal Component Analysis (PCA) is a widely used approach for dimensionality reduction in image processing. In microwave imaging, for example, it is used as an intermediate step toward image reconstruction. An FPGA hardware implementation of PCA is highly beneficial, especially as an accelerator for a low-cost embedded environment. In this paper we propose a flexible PCA hardware accelerator that can be used for different input data dimensions and input precisions. In addition, it supports both floating-point and fixed-point arithmetic representations. The target hardware is a ZYNQ SoC. We used High Level Synthesis (HLS) to quickly explore the design space and so to find the best implementation for a given setting of the application parameters and given the characteristics of the target hardware. We show the impact on performance of different hardware optimization techniques enabled by HLS. The proposed method outperforms a similar state-of-the-art HLS design in terms of latency and resource usage
Design-Space Exploration of Mixed-precision DNN Accelerators based on Sum-Together Multipliers
Mixed-precision quantization (MPQ) is gaining momentum in academia and industry as a way to improve the trade-off between accuracy and latency of Deep Neural Networks (DNNs) in edge applications. MPQ requires dedicated hardware to support different bit-widths. One approach uses Precision-Scalable MAC units (PSMACs) based on multipliers operating in Sum-Together (ST) mode. These can be configured to compute N = 1, 2, 4 multiplications/dot-products in parallel with operands at 16/N bits. We contribute to the State of the Art (SoA) in three directions: we compare for the first time the SoA ST multipliers architectures in performance, power and area; compared to previous work, we contribute to the portfolio of ST-based accelerators proposing three designs for the most common DNN algorithms: 2D-Convolution, Depth-wise Convolution and Fully-Connected; we show how these accelerators can be obtained with a High-Level Synthesis (HLS) flow. In particular, we perform a design-space exploration (DSE) in area, latency, power, varying many knobs, including PSMAC units parallelism, clock frequency and ST multipliers type. From the DSE on a 28-nm technology we observe that both at multiplier level and at accelerator level there is no one-fits-all solution for each possible scenario. Our findings allow accelerators’ designers to choose, out of a rich variety, the best combination of ST multiplier and HLS knobs depending on the target, either high performance, low area, or low power
Accelerating Quantized DNN Layers on RISC-V with a STAR MAC Unit
To support quantized neural networks in low-end CPUs, we propose STAR MAC, a reconfigurable multiply-and-accumulate unit based on a modified Baugh-Wooley architecture that operates at a variable reduced precision. We integrated it in a small RISC-V processor called Ibex obtaining an acceleration up to 5.8 in Fully-Connected (FC) layers, 3.7 in 2D-Convolution (2DConv) layers, and 2.8 in Depth-Wise Convolution (DWConv) layers, with respect to the original Ibex core (Orig.), and up to 4.5 in FC layers, 3.0 in 2DConv layers, and 2.3 in DWConv layers, against a modified Ibex core supporting standard 32-bit MAC operations (Orig.+MAC). Area and power in a 28-nm technology with 200 and 600 MHz target clock frequency are 0.015 and 0.017 mm, and 1.5 and 4.3 mW, respectively, with a limited overhead within 10% and 3% with respect to Orig., and within 3% and 3% against Orig.+MAC
Efficient Training and Hardware Co-design of Machine Learning Models
To implement a Machine Learning (ML) model in hardware (Hw), usually a first Design Space Exploration (DSE) optimizes the model hyper-parameters in search of the best ML performance, while a second DSE finds the configuration with the best Hw performance. Multiple iterations of these steps might be needed as the optimal ML model may not necessarily be implementable. To reduce the design-time and provide the designer with a single exploration environment, we propose a general framework based on Bayesian Optimization (BO) and High-Level Synthesis (HLS), which performs at once both DSEs generating efficient Pareto curves in the space of ML and Hw performance
HLS-based dataflow hardware architecture for Support Vector Machine in FPGA
Implementing fast and accurate Support Vector Machine (SVM) classifiers in embedded systems with limited compute and memory capacity and in applications with real-time constraints, such as continuous medical monitoring for anomaly detection, can be challenging and calls for low cost, low power and resource efficient hardware accelerators. In this paper, we propose a flexible FPGA-based SVM accelerator highly optimized through a dataflow architecture. Thanks to High Level Synthesis (HLS) and the dataflow method, our design is scalable and can be used for large data dimensions when there is limited on-chip memory. The hardware parallelism is adjustable and can be specified according to the available FPGA resources. The performance of different SVM kernels are evaluated in hardware. In addition, an efficient fixed-point implementation is proposed to improve the speed. We compared our design with recent SVM accelerators and achieved a minimum of 10x speed-up compared to other HLS-based and 4.4x compared to HDL-based designs
High-Level Design of Precision-Scalable DNN Accelerators Based on Sum-Together Multipliers
Precison-scalable (PS) multipliers are gaining traction in Deep Neural Network accelerators, particularly for enabling mixed-precision (MP) quantization in Deep Learning at the edge. This paper focuses on the Sum-Together (ST) class of PS multipliers, which are subword-parallel multipliers that can execute a standard multiplication at full precision or a dot-product with parallel low-precision operands. Our contributions in this area encompass multiple aspects: we enrich our previous comparison of SoA ST multipliers by including our recent radix-4 Booth ST multiplier and two novel designs; we extend the explanation of the architecture and the design flow of our previously proposed ST-based PS hardware accelerators designed for 2D-Convolution, Depth-wise Convolution, and Fully-Connected layers that we developed using High-Level Synthesis (HLS); we implement the uniform integer quantization equations in hardware; we conduct a broad HLS-driven design space exploration of our ST-based accelerators, varying numerous hardware parameters; finally, we showcase the advantages of ST-based accelerators when integrated into System-on-Chips (SoCs) in three different scenarios (low-area, low-power, and low-latency), running inference on MP-quantized MLPerf Tiny models as case study. Across the three scenarios, the results show an average latency speedup of 1.46x, 1.33x, and 1.29x, a reduced energy consumption in most of the cases, and a marginal area overhead of 0.9%, 2.5% and 8.0%, compared to SoCs with accelerators based on fixed-precision 16-bit multipliers. To sum up, our work provides a comprehensive understanding of ST-based accelerators’ performance in an SoC context, paving the way for future enhancements and the solution of identified inefficiencies
Hardware Acceleration of Microwave Imaging Algorithms
Microwave Imaging (MWI) is a technique that allows to reconstruct an image of the internal structure of an object by irradiating the object with a known incident field and by acquiring and processing the scattered field. As such, MWI can be used as a diagnostic technique for various medical issues, such as brain stroke or breast cancer. However, the image reconstruction process entails a sequence of compute-intensive algorithms, which we call “kernels.” The kernels execution time might become an issue when the time to diagnosis is a key factor. To speed up the kernels execution, we can use hardware acceleration techniques. In this chapter, we will identify the complex computational kernels that are recurrent in MWI. Then, we present the methodologies to design specific hardware accelerators in Field-Programmable Gate Arrays (FPGAs) for these kernels by means of an efficient design method called High Level Synthesis (HLS) and by several HLS-based hardware optimization techniques. In addition, we will present new methodologies for design automation in HLS-based hardware designs. The results show that the presented HLS optimizations and design automation techniques can significantly improve the efficiency of the hardware accelerator designs for MWI
Efficient FPGA Implementation of PCA Algorithm for Large Data using High Level Synthesis
Principal Component Analysis (PCA) is a widely used method for dimensionality reduction in different application areas, including microwave imaging where the size of input data is large. Despite its popularity, one of the difficulties in using PCA is its high computational complexity, especially for large dimensional data. In recent years several FPGA implementations have been proposed to accelerate PCA computation. However, most of them use manual RTL design, which requires more time for design and development. In this paper, we propose an FPGA implementation of PCA using High Level Synthesis (HLS), which allows us to explore the design space more efficiently than with hand-coded RTL design. Starting from a PCA algorithm written in C++, we apply various hardware optimization techniques to the same code using Vivado HLS in order to quickly explore the design space. Our experiments show that the performance of the design obtained with the proposed method is superior to the state-of-the-art RTL design in terms of resource utilization, latency and frequency
FPGA Acceleration of 3D FDTD for Multi- Antennas Microwave Imaging Using HLS
Microwave Imaging (MI) for biomedical applications has attracted attention due to its harmless radiation compared to X-ray or MRI. One of the commonly used computing methods in MI is Finite Difference Time Domain (FDTD), which is executed several times in iterative loops, hence resulting in a high execution time. Although several hardware accelerators for FDTD have been recently introduced, they are not specifically designed for MI applications. In particular, only simple absorbing boundary conditions have been investigated, and the impact of dispersive materials on FDTD has not been considered. In this paper, we propose a multi-FPGA accelerator for 3D FDTD that is integrated in an MI algorithm, with Convolutional Perfectly Matched Layer (CPML) boundary conditions and an exact model for dispersive materials. By using High Level Synthesis (HLS), we obtain an optimized hardware accelerator that uses an efficient blocking method to reduce the data transfer time between external and local memories. We propose two alternative architectures that trade off performance and resource usage. In addition, our code, being developed at a high level, can also be run on GPUs whenever necessary. The results show that our multi-FPGA accelerator is superior to three similar GPU-based designs in terms of execution time and power consumption
- …
