1,720,968 research outputs found
VeriIntel2C: Abstracting RTL to C to Maximize High-Level Synthesis Design Space Exploration
Due to copyright restrictions and/or publisher's policy full text access from Treasures at UT Dallas is limited to current UTD affiliates (use the provided Link to Article).The design of integrated circuits (ICs) is typically done using low level Hardware Description Languages (HDLs) like Verilog or VHDL (Register Transfer Level). These enable the full controllability of the generated hardware design as they allow to specify the detailed behaviour and structure of the architecture, at every single clock cycle. The main drawback of using these low level HDLs is that takes very long time to create and verify large ICs with them. Moreover, it is hard to re-use HDL code for future projects that require changes in the micro-architecture. Thus, the industry is moving the level of abstraction to C-based VLSI design where designers only have to specify the functionality of the program and High-Level Synthesis (HLS) tools generate the HDL automatically. One additional benefit of C-based VLSI design is that it enables to explore the search space of possible micro-architectures from a single behavioral description. The result of a Design Space Exploration (DSE) is a trade-off curve of Pareto-optimal designs with unique area vs. performance metrics. Most VLSI design companies have large amounts of legacy HDL code. Thus, it makes sense to have an automatic flow to convert HDL designs into behavioral descriptions (e.g. C, C++ or SystemC) optimized for HLS DSE. This implies that the generation of explorable constructs, e.g. loops and arrays, which upon exploration, lead to very different micro-architectures (e.g. loops can be unrolled or folded, arrays can be mapped to RAMs or registers). In this paper, we propose a robust RTL to C translation method called Verilntel2C to abstract RTL descriptions (written in Verilog) into ANSI-C descriptions optimized for HLS DSE by generating a large number of loops and arrays. Our method is able to generate these explorable constructs with the use of extended Hardware Petri Nets to extract the behaviour of the Verilog designs and to generate a Control Data Flow Graph (CDFG) that allows the easy identification of these constructs. From the experimental results, we are able to demonstrate that Verilntel2C expands the design space considerably and also improves the quality of design space by 55% on average compared to previous work, on a wide range of designs.Erik Jonsson School of Engineering and Computer Scienc
Partial Encryption of Behavioral IPs to Selectively Control the Design Space in High-Level Synthesis
Due to copyright restrictions and/or publisher's policy full text access from Treasures at UT Dallas is limited to current UTD affiliates (use the provided Link to Article).Commercial High-Level Synthesis (HLS) tool vendors have started to enable ways to protect Behavioral IP (BIPs) from being unlawful used. The main approach is to provide tools to encrypt these BIPs which can be decrypted by the HLS tool only. The main problem with this approach is that encrypting the IP does not allow BIP users to insert synthesis directives into the source code in the form of pragmas (comments), and hence cancels out one of the most important advantages of C-based VLSI design: The ability to automatically generate micro-architectures with unique design metrics, e.g. area, power and performance. This work studies the impact to the search space when synthesis directives are not able to be inserted in to the encrypted IP source code while other options are still available to the BIP users (e.g. setting global synthesis options and limiting the number and type of functional units) and proposes a method that selectively controls the search space by encrypting different portions of the BIP. To achieve this goal we propose a fast heuristic based on divide and conquer method. Experimental results show that our proposed method works well compared to an exhaustive search that leads to the optimal solution. © 2019 EDAA.Erik Jonsson School of Engineering and Computer Scienc
A machine learning based hard fault recuperation model for approximate hardware accelerators
Full text access from Treasures at UT Dallas is restricted to current UTD affiliates (use the provided Link to Article). Non UTD affiliates will find the web address for this item by clicking the "Show full item record" link, copying the "dc.relation.uri" metadata and pasting it into a browser.Continuous pursuit of higher performance and energy efficiency has led to heterogeneous SoC that contains multiple dedicated hardware accelerators. These accelerators exploit the inherent parallelism of tasks and are often tolerant to inaccuracies in their outputs, e.g. image and digital signal processing applications. At the same time, permanent faults are escalating due to process scaling and power restrictions, leading to erroneous outputs. To address this issue, in this paper, we propose a low-cost, universal fault recovery/repair method that utilizes supervised machine learning techniques to ameliorate the effect of permanent fault(s) in hardware accelerators that can tolerate inexact outputs. The proposed compensation model does not require any information about the accelerator and is highly scalable with low area overhead. Experimental results show, the proposed method improves the accuracy by 50% and decreases the overall mean error rate by 90% with an area overhead of 5% compared to execution without fault compensation.Erik Jonsson School of Engineering and Computer Scienc
Common-Mode Failure Mitigation: Increasing Diversity through High-Level Synthesis
Due to copyright restrictions and/or publisher's policy full text access from Treasures at UT Dallas is limited to current UTD affiliates (use the provided Link to Article).Fault tolerance is vital in many domains. One popular way to increase fault-tolerance is through hardware redundancy. However, basic redundancy cannot cope with Common Mode Failures (CMFs). One way to address CMF is through the use of diversity in combination with traditional hardware redundancy. This work proposes an automatic design space exploration (DSE) method to generate optimized redundant hardware accelerators with maximum diversity to protect against CMFs given as a single behavioral description for High-Level Synthesis (HLS). For this purpose, this work exploits one of the main advantages of C-based VLSI design over the traditional RT-level design based on low-level Hardware Description Languages (HDLs): The ability to generate micro-architectures with unique characteristics from the same behavioral description. Experimental results show that the proposed method provides a significant diversity increment compared to using traditional RTL-based exploration to generate diverse designs. © 2019 EDAA.Erik Jonsson School of Engineering and Computer Scienc
In-Situ Implementation and Training of Convolutional Neural Network on FPGAs
The main objective of this thesis is to investigate the efficiency of in-situ trainable Convolutional
Neural Networks (CNNs) on modern programmable System-on-Chip (SoC) Field Programmable
Gate Arrays (FPGAs) composed of embedded processors and reconfigurable fabric and to study
the robustness of the system when faults happen. One particular characteristic of this work is that
CNN is developed exclusively using High-Level Synthesis (HLS), particularly in SystemC,
generating Verilog code. In this thesis, the feature maps are also being trained on the FPGA, which
is traditionally done offline. The CNN architecture is instantiated on the FPGA and weights are
trained through the software model on the ARM processor embedded into the FPGA and updated
in the architecture through the AXI bus interface.
Moreover, since CNN is implemented in hardware the resource used need to be minimized. This
allows to choose a smaller, and cheaper FPGA, as well as reducing the total power consumption.
To address this, the effect of bitwidth reduction of the CNN is investigated with respect to the
accuracy of handwritten characters recognitions. Finally, the robustness of the CNN is analyzed
by breaking internal connection of different neurons studying how the accuracy drops when the
fault happens at different layers If the accuracy is reduced, then the CNN is re-trained in-situ to
increase the accuracy of the CNN
Bespoke Behavioral Processors
Many emerging applications require simple controllers that run the exact same application
continuously. These include medical devices and IoTs of different nature. Because of the
nature of these applications, they have to be ultra-low power and small. Most of the applications are mapped onto low-power processors that are computationally inexpensive, thus,
amenable to be executed on a simple microprocessor. One of the problems of using a general
purpose processor, is that not all of the resources are required for a specific application, thus,
there is a large potential for simplifying the processor to achieve lower area and power. In
addition, these processors can be specified at the behavioral level using High-Level Synthesis
(HLS) to generate the RTL automatically. This opens a window for additional optimizations
as the processor can be pruned and re-synthesized at different VLSI design levels in order
to obtain a smaller and more power-efficient processor. This work presents a methodology
to customize a behavioral RISC processor automatically for a given workload such that its
area and power are significantly reduced as compared to the original processor. Compared
to previous work that customizes a given processor at the gate netlist only, this proposed
method helps reduce the area and power significantly by raising the level of abstraction
Effective High-Level Synthesis Design Space Exploration Through a Novel Cost Function Formulation
In the last few decades, Integrated Circuits (IC) designers have had to manually translate
behavioral description into Register-Transfer Level (RTL) code (e.g. Verilog or VHDL).
High-Level Synthesis (HLS) automates this process. HLS has many advantages as compared to specifying hardware at the RT-Level. One big advantages is that the behavioral
description only needs to be designed and verified once, but allows to generate RTLs with
different characteristics by simply specifying different synthesis options. This opens the door
to perform a fully automatic Design Space Exploration (DSE). The main goal in HLS DSE
is to find Pareto-optimal micro-architectures for the given untimed behavioral description.
For large untimed descriptions an exhaustive enumeration of all possible synthesis options
combinations is not possible, hence heuristics are required. This work presents three metaheuristic algorithms to address this issue: Simulated Annealing (SA), Genetic Algorithm
(GA) and Ant Colony Optimization (ACO). These algorithms are originally used to solve
Single-Objective (SO) problems whereas DSE is Multi-Objective (MO), i.e. area vs. performance. To convert the MO problem into a SO, this work proposes a new method called
ξ-constraint to do the conversion, and compares the result with the traditional method
(weighted sum as cost function) for all three algorithms
Efficient Hardware Acceleration on SoC-FPGA using OpenCL
Field Programmable Gate Arrays (FPGAs) are taking over the conventional processors in the field of High Performance computing. With the advent of FPGA architectures and High level synthesis tools, FPGAs can now be easily used to accelerate computationally intensive applications like, e.g., AI and Cognitive computing. One of the advantages of raising the level of hardware design abstraction is that multiple configurations with unique properties (i.e. area, performance and power) can be automatically generated without the need to re-write the input description. This is not possible when using traditional low-level hardware description languages like VHDL or Verilog. This thesis deals with this important topic and accelerates multiple computationally intensive applications amiable to hardware acceleration and proposes a fast heuristic Design Space Exploration method to find dominant design solutions quickly.
In particular, in this work, we developed different computationally intensive applications in OpenCL and mapped them onto a heterogeneous SoC-FPGA. A Genetic Algorithm (GA) based meta-heuristics that does automatic Design Space Exploration (DSE) on these applications was also developed as GA has shown in the past to lead to good results in multi-objective optimization problems like this one. The developed explorer automatically inserts a set of control knobs into the OpenCL behavioral description, e.g., to control how to synthesize loops (unroll or not), and to replicate Compute Units (CUs). By tuning the these control attributes with possible values, thousands of different micro-architecture configurations can be obtained. Thus, an exhaustive search is not feasible and other heuristics are needed. Each configuration is compiled using Altera OpenCL SDK tool and executed on Terasic DE1-SoC FPGA board platform to record the corresponding performance and logic utilization. In order to measure the quality of the proposed GA-based heuristic, each application is explored exhaustively (taking multiple days to finish for smaller designs) to find the dominant optimal solutions (Pareto Optimal Designs). For complex and larger designs, exploring the entire design space exhaustively is not feasible due to very large design space. The comparison is quantified by using metrics like Dominance, Average Distance from Reference Set (ADRS) and run time speed up, showing that our proposed heuristics lead to very good results at a fraction of the time of the exhaustive search
Flexible Partial Reconfiguration Based Design Architecture for Dataflow Computation
In this thesis research we proposed a generic semi-automatic partial reconfiguration based
design methodology which takes inputs in the form of behavioral description files using
C/C++/SystemC for a dataflow process and outputs partial binaries to deploy on the SoC
FPGA. This methodology is coupled with a novel static design architectural framework
utilizing internal block ram memory to store intermediate results. In order to prove the
efficacy of the proposed methodology and architecture in terms of area and timing, we
have implemented JPEG Encoder from S2CBench v.2.0 spatially and then with partial
reconfiguration design methodologies. The proposed design method abbreviated as PRBRAM
where internal FPGA on-chip memory is used to store intermediate results when time
multiplexing kernels and PRDDR is a partial reconfiguration based design method utilizing
external off-chip DDR memory. The reconfiguration time is a critical parameter determining
the performance of DPR designs. Reconfiguration time depends on the area of Reconfigurable
Partition (RP) and the generated partial bitstream. Thus, we study and prove experimentally
considering equal area of RP for both PRBRAM & PRDDR, that the proposed former method
is runtime and latency efficient compared to the latter. We also examine and study the effects
of variations on reconfigurable partition area on running time, considering different number
of reconfigurations required for the application on the proposed architecture PRBRAM. We prove that the implementation with the proposed Architecture PRBRAM is area efficient
compared to spatial implementation with LUT area savings upto 21.20 % and FF area
savings up to 30.41 % for 1598.896 KB as partial bitstream size. These %’s are including
the additional resources utilized by proposed static architecture. We also have seen an
improvement in average hardware running of 0.529363s against PRDDR
Reducing the Complexity of Fault-Tolerant Behavioral Hardware Accelerators
Continuous technology scaling has allowed to integrate a large number of different hardware
components on the same integrated circuit (IC). Thus, these complex ICs are typically
called System-on-Chip (SoC). Area, power and performance have been traditionally the
most important design metrics, but for many safety critical applications, reliability is equally
important. Fault tolerance can therefore not be a second class citizen anymore and must be
considered early on in the design process of these complex ICs.
Due to the heterogeneity of these SoCs a single fault-tolerance solution is not possible.
Dedicated solutions have been proposed for the embedded processor, the memory, different
interfaces and for the dedicated hardware accelerators.
For example, in the processor case, the program execution relies on the control flow instructions that determine which section of code will be executed at run-time. A single event upset
(SEU) can impact the execution order of the program. Thus, in this thesis we study the
effect of transient errors on the corruption of program control flows, and present a methodology to detect these at the software level. This is done by inserting additional control flow
instructions directly at the assembly code after a static control flow analysis is performed.
Moreover, one key differentiating element between different SoCs is the hardware accelerators in them. Most of other components in the SoCs are off-the-shelve modules and the
main differentiation element in the different SoC offering is typically the mix of hardware
accelerators that they include. Due to the long design cycles of these complex systems,
the design of these accelerators is often now done at the behavioral level and High-Level
Synthesis (HLS) is used to generate the Register Transfer Level (RTL) code of the accelerator. It is therefore imperative to introduce low overhead fault-tolerance techniques for these
accelerators described at the behavioral level. This thesis presents different techniques to
reduce the overhead associated with traditional N-modular redundancy techniques for these
accelerators
- …
