Search CORE

1,720,968 research outputs found

VeriIntel2C: Abstracting RTL to C to Maximize High-Level Synthesis Design Space Exploration

Author: Schaefer Benjamin Carrion
Mahapatra Anushree
Publication venue
Publication date: 23/08/2018
Field of study

Due to copyright restrictions and/or publisher's policy full text access from Treasures at UT Dallas is limited to current UTD affiliates (use the provided Link to Article).The design of integrated circuits (ICs) is typically done using low level Hardware Description Languages (HDLs) like Verilog or VHDL (Register Transfer Level). These enable the full controllability of the generated hardware design as they allow to specify the detailed behaviour and structure of the architecture, at every single clock cycle. The main drawback of using these low level HDLs is that takes very long time to create and verify large ICs with them. Moreover, it is hard to re-use HDL code for future projects that require changes in the micro-architecture. Thus, the industry is moving the level of abstraction to C-based VLSI design where designers only have to specify the functionality of the program and High-Level Synthesis (HLS) tools generate the HDL automatically. One additional benefit of C-based VLSI design is that it enables to explore the search space of possible micro-architectures from a single behavioral description. The result of a Design Space Exploration (DSE) is a trade-off curve of Pareto-optimal designs with unique area vs. performance metrics. Most VLSI design companies have large amounts of legacy HDL code. Thus, it makes sense to have an automatic flow to convert HDL designs into behavioral descriptions (e.g. C, C++ or SystemC) optimized for HLS DSE. This implies that the generation of explorable constructs, e.g. loops and arrays, which upon exploration, lead to very different micro-architectures (e.g. loops can be unrolled or folded, arrays can be mapped to RAMs or registers). In this paper, we propose a robust RTL to C translation method called Verilntel2C to abstract RTL descriptions (written in Verilog) into ANSI-C descriptions optimized for HLS DSE by generating a large number of loops and arrays. Our method is able to generate these explorable constructs with the use of extended Hardware Petri Nets to extract the behaviour of the Verilog designs and to generate a Control Data Flow Graph (CDFG) that allows the easy identification of these constructs. From the experimental results, we are able to demonstrate that Verilntel2C expands the design space considerably and also improves the quality of design space by 55% on average compared to previous work, on a wide range of designs.Erik Jonsson School of Engineering and Computer Scienc

Treasures @ UT Dallas

Partial Encryption of Behavioral IPs to Selectively Control the Design Space in High-Level Synthesis

Author: Wang Zi
Schaefer Benjamin Carrion
Zi Wang
Benjamin Carrion Schafer
Publication venue
Publication date: 2019
Field of study

Due to copyright restrictions and/or publisher's policy full text access from Treasures at UT Dallas is limited to current UTD affiliates (use the provided Link to Article).Commercial High-Level Synthesis (HLS) tool vendors have started to enable ways to protect Behavioral IP (BIPs) from being unlawful used. The main approach is to provide tools to encrypt these BIPs which can be decrypted by the HLS tool only. The main problem with this approach is that encrypting the IP does not allow BIP users to insert synthesis directives into the source code in the form of pragmas (comments), and hence cancels out one of the most important advantages of C-based VLSI design: The ability to automatically generate micro-architectures with unique design metrics, e.g. area, power and performance. This work studies the impact to the search space when synthesis directives are not able to be inserted in to the encrypted IP source code while other options are still available to the BIP users (e.g. setting global synthesis options and limiting the number and type of functional units) and proposes a method that selectively controls the search space by encrypting different portions of the BIP. To achieve this goal we propose a fast heuristic based on divide and conquer method. Experimental results show that our proposed method works well compared to an exhaustive search that leads to the optimal solution. © 2019 EDAA.Erik Jonsson School of Engineering and Computer Scienc

Crossref

Treasures @ UT Dallas

A machine learning based hard fault recuperation model for approximate hardware accelerators

Author: Joseph Callenes-Sloan
Schaefer Benjamin Carrion
Farah Naz Taher
Callenes-Sloan J.
Taher Farah Naz
Benjamin Carrion Schafer
Publication venue
Publication date: 24/06/2018
Field of study

Full text access from Treasures at UT Dallas is restricted to current UTD affiliates (use the provided Link to Article). Non UTD affiliates will find the web address for this item by clicking the "Show full item record" link, copying the "dc.relation.uri" metadata and pasting it into a browser.Continuous pursuit of higher performance and energy efficiency has led to heterogeneous SoC that contains multiple dedicated hardware accelerators. These accelerators exploit the inherent parallelism of tasks and are often tolerant to inaccuracies in their outputs, e.g. image and digital signal processing applications. At the same time, permanent faults are escalating due to process scaling and power restrictions, leading to erroneous outputs. To address this issue, in this paper, we propose a low-cost, universal fault recovery/repair method that utilizes supervised machine learning techniques to ameliorate the effect of permanent fault(s) in hardware accelerators that can tolerate inexact outputs. The proposed compensation model does not require any information about the accelerator and is highly scalable with low area overhead. Experimental results show, the proposed method improves the accuracy by 50% and decreases the overall mean error rate by 90% with an area overhead of 5% compared to execution without fault compensation.Erik Jonsson School of Engineering and Computer Scienc

Crossref

Treasures @ UT Dallas

Common-Mode Failure Mitigation: Increasing Diversity through High-Level Synthesis

Author: Matthew Joslin
Balachandran A.
Zhu Zhiqi
Schaefer Benjamin Carrion
Anjana Balachandran
Joslin Matthew
Zhiqi Zhu
Farah Naz Taher
Taher Farah Naz
Benjamin Carrion Schafer
Publication venue
Publication date: 25/03/2019
Field of study

Due to copyright restrictions and/or publisher's policy full text access from Treasures at UT Dallas is limited to current UTD affiliates (use the provided Link to Article).Fault tolerance is vital in many domains. One popular way to increase fault-tolerance is through hardware redundancy. However, basic redundancy cannot cope with Common Mode Failures (CMFs). One way to address CMF is through the use of diversity in combination with traditional hardware redundancy. This work proposes an automatic design space exploration (DSE) method to generate optimized redundant hardware accelerators with maximum diversity to protect against CMFs given as a single behavioral description for High-Level Synthesis (HLS). For this purpose, this work exploits one of the main advantages of C-based VLSI design over the traditional RT-level design based on low-level Hardware Description Languages (HDLs): The ability to generate micro-architectures with unique characteristics from the same behavioral description. Experimental results show that the proposed method provides a significant diversity increment compared to using traditional RTL-based exploration to generate diverse designs. © 2019 EDAA.Erik Jonsson School of Engineering and Computer Scienc

Crossref

Treasures @ UT Dallas

In-Situ Implementation and Training of Convolutional Neural Network on FPGAs

Author: Krishnani Akshay Raju
Publication venue
Publication date: 2020
Field of study

The main objective of this thesis is to investigate the efficiency of in-situ trainable Convolutional Neural Networks (CNNs) on modern programmable System-on-Chip (SoC) Field Programmable Gate Arrays (FPGAs) composed of embedded processors and reconfigurable fabric and to study the robustness of the system when faults happen. One particular characteristic of this work is that CNN is developed exclusively using High-Level Synthesis (HLS), particularly in SystemC, generating Verilog code. In this thesis, the feature maps are also being trained on the FPGA, which is traditionally done offline. The CNN architecture is instantiated on the FPGA and weights are trained through the software model on the ARM processor embedded into the FPGA and updated in the architecture through the AXI bus interface. Moreover, since CNN is implemented in hardware the resource used need to be minimized. This allows to choose a smaller, and cheaper FPGA, as well as reducing the total power consumption. To address this, the effect of bitwidth reduction of the CNN is investigated with respect to the accuracy of handwritten characters recognitions. Finally, the robustness of the CNN is analyzed by breaking internal connection of different neurons studying how the accuracy drops when the fault happens at different layers If the accuracy is reduced, then the CNN is re-trained in-situ to increase the accuracy of the CNN

Treasures @ UT Dallas

Bespoke Behavioral Processors

Author: Sreekumar Rohit
Publication venue
Publication date: 2020
Field of study

Many emerging applications require simple controllers that run the exact same application continuously. These include medical devices and IoTs of different nature. Because of the nature of these applications, they have to be ultra-low power and small. Most of the applications are mapped onto low-power processors that are computationally inexpensive, thus, amenable to be executed on a simple microprocessor. One of the problems of using a general purpose processor, is that not all of the resources are required for a specific application, thus, there is a large potential for simplifying the processor to achieve lower area and power. In addition, these processors can be specified at the behavioral level using High-Level Synthesis (HLS) to generate the RTL automatically. This opens a window for additional optimizations as the processor can be pruned and re-synthesized at different VLSI design levels in order to obtain a smaller and more power-efficient processor. This work presents a methodology to customize a behavioral RISC processor automatically for a given workload such that its area and power are significantly reduced as compared to the original processor. Compared to previous work that customizes a given processor at the gate netlist only, this proposed method helps reduce the area and power significantly by raising the level of abstraction

Treasures @ UT Dallas

Effective High-Level Synthesis Design Space Exploration Through a Novel Cost Function Formulation

Author: Gao Yiheng
Publication venue
Publication date: 2020
Field of study

In the last few decades, Integrated Circuits (IC) designers have had to manually translate behavioral description into Register-Transfer Level (RTL) code (e.g. Verilog or VHDL). High-Level Synthesis (HLS) automates this process. HLS has many advantages as compared to specifying hardware at the RT-Level. One big advantages is that the behavioral description only needs to be designed and verified once, but allows to generate RTLs with different characteristics by simply specifying different synthesis options. This opens the door to perform a fully automatic Design Space Exploration (DSE). The main goal in HLS DSE is to find Pareto-optimal micro-architectures for the given untimed behavioral description. For large untimed descriptions an exhaustive enumeration of all possible synthesis options combinations is not possible, hence heuristics are required. This work presents three metaheuristic algorithms to address this issue: Simulated Annealing (SA), Genetic Algorithm (GA) and Ant Colony Optimization (ACO). These algorithms are originally used to solve Single-Objective (SO) problems whereas DSE is Multi-Objective (MO), i.e. area vs. performance. To convert the MO problem into a SO, this work proposes a new method called ξ-constraint to do the conversion, and compares the result with the traditional method (weighted sum as cost function) for all three algorithms

Treasures @ UT Dallas

Efficient Hardware Acceleration on SoC-FPGA using OpenCL

Author: Gogineni Susmitha
Publication venue
Publication date: 2018
Field of study

Field Programmable Gate Arrays (FPGAs) are taking over the conventional processors in the ﬁeld of High Performance computing. With the advent of FPGA architectures and High level synthesis tools, FPGAs can now be easily used to accelerate computationally intensive applications like, e.g., AI and Cognitive computing. One of the advantages of raising the level of hardware design abstraction is that multiple conﬁgurations with unique properties (i.e. area, performance and power) can be automatically generated without the need to re-write the input description. This is not possible when using traditional low-level hardware description languages like VHDL or Verilog. This thesis deals with this important topic and accelerates multiple computationally intensive applications amiable to hardware acceleration and proposes a fast heuristic Design Space Exploration method to ﬁnd dominant design solutions quickly. In particular, in this work, we developed different computationally intensive applications in OpenCL and mapped them onto a heterogeneous SoC-FPGA. A Genetic Algorithm (GA) based meta-heuristics that does automatic Design Space Exploration (DSE) on these applications was also developed as GA has shown in the past to lead to good results in multi-objective optimization problems like this one. The developed explorer automatically inserts a set of control knobs into the OpenCL behavioral description, e.g., to control how to synthesize loops (unroll or not), and to replicate Compute Units (CUs). By tuning the these control attributes with possible values, thousands of different micro-architecture conﬁgurations can be obtained. Thus, an exhaustive search is not feasible and other heuristics are needed. Each conﬁguration is compiled using Altera OpenCL SDK tool and executed on Terasic DE1-SoC FPGA board platform to record the corresponding performance and logic utilization. In order to measure the quality of the proposed GA-based heuristic, each application is explored exhaustively (taking multiple days to ﬁnish for smaller designs) to ﬁnd the dominant optimal solutions (Pareto Optimal Designs). For complex and larger designs, exploring the entire design space exhaustively is not feasible due to very large design space. The comparison is quantiﬁed by using metrics like Dominance, Average Distance from Reference Set (ADRS) and run time speed up, showing that our proposed heuristics lead to very good results at a fraction of the time of the exhaustive search

Treasures @ UT Dallas

Flexible Partial Reconfiguration Based Design Architecture for Dataflow Computation

Author: Shah Mihir
Publication venue
Publication date: 2018
Field of study

In this thesis research we proposed a generic semi-automatic partial reconfiguration based design methodology which takes inputs in the form of behavioral description files using C/C++/SystemC for a dataflow process and outputs partial binaries to deploy on the SoC FPGA. This methodology is coupled with a novel static design architectural framework utilizing internal block ram memory to store intermediate results. In order to prove the efficacy of the proposed methodology and architecture in terms of area and timing, we have implemented JPEG Encoder from S2CBench v.2.0 spatially and then with partial reconfiguration design methodologies. The proposed design method abbreviated as PRBRAM where internal FPGA on-chip memory is used to store intermediate results when time multiplexing kernels and PRDDR is a partial reconfiguration based design method utilizing external off-chip DDR memory. The reconfiguration time is a critical parameter determining the performance of DPR designs. Reconfiguration time depends on the area of Reconfigurable Partition (RP) and the generated partial bitstream. Thus, we study and prove experimentally considering equal area of RP for both PRBRAM & PRDDR, that the proposed former method is runtime and latency efficient compared to the latter. We also examine and study the effects of variations on reconfigurable partition area on running time, considering different number of reconfigurations required for the application on the proposed architecture PRBRAM. We prove that the implementation with the proposed Architecture PRBRAM is area efficient compared to spatial implementation with LUT area savings upto 21.20 % and FF area savings up to 30.41 % for 1598.896 KB as partial bitstream size. These %’s are including the additional resources utilized by proposed static architecture. We also have seen an improvement in average hardware running of 0.529363s against PRDDR

Treasures @ UT Dallas

Reducing the Complexity of Fault-Tolerant Behavioral Hardware Accelerators

Author: Zhu Zhiqi
Publication venue
Publication date: 2020
Field of study

Continuous technology scaling has allowed to integrate a large number of different hardware components on the same integrated circuit (IC). Thus, these complex ICs are typically called System-on-Chip (SoC). Area, power and performance have been traditionally the most important design metrics, but for many safety critical applications, reliability is equally important. Fault tolerance can therefore not be a second class citizen anymore and must be considered early on in the design process of these complex ICs. Due to the heterogeneity of these SoCs a single fault-tolerance solution is not possible. Dedicated solutions have been proposed for the embedded processor, the memory, different interfaces and for the dedicated hardware accelerators. For example, in the processor case, the program execution relies on the control flow instructions that determine which section of code will be executed at run-time. A single event upset (SEU) can impact the execution order of the program. Thus, in this thesis we study the effect of transient errors on the corruption of program control flows, and present a methodology to detect these at the software level. This is done by inserting additional control flow instructions directly at the assembly code after a static control flow analysis is performed. Moreover, one key differentiating element between different SoCs is the hardware accelerators in them. Most of other components in the SoCs are off-the-shelve modules and the main differentiation element in the different SoC offering is typically the mix of hardware accelerators that they include. Due to the long design cycles of these complex systems, the design of these accelerators is often now done at the behavioral level and High-Level Synthesis (HLS) is used to generate the Register Transfer Level (RTL) code of the accelerator. It is therefore imperative to introduce low overhead fault-tolerance techniques for these accelerators described at the behavioral level. This thesis presents different techniques to reduce the overhead associated with traditional N-modular redundancy techniques for these accelerators

Treasures @ UT Dallas