1,721,212 research outputs found

    Autonomous soft-error tolerance of FPGA configuration bits

    No full text
    Field-programmable gate arrays (FPGAs) are increasingly susceptible to radiation-induced single event upsets (SEUs). These upsets are predominant in a space environment; however, with increasing use of static RAM (SRAM) in modern FPGAs, these SEUs are gaining prominence even in a terrestrial environment. SEUs can flip SRAM bits of FPGA, potentially altering the functionality of the implemented design. This has motivated FPGA designers to investigate techniques to protect the FPGA configuration bits against such inadvertent bit flips (soft error). Traditionally, triple modular redundancy (TMR) is used to protect the FPGA bit flips. Increasing design complexity and limited battery life motivate for alternative approaches for soft-error tolerance. In this article, we propose a technique to improve autonomous fault-masking capabilities of a design by maximizing the number of zeros or ones in lookup tables (LUTs). The technique analyzes critical configuration bits and utilizes spare resources (XOR gates and carry chains) of FPGAs to selectively manipulate the logic implemented in LUTs using two operations: LUT restructuring and LUT decomposition. We implemented the proposed approach for Xilinx Virtex-6 FPGAs and validated the same with a wide set of designs from the MCNC, IWLS 2005, and ITC99 benchmark suites. Results demonstrate that the proposed logic restructuring maximizes logic 0 (or 1) of LUTs by an average of 20%, achieving 80% fault masking with no area overhead. The fault rate of the entire design is reduced by 60% on average as compared to the existing techniques. Furthermore, the logic decomposition algorithm provides incremental fault-tolerance capabilities and achieves an additional 5% fault masking with an average 7% increase in slice usage.The complete methodology is implemented into a tool for Xilinx FPGA and is made available online for the benefit of the research community. The algorithms are lightweight, and the whole design flow (including Xilinx Place and Route) was completed in 75 minutes for the largest benchmark in the set.<br/

    Spectrum of run-time management for modern and next generation multi/many-core systems

    No full text
    Run-time management of multi/many-core systems is becoming extremely challenging due to several factors, e.g. increasing demand to execute concurrent applications, inefficient exploitation of heterogeneous cores, changing workload variations over time, changing run-time scenarios and desire for optimization of several metrics such as performance, energy consumption and reliability. For next generation multi/many-core systems, the challenges will further increase mainly due to higher number of cores and increased heterogeneity. To address one or more of these challenges, run-time management approaches are being extensively developed. This tutorial presents a spectrum of run-time management approaches investigated over the last couple of years. Depending upon the target problems, the designers can employ these methodologies to achieve efficiency in multi/many-core systems in terms of performance, energy consumption and reliability

    Reliability and energy-aware mapping and scheduling of multimedia applications on multiprocessor systems

    No full text
    Lifetime reliability is an emerging concern in multiprocessor systems as escalating power density and hence temperature variation continues to accelerate wear-out leading to a growing prominence of device defects. In this paper, we propose a system-level approach that involves performance-aware mapping of multimedia applications on a multiprocessor system to jointly minimize energy consumption and temperature related wear-out. Fundamental to this approach is a simplified temperature model that incorporates not only the transient and the steady-state behavior (temporal effect), but also the temperature dependency on the surrounding cores (spatial effect). This model is validated against the temperature obtained using the HotSpot tool with transient and steady-state simulations, and is shown to be accurate within 5.5 celsius, leading to an MTTF estimation accuracy of an average 21% with respect to the state-of-the-art approaches. The proposed temperature model is integrated in a gradient-based fast heuristic that controls the voltage and frequency of the cores to limit the average and peak temperature leading to a longer lifetime, simultaneously minimizing the energy consumption. Lifetime computation considers task remapping, which is a common feature available in modern multiprocessor systems. A linear programming approach is then proposed to distribute the cores of a multiprocessor system among concurrent applications to maximize the lifetime. Experiments conducted with a set of synthetic and real-life applications represented as synchronous data flow graphs demonstrate that the proposed approach minimizes energy consumption by an average 24% with 47% increase in lifetime. For concurrent applications, the proposed lifetime-aware core distribution results in an average 10\% improvement in lifetime as compared to performance-based core distribution

    Communication and migration energy aware task mapping for reliable multiprocessor systems

    No full text
    Heterogeneous multiprocessor systems-on-chip (MPSoCs) are emerging as a promising solution in deep sub-micron technology nodes to satisfy design performance and power requirements. However, shrinking transistor geometry and aggressive voltage scaling are negatively impacting the dependability of these MPSoCs by increasing the chances of failures. This paper proposes an offline (design-time) task remapping technique to minimize the communication energy and task migration overhead of an application mapped on a heterogeneous multiprocessor system for all processor fault-scenarios. The proposed technique involves two steps–(1) Communication Energy driven Design Space Exploration (CDSE) to select an initial mapping and (2) Communication energy and Migration overhead aware Task Mapping (CMTM) for different fault-scenarios. The CDSE is formulated as a Mixed Integer Quadratic Programming (MIQP) problem and solved using an energy-gradient based heuristic. The CMTM problem is solved using a fast heuristic with the starting mapping selected using CDSE step. The proposed two steps technique is compared with state-of-the-art approaches through rigorous simulations with synthetic and real application graphs. Results demonstrate that the proposed CDSE reduces design space exploration time by 99% with a maximum variation of 5% from the optimal solution obtained by solving the MIQP problem directly. Further, the proposed CMTM reduces communication energy by an average 35% and migration overhead by an average 20% for all single and double fault-scenarios as compared to the existing fault-tolerant techniques. The CMTM also achieves over 30x reductions in execution time for large problem sizes with a maximum deviation of 15% from the minimum communication energy achievable with the given application on a given architecture. For streaming multimedia applications, the proposed technique delivers 50% higher throughput per unit energy as compared to the existing approaches

    Energy-aware task mapping and scheduling for reliable embedded computing systems

    No full text
    Task mapping and scheduling are critical in minimizing energy consumption while satisfying the performance requirement of applications enabled on heterogeneous multiprocessor systems. An area of growing concern for modern multiprocessor systems is the increase in the failure probability of one or more component processors. This is especially critical for applications where performance degradation (e.g., throughput) directly impacts the quality of service requirement. This article proposes a design-time (offline) multi-criterion optimization technique for application mapping on embedded multiprocessor systems to minimize energy consumption for all processor fault-scenarios. A scheduling technique is then proposed based on self-timed execution to minimize the schedule storage and construction overhead at runtime. Experiments conducted with synthetic and real applications from streaming and nonstreaming domains on heterogeneous MPSoCs demonstrate that the proposed technique minimizes energy consumption by 22% and design space exploration time by 100x, while satisfying the throughput requirement for all processor fault-scenarios. For scalable throughput applications, the proposed technique achieves 30% better throughput per unit energy, compared to the existing techniques. Additionally, the self-timed execution-based scheduling technique minimizes schedule construction time by 95% and storage overhead by 92%

    Execution trace--driven energy-reliability optimization for multimedia MPSoCs

    No full text
    Multiprocessor systems-on-chip (MPSoCs) are becoming a popular design choice in current and future technology nodes to accommodate the heterogeneous computing demand of a multitude of applications enabled on these platform. Streaming multimedia and other communication-centric applications constitute a significant fraction of the application space of these devices. The mapping of an application on an MPSoC is an NP-hard problem. This has attracted researchers to solve this problem both as stand-alone (best-effort) and in conjunction with other optimization objectives, such as energy and reliability. Most existing studies on energy-reliability joint optimization are static—that is, design time based. These techniques fail to capture runtime variability such as resource unavailability and dynamism associated with application behaviors, which are typical of multimedia applications. The few studies that consider dynamic mapping of applications do not consider throughput degradation, which directly impacts user satisfaction. This article proposes a runtime technique to analyze the execution trace of an application modeled as Synchronous Data Flow Graphs (SDFGs) to determine its mapping on a multiprocessor system with heterogeneous processing units for different fault scenarios. Further, communication energy is minimized for each of these mappings while satisfying the throughput constraint. Experiments conducted with synthetic and real SDFGs demonstrate that the proposed technique achieves significant improvement with respect to the state-of-the-art approaches in terms of throughput and storage overhead with less than 20% energy overhead

    Post-Training Optimization of Cross-layer Approximate Computing for Edge Inference of Deep Learning Applications

    Full text link
    Over the past decade, the rapid development of deep learning (DL) algorithms has enabled extraordinary advances in perception tasks throughout different fields, from computer vision to audio signal processing. Additionally, increasing computational resources available in supercomputers and graphic processor clusters have provided a suitable environment to train larger and deeper deep neural network (DNN) models for improved performances. However, the resulting memory bandwidth and computational requirements of such DNN models restricts their deployment in embedded systems with constrained hardware resources. To overcome this challenge, it is important to establish new paradigms to reduce the computational workload of such DL algorithms while maintaining their original accuracy. A key observation of previous research is that DL models are resilient to input noise and computational errors; therefore, a reasonable approach to decreasing such hardware requirements is to embrace DNN resiliency and utilize approximate computing techniques at different system design layers. This approach requires, however, constant monitoring as well as a careful combination of approximation techniques to avoid performance degradation while maximizing computational savings. Within this context, the focus of this thesis is the simulation of cross-layer approximate computing (AC) methods for DNN computation and the development of optimization methods to compensate AC errors in approximated DNNs. The first part of this thesis proposes the simulation framework ProxSim. This framework enables accelerated approximate computational unit (ACU) simulation for evaluation and training of approximated DNNs. ProxSim supports quantization and approximation of common neural layers such as fully connected (FC), convolutional, and recurrent layers. A performance evaluation using a variety of DNN architectures, as well as a comparison with the state of the art is also presented. The author used ProxSim to implement and evaluate the following methods presented in this work. The second part of this thesis introduces an approach to model the approximation error in DNN computation. First, the author thoroughly anaylzes the error caused by approximate multipliers to compute the multiply and accumulate (MAC) operations in DNN models. From this analysis, a statistical model of the approximation error is obtained. Through various experiments with DNNs for image classification, the proposed model is verified and compared with other methods from the literature. The results demonstrate the validity of the approximation error model and reinforce a general understanding of approximate computing in DNNs. In the third part of this thesis, the author presents a methodology for uniform systematic approximation of DNNs. This methodology focuses on the optimization of full DNN approximation with a single type of ACU to minimize power consumption without accuracy loss. The backbone of this methodology is the custom fine-tuning methods the author proposes to compensate for the approximation error. These methods enable the use of ACUs with large approximation errors, which results in significant power savings and negligible accuracy losses. This process is corroborated by extensive experiments, where the estimated savings and the accuracy achieved after approximation are thoroughly examined using ProxSim. In the last part of this thesis, the author proposes two different methodologies to further boost energy savings after applying uniform approximation. This increment in energy savings is achieved by computing more resilient DNN elements (neurons or layers) with increased approximation levels. The first methodology focuses on iterative kernel-wise approximation and quantization enabled by a custom approximate MAC unit. The second method is based on flexible layer-wise approximation, and applied to bit-decomposed in-memory computing (IMC) architectures as a case study to demonstrate the effectiveness of the proposed approach

    Workload uncertainty characterization and adaptive frequency scaling for energy minimization of embedded systems

    No full text
    A primary design optimization objective for battery-operated embedded systems is to minimize the energy consumption of applications while satisfying their performance requirement. A system-level approach to this problem is to scale the frequency of the hardware based on the readings obtained from the hardware performance monitors. We show that the performance monitor readings contain uncertainty, which becomes prominent when applications are executed in a multicore environment. These uncertainties (termed as "noise") are attributed to factors such as cache contention and DRAM access time, that are very difficult to predict dynamically. In this paper, we propose a multinomial logistic regression model, which combines probabilistic interpretation with maximum likelihood (ML) estimation to classify an incoming noisy workload, at run-time, into a finite set of classes. Every workload class corresponds to a frequency pre-determined using an appropriate training set and results in minimum energy consumption. The classifier incorporates (1) "noise" with arbitrary probability distribution to estimate the actual frame workload; and (2) the frequency switching overhead, neither of which are considered in any of the existing approaches. The classified frequency is applied on the processing cores to execute the workload. The proposed approach is engineered into an embedded multicore system and is validated with a set of standard multimedia applications. Results demonstrate that the proposed approach minimizes energy consumption by an average 20% as compared to the existing techniques

    Bayesian learning aided simultaneous sparse estimation of dual-wideband THz channels in multi-user hybrid MIMO systems

    No full text
    This work conceives the Bayesian Group-Sparse Regression (BGSR) for the estimation of a spatial and frequency wideband, i.e., a dual wideband channel in Multi-User (MU) THz hybrid MIMO scenarios. We develop a practical dual wideband THz channel model that incorporates absorption losses, reflection losses, diffused ray modeling and angles of arrival/departure (AoAs/AoDs) using a Gaussian Mixture Model (GMM). Furthermore, a low-resolution analog-to-digital converter (ADC) is employed at each RF chain, which is crucial for wideband THz massive MIMO systems to reduce power consumption and hardware complexity, given the high sampling rates and large number of antennas involved. The quantized MU THz MIMO model is linearized using the popular Bussgang decomposition followed by BGSR based channel learning framework that results in sparsity across different subcarriers, where each subcarrier has its unique dictionary matrix. Next, the Bayesian Cramér Rao Bound (BCRB) is devised for bounding the normalized mean square error (NMSE) performance. Extensive simulations were performed to assess the performance improvements achieved by the proposed BGSR method compared to other sparse estimation techniques. The metrics considered for quantifying the performance improvements include the NMSE and bit error rate (BER)

    Post-Training Optimization of Cross-layer Approximate Computing for Edge Inference of Deep Learning Applications

    No full text
    Over the past decade, the rapid development of deep learning (DL) algorithms has enabled extraordinary advances in perception tasks throughout different fields, from computer vision to audio signal processing. Additionally, increasing computational resources available in supercomputers and graphic processor clusters have provided a suitable environment to train larger and deeper deep neural network (DNN) models for improved performances. However, the resulting memory bandwidth and computational requirements of such DNN models restricts their deployment in embedded systems with constrained hardware resources. To overcome this challenge, it is important to establish new paradigms to reduce the computational workload of such DL algorithms while maintaining their original accuracy. A key observation of previous research is that DL models are resilient to input noise and computational errors; therefore, a reasonable approach to decreasing such hardware requirements is to embrace DNN resiliency and utilize approximate computing techniques at different system design layers. This approach requires, however, constant monitoring as well as a careful combination of approximation techniques to avoid performance degradation while maximizing computational savings. Within this context, the focus of this thesis is the simulation of cross-layer approximate computing (AC) methods for DNN computation and the development of optimization methods to compensate AC errors in approximated DNNs. The first part of this thesis proposes the simulation framework ProxSim. This framework enables accelerated approximate computational unit (ACU) simulation for evaluation and training of approximated DNNs. ProxSim supports quantization and approximation of common neural layers such as fully connected (FC), convolutional, and recurrent layers. A performance evaluation using a variety of DNN architectures, as well as a comparison with the state of the art is also presented. The author used ProxSim to implement and evaluate the following methods presented in this work. The second part of this thesis introduces an approach to model the approximation error in DNN computation. First, the author thoroughly anaylzes the error caused by approximate multipliers to compute the multiply and accumulate (MAC) operations in DNN models. From this analysis, a statistical model of the approximation error is obtained. Through various experiments with DNNs for image classification, the proposed model is verified and compared with other methods from the literature. The results demonstrate the validity of the approximation error model and reinforce a general understanding of approximate computing in DNNs. In the third part of this thesis, the author presents a methodology for uniform systematic approximation of DNNs. This methodology focuses on the optimization of full DNN approximation with a single type of ACU to minimize power consumption without accuracy loss. The backbone of this methodology is the custom fine-tuning methods the author proposes to compensate for the approximation error. These methods enable the use of ACUs with large approximation errors, which results in significant power savings and negligible accuracy losses. This process is corroborated by extensive experiments, where the estimated savings and the accuracy achieved after approximation are thoroughly examined using ProxSim. In the last part of this thesis, the author proposes two different methodologies to further boost energy savings after applying uniform approximation. This increment in energy savings is achieved by computing more resilient DNN elements (neurons or layers) with increased approximation levels. The first methodology focuses on iterative kernel-wise approximation and quantization enabled by a custom approximate MAC unit. The second method is based on flexible layer-wise approximation, and applied to bit-decomposed in-memory computing (IMC) architectures as a case study to demonstrate the effectiveness of the proposed approach
    corecore