1,721,088 research outputs found
Recommended from our members
Efficient Machine Learning Acceleration at the Edge
My thesis is a result of a confluence of several trends that have emerged in recent years. First, the rapid proliferation of deep learning across the application and hardware landscapes is creating an immense demand for computing power. Second, the waning of Moore's Law is paving the way for domain-specific acceleration as a means of delivering performance improvements. Third, deep learning's inherent error tolerance is reviving long-forgotten approximate computing paradigms. Fourth, latency, energy, and privacy considerations are increasingly pushing deep learning towards edge inference, with its stringent deployment constraints. All of the above have created a unique, once-in-a-generation opportunity for accelerated widespread adoption of new classes of hardware and algorithms, provided they can deliver fast, efficient, and accurate deep learning inference within a tight area and energy envelope. One approach towards efficient machine learning acceleration that I have explored attempts to push a neural network model size to its absolute minimum. 3PXNet - Pruned, Permuted, Packed XNOR Networks combines two widely used model compression techniques: binarization and sparsity to deliver usable models with a size down to single kilobytes. It uses an innovative combination of weight permutation and packing to create structured sparsity that can be implemented efficiently in both software and hardware. 3PXNet has been deployed as an open-source library targeting microcontroller-class devices with various software optimizations, further improving runtime and storage requirements. The second line of work I have pursued is the application of stochastic computing (SC). It is an approximate, stream-based computing paradigm enabling extremely area-efficient implementations of basic arithmetic operations such as multiplication and addition. SC has been enjoying a renaissance over the past few years due to its unique synergy with deep learning. On the one hand, SC makes it possible to implement extremely dense multiply-accumulate (MAC) computational fabric well suited towards computing large linear algebra kernels, which are the bread-and-butter of deep neural networks. On the other hand, those neural networks exhibit immense approximation tolerance levels, making SC a viable implementation candidate. However, several issues need to be solved to make the SC acceleration of neural networks feasible. The area efficiency comes at the cost of long stream processing latency. The conversion cost between fixed-point and stochastic representations can cancel out the gains from computation efficiency if not managed correctly. The above issues lead to a question on how to design an accelerator architecture that best takes advantage of SC's benefits and minimizes its shortcomings. To address this, I proposed the ACOUSTIC (Accelerating Convolutional Neural Networks through Or-Unipolar Skipped Stochastic Computing) architecture and its extension - GEO (Generation and Execution Optimized Stochastic Computing Accelerator for Neural Networks). ACOUSTIC is an architecture that tries to maximize SC's compute density to amortize conversion costs and memory accesses, delivering system-level reduction in inference energy and latency. It has taped out and demonstrated in silicon, using a 14nm fabrication process. GEO addresses some of the shortcomings of ACOUSTIC. Through the introduction of near-memory computation fabric, GEO enables a more flexible selection of dataflows. Novel progressive buffering scheme unique to SC lowers the reliance on high memory bandwidth. Overall, my work tries to approach accelerator design from the systems perspective, making it stand apart from most recent SC publications targeting point improvements in the computation itself. As an extension to the above line of work, I have explored the combination of SC and sparsity, to apply it to new classes of applications, and enable further benefits. I have proposed the first SC accelerator that supports weight sparsity - SASCHA (Sparsity-Aware Stochastic Computing Hardware Architecture for Neural Network Acceleration), which can improve performance on pruned neural networks, while maintaining the throughput when processing dense ones. SASCHA solves a series of unique, non-trivial challenges of combining SC with sparsity. On the other hand, I have also designed an architecture for accelerating event-based camera object tracking - SCIMITAR. Event-based cameras are relatively new imaging devices which only transmit information about pixels that have changed in brightness, resulting in very high input sparsity. SCIMITAR combines SC with computing-in-memory (CIM), and, through a series of architectural optimizations, is able to take advantage of this new data format to deliver low-latency object detection for tracking applications
Recommended from our members
Design, Evaluation and Co-optimization of Emerging Devices and Circuits
The continued push for traditional Silicon technology scaling faces the main challenge of non-scaling power density. Exploring alternative power-efficient technologies is essential for sustaining technology development. Many emerging technologies have been proposed as potential replacement for Silicon technology. However, these emerging technologies need rigorous evaluation in the contexts of circuits and systems to identify their value prior to commercial investment. We have developed evaluation frameworks covering emerging Boolean logic devices, memory devices, memory systems, and integration technologies. The evaluation metrics are in terms of delay, power, and reliability. According to the evaluation results, the development of emerging Boolean logic devices is still far from being able to replace Silicon technology, but magnetic random access memory (MRAM) is a promising memory technology showing benefits in performance and energy-efficiency.As a specific example, we co-optimize MRAM with application circuits and systems. Optimized MRAM write and read design can significantly improve the system performance. We have proposed magnetic tunnel junction (MTJ) based process and temperature variation monitor, which enables variation-aware MRAM write and read optimization. We have also proposed utilizing negative differential resistance (NDR) to enable fast and energy-efficient write and zero-disturbance read for resistive memories including MRAM. In addition, we also design and adapt MRAM technology into low-power stochastic computing system to improve energy-efficiency. To further improve the stochastic computing system, a promising VC-MTJ based true random stochastic bitstream generator is proposed and utilized
Recommended from our members
New Methodologies for Evaluating Design Rules
Design Rules (DRs) are the biggest design-relevant quality metric for a technology. Even small changes in DRs can have significant impact on manufacturability as well as circuit characteristics including layout area, variability, power, and performance. To systematically evaluate design rules several works have been published. The most recent among them is the Design Rule Evaluator (UCLA_DRE), a tool developed by NanoCad lab at UCLA, for fast and systematic evaluation of design rules and layout styles in terms of major layout characteristics of area, manufacturability, and variability. The framework essentially creates a virtual standard-cell library and performs the evaluation based on the virtual layout using first order models of variability and manufacturability (instead of relying on accurate simulation) and layout topology/congestion-based area estimates (instead of explicit and slow layout generation).However, UCLA_DRE suffers from few major limitations. First, UCLA_DRE currently does not have the capability to evaluate the interaction between overlay design rules and overlay control, which is becoming more critical and more challenging with the move toward multiple-patterning(MP) lithography. Second, UCLA_DRE currently evaluates design rules at the cell level which may lead to misleading conclusions because most designs are routing-limited and, hence, not every change in cell area results in a corresponding change in chip area. Third, delay was not evaluated but it is well-known that delay-change can affect chip-area due to different buffering and gate sizing to meet timing requirements. The first part of this dissertation offers a framework to study interaction between overlay design rules and overly control options in terms of area, performance and yield. The framework can also be used for designing informed, design-aware overlay metrology and control strategies. In this work, the framework was used to explore the design impact of LELE double-patterning rules and poly-line end extension rule defined between poly and active layer for different overlay characteristics (i.e., within-field vs. field-to-field overlay) and different overlay models at the 14nm node. Interesting conclusions can be drawn from the results. For example, one result shows that increasing the minimum mask-overlap length by 1nm would allow the use of a third-order wafer/sixth-order field-level overlay model instead of a sixth-order wafer/sixth-order field-level model with negligible impact on design.In the second part of the dissertation, a new methodology called chipDRE, a framework to evaluate design rules at the chip-level, is described. chipDRE uses a good chips per wafer metric to unify area, performance, variability and functional yield. It uses UCLA_DRE to generate virtual standard-cell library and uses a mix of physical design and semi-empirical models to estimate area change at the chip-level due to both cell delay and cell area change. One interesting result for well to active spacing shows non-monotonic relationship of ``good chips per wafer" with the rule valu
Recommended from our members
Hardware-Enabled Design For Security (DFS) Solution
The Integrated Circuit (IC) supply chains of modern companies often involve multiple business entities on a global scale, including offshore manufacturing, system integration and distribution of VLSI chips and systems. While the industry is trying to lower the risks imposed by the global supply chain production model, most existing techniques, such as Physical Uncolonable Function (PUF),logic obfuscation, and hardware metering often suffer from their unreliability characteristicsfor their parametric nature or high implementation cost of the whole security system. Therefore, IC/IP Design for Security (DFS) solutions that are efficient and practicalfor the industry are still yet to be discovered.In this dissertation we study the behavior of PUFs and propose several sources of randomness to construct stability-guaranteed PUFs through Locally Enhanced Defectivity (LED) mechanisms, such as Directed Self Assembly (DSA) and transistor gate oxide breakdown. These PUFs are fabricated and demonstrated to be stable and random, which can be used as reliable sources of hardware root-of-trust for DFS techniques. To study the security of PUFs and to show the benefits of our proposed stability-guaranteed PUFs, we present a new unified framework for evaluating PUF security through guesswork analysis. This framework enables us to evaluate and quantify the effect of noise, bias and model attacks on security. We also relate guesswork to other security measures such as min-entropy, and mutual information. The model quantitatively measures the security of various PUFs under different scenarios, and by doing so enables us to compare the security level of different sorts of PUFs.To further utilize the stable PUFs, a secure lightweight entity authentication hardware primitive (SLATE) is proposed and shown to be much smaller than existing strong PUFs and lightweight ciphers. The proposed SLATE is a practical DFS solution for its extremely lightweight implementation and is proven to be secure from both empirical and theoretical perspectives. Finally, the dissertation proposes an effective attack to reconstruct missing connections in 2.5D split manufacturing, which is a technique used to prevent reverse engineering from malicious foundry. A Satisfiability Modulo Theories (SMT) based grouping algorithm depending purely on the circuit functionality but not physical implementation is proposed to significantly reduce the runtime of Boolean Satisfiability (SAT) solver, which is used to recover configuration keys of the connection network. Defence strategies of our attacks are also studied
Going Beyond Counting First Authors in Author Co-citation Analysis
The present study examines one of the fundamental aspects of author co-citation analysis (ACA) - the way co-citation
counts are defined. Co-citation counting provides the data on which all subsequent statistical analyses and mappings
are based, and we compare ACA results based on two different types of co-citation counting - the traditional type that
only counts the first one among a cited work's authors on the one hand and a non-traditional type that takes into
account the first 5 authors of a cited work on the other hand. Results indicate that the picture produced through this non-traditional author co-citation counting contains more coherent author groups and is therefore considerably clearer. However, this picture represents fewer specialties in the research field being studied than that produced through the traditional first-author co-citation counting when the same number of top-ranked authors is selected and analyzed. Reasons for these effects are discussed
Recommended from our members
Lightweight Opportunistic Memory Resilience
The reliability of memory subsystems is worsening rapidly and needs to be considered as one of the primary design objectives when designing today's computer systems. From on-chip embedded memories in Internet-of-Things (IoT) devices and on-chip caches to off-chip main memories, they have become the limiting factor in the reliability of these computing systems. Today's applications demand large capacity of on-chip or off-chip memory or both. With aggressive technology scaling, coupled with the increase in the total area devoted to memory in a chip, memories are becoming particularly sensitive to manufacturing process variation, environmental operating conditions, and aging-induced wearout. However, the challenge with memory reliability is that the resiliency techniques need to be effective but with minimal overhead. Today's typical error correcting schemes do not take into consideration the data value that they are protecting and are purely based on positional errors. This increases their overheads and makes them too expensive, especially for on-chip memories. Also, the drive for denser off-chip main memories is worsening their reliability. But strengthening today's error correction techniques will result in non-negligible increase in overheads. Hence, this dissertation proposes Lightweight Opportunistic Memory Resilience. We exploit the following three aspects to make memories more reliable with low overheads: (1) Underlying memory fault models, (2) Data value behavior of commonly used applications, and (3) The architecture of the memory itself. We opportunistically exploit these three aspects to provide stronger protection against memory errors. We design novel error detecting and correcting codes and develop several other architectural fault tolerance techniques at minimal overheads compared to the conventional reliability techniques used in today's memories.In part 1 of this dissertation, we address the reliability concerns in lightweight on-chip caches or embedded memories like scratchpads in IoT devices. These memories are becoming larger in size, but needs to be low power. Using standard error correcting codes or traditional row/column sparing to recover from faults are too expensive for them. Here, we leverage the fact that manufacturing defects and aging-induced hard faults usually only affect only a few bits in a memory. These bits, however, inhibit how low of a voltage these chips can be operated at. Traditional software fails even when a small number of bits in a memory are faulty. For the first time, we provide two solutions, FaultLink and SAME-Infer, which help deal with these weak faulty cells in the memory by generating a custom-tailored fault-aware application binary image for each chip. Next, we designed Software-Defined Error Localization Code (SDELC) and Parity++ as lightweight runtime error recovery techniques that leverage the insight that data values have locality in them and certain ranges of data values occur more frequently than others. Conventional ECC is too expensive for these lightweight memories. SDELC uses novel ultra-lightweight error-localizing codes to localize the error to a chunk in the data. It then heuristically recovers from the localized error by exploiting side information about the application's memory contents. Parity++ is a novel unequal message protection scheme that preferentially provides stronger error protection to certain ''special messages". This protection scheme provides Single Error Detection (SED) for all messages and Single Error Correction (SEC) for a subset of special messages. Both these novel codes utilize data value behavior to provide single error correction at 2.5x-4x lower overhead than a conventional hamming single error correcting code.In part 2 of this dissertation, we focus on off-chip main memory technologies. We primarily leverage the details of the memory architecture itself and their dominant fault mechanisms to effectively design reliability schemes. The need for larger main memory capacity in today's workstation or server environments is driving the use of non-volatile memories (NVM) or techniques to enable high density DRAMs. Due to aggressive scaling, the single-bit error rate in DRAMs is steadily increasing and DRAM manufacturers are adopting on-die error correction coding (ECC) schemes, along with within memory controller ECC, to correct single-bit errors in the memory. In COMET we have shown that today’s standard on-die ECCs can lead to silent data corruption if not designed correctly. We propose a collaborative on-die and in-controller error correction scheme that prevents double-bit error induced silent data corruption and corrects 99.9997% of these double-bit errors at absolutely no additional storage, latency, and area overheads. Not just DRAMs, reliability is a major concern in most of the emerging NVM technologies. In Compression with Multi-ECC (CME), we propose a new opportunistic compression-based ECC protection scheme for magnetic memory-based main memories. CME compresses every memory line and uses the saved bits to add stronger protection. In some of these NVMs, error rates increase as we try to improve read/write latencies. In PCM-Duplicate, we propose an enhanced PCM architecture that reduces PCM read latency by more than 3x and makes it comparable to that of DRAM. We then use ECC to tolerate the additional errors that arise because of the proposed optimizations. Overall, we have developed a complementary suite of novel methods for tolerating faults and correcting errors in different levels of the memory hierarchy. We exploit the memory architecture and fault mechanisms as well as the application data behavior to tune the proposed solutions to the particular memory characteristics; lightweight solutions for low-cost embedded memories and latency-critical on-chip caches while stronger protection for off-chip main memory subsystems. With memory reliability being a major bottleneck in today’s systems, these novel solutions are expected to alleviate this problem, help cope with the unique outcomes of hardware variability in memory systems and provide improved reliability at minimal cost
Recommended from our members
Learned Approximate Computing for Machine Learning
{Machine learning using deep neural networks is growing in popularity and is demanding increasing computation requirements at the same time. Approximate computing is a promising approach that trades accuracy for performance, and stochastic computing is an especially interesting approach that preserves the compute units of single-bit computation while allowing adjustable compute precision. This dissertation centers around enabling and improving stochastic computing for neural networks, while also discussing works that lead up to stochastic computing and how the techniques developed for stochastic computing are applied to other approximate computing methods and applications other than deep neural networks. We start with 3pxnet, which combines extreme quantization with model pruning. While 3pxnet achieves extremely compact models, it demonstrates limits of binarization, including the inability to scale to higher precision levels and performance bottlenecks from accumulation. This leads us to stochastic computing, which performs single-gate multiplications and additions on probabilistic bit streams. The initial SC neural network implementation in ACOUSTIC aims at maximizing SC performance benefits while achieving usable accuracy. This is achieved through design choices in stream representation, performance optimizations using pooling layers, and training modifications to make single-gate accumulation possible. The subsequent work in GEO improves the stream generation and computation aspects of stochastic computing and reduces the accuracy gap between stochastic computing and fixed-point computing. The accumulation part of SC is further optimized in REX-SC, which allows efficient modeling of SC accumulation during training. During these iterations of the SC algorithm, we developed efficient training pipelines that target various aspects of training for approximate computing. Both forward and backward passes of training are optimized, which allows us to demonstrate model convergence results using SC and other approximate computing methods with limited hardware resources. Finally, we apply the training concept to other applications. In LAC, we show that an almost arbitrary parameterized application can be trained to perform well with approximate computing. At the same time, we can search for the optimal hardware configuration using NAS techniques
Recommended from our members
Understanding Software Application Behaviour in Presence of Permanent and Intermittent Hardware Faults
Over past three decades technological advancement in fabrication of VLSI ICs has been accompanied by shrinking of device sizes and scaling of supply voltage. While power, area and performance have constantly improved, hardware reliability is becoming a growing concern. Due to increased process, voltage and temperature (PVT) variations, the infant mortality rate has gone up. Coupled with PVT variations, aging and wearout induced failures have exacerbated the problem as devices unexpectedly fail while in operation. Although a significant fraction of emerging failure and wearout mechanisms result in intermittent or permanent faults in the hardware, their impact (as distinct from transient faults) on software applications has not been well studied. In this work, we analyze the impact of such failures on software applications and develop a distinguishing application characteristic, referred to as similarity from basic circuit-level understanding of the failure mechanisms. We present a mathematical definition and approximations for similarity computation for practical software applications and experimentally verify the relationship between similarity and fault rate. Leveraging the dependence of application robustness on similarity metric, we present example architecture independent code transformations to reduce similarity and thereby the worst case fault rate with minimal performance degradation. The experiments with arithmetic unit faults show as much as 74% improvement in the worst case fault rate on benchmark kernels with less than 10% performance degradation
Recommended from our members
Cross-Layer Approaches for Monitoring, Margining and Mitigation of Circuit Variability
With technology scaling, circuit performance has become more sensitive to various sources of variability, including manufacturing variations, ambient fluctuations, and circuit wear-out. These increased variations have created new challenges for conventional hardware guardbanding, as the additional design margin diminishes the benefits of technology scaling. This dissertation aims at reducing total system design margin with cross-layer approaches on monitoring, margining and mitigation of circuit variability. Since hardware and software adaptation can be used to reduce design margin with theexposed hardware variability provided by hardware monitors, we start by proposing twodifferent types of performance monitors that can achieve better monitoring accuracy andsmaller monitoring overhead. We also demonstrate the use of these performance monitors in system adaptation with our end-to-end implementation of software testbeds.We also study the dynamic variations and reliability margining problem in presence ofmonitor-and-actuate adaptation and emerging system contexts. In a system with monitor-and-actuate adaptation, dynamic variations require extra margin for monitor and actuate latencies. We analyze and study the margining problem considering different choices of the monitor and actuator types. System reliability margining strategies are also proposed for circuits in the “dark silicon” era, where the low-level design margin should consider the contexts of high-level power/thermal constraints.Last, we propose a clock gating methodology to mitigate the aging induced clock skew,which is difficult to monitor and resolve through adaptation. For certain phenomena andvariation sources, for example, soft error rates at different location/altitude, we also proposesystem/cloud-based monitors. An emulation platform is built to study the impacts ofdynamic power management schemes on system reliability
- …
