1,720,975 research outputs found
RedMulE: A Compact FP16 Matrix-Multiplication Accelerator for Adaptive Deep Learning on RISC-V-Based Ultra-Low-Power SoCs
The fast proliferation of extreme-edge applications using Deep Learning (DL) based algorithms required dedicated hardware to satisfy extreme-edge applications' latency, throughput, and precision requirements. While inference is achievable in practical cases, online finetuning and adaptation of general DL models are still highly challenging.
One of the key stumbling stones is the need for parallel floating-point operations, which are considered unaffordable on sub-100 mW extreme-edge SoCs. We tackle this problem with RedMulE (Reduced-precision matrix Multiplication Engine), a parametric low-power hardware accelerator for FP16 matrix multiplications - the main kernel of DL training and inference - conceived for tight integration within a cluster of tiny RISC-V cores based on the PULP (Parallel Ultra-Low-Power) architecture. In 22 nm technology, a 32-FMA RedMulE instance occupies just 0.07 mm^2 (14% of an 8-core \RISC-V cluster) and achieves up to 666 MHz maximum operating frequency, for a throughput of 31.6MAC/cycle (98.8% utilization). We reach a cluster-level power consumption of 43.5 mW and a full-cluster energy efficiency of 688 16-bit GFLOPS/W. Overall, RedMulE features up to 4.65x higher energy efficiency and 22x speedup over SW execution on 8 RISC-V cores
MiniFloat-NN and ExSdotp: An ISA Extension and a Modular Open Hardware Unit for Low-Precision Training on RISC-V Cores
Low-precision formats have recently driven major breakthroughs in neural network (NN) training and inferenceby redncing the memory footprint of the NN models and improving the energy elIiciency of the underlying hardware arehitectures, Narrow integer data types have been vastly investigated for NN inference and have successfully beeo pnsbed to the extreme of ternary and binary representations. In contrast, most training-oriented platforms use at least 16-bit floating-point (FP) Cormats. Lower-precision data types such as 8-bit FP formats and mixed-precision techniques have only recently been explored in hardware implementations. We present MiniFloat-NN, a RISe-v instruction set architecture extension for low-precision NN training, providing support Cor two 8-bit and two 16-bit FP Cormats andexpanding operations. The extension includes sum-of-dot-product instructions that accnmulate the result in a larger format and three-term additions in two variations: expanding and non-expanding. We implement an ExSdolp unit to elliciently support in hardware both instruetion types. The fused nature of the ExSdotp module prevents precision losses generated by the non-assofiativity of two consecutive FP additions while saving around 30% of the area and eritieal poth compared to a cascade of two expanding fused multiply-add units. We repUcate the ExSdolp module in a SIMD wrapper and integrate It into an open-source floating-point unit, which, conpled to an open-source RISC-V core, lays the foundation for future scalable architectures targeting low-precision and mixed-precision NN training. A cluster containing eight extended oores sharing a scratehpad memory, implemented in 12nm FinFET technology, achieves up to 575 GFLOPS/W when computing FP8-to-FP16 GEMMs at 0.8 V, 1.26GHz
Extending RISC-V for Efficient Overflow Recovery in Mixed-Precision Computations
Pushed by the fast exponential growth of machine learning models, low-precision floating-point (FP) formats, such as FP8 and FP16, are now supported by many commercial hardware platforms. Thanks to the available hardware support and their reduced storage and energy footprint, these low-precision formats are currently being investigated for many applications beyond neural network (NN) training and inference. These data types, however, rely on narrow exponent bitwidths, which directly translate to small dynamic ranges. Consequently, they are less robust to overflow with respect to FP32, especially during long accumulations. While overflowing values are often saturated in NN algorithms, this approach might not be sustainable in all scenarios, such as in the case of safety-critical applications. In this work, we propose a low-overhead hardware-software approach for overflow recovery. We devise an online recovery scheme, which leverages a RISC-V instruction set architecture (ISA) extension to minimize the overhead required to detect overflow and adjust the accumulation precision. For this purpose, branch instructions depending on the FP overflow flag and widening dot-product instructions working on 8-bit inputs and accumulating with 32 bits are added to a RISC-V core with mixed-precision capabilities. Our ISA extension adds less than 1% of hardware overhead to the RISC-V core and allows for less than 2% of performance penalty for overflow detection in a 128 x 128 matrix multiplication. Supporting overflow detection and recovery introduces negligible overhead with respect to a fragile baseline mixed-precision computation while maintaining its storage and performance advantages with respect to the full-precision baseline
Going Beyond Counting First Authors in Author Co-citation Analysis
The present study examines one of the fundamental aspects of author co-citation analysis (ACA) - the way co-citation
counts are defined. Co-citation counting provides the data on which all subsequent statistical analyses and mappings
are based, and we compare ACA results based on two different types of co-citation counting - the traditional type that
only counts the first one among a cited work's authors on the one hand and a non-traditional type that takes into
account the first 5 authors of a cited work on the other hand. Results indicate that the picture produced through this non-traditional author co-citation counting contains more coherent author groups and is therefore considerably clearer. However, this picture represents fewer specialties in the research field being studied than that produced through the traditional first-author co-citation counting when the same number of top-ranked authors is selected and analyzed. Reasons for these effects are discussed
RedMule: A mixed-precision matrix–matrix operation engine for flexible and energy-efficient on-chip linear algebra and TinyML training acceleration
The increasing interest in TinyML, i.e., near-sensor machine learning on power budgets of a few tens of mW, is currently pushing toward enabling TinyML-class training as opposed to inference only. Current training algorithms, based on various forms of error and gradient backpropagation, rely on floating-point matrix operations to meet the precision and dynamic range requirements. So far, the energy and power cost of these operations has been considered too high for TinyML scenarios. This paper addresses the open challenge of near-sensor training on a few mW power budget and presents RedMulE — Reduced-Precision Matrix Multiplication Engine, a low-power specialized accelerator conceived for multi-precision floating-point General Matrix–Matrix Operations (GEMM-Ops) acceleration, supporting FP16, as well as hybrid FP8 formats, with {sign, exponent, mantissa} = ({1, 4, 3}, {1, 5, 2}). We integrate RedMule into a Parallel Ultra-Low-Power (PULP) cluster containing eight energy-efficient RISC-V cores sharing a tightly-coupled data memory and implement the resulting system in a 22 nm technology. At its best efficiency point (@ 470 MHz, 0.65 V), the RedMulE-augmented PULP cluster achieves 755 GFLOPS/W and 920 GFLOPS/W during regular General Matrix–Matrix Multiplication (GEMM), and up to 1.19 TFLOPS/W and 1.67 TFLOPS/W when executing GEMM-Ops, respectively, for FP16 and FP8 input/output tensors. In its best performance point (@ 613 MHz, 0.8 V), RedMulE achieves up to 58.5 GFLOPS and 117 GFLOPS for FP16 and FP8, respectively, with 99.4% utilization of the array of Computing Elements and consuming less than 60 mW on average, thus enabling on-device training of deep learning models in TinyML application scenarios while retaining the flexibility to tackle other classes of common linear algebra problems efficiently
Variations on the Author
“Variations on the Author” discusses two of Eduardo Coutinho’s recent films (Um Dia na Vida, from 2010, and Últimas Conversas, posthumously released in 2015) and their contribution to the general question of documentary authorship. The director’s filmography is characterized by a consistent yet self-effacing form of authorial self-inscription: Coutinho often features as an interviewer that rather than express opinions propels discourses; an interviewer that is good at listening. This mode of self-inscription characterizes him as an author who is not expressive but who is nonetheless markedly present on the screen. In Um Dia na Vida, however, Coutinho is completely absent form the image, while Últimas Conversas, on the contrary, includes a confessional prologue that moves the director from the margins to the center of his films. This article examines the ways in which these works stand out in the filmography of a director who offers new insights into the notion of cinematic authorship
Appropriate Similarity Measures for Author Cocitation Analysis
We provide a number of new insights into the methodological discussion about author cocitation analysis. We first argue that the use of the Pearson correlation for measuring the similarity between authors’ cocitation profiles is not very satisfactory. We then discuss what kind of similarity measures may be used as an alternative to the Pearson correlation. We consider three similarity measures in particular. One is the well-known cosine. The other two similarity measures have not been used before in the bibliometric literature. Finally, we show by means of an example that our findings have a high practical relevance.information science;Pearson correlation;cosine;similarity measure;author cocitation analysis
Different outcomes among favourable and unfavourable intermediate-risk prostate cancer patients treated with hypofractionated radiotherapy and androgen deprivation therapy
BACKGROUND:
to evaluate the role of a risk stratification system in intermediate-risk prostate cancer (PCa) treated with hypofractionated radiotherapy (HyRT).
METHODS:
131 patients affected by intermediate-risk PCa were treated with HyRT at the total dose of 54,75 Gy in 15 fraction plus 9 months of androgen deprivation therapy (ADT). Patients were classified as favourable risk (FIR) if they had a single NCCN intermediate-risk factor (IRF), a Gleason score ≤3 + 4 = 7, and <50 % of biopsy cores containing cancer (PBCC). If these criteria were not met were classified as unfavourable risk (UIR). Univariate and multivariate analyses using Cox proportional hazards model were calculated for biochemical recurrence-free survival (bRFS), the risk of local recurrence and metastasis-free survival (MFS).
RESULTS:
After a median follow-up of 56.7 months (range 9.8 to 93.7 months), 11 patients (8.4 %) died, of whom 2 (1.5 %) for PCa. In the univariate analysis, Gleason score, PPBCs, IRFs and PSA at first follow-up were prognostic factors for bRFS and LF while Gleason score, PPBCs and PSA at first follow-up were significant predictor for MFS. In the multivariate analysis only the PSA at first follow-up resulted a prognostic factor for bRFS and MFS. Patients with a value of PSA at first follow-up <0.7 ng/mL respect to those with PSA ≥0,7 ng/mL had a 5y-bRFS of 93.3 % vs. 57.5 %, 5y-MFS of 99.0 % vs. 78.9 % and 5y-LF of 5.8 % vs. 38.3 %. Patients in the UIR PCa group with a PSA value <0.7 ng/mL at first follow-up had significant better bRFS, LF and MFS.
CONCLUSIONS:
Risk factors currently not included in the guidelines are useful to stratify patients with intermediate-risk PCa in two groups of different prognosis even when HyRT is delivered. PSA at first follow-up is useful in UIR PCa to guide the overall length of ADT
RedMulE-FT: A Reconfigurable Fault-Tolerant Matrix Multiplication Engine
As safety-critical applications increasingly rely on data-parallel floating-point computations, there is an increasing need for flexible and configurable fault tolerance in parallel floating-point accelerators such as tensor engines. While replication-based methods ensure reliability but incur high area and power costs, error correction codes lack the flexibility to trade off robustness against performance. This work presents RedMulE-FT, a runtime-configurable fault-tolerant extension of the RedMulE matrix multiplication accelerator, balancing fault tolerance, area overhead, and performance impacts. The fault tolerance mode is configured in a shadowed context register file before task execution. By combining replication with error-detecting codes to protect the data path, RedMulE-FT achieves an 11 × uncorrected fault reduction with only area overhead. Full protection extends to control signals, resulting in no functional errors after 1M injections during our extensive fault injection simulation campaign, with a total area overhead of while maintaining a 500 MHz frequency in a 12 nm technology
- …
