1,721,309 research outputs found
Transparent caching for RMA systems
The constantly increasing gap between communication and computation performance emphasizes the importance of communication-avoidance techniques. Caching is a well-known concept used to reduce accesses to slow local memories. In this work, we extend the caching idea to MPI3 Remote Memory Access (RMA) operations. Here, caching can avoid inter-node communications and achieve similar benefits for irregular applications as communication-avoiding algorithms for structured applications. We propose CLaMPI,
a caching library layered on top of MPI-3 RMA, to automatically optimize code with minimum user intervention. We demonstrate how cached RMA improves the performance of a Barnes Hut simulation and a Local Clustering Coefficient computation up to a factor of 1.8x and 5x, respectively. Due to the low overheads in the cache miss case and the potential benefits, we expect that our ideas around transparent RMA caching will soon be an integral part of many MPI librarie
Scaling betweenness centrality using communication-efficient sparse matrix multiplication
Betweenness centrality (BC) is a crucial graph problem that measures the significance of a vertex by the number of shortest paths
leading through it. We propose Maximal Frontier Betweenness Centrality (MFBC): a succinct BC algorithm based on novel sparse matrix multiplication routines that performs a factor of p 1/3 less communication on p processors than the best known alternatives, for
graphs withn vertices and average degree k = n/p 2/3. We formulate, implement, and prove the correctness of MFBC for weighted graphs by leveraging monoids instead of semirings, which enables a surprisingly succinct formulation. MFBC scales well for both extremely sparse and relatively dense graphs. It automatically searches a space of distributed data decompositions and sparse matrix multiplication algorithms for the most advantageous configuration. The MFBC implementation outperforms the well-known CombBLAS library by up to 8x and shows more robust performance. Our design methodology is readily extensible to other graph problems
Extending RISC-V for Efficient Overflow Recovery in Mixed-Precision Computations
Pushed by the fast exponential growth of machine learning models, low-precision floating-point (FP) formats, such as FP8 and FP16, are now supported by many commercial hardware platforms. Thanks to the available hardware support and their reduced storage and energy footprint, these low-precision formats are currently being investigated for many applications beyond neural network (NN) training and inference. These data types, however, rely on narrow exponent bitwidths, which directly translate to small dynamic ranges. Consequently, they are less robust to overflow with respect to FP32, especially during long accumulations. While overflowing values are often saturated in NN algorithms, this approach might not be sustainable in all scenarios, such as in the case of safety-critical applications. In this work, we propose a low-overhead hardware-software approach for overflow recovery. We devise an online recovery scheme, which leverages a RISC-V instruction set architecture (ISA) extension to minimize the overhead required to detect overflow and adjust the accumulation precision. For this purpose, branch instructions depending on the FP overflow flag and widening dot-product instructions working on 8-bit inputs and accumulating with 32 bits are added to a RISC-V core with mixed-precision capabilities. Our ISA extension adds less than 1% of hardware overhead to the RISC-V core and allows for less than 2% of performance penalty for overflow detection in a 128 x 128 matrix multiplication. Supporting overflow detection and recovery introduces negligible overhead with respect to a fragile baseline mixed-precision computation while maintaining its storage and performance advantages with respect to the full-precision baseline
Sparse Stream Semantic Registers: A Lightweight ISA Extension Accelerating General Sparse Linear Algebra
Sparse linear algebra is crucial in many application domains, but challenging to handle efficiently in both software and hardware, with one- and two-sided operand sparsity handled with distinct approaches. In this work, we enhance an existing memory-streaming RISC-V ISA extension to accelerate both one- and two-sided operand sparsity on widespread sparse tensor formats like compressed sparse row (CSR) and compressed sparse fiber (CSF) by accelerating the underlying operations of streaming indirection, intersection, and union. Our extensions enable single-core speedups over an optimized RISC-V baseline of up to 7.0x, 7.7x, and 9.8x on sparse-dense multiply, sparse-sparse multiply, and sparse-sparse addition, respectively, and peak FPU utilizations of up to 80% on sparse-dense problems. On an eight-core cluster, sparse-dense and sparse-sparse matrix-vector multiply using real-world matrices are up to 4.9x and 5.9x faster and up to 2.9x and 3.0x more energy efficient. We explore further applications for our extensions, such as stencil codes and graph pattern matching. Compared to recent CPU, GPU, and accelerator approaches, our extensions enable higher flexibility on data representation, degree of sparsity, and dataflow at a minimal hardware footprint, adding only 1.8% in area to a compute cluster. A cluster with our extensions running CSR matrix-vector multiplication achieves 9.9x and 1.7x higher peak floating-point utilizations than recent highly optimized sparse data structures and libraries for CPU and GPU, respectively, even when accounting for off-chip main memory (HBM) and on-chip interconnect latency and bandwidth effects
NeVerMore: Exploiting RDMA Mistakes in NVMe-oF Storage Applications
This paper presents a security analysis of the InfiniBand architecture, a prevalent RDMA standard, and NVMe-over-Fabrics (NVMe-oF), a prominent protocol for industrial disaggregated storage that exploits RDMA protocols to achieve low-latency and high-bandwidth access to remote solid-state devices. Our work, NeVerMore, discovers new vulnerabilities in RDMA protocols that unveils several attack vectors on RDMA-enabled applications and the NVMe-oF protocol, showing that the current security mechanisms of the NVMe-oF protocol do not address the security vulnerabilities posed by the use of RDMA. In particular, we show how an unprivileged user can inject packets into any RDMA connection created on a local network controller, bypassing security mechanisms of the operating system and its kernel, and how the injection can be used to acquire unauthorized block access to NVMe-oF devices. Overall, we implement four attacks on RDMA protocols and seven attacks on the NVMe-oF protocol and verify them on the two most popular implementations of NVMe-oF: SPDK and the Linux kernel. To mitigate the discovered attacks we propose multiple mechanisms that can be implemented by RDMA and NVMe-oF providers
A High-Performance, Energy-Efficient Modular DMA Engine Architecture
Data transfers are essential in today's computing systems as latency and complex memory access patterns are increasingly challenging to manage. Direct memory access engines (DMAES) are critically needed to transfer data independently of the processing elements, hiding latency and achieving high throughput even for complex access patterns to high-latency memory. With the prevalence of heterogeneous systems, DMAEs must operate efficiently in increasingly diverse environments. This work proposes a modular and highly configurable open-source DMAE architecture called intelligent DMA (iDMA), split into three parts that can be composed and customized independently. The front-end implements the control plane binding to the surrounding system. The mid-end accelerates complex data transfer patterns such as multi-dimensional transfers, scattering, or gathering. The back-end interfaces with the on-chip communication fabric (data plane). We assess the efficiency of iDMA in various instantiations: In high-performance systems, we achieve speedups of up to 15.8 × with only 1% additional area compared to a base system without a DMAE. We achieve an area reduction of 10% while improving ML inference performance by 23% in ultra-low-energy edge AI systems over an existing DMAE solution. We provide area, timing, latency, and performance characterization to guide its instantiation in various systems
HexaMesh: Scaling to Hundreds of Chiplets with an Optimized Chiplet Arrangement
2.5D integration is an important technique to tackle the growing cost of manufacturing chips in advanced technology nodes. This poses the challenge of providing high-performance inter-chiplet interconnects (ICIs). As the number of chiplets grows to tens or hundreds, it becomes infeasible to hand-optimize their arrangement in a way that maximizes the ICI performance. In this paper, we propose HexaMesh, an arrangement of chiplets that outperforms a grid arrangement both in theory (network diameter reduced by 42%; bisection bandwidth improved by 130%) and in practice (latency reduced by 19%; throughput improved by 34%). MexaMesh enables large-scale chiplet designs with high-performance ICIs
Sparse Hamming Graph: A Customizable Network-on-Chip Topology
Chips with hundreds to thousands of cores require scalable networks-on-chip (NoCs). Customization of the NoC topology is necessary to reach the diverse design goals of different chips. We introduce sparse Hamming graph, a novel NoC topology with an adjustable cost-performance trade-off that is based on four NoC topology design principles we identified. To efficiently customize this topology, we develop a toolchain that leverages approximate floorplanning and link routing to deliver fast and accurate cost and performance predictions. We demonstrate how to use our methodology to achieve desired cost-performance trade-offs while outperforming established topologies in cost, performance, or both
- …
