Search CORE

1,721,309 research outputs found

Hoefler, Torsten

Author: Hoefler Torsten
Publication venue
Publication date: 17/03/2016
Field of study

Banwa Publications (University of the Philippines Mindanao)

Foreword EuroMPI 2019

Author: Hoefler Torsten
Träff Jesper Larsson
Publication venue
Publication date: 01/01/2019
Field of study

reposiTUm (TUW Vienna)

Transparent caching for RMA systems

Author: Flavio Vella
Torsten Hoefler
Hoefler Torsten
Di Girolamo Salvatore
Salvatore Di Girolamo
Vella Flavio
Girolamo Salvatore DI
Publication venue
Publication date: 01/01/2017
Field of study

The constantly increasing gap between communication and computation performance emphasizes the importance of communication-avoidance techniques. Caching is a well-known concept used to reduce accesses to slow local memories. In this work, we extend the caching idea to MPI3 Remote Memory Access (RMA) operations. Here, caching can avoid inter-node communications and achieve similar benefits for irregular applications as communication-avoiding algorithms for structured applications. We propose CLaMPI, a caching library layered on top of MPI-3 RMA, to automatically optimize code with minimum user intervention. We demonstrate how cached RMA improves the performance of a Barnes Hut simulation and a Local Clustering Coefficient computation up to a factor of 1.8x and 5x, respectively. Due to the low overheads in the cache miss case and the potential benefits, we expect that our ideas around transparent RMA caching will soon be an integral part of many MPI librarie

ETHzürich Repository for Publications and Research Data

Crossref

Archivio della ricerca- Università di Roma La Sapienza

Scaling betweenness centrality using communication-efficient sparse matrix multiplication

Author: Flavio Vella
Torsten Hoefler
Solomonik Edgar
Besta Maciej
Hoefler Torsten
Vella Flavio
Maciej Besta
Edgar Solomonik
Publication venue
Publication date: 01/01/2017
Field of study

Betweenness centrality (BC) is a crucial graph problem that measures the significance of a vertex by the number of shortest paths leading through it. We propose Maximal Frontier Betweenness Centrality (MFBC): a succinct BC algorithm based on novel sparse matrix multiplication routines that performs a factor of p 1/3 less communication on p processors than the best known alternatives, for graphs withn vertices and average degree k = n/p 2/3. We formulate, implement, and prove the correctness of MFBC for weighted graphs by leveraging monoids instead of semirings, which enables a surprisingly succinct formulation. MFBC scales well for both extremely sparse and relatively dense graphs. It automatically searches a space of distributed data decompositions and sparse matrix multiplication algorithms for the most advantageous configuration. The MFBC implementation outperforms the well-known CombBLAS library by up to 8x and shows more robust performance. Our design methodology is readily extensible to other graph problems

Crossref

Archivio della ricerca- Università di Roma La Sapienza

Extending RISC-V for Efficient Overflow Recovery in Mixed-Precision Computations

Author: Bertaccini Luca
Shen Siyuan
Benini Luca; id_orcid
Hoefler Torsten
Benini Luca
Publication venue
Publication date: 01/01/2024
Field of study

Pushed by the fast exponential growth of machine learning models, low-precision floating-point (FP) formats, such as FP8 and FP16, are now supported by many commercial hardware platforms. Thanks to the available hardware support and their reduced storage and energy footprint, these low-precision formats are currently being investigated for many applications beyond neural network (NN) training and inference. These data types, however, rely on narrow exponent bitwidths, which directly translate to small dynamic ranges. Consequently, they are less robust to overflow with respect to FP32, especially during long accumulations. While overflowing values are often saturated in NN algorithms, this approach might not be sustainable in all scenarios, such as in the case of safety-critical applications. In this work, we propose a low-overhead hardware-software approach for overflow recovery. We devise an online recovery scheme, which leverages a RISC-V instruction set architecture (ISA) extension to minimize the overhead required to detect overflow and adjust the accumulation precision. For this purpose, branch instructions depending on the FP overflow flag and widening dot-product instructions working on 8-bit inputs and accumulating with 32 bits are added to a RISC-V core with mixed-precision capabilities. Our ISA extension adds less than 1% of hardware overhead to the RISC-V core and allows for less than 2% of performance penalty for overflow detection in a 128 x 128 matrix multiplication. Supporting overflow detection and recovery introduces negligible overhead with respect to a fragile baseline mixed-precision computation while maintaining its storage and performance advantages with respect to the full-precision baseline

ETHzürich Repository for Publications and Research Data

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

Sparse Stream Semantic Registers: A Lightweight ISA Extension Accelerating General Sparse Linear Algebra

Author: Zaruba Florian
Torsten Hoefler
Paul Scheffler
Schuiki Fabian
Hoefler Torsten
Fabian Schuiki
Benini Luca
Luca Benini
Florian Zaruba
Scheffler Paul
Publication venue
Publication date: 01/01/2023
Field of study

Sparse linear algebra is crucial in many application domains, but challenging to handle efficiently in both software and hardware, with one- and two-sided operand sparsity handled with distinct approaches. In this work, we enhance an existing memory-streaming RISC-V ISA extension to accelerate both one- and two-sided operand sparsity on widespread sparse tensor formats like compressed sparse row (CSR) and compressed sparse fiber (CSF) by accelerating the underlying operations of streaming indirection, intersection, and union. Our extensions enable single-core speedups over an optimized RISC-V baseline of up to 7.0x, 7.7x, and 9.8x on sparse-dense multiply, sparse-sparse multiply, and sparse-sparse addition, respectively, and peak FPU utilizations of up to 80% on sparse-dense problems. On an eight-core cluster, sparse-dense and sparse-sparse matrix-vector multiply using real-world matrices are up to 4.9x and 5.9x faster and up to 2.9x and 3.0x more energy efficient. We explore further applications for our extensions, such as stencil codes and graph pattern matching. Compared to recent CPU, GPU, and accelerator approaches, our extensions enable higher flexibility on data representation, degree of sparsity, and dataflow at a minimal hardware footprint, adding only 1.8% in area to a compute cluster. A cluster with our extensions running CSR matrix-vector multiplication achieves 9.9x and 1.7x higher peak floating-point utilizations than recent highly optimized sparse data structures and libraries for CPU and GPU, respectively, even when accounting for off-chip main memory (HBM) and on-chip interconnect latency and bandwidth effects

Crossref

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

NeVerMore: Exploiting RDMA Mistakes in NVMe-oF Storage Applications

Author: De Sensi Daniele; id_orcid
Hoefler Torsten
De Sensi Daniele
Perrig Adrian
Taranov Konstantin
Rothenberger Benjamin
Publication venue
Publication date: 01/01/2022
Field of study

This paper presents a security analysis of the InfiniBand architecture, a prevalent RDMA standard, and NVMe-over-Fabrics (NVMe-oF), a prominent protocol for industrial disaggregated storage that exploits RDMA protocols to achieve low-latency and high-bandwidth access to remote solid-state devices. Our work, NeVerMore, discovers new vulnerabilities in RDMA protocols that unveils several attack vectors on RDMA-enabled applications and the NVMe-oF protocol, showing that the current security mechanisms of the NVMe-oF protocol do not address the security vulnerabilities posed by the use of RDMA. In particular, we show how an unprivileged user can inject packets into any RDMA connection created on a local network controller, bypassing security mechanisms of the operating system and its kernel, and how the injection can be used to acquire unauthorized block access to NVMe-oF devices. Overall, we implement four attacks on RDMA protocols and seven attacks on the NVMe-oF protocol and verify them on the two most popular implementations of NVMe-oF: SPDK and the Linux kernel. To mitigate the discovered attacks we propose multiple mechanisms that can be implemented by RDMA and NVMe-oF providers

ETHzürich Repository for Publications and Research Data

ZENODO

Archivio della ricerca- Università di Roma La Sapienza

A High-Performance, Energy-Efficient Modular DMA Engine Architecture

Author: Benz Thomas
Hoefler Torsten
Benini Luca
Ottaviano Alessandro
Riedel Samuel
Kurth Andreas
Rogenmoser Michael
Scheffler Paul
Publication venue
Publication date: 01/01/2024
Field of study

Data transfers are essential in today's computing systems as latency and complex memory access patterns are increasingly challenging to manage. Direct memory access engines (DMAES) are critically needed to transfer data independently of the processing elements, hiding latency and achieving high throughput even for complex access patterns to high-latency memory. With the prevalence of heterogeneous systems, DMAEs must operate efficiently in increasingly diverse environments. This work proposes a modular and highly configurable open-source DMAE architecture called intelligent DMA (iDMA), split into three parts that can be composed and customized independently. The front-end implements the control plane binding to the surrounding system. The mid-end accelerates complex data transfer patterns such as multi-dimensional transfers, scattering, or gathering. The back-end interfaces with the on-chip communication fabric (data plane). We assess the efficiency of iDMA in various instantiations: In high-performance systems, we achieve speedups of up to 15.8 × with only 1% additional area compared to a base system without a DMAE. We achieve an area reduction of 10% while improving ML inference performance by 23% in ultra-low-energy edge AI systems over an existing DMAE solution. We provide area, timing, latency, and performance characterization to guide its instantiation in various systems

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

HexaMesh: Scaling to Hundreds of Chiplets with an Optimized Chiplet Arrangement

Author: Benini Luca; id_orcid
Cavalcante Matheus; id_orcid
Besta Maciej
Hoefler Torsten
Cavalcante Matheus
Iff Patrick
Benini Luca
Fischer Tim
Publication venue
Publication date: 01/01/2023
Field of study

2.5D integration is an important technique to tackle the growing cost of manufacturing chips in advanced technology nodes. This poses the challenge of providing high-performance inter-chiplet interconnects (ICIs). As the number of chiplets grows to tens or hundreds, it becomes infeasible to hand-optimize their arrangement in a way that maximizes the ICI performance. In this paper, we propose HexaMesh, an arrangement of chiplets that outperforms a grid arrangement both in theory (network diameter reduced by 42%; bisection bandwidth improved by 130%) and in practice (latency reduced by 19%; throughput improved by 34%). MexaMesh enables large-scale chiplet designs with high-performance ICIs

ETHzürich Repository for Publications and Research Data

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

Sparse Hamming Graph: A Customizable Network-on-Chip Topology

Author: Benini Luca; id_orcid
Cavalcante Matheus; id_orcid
Besta Maciej
Hoefler Torsten
Cavalcante Matheus
Iff Patrick
Benini Luca
Fischer Tim
Publication venue
Publication date: 01/01/2023
Field of study

Chips with hundreds to thousands of cores require scalable networks-on-chip (NoCs). Customization of the NoC topology is necessary to reach the diverse design goals of different chips. We introduce sparse Hamming graph, a novel NoC topology with an adjustable cost-performance trade-off that is based on four NoC topology design principles we identified. To efficiently customize this topology, we develop a toolchain that leverages approximate floorplanning and link routing to deliver fast and accurate cost and performance predictions. We demonstrate how to use our methodology to achieve desired cost-performance trade-offs while outperforming established topologies in cost, performance, or both

ETHzürich Repository for Publications and Research Data

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna