Search CORE

1,446 research outputs found

Prediction horizon vs. efficiency of optimal dynamic thermal control policies in HPC nodes

Author: Bartolini Andrea
Benini Luca
Luca Benini
Cesarini Daniele
Andrea Bartolini
Daniele Cesarini
Publication venue
Publication date: 01/01/2017
Field of study

Crossref

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

Benefits in Relaxing the Power Capping Constraint

Author: Bartolini Andrea
Benini Luca
Luca Benini
Cesarini Daniele
Andrea Bartolini
Daniele Cesarini
Publication venue
Publication date: 01/01/2017
Field of study

In this manuscript we evaluate the impact of HW power capping mechanisms on a real scientific application composed by parallel execution. By comparing HW capping mechanism against static frequency allocation schemes we show that a speed up can be achieved if the power constraint is enforced in average, during the application run, instead of on short time periods. RAPL, which enforces the power constraint on a few ms time scale, fails on sharing power budget between more demanding and less demanding application phases

Crossref

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

An optimized task-based runtime system for resource-constrained parallel accelerators

Author: Benini Luca
Andrea Marongiu
Marongiu Andrea
Luca Benini
Cesarini Daniele
Daniele Cesarini
Publication venue
Publication date: 01/01/2016
Field of study

Manycore accelerators have recently proven a promising solution for increasingly powerful and energy efficient computing systems. This raises the need for parallel programming models capable of effectively leveraging hundreds to thousands of processors. Task-based parallelism has the potential to provide such capabilities, offering flexible support to fine-grained and irregular parallelism. However, efficiently supporting this programming paradigm on resource-constrained parallel accelerators is a challenging task. In this paper, we present an optimized implementation of the OpenMP tasking model for embedded parallel accelerators, discussing the key design solution that guarantee small memory (footprint) and minimize performance overheads. We validate our design by comparing to several state-of-the-art tasking implementations, using the most representative parallelization patterns. The experimental results confirm that our solution achieves near-ideal speedups for tasks as small as 5K cycles

Crossref

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

Unleashing Fine-Grained Parallelism on Embedded Many-Core Accelerators with Lightweight OpenMP Tasking

Author: Andrea Marongiu
Marongiu Andrea
Cesarini Daniele
Giuseppe Tagliavini
Tagliavini Giuseppe
Daniele Cesarini
Publication venue
Publication date: 01/01/2018
Field of study

In recent years, programmable many-core accelerators (PMCAs) have been introduced in embedded systems to satisfy stringent performance/Watt requirements. This has increased the urge for programming models capable of effectively leveraging hundreds to thousands of processors. Task-based parallelism has the potential to provide such capabilities, offering high-level abstractions to outline abundant and irregular parallelism in embedded applications. However, efficiently supporting this programming paradigm on embedded PMCAs is challenging, due to the large time and space overheads it introduces. In this paper we describe a lightweight OpenMP tasking runtime environment (RTE) design for a state-of-the-art embedded PMCA, the Kalray MPPA 256. We provide an exhaustive characterization of the costs of our RTE, considering both synthetic workload and real programs, and we compare to several other tasking RTEs. Experimental results confirm that our solution achieves near-ideal parallelization speedups for tasks as small as 5K cycles, and an average speedup of 12 × for real benchmarks, which is 60% higher than what we observe with the original Kalray OpenMP implementation

Crossref

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

Assessing Tenstorrent’s RISC-V MatMul Acceleration Capabilities

Author: Pizzini Cavagna Hiari
Bartolini Andrea
Cesarini Daniele
Publication venue
Publication date: 01/01/2025
Field of study

The increasing demand for generative AI as Large Language Models (LLMs) services has driven the need for specialized hardware ar- chitectures that optimize computational e!ciency and energy consump- tion. This paper evaluates the performance of the Tenstorrent Grayskull e75 RISC-V accelerator for basic linear algebra kernels at reduced nu- merical precision, a fundamental operation in LLM computations. We present a detailed characterization of Grayskull’s execution model, grid size, matrix dimensions, data formats, and numerical precision impact on computational e!ciency. Furthermore, we compare Grayskull’s perfor- mance against state-of-the-art architectures with tensor acceleration, in- cluding Intel Sapphire Rapids processors and two NVIDIA GPUs (V100 and A100). Whilst NVIDIA GPUs dominate raw performance, Grayskull demonstrates a competitive trade-o" between power consumption and computational throughput, reaching a peak of 1.55 TFLOPs/Watt with BF16

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

Energy Saving and Thermal Management Opportunities in a Workload-Aware MPI Runtime for a Scientific HPC Computing Node

Author: Benini Luca; id_orcid
Bartolini Andrea
Bartolini Andrea
Cesarini Daniele
Benini Luca
Cesarini Daniele
Publication venue
Publication date: 01/01/2017
Field of study

With the advent of a new generation of supercomputers characterized by tightly-coupled integration of a large-number of powerful processing cores in the same die, energy and temperature walls are looming threats to the growth in computational power. Scientific computing is characterized by a single application running in parallel on multiple nodes and cores until termination. The message-passing programming model is a widely adopted paradigm for explicitly handling data-sharing between processes of the same application. As an effect of the MPI communication patterns among different processes, the application is characterized by phases which can be exploited by OS power manager. In addition, the large number of cores integrated in the same silicon die introduces large thermal capacitance as well as on-die thermal heterogeneity. Jointly exploiting local workload unbalance and computational node heterogeneity can open interesting opportunities for advanced thermal and energy management. In this paper, we present an exploratory work to assess these opportunities and their limiting factors. We analyze application workload and we identify opportunities to reduce energy consumption and their impact on performance. We test our methodology on a widely-used quantum-chemistry application demonstrating potential benefits of combining the application flow with power and thermal management strategies

ETHzürich Repository for Publications and Research Data

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

COUNTDOWN - A Run-time Library for Application-agnostic Energy Saving in MPI Communication Primitives

Author: Carlo Cavazzoni
Piero Bonfà
Luca Benini
Andrea Bartolini
Daniele Cesarini
Publication venue
Publication date: 01/01/2018
Field of study

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

Evaluation of NTP/PTP Fine-Grain Synchronization Performance in HPC Clusters

Author: Luca Benini
Andrea Bartolini
Antonio Libri
Daniele Cesarini
Publication venue
Publication date: 01/01/2018
Field of study

Crossref

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

Paving the Way Toward Energy-Aware and Automated Datacentre

Author: Benini Luca; id_orcid
Borghesi Andrea
Bartolini Andrea
Benini Luca
Cesarini Daniele
Libri Antonio
Beneventi Francesco
Cesarini Daniele; id_orcid
Cavazzoni Carlo
Publication venue
Publication date: 01/01/2019
Field of study

ETHzürich Repository for Publications and Research Data

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

Countdown Slack: A Run-Time Library to Reduce Energy Footprint in Large-Scale MPI Applications

Author: Benini Luca; id_orcid
Borghesi Andrea
Bartolini Andrea
Benini Luca
Cesarini Daniele
Luisier Mathieu
Cesarini Daniele; id_orcid
Cavazzoni Carlo
Publication venue
Publication date: 01/01/2020
Field of study

The power consumption of supercomputers is a major chal- lenge for system owners, users, and society. It limits the capacity of system installations, it requires large cooling infrastructures, and it is the cause of a large carbon footprint. Reducing power during application execution without changing the application source code or increasing time-to-completion is highly desirable in real-life high-performance com- puting scenarios. The power management run-time frameworks proposed in the last decade are based on the assumption that the duration of communication and application phases in an MPI application can be predicted and used at run-time to trade-off communication slack with power consumption. In this manuscript, we first show that this assumption is too general and leads to mispredictions, slowing down applications, thereby jeopardizing the claimed benefits. We then propose a new approach based on (i) the separation of communication phases and slack during MPI calls and (ii) a timeout algorithm to cope with the hardware power management latency, which jointly makes it possible to achieve performance-neutral power saving in MPI applications without requiring labor-intensive and risky application source code modifications. We validate our approach in a tier-1 production environment with widely adopted scientific appli- cations. Our approach has a time-to-completion overhead lower than 1% , while it successfully exploits slack in communication phases to achieve an average energy saving of 10% . If we focus on a large- scale application runs, the proposed approach achieves 22% energy saving with an overhead of only 0.4% . With respect to state-of-the-art approaches, COUNTDOWN Slack is the only that always leads to an energy saving with negligible overhead ( < 3% )

ETHzürich Repository for Publications and Research Data

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna