1,446 research outputs found
Prediction horizon vs. efficiency of optimal dynamic thermal control policies in HPC nodes
Benefits in Relaxing the Power Capping Constraint
In this manuscript we evaluate the impact of HW power capping mechanisms on a real scientific application composed by parallel execution. By comparing HW capping mechanism against static frequency allocation schemes we show that a speed up can be achieved if the power constraint is enforced in average, during the application run, instead of on short time periods. RAPL, which enforces the power constraint on a few ms time scale, fails on sharing power budget between more demanding and less demanding application phases
An optimized task-based runtime system for resource-constrained parallel accelerators
Manycore accelerators have recently proven a promising solution for increasingly powerful and energy efficient computing systems. This raises the need for parallel programming models capable of effectively leveraging hundreds to thousands of processors. Task-based parallelism has the potential to provide such capabilities, offering flexible support to fine-grained and irregular parallelism. However, efficiently supporting this programming paradigm on resource-constrained parallel accelerators is a challenging task. In this paper, we present an optimized implementation of the OpenMP tasking model for embedded parallel accelerators, discussing the key design solution that guarantee small memory (footprint) and minimize performance overheads. We validate our design by comparing to several state-of-the-art tasking implementations, using the most representative parallelization patterns. The experimental results confirm that our solution achieves near-ideal speedups for tasks as small as 5K cycles
Unleashing Fine-Grained Parallelism on Embedded Many-Core Accelerators with Lightweight OpenMP Tasking
In recent years, programmable many-core accelerators (PMCAs) have been introduced in embedded systems to satisfy stringent performance/Watt requirements. This has increased the urge for programming models capable of effectively leveraging hundreds to thousands of processors. Task-based parallelism has the potential to provide such capabilities, offering high-level abstractions to outline abundant and irregular parallelism in embedded applications. However, efficiently supporting this programming paradigm on embedded PMCAs is challenging, due to the large time and space overheads it introduces. In this paper we describe a lightweight OpenMP tasking runtime environment (RTE) design for a state-of-the-art embedded PMCA, the Kalray MPPA 256. We provide an exhaustive characterization of the costs of our RTE, considering both synthetic workload and real programs, and we compare to several other tasking RTEs. Experimental results confirm that our solution achieves near-ideal parallelization speedups for tasks as small as 5K cycles, and an average speedup of 12 × for real benchmarks, which is 60% higher than what we observe with the original Kalray OpenMP implementation
Assessing Tenstorrent’s RISC-V MatMul Acceleration Capabilities
The increasing demand for generative AI as Large Language
Models (LLMs) services has driven the need for specialized hardware ar-
chitectures that optimize computational e!ciency and energy consump-
tion. This paper evaluates the performance of the Tenstorrent Grayskull
e75 RISC-V accelerator for basic linear algebra kernels at reduced nu-
merical precision, a fundamental operation in LLM computations. We
present a detailed characterization of Grayskull’s execution model, grid
size, matrix dimensions, data formats, and numerical precision impact on
computational e!ciency. Furthermore, we compare Grayskull’s perfor-
mance against state-of-the-art architectures with tensor acceleration, in-
cluding Intel Sapphire Rapids processors and two NVIDIA GPUs (V100
and A100). Whilst NVIDIA GPUs dominate raw performance, Grayskull
demonstrates a competitive trade-o" between power consumption and
computational throughput, reaching a peak of 1.55 TFLOPs/Watt with
BF16
Energy Saving and Thermal Management Opportunities in a Workload-Aware MPI Runtime for a Scientific HPC Computing Node
With the advent of a new generation of supercomputers characterized by tightly-coupled integration of a large-number of powerful processing cores in the same die, energy and temperature walls are looming threats to the growth in computational power.
Scientific computing is characterized by a single application running in parallel on multiple nodes and cores until termination. The message-passing programming model is a widely adopted paradigm for explicitly handling data-sharing between processes of the same application. As an effect of the MPI communication patterns among different processes, the application is characterized by phases which can be exploited by OS power manager. In addition, the large number of cores integrated in the same silicon die introduces large thermal capacitance as well as on-die thermal heterogeneity. Jointly exploiting local workload unbalance and computational node heterogeneity can open interesting opportunities for advanced thermal and energy management. In this paper, we present an exploratory work to assess these opportunities and their limiting factors. We analyze application workload and we identify opportunities to reduce energy consumption and their impact on performance. We test our methodology on a widely-used quantum-chemistry application demonstrating potential benefits of combining the application flow with power and thermal management strategies
COUNTDOWN - A Run-time Library for Application-agnostic Energy Saving in MPI Communication Primitives
Countdown Slack: A Run-Time Library to Reduce Energy Footprint in Large-Scale MPI Applications
The power consumption of supercomputers is a major chal-
lenge for system owners, users, and society. It limits the capacity of
system installations, it requires large cooling infrastructures, and it is the
cause of a large carbon footprint. Reducing power during application
execution without changing the application source code or increasing
time-to-completion is highly desirable in real-life high-performance com-
puting scenarios.
The power management run-time frameworks proposed in the last
decade are based on the assumption that the duration of communication
and application phases in an MPI application can be predicted and used
at run-time to trade-off communication slack with power consumption.
In this manuscript, we first show that this assumption is too general and
leads to mispredictions, slowing down applications, thereby jeopardizing
the claimed benefits. We then propose a new approach based on (i) the
separation of communication phases and slack during MPI calls and
(ii) a timeout algorithm to cope with the hardware power management
latency, which jointly makes it possible to achieve performance-neutral
power saving in MPI applications without requiring labor-intensive and
risky application source code modifications. We validate our approach
in a tier-1 production environment with widely adopted scientific appli-
cations. Our approach has a time-to-completion overhead lower than
1% , while it successfully exploits slack in communication phases to
achieve an average energy saving of 10% . If we focus on a large-
scale application runs, the proposed approach achieves 22% energy
saving with an overhead of only 0.4% . With respect to state-of-the-art
approaches, COUNTDOWN Slack is the only that always leads to an
energy saving with negligible overhead ( < 3% )
- …
