1,446 research outputs found

    Benefits in Relaxing the Power Capping Constraint

    Full text link
    In this manuscript we evaluate the impact of HW power capping mechanisms on a real scientific application composed by parallel execution. By comparing HW capping mechanism against static frequency allocation schemes we show that a speed up can be achieved if the power constraint is enforced in average, during the application run, instead of on short time periods. RAPL, which enforces the power constraint on a few ms time scale, fails on sharing power budget between more demanding and less demanding application phases

    An optimized task-based runtime system for resource-constrained parallel accelerators

    No full text
    Manycore accelerators have recently proven a promising solution for increasingly powerful and energy efficient computing systems. This raises the need for parallel programming models capable of effectively leveraging hundreds to thousands of processors. Task-based parallelism has the potential to provide such capabilities, offering flexible support to fine-grained and irregular parallelism. However, efficiently supporting this programming paradigm on resource-constrained parallel accelerators is a challenging task. In this paper, we present an optimized implementation of the OpenMP tasking model for embedded parallel accelerators, discussing the key design solution that guarantee small memory (footprint) and minimize performance overheads. We validate our design by comparing to several state-of-the-art tasking implementations, using the most representative parallelization patterns. The experimental results confirm that our solution achieves near-ideal speedups for tasks as small as 5K cycles

    Unleashing Fine-Grained Parallelism on Embedded Many-Core Accelerators with Lightweight OpenMP Tasking

    Full text link
    In recent years, programmable many-core accelerators (PMCAs) have been introduced in embedded systems to satisfy stringent performance/Watt requirements. This has increased the urge for programming models capable of effectively leveraging hundreds to thousands of processors. Task-based parallelism has the potential to provide such capabilities, offering high-level abstractions to outline abundant and irregular parallelism in embedded applications. However, efficiently supporting this programming paradigm on embedded PMCAs is challenging, due to the large time and space overheads it introduces. In this paper we describe a lightweight OpenMP tasking runtime environment (RTE) design for a state-of-the-art embedded PMCA, the Kalray MPPA 256. We provide an exhaustive characterization of the costs of our RTE, considering both synthetic workload and real programs, and we compare to several other tasking RTEs. Experimental results confirm that our solution achieves near-ideal parallelization speedups for tasks as small as 5K cycles, and an average speedup of 12 × for real benchmarks, which is 60% higher than what we observe with the original Kalray OpenMP implementation

    Assessing Tenstorrent’s RISC-V MatMul Acceleration Capabilities

    No full text
    The increasing demand for generative AI as Large Language Models (LLMs) services has driven the need for specialized hardware ar- chitectures that optimize computational e!ciency and energy consump- tion. This paper evaluates the performance of the Tenstorrent Grayskull e75 RISC-V accelerator for basic linear algebra kernels at reduced nu- merical precision, a fundamental operation in LLM computations. We present a detailed characterization of Grayskull’s execution model, grid size, matrix dimensions, data formats, and numerical precision impact on computational e!ciency. Furthermore, we compare Grayskull’s perfor- mance against state-of-the-art architectures with tensor acceleration, in- cluding Intel Sapphire Rapids processors and two NVIDIA GPUs (V100 and A100). Whilst NVIDIA GPUs dominate raw performance, Grayskull demonstrates a competitive trade-o" between power consumption and computational throughput, reaching a peak of 1.55 TFLOPs/Watt with BF16

    Energy Saving and Thermal Management Opportunities in a Workload-Aware MPI Runtime for a Scientific HPC Computing Node

    No full text
    With the advent of a new generation of supercomputers characterized by tightly-coupled integration of a large-number of powerful processing cores in the same die, energy and temperature walls are looming threats to the growth in computational power. Scientific computing is characterized by a single application running in parallel on multiple nodes and cores until termination. The message-passing programming model is a widely adopted paradigm for explicitly handling data-sharing between processes of the same application. As an effect of the MPI communication patterns among different processes, the application is characterized by phases which can be exploited by OS power manager. In addition, the large number of cores integrated in the same silicon die introduces large thermal capacitance as well as on-die thermal heterogeneity. Jointly exploiting local workload unbalance and computational node heterogeneity can open interesting opportunities for advanced thermal and energy management. In this paper, we present an exploratory work to assess these opportunities and their limiting factors. We analyze application workload and we identify opportunities to reduce energy consumption and their impact on performance. We test our methodology on a widely-used quantum-chemistry application demonstrating potential benefits of combining the application flow with power and thermal management strategies

    Countdown Slack: A Run-Time Library to Reduce Energy Footprint in Large-Scale MPI Applications

    Full text link
    The power consumption of supercomputers is a major chal- lenge for system owners, users, and society. It limits the capacity of system installations, it requires large cooling infrastructures, and it is the cause of a large carbon footprint. Reducing power during application execution without changing the application source code or increasing time-to-completion is highly desirable in real-life high-performance com- puting scenarios. The power management run-time frameworks proposed in the last decade are based on the assumption that the duration of communication and application phases in an MPI application can be predicted and used at run-time to trade-off communication slack with power consumption. In this manuscript, we first show that this assumption is too general and leads to mispredictions, slowing down applications, thereby jeopardizing the claimed benefits. We then propose a new approach based on (i) the separation of communication phases and slack during MPI calls and (ii) a timeout algorithm to cope with the hardware power management latency, which jointly makes it possible to achieve performance-neutral power saving in MPI applications without requiring labor-intensive and risky application source code modifications. We validate our approach in a tier-1 production environment with widely adopted scientific appli- cations. Our approach has a time-to-completion overhead lower than 1% , while it successfully exploits slack in communication phases to achieve an average energy saving of 10% . If we focus on a large- scale application runs, the proposed approach achieves 22% energy saving with an overhead of only 0.4% . With respect to state-of-the-art approaches, COUNTDOWN Slack is the only that always leads to an energy saving with negligible overhead ( < 3% )
    corecore