1,721,024 research outputs found

    Automatic Creation of High-Bandwidth Memory Architectures from Domain-Specific Languages: The Case of Computational Fluid Dynamics

    Full text link
    Numerical simulations can help solve complex problems. Most of these algorithms are massively parallel and thus good candidates for FPGA acceleration thanks to spatial parallelism. Modern FPGA devices can leverage high-bandwidth memory technologies, but when applications are memory-bound designers must craft advanced communication and memory architectures for efficient data movement and on-chip storage. This development process requires hardware design skills that are uncommon in domain-specific experts. In this paper, we propose an automated tool flow from a domain-specific language (DSL) for tensor expressions to generate massively-parallel accelerators on HBM-equipped FPGAs. Designers can use this flow to integrate and evaluate various compiler or hardware optimizations. We use computational fluid dynamics (CFD) as a paradigmatic example. Our flow starts from the high-level specification of tensor operations and combines an MLIR-based compiler with an in-house hardware generation flow to generate systems with parallel accelerators and a specialized memory architecture that moves data efficiently, aiming at fully exploiting the available CPU-FPGA bandwidth. We simulated applications with millions of elements, achieving up to 103 GFLOPS with one compute unit and custom precision when targeting a Xilinx Alveo U280. Our FPGA implementation is up to 25x more energy-efficient than expert-crafted Intel CPU implementations.Comment: Accepted for publication in ACM Transactions on Reconfigurable Technology and Systems (TRETS

    Eine Lernbasierte Methodik für Hybride Abbildung von Anwendungen auf Heterogene MPSoCs

    Full text link
    Heterogeneous multiprocessor systems-on-chip (MPSoCs) have emerged as a solution to address both the physical limitations of miniaturizing processing cores and the growing computational demands of modern embedded systems. These MPSoCs incorporate numerous heterogeneous processing resources and must often manage applications that experience dynamic workload fluctuations during execution. Optimizing the execution of such dynamic real-time applications on large, heterogeneous architectures presents a complex multi-objective challenge. This challenge involves balancing task mapping and dynamic voltage and frequency scaling (DVFS) to minimize energy consumption while maintaining critical real-time properties, such as meeting application deadlines. This dissertation presents a novel solution to this challenge by proposing a learning-based, scalable, composable, and adaptive scenario-aware hybrid application mapping (HAM) methodology for heterogeneous tile-based MPSoCs. Unlike previous methods, our methodology does not presuppose functional knowledge about the application or its input data and is capable of handling unknown run-time workload profiles. The methodology tackles run-time uncertainty by learning patterns in the applications’ execution behavior at design time and then developing scalable run-time management strategies to adapt the mappings and DVFS settings to the current workload profiles. Adhering to the concept of composability, the proposed methodology allocates separate processing resources for each application, supporting high scalability and independent optimization. To combat the abundance of possible workload profiles elicited by the interplay of the concurrently running applications, the approach is moreover scenario-based. As existing approaches inadequately capture the wide range of workload profiles induced by different input data, this work utilizes the concept of data scenarios. These data scenarios explicitly encapsulate common workload patterns triggered by diverse input data and enable the efficient optimization of tailored mappings and DVFS settings. To determine data scenarios and simultaneously optimize scenario-specific mappings, and DVFS settings, the proposed hybrid optimization scheme integrates a scenario-aware multi-objective design-time optimization phase. This approach resolves the interdependence between scenario, mapping, and DVFS exploration through its iterative multi-phase design. By optimizing synergistic mapping and DVFS settings at design time, the methodology facilitates scalable scenario-aware run-time management, achieving low energy consumption and low deadline miss rates despite an uncertain execution environment. The design-time phase establishes a foundation for our proposed scenario-aware run-time manager that dynamically adapts mappings and DVFS settings based on current workload profiles and data scenarios. To detect and select the best-suited data scenarios at run time, this thesis introduces learning-based scenario identification and scenario selection components. Based on the selected data scenarios, scenario-aware, learning-based operating point selection components determine high-quality operating points tailored to the execution environment. The dissertation proposes selection methodologies with varying degrees of run-time overhead and adaptivity, providing a range of different run-time trade-offs to efficiently handle diverse application and architecture types. The scenario and operating point selection components finely adapt their strategy to the run-time environment by learning generalizable features at design time. However, this can lead to subpar execution characteristics when the learned features diverge between the design-time training process and the run-time deployment of the manager. To address this issue, this thesis proposes the first domain-adaptive Hybrid Application Mapping (HAM) methodology that is resilient to design-time and run-time divergences, evoked by shifts between the design-time and run-time environments. Another key contribution of this thesis is the development of an architectural transfer methodology to tackle architectural degradation and enable efficient architectural exploration. This transfer methodology determines mappings for novel target architecture candidates in seconds when optimized mappings for another MPSoC architecture are disposable. In summary, this thesis contributes to the management of multi-application workloads by proposing a scenario-aware run-time methodology for adaptive resource redistribution between applications based on their workloads. The proposed approach solves the fundamental problem of composable mapping methodologies, which often result in underprovisioning or overprovisioning processing resources due to dynamic workload fluctuations. The proposed multi-application tile mediation approach provides a holistic data-scenario-aware treatment of both intra-application and inter-application management. With its thorough experimental evaluation, this dissertation demonstrates significant improvements in energy efficiency and deadline adherence compared to existing approaches, demonstrating the efficacy of the proposed HAM scheme in managing dynamic workload variations in modern embedded systems

    Software exploitation of traditional interfaces for modern technologies

    No full text
    Modern computer Technologies are skyrocketing to spheres, which frequently seemed unimaginable years ago. Quantum effect petabyte-sized storage devices or deep cache hierarchies, acting within nanoseconds, make only a few examples. At the same time, interfaces to communicate with such technologies are settled and remain largely unaffected by the technology development. While loading and storing a word to a given memory address was the standard interface to communicate with memory devices in very early stages of computer systems, it still features a similar shape nowadays. Unsurprisingly, modern computing technologies come with increasing demand of management, such as lifetime management for NON-VOLATILE MEMORY (NVM) or prefetching and eviction strategies for cache hierarchies. Leaving this management to the hardware solely provides a limited design space and space for optimization. Consequently, soft- ware has to find ways, which allow an either direct or indirect management of the technologies over the traditional interfaces. This dissertation picks up this need and studies selected modern technologies and their need for management. Methods are presented in this thesis, which systematically exploit existing traditional interfaces in order to provide extended functionalities for the management of modern technologies. The exploitations in this thesis are solely conducted on a software level and do not require any actions in the available hardware. In a first part, memory technologies are picked up as a target technology. In greater detail, NON-VOLATILE MEMORY (NVM) is studied. This thesis discusses the lifetime issue of these technologies and the resulting need for wear-leveling. Various approaches are introduced, which allow different forms of wear-leveling on different levels of the software. This ranges from wear-leveling procedures inside the operating system and the system software towards direct application integration to extend the memory lifetime. Apart from the lifetime issue, the latency and energy property of a specific type of emerging memory, namely RACETRACK MEMORY (RTM), is considered. Dedicated to the application of RANDOM FOREST (RF) models, the access properties are optimized in the application level directly. In the last part of this thesis, the focus is moved from memories to arithemtic compuation. RANDOM FOREST (RF) models are kept as a target application and their execution on modern computation technologies is considered. The usage of floating- point numbers is put to a major focus and the memory behavior of floating-point numbers is optimized. By proposing alternative computation schemes for floating-point numbers, which are entirely realized in software and leave the hardware untouched, significant performance improvement is gained

    Design and Code Optimization for Systems with Next-generation Racetrack Memories

    Full text link
    With the rise of computationally expensive application domains such as machine learning, genomics, and fluids simulation, the quest for performance and energy-efficient computing has gained unprecedented momentum. The significant increase in computing and memory devices in modern systems has resulted in an unsustainable surge in energy consumption, a substantial portion of which is attributed to the memory system. The scaling of conventional memory technologies and their suitability for the next-generation system is also questionable. This has led to the emergence and rise of nonvolatile memory ( NVM ) technologies. Today, in different development stages, several NVM technologies are competing for their rapid access to the market. Racetrack memory ( RTM ) is one such nonvolatile memory technology that promises SRAM -comparable latency, reduced energy consumption, and unprecedented density compared to other technologies. However, racetrack memory ( RTM ) is sequential in nature, i.e., data in an RTM cell needs to be shifted to an access port before it can be accessed. These shift operations incur performance and energy penalties. An ideal RTM , requiring at most one shift per access, can easily outperform SRAM . However, in the worst-cast shifting scenario, RTM can be an order of magnitude slower than SRAM . This thesis presents an overview of the RTM device physics, its evolution, strengths and challenges, and its application in the memory subsystem. We develop tools that allow the programmability and modeling of RTM -based systems. For shifts minimization, we propose a set of techniques including optimal, near-optimal, and evolutionary algorithms for efficient scalar and instruction placement in RTMs . For array accesses, we explore schedule and layout transformations that eliminate the longer overhead shifts in RTMs . We present an automatic compilation framework that analyzes static control flow programs and transforms the loop traversal order and memory layout to maximize accesses to consecutive RTM locations and minimize shifts. We develop a simulation framework called RTSim that models various RTM parameters and enables accurate architectural level simulation. Finally, to demonstrate the RTM potential in non-Von-Neumann in-memory computing paradigms, we exploit its device attributes to implement logic and arithmetic operations. As a concrete use-case, we implement an entire hyperdimensional computing framework in RTM to accelerate the language recognition problem. Our evaluation shows considerable performance and energy improvements compared to conventional Von-Neumann models and state-of-the-art accelerators

    Going Beyond Counting First Authors in Author Co-citation Analysis

    Full text link
    The present study examines one of the fundamental aspects of author co-citation analysis (ACA) - the way co-citation counts are defined. Co-citation counting provides the data on which all subsequent statistical analyses and mappings are based, and we compare ACA results based on two different types of co-citation counting - the traditional type that only counts the first one among a cited work's authors on the one hand and a non-traditional type that takes into account the first 5 authors of a cited work on the other hand. Results indicate that the picture produced through this non-traditional author co-citation counting contains more coherent author groups and is therefore considerably clearer. However, this picture represents fewer specialties in the research field being studied than that produced through the traditional first-author co-citation counting when the same number of top-ranked authors is selected and analyzed. Reasons for these effects are discussed

    Improving Model-Based Software Synthesis: A Focus on Mathematical Structures

    Full text link
    Computer hardware keeps increasing in complexity. Software design needs to keep up with this. The right models and abstractions empower developers to leverage the novelties of modern hardware. This thesis deals primarily with Models of Computation, as a basis for software design, in a family of methods called software synthesis. We focus on Kahn Process Networks and dataflow applications as abstractions, both for programming and for deriving an efficient execution on heterogeneous multicores. The latter we accomplish by exploring the design space of possible mappings of computation and data to hardware resources. Mapping algorithms are not at the center of this thesis, however. Instead, we examine the mathematical structure of the mapping space, leveraging its inherent symmetries or geometric properties to improve mapping methods in general. This thesis thoroughly explores the process of model-based design, aiming to go beyond the more established software synthesis on dataflow applications. We starting with the problem of assessing these methods through benchmarking, and go on to formally examine the general goals of benchmarks. In this context, we also consider the role modern machine learning methods play in benchmarking. We explore different established semantics, stretching the limits of Kahn Process Networks. We also discuss novel models, like Reactors, which are designed to be a deterministic, adaptive model with time as a first-class citizen. By investigating abstractions and transformations in the Ohua language for implicit dataflow programming, we also focus on programmability. The focus of the thesis is in the models and methods, but we evaluate them in diverse use-cases, generally centered around Cyber-Physical Systems. These include the 5G telecommunication standard, automotive and signal processing domains. We even go beyond embedded systems and discuss use-cases in GPU programming and microservice-based architectures

    Deterministic Reactive Programming for Cyber-physical Systems

    No full text
    Today, cyber-physical systems (CPSs) are ubiquitous. Whether it is robotics, electric vehicles, the smart home, autonomous driving, or smart prosthetics, CPSs shape our day-to-day lives. Yet, designing and programming CPSs becomes evermore challenging as the overall complexity of systems increases. CPSs need to interface (potentially distributed) computation with concurrent processes in the physical world while fulfilling strict safety requirements. Modern and popular frameworks for designing CPS applications, such as ROS and AUTOSAR, address the complexity challenges by emphasizing scalability and reactivity. This, however, comes at the cost of compromising determinism and the time predictability of applications, which ultimately compromises safety. This thesis argues that this compromise is not a necessity and demonstrates that scalability can be achieved while ensuring a predictable execution. At the core of this thesis is the novel reactor model of computation (MoC) that promises to provide timed semantics, reactivity, scalability, and determinism. A comprehensive study of related models indicates that there is indeed no other MoC that provides similar properties. The main contribution of this thesis is the introduction of a complete set of tools that make the reactor model accessible for CPS design and a demonstration of their ability to facilitate the development of scalable deterministic software. After introducing the reactor model, we discuss its key principles and utility through an adaptation of reactors in the DEAR framework. This framework integrates reactors with a popular runtime for adaptive automotive applications developed by AUTOSAR. An existing AUTOSAR demonstrator application serves as a case study that exposes the problem of nondeterminism in modern CPS frameworks. We show that the reactor model and its implementation in the DEAR framework are applicable for achieving determinism in industrial use cases. Building on the reactor model, we introduce the polyglot coordination language Lingua Franca (LF), which enables the definition of reactor programs independent of a concrete target programming language. Based on the DEAR framework, we develop a full-fledged C++ reactor runtime and a code generation backend for LF. Various use cases studied throughout the thesis illustrate the general applicability of reactors and LF to CPS design, and a comprehensive performance evaluation using an optimized version of the C++ reactor runtime demonstrates the scalability of LF programs. We also discuss some limitations of the current scheduling mechanisms and show how they can be overcome by partitioning programs. Finally, we consider design space exploration (DSE) techniques to further improve the scalability of LF programs and manage hardware complexity by automating the process of allocating hardware resources to specific components in the program. This thesis contributes the Mocasin framework, which resembles a modular platform for prototyping and researching DSE flows. While a concrete integration with LF remains for future work, Mocasin provides a foundation for exploring DSE in Lingua Franca

    Towards Implicit Parallel Programming for Systems

    Full text link
    Multi-core processors require a program to be decomposable into independent parts that can execute in parallel in order to scale performance with the number of cores. But parallel programming is hard especially when the program requires state, which many system programs use for optimization, such as for example a cache to reduce disk I/O. Most prevalent parallel programming models do not support a notion of state and require the programmer to synchronize state access manually, i.e., outside the realms of an associated optimizing compiler. This prevents the compiler to introduce parallelism automatically and requires the programmer to optimize the program manually. In this dissertation, we propose a programming language/compiler co-design to provide a new programming model for implicit parallel programming with state and a compiler that can optimize the program for a parallel execution. We define the notion of a stateful function along with their composition and control structures. An example implementation of a highly scalable server shows that stateful functions smoothly integrate into existing programming language concepts, such as object-oriented programming and programming with structs. Our programming model is also highly practical and allows to gradually adapt existing code bases. As a case study, we implemented a new data processing core for the Hadoop Map/Reduce system to overcome existing performance bottlenecks. Our lambda-calculus-based compiler automatically extracts parallelism without changing the program's semantics. We added further domain-specific semantic-preserving transformations that reduce I/O calls for microservice programs. The runtime format of a program is a dataflow graph that can be executed in parallel, performs concurrent I/O and allows for non-blocking live updates
    corecore