1,720,962 research outputs found
Flexible task-DAG management in PHAST library: Data-parallel tasks and orchestration support for heterogeneous systems
Heterogeneous architectures proved successful in achieving unprecedented performance and energy-efficiency. However, taking advantage of these diverse processing elements is still hard. Programmers need to code through the different approaches suitable for each target architecture and need to decide the distribution of activities on the different resources. The majority of current frameworks focuses on either performance or productivity. The former mainly provides low-level target-specific programming interfaces, and the latter offers high-level tools that often fail in achieving high-performance. In both cases, the design is usually data-parallel, as task-parallelism is not supported. In this work, we propose a task-based solution within the data-parallel heterogeneous single-source PHAST library. Tasks can be coded in a target-agnostic fashion, can be compiled and parallelized on multi-core CPUs and NVIDIA GPUs automatically and support the choice of the execution platform at runtime. We evaluate the capabilities of the proposed task-directed acyclic graph support in case of an extensive set of randomly generated task-based applications with different sizes and characteristics. We compare it against a SYCL implementation in terms of performance and complexity metrics, highlighting that PHAST achieves about 1.56× and 2.60× speedup over SYCL for multi-core CPU and GPU, respectively, while improving also code complexity metrics
Task-DAG support in single-source PHAST library: Enabling flexible assignment of tasks to CPUs and GPUs in heterogeneous architectures
Nowadays, the majority of desktop, mobile, and embedded devices in the consumer and industrial markets are heterogeneous, as they contain at least multi-core CPU and GPU resources in the same system. However, exploiting the performance and energy-efficiency of these diverse processing elements does not come for free from a software point of view: programmers need to a) code each activity through the specific approaches, libraries, and frameworks suitable for their target architecture (e.g., CPUs and GPUs) along with the orchestration of such heterogeneous execution, and b) decide the distribution of sequential and parallel activities towards the different parallel hardware resources available. Current frameworks typically provide either low-abstraction-level target-specific and/or generic but not high-performance interfaces, which complicate the exploration of different task assignments, with DAG1 precedence relationship, to the available heterogeneous resources. To enable this, tasks would typically need to be coded one time for each target architecture due to the profound differences in their programming. In this work, we include the support of tasks and DAGs of data-parallel tasks within the single-source PHAST library, which currently supports both multi-core CPUs and NVIDIA GPUs, so that tasks are coded in a target-agnostic fashion and their targeting to multi-core or GPU architectures is automatic and efficient. The integration of this coding approach with tasks can help to postpone the choice of the execution platform for each task up to the testing, or even to the runtime, phase. Finally, we demonstrate the effects of this approach in the case of a sample image pipeline benchmark from the computer vision domain. We compare our implementation to a SYCL implementation from a productivity point of view. Also, we show that various task assignments can be seamlessly explored by implementing both the PEFT2 mapping technique along with an exhaustive search in the mapping space
Single-source library for enabling seamless assignment of data-parallel task-DAGs to CPUs and GPUs in heterogeneous architectures
Currently, the majority of devices is heterogeneous and comprises at least a multi-core CPU and a GPU. Exploiting these modules requires programmers to a) assign parallel activities to the different hardware resources, and b) code each activity through target-specific approaches x(e.g., multi-core CPUs and GPUs). Current frameworks do not provide high-productivity while guaranteeing performance comparable to low-level ones. Also, they often lack the easy exploration of possible task assignments to the heterogeneous resources, mainly because the assignment needs to be done in advance and the targets have software incompatibilities. This paper introduces task-DAGs (Directed Acyclic Graph) concepts into PHAST, a single-source data-parallel library that supports multi-core CPUs and NVIDIA GPUs. The resulting framework allows postponing the task-to-target assignment to the testing or runtime phase. In fact, each task is coded in a target-agnostic fashion and can be automatically optimized for a specific platform. The consequent productivity effects are shown for a computer-vision benchmark comparing to a SYCL implementation and performing an automatic task assignment for maximum resources utilization
Parallel programming in cyber-physical systems
The growing diffusion of heterogeneous Cyber-Physical Systems (CPSs) poses a problem of security. The employment of cryptographic strategies and techniques is a fundamental part in the attempt of finding a solution to it. Cryptographic algorithms, however, need to increase their security level due to the growing computational power in the hands of potential attackers. To avoid a consequent performance worsening and keep CPSs functioning and secure, these cryptographic techniques must be implemented so to exploit the aggregate computational power that modern parallel architectures provide. In this chapter we investigate the possibility to parallelize two very common basic operations in cryptography: modular exponentiation and Karatsuba multiplication. For the former, we propose two different techniques (m-ary and exponent slicing) that reduce calculation time of 30/40%. For the latter, we show various implementations of a 3-thread parallelization scheme that provide up to 60% better performance with respect to a sequential implementation
A survey on hardware accelerators: Taxonomy, trends, challenges, and perspectives
In recent years, the limits of the multicore approach emerged in the so-called “dark silicon” issue and diminishing returns of an ever-increasing core count. Hardware manufacturers, out of necessity, switched their focus to accelerators, a new paradigm that pursues specialization and heterogeneity over generality and homogeneity. They are special-purpose hardware structures separated from the CPU with aspects that exhibit a high degree of variability. We define a taxonomy based on fourteen of these aspects, grouped in four macro-categories: general aspects, host coupling, architecture, and software aspects. According to it, we categorize around 100 accelerators of the last decade from both industry and academia, and critically analyze emerging trends. We complete our discussion with throughput and efficiency figures. Then, we discuss some prominent open challenges that accelerators are facing, analyzing state-of-the-art solutions, and suggesting prospective research directions for the future
A General Framework for Accelerator Management Based on ISA Extension
Thanks to the promised improvements in performance and energy efficiency, hardware accelerators are taking momentum in many computing contexts, both in terms of variety and relative weight in the silicon area of many chips. Commonly, the way an application interacts with these hardware modules has many accelerator-specific traits and requires ad-hoc drivers that usually rely on potentially expensive system calls to manage accelerator resources and access orchestration. As a consequence, driver-based interfacing is far from uniform and can expose high latency, limiting the set of tasks suitable for acceleration. In this paper, we propose a uniform and low-latency interface based on Instruction Set Architecture (ISA) extension. All the previous studies that proposed extensions, were deeply tailored to address a single accelerator. One of the biggest disadvantages of those methods is their inability to scale. Adding more of these accelerators to one System-on-Chip (SoC) would result in ISA bloat, increasing power consumption and complexifying the decoding phase proportionally. Our proposed framework consists of a six-instruction ISA extension and the corresponding architectural support that implements the interface abstraction and the reservation logic at the hardware level. Our proposal allows controlling a broad class of integrated accelerators directly from the CPU. The proposed framework is ISA-independent, which means that it is applicable to all the existing ISAs. We implement it on the gem5 simulator by extending the RISC-V ISA. We evaluate it by simulating three compute-intensive accelerators and comparing our interfacing with a conventional driver-based one. The benchmarks highlight the performance benefits brought by our framework, with up to 10.38x speed up, as well as the ability to seamlessly support different accelerators with the same interface. The speed up advantage of our technique diminishes as the granularity of the workloads increases and the overhead for driver-based accelerators becomes less important. We also show that the impact of its hardware components on chip area and power consumption is limited
High-Performance Multi-Object Tracking for Autonomous Driving in Urban Scenarios with Heterogeneous Embedded Boards
Autonomous Driving is emerging as a paradigm shift in the way we conceive people and goods transportation. It promises to improve road safety, reduce traffic congestion, and increase overall transportation efficiency. It is made possible by a plethora of modern technologies, such as AI, low-power hardware, and complex computing. Common stages of Autonomous Driving systems are the identification of objects in the scene (Object Detection), and the ability to predict the evolution of the tracked objects' states - usually, positions and velocities (Multi-Object Tracking). In this paper, we address the issues behind high-performance Multi-Object Tracking (MOT) algorithms for a real-world urban transportation scenario. The main objective is to exploit the high-performance capabilities of NVIDIA heterogeneous embedded platforms, which are not automatically compatible with the peculiar features of these algorithms. We propose and discuss some non-trivial software design choices and implementation strategies that are needed to match the specific computational and memory access needs of MOT strategies, and the architectural opportunities of NVIDIA embedded GPGPU systems. Our code is made available as open-source. We compare our solution against a highly optimized multi-core version and show that we are able to trigger significant performance speedups (up to 7.19×) in all power-modes, sometimes even surpassing the CPU reference with the GPU in a lower-power operating mode
Going Beyond Counting First Authors in Author Co-citation Analysis
The present study examines one of the fundamental aspects of author co-citation analysis (ACA) - the way co-citation
counts are defined. Co-citation counting provides the data on which all subsequent statistical analyses and mappings
are based, and we compare ACA results based on two different types of co-citation counting - the traditional type that
only counts the first one among a cited work's authors on the one hand and a non-traditional type that takes into
account the first 5 authors of a cited work on the other hand. Results indicate that the picture produced through this non-traditional author co-citation counting contains more coherent author groups and is therefore considerably clearer. However, this picture represents fewer specialties in the research field being studied than that produced through the traditional first-author co-citation counting when the same number of top-ranked authors is selected and analyzed. Reasons for these effects are discussed
Performance portability in a real world application: PHAST applied to Caffe
This work covers the PHAST Library’s employment, a hardware-agnostic programming library, to a real-world application like the Caffe framework. The original implementation of Caffe consists of two different versions of the source code: one to run on CPU platforms and another one to run on the GPU side. With PHAST, we aim to develop a single-source code implementation capable of running efficiently on CPU and GPU. In this paper, we start by carrying out a basic Caffe implementation performance analysis using PHAST. Then, we detail possible performance upgrades. We find that the overall performance is dominated by few ‘heavy’ layers. In refining the inefficient parts of this version, we find two different approaches: improvements to the Caffe source code and improvements to the PHAST Library itself, which ultimately translates into improved performance in the PHAST version of Caffe. We demonstrate that our PHAST implementation achieves performance portability on CPUs and GPUs. With a single source, the PHAST version of Caffe provides the same or even better performance than the original version of Caffe built from two different codebases. For the MNIST database, the PHAST implementation takes an equivalent amount of time as native code in CPU and GPU. Furthermore, PHAST achieves a speedup of 51% and a 49% with the CIFAR-10 database against native code in CPU and GPU, respectively. These results provide a new horizon for software development in the upcoming heterogeneous computing era
- …
