1,721,024 research outputs found
AgrUNet: A Multi-GPU UNet Based Model for Crops Classification
Agriculture acts as a catalyst for comprehensive economic growth, boosting income levels, mitigating poverty, and contrasting hunger. For these reasons, it is important to monitor agricultural practices and the use of parcels carefully and automatically to support the development of sustainable use of natural resources. The deployment of high-resolution satellite missions, like LandSat and Copernicus Sentinel, combined with AI Deep Learning (DL) methodologies has revolutionized Earth Observation science, enabling studies on yield predictions, soil classifications, and crop mappings on large areas, and the analysis and processing of Big Data using innovative approaches. This approach requires high-performance computing systems since DL algorithms are known to be very computing-heavy, and recent multi-GPU HPC systems can boost by one or two orders of magnitude the processing power of classical computing systems based only on CPUs. In this study, we develop AgrUNet, a scalable, fast, and reliable UNet-based architecture DL model to perform crop classification on multispectral multitemporal satellite data, implemented and optimized to run on single and multi-GPU HPC systems. Our model achieves a Dice score of approximately 0.90, a peak throughput of 59 and 605 /s for the train and inference steps respectively, improving by approximately a factor 7X the best results reported in the literature and quite ideal speedup running both on a 4X V100 and 8X A100 GPU systems
A Portable OpenCL Lattice Boltzmann Code for Multi- and Many-core Processor Architectures
AbstractThe architecture of high performance computing systems is becoming more and more heterogeneous, as accelerators play an increasingly important role alongside traditional CPUs. Programming heterogeneous systems efficiently is a complex task, that often requires the use of specific programming environments. Programming frameworks supporting codes portable across different high performance architectures have recently appeared, but one must carefully assess the relative costs of portability versus computing efficiency, and find a reasonable tradeoff point. In this paper we address precisely this issue, using as test-bench a Lattice Boltzmann code implemented in OpenCL. We analyze its performance on several different state-of-the-art processors: NVIDIA GPUs and Intel Xeon-Phi many-core accelerators, as well as more traditional Ivy Bridge and Opteron multi-core commodity CPUs. We also compare with results obtained with codes specifically optimized for each of these systems. Our work shows that a properly structured OpenCL code runs on many different systems reaching performance levels close to those obtained by architecture-tuned CUDA or C codes
Energy-Efficiency Evaluation of FPGAs for Floating-Point Intensive Workloads
In this work we describe a method to measure the computing performance and energy-efficiency to be expected of an FPGA device. The motivation of this work is given by their possible usage as accelerators in the context of floating-point intensive HPC workloads. In fact, FPGA devices in the past were not considered an efficient option to address floating-point intensive computations, but more recently, with the advent of dedicated DSP units and the increased amount of resources in each chip, the interest towards these devices raised. Another obstacle to a wide adoption of FPGAs in the HPC field has been the low level hardware knowledge commonly required to program them, using Hardware Description Languages (HDLs). Also this issue has been recently mitigated by the introduction of higher level programming framework, adopting so called High Level Synthesis approaches, reducing the development time and shortening the gap between the skills required to program FPGAs wrt the skills commonly owned by HPC software developers. In this work we apply the proposed method to estimate the maximum floating-point performance and energy-efficiency of the FPGA embedded in a Xilinx Zynq Ultrascale+ MPSoC hosted on a Trenz board
Performance and Power Analysis of HPC Workloads on Heterogenous Multi-Node Clusters
Performance analysis tools allow application developers to identify and characterize the inefficiencies that cause performance degradation in their codes, allowing for application optimizations. Due to the increasing interest in the High Performance Computing (HPC) community towards energy-efficiency issues, it is of paramount importance to be able to correlate performance and power figures within the same profiling and analysis tools. For this reason, we present a performance and energy-efficiency study aimed at demonstrating how a single tool can be used to collect most of the relevant metrics. In particular, we show how the same analysis techniques can be applicable on different architectures, analyzing the same HPC application on a high-end and a low-power cluster. The former cluster embeds Intel Haswell CPUs and NVIDIA K80 GPUs, while the latter is made up of NVIDIA Jetson TX1 boards, each hosting an Arm Cortex-A57 CPU and an NVIDIA Tegra X1 Maxwell GPU.The research leading to these results has received funding from the European Community’s Seventh Framework Programme [FP7/2007-2013] and Horizon 2020 under the Mont-Blanc projects [17], grant agreements n. 288777, 610402 and 671697. E.C. was partially founded by “Contributo 5 per mille assegnato all’Università degli Studi di Ferrara-dichiarazione dei redditi dell’anno 2014”. We thank the University of Ferrara and INFN Ferrara for the access to the COKA Cluster. We warmly thank the BSC tools group, supporting us for the smooth integration and test of our setup within Extrae and Paraver.Peer ReviewedPostprint (published version
Porting a Lattice Boltzmann Simulation to FPGAs Using OmpSs
Reconfigurable computing, exploiting Field Programmable Gate Arrays (FPGA), has become of great interest for both academia and industry research thanks to the possibility to greatly accelerate a variety of applications. The interest has been further boosted by recent developments of FPGA programming frameworks which allows to design applications at a higher-level of abstraction, for example using directive based approaches.
In this work we describe our first experiences in porting to FPGAs an HPC application, used to simulate Rayleigh-Taylor instability of fluids with different density and temperature using Lattice Boltzmann Methods. This activity is done in the context of the FET HPC H2020 EuroEXA project which is developing an energyefficient HPC system, at exa-scale level, based on Arm processors and FPGAs. In this work we use the OmpSs directive based programming model, one of the models available within the EuroEXA project. OmpSs is developed by the Barcelona Supercomputing Center (BSC) and allows to target FPGA devices as accelerators, but also commodity CPUs and GPUs, enabling code portability across different architectures. In particular, we describe the initial porting of this application, evaluating the programming efforts required, and assessing the preliminary performances on a Trenz development board hosting a Xilinx Zynq UltraScale+ MPSoC embedding a 16nm FinFET+ programmable logic and a multi-core Arm CPU
Optimization of lattice Boltzmann simulations on heterogeneous computers
High-performance computing systems are more and more often based on accelerators. Computing applications targeting those systems often follow a host-driven approach, in which hosts offload almost all compute-intensive sections of the code onto accelerators; this approach only marginally exploits the computational resources available on the host CPUs, limiting overall performances. The obvious step forward is to run compute-intensive kernels in a concurrent and balanced way on both hosts and accelerators. In this paper, we consider exactly this problem for a class of applications based on lattice Boltzmann methods, widely used in computational fluid dynamics. Our goal is to develop just one program, portable and able to run efficiently on several different combinations of hosts and accelerators. To reach this goal, we define common data layouts enabling the code to exploit the different parallel and vector options of the various accelerators efficiently, and matching the possibly different requirements of the compute-bound and memory-bound kernels of the application. We also define models and metrics that predict the best partitioning of workloads among host and accelerator, and the optimally achievable overall performance level. We test the performance of our codes and their scaling properties using, as testbeds, HPC clusters incorporating different accelerators: Intel Xeon Phi many-core processors, NVIDIA GPUs, and AMD GPUs
Energy-Performance Tradeoffs for HPC Applications on Low Power Processors
Energy efficiency is becoming more and more important in the HPC field; high-end processors are quickly evolving towards more advanced power-saving and power-monitoring technologies. On the other hand, low-power processors, designed for the mobile market, attract interest in the HPC area for their increasing computing capabilities, competitive pricing and low power consumption. In this work we study energy and computing performances of a Tegra K1 mobile processor using an HPC Lattice Boltzmann application as a benchmark. We run this application on the ARM Cortex-A15 CPU and on the GK20A GPU, both available in this processor. Our analysis uses time-accurate measurements, obtained by a simple custom-developed current monitor. We discuss several energy and performance metrics, interesting per se and also in view of a prospective use of these processors in a HPC context
Advanced Performance Analysis of HPC Workloads on Cavium ThunderX
The interest towards Arm based platforms as HPC solutions increased significantly during the last 5 years. In this paper we show that, in contrast to the early days of pioneer tests, several application performance analysis techniques can now be applied also to Arm based SoCs. To show the possibilities offered by the available tools, we provide as an example, the analysis of a Lattice Boltzmann HPC production code, highly optimized for several architectures and now ported also to Armv8. We tested it on a system based on a production silicon, Cavium CN8890 SoC. In particular, as performance analysis tools we adopt Extrae and Paraver, making use of the PAPI support, initially developed by us for the ThunderX platform, and now available also upstream. The contribution of this paper is twofold: first, we demonstrate that performance analysis tools available on standard HPC platforms, independently from the CPU providers, are nowadays available also for Arm SoCs; second, we actually optimize an HPC application for this platforms, showing similarities with other architectures.The research leading to these results
has received funding from the European Community’s Seventh Framework Programme [FP7/2007-2013] and Horizon 2020 under the Mont-Blanc projects [15], grant agreements
n. 288777, 610402 and 671697. E.C. was partially founded by “Contributo 5 per mille assegnato all’Università degli Studi di Ferrara - dichiarazione dei redditi dell’anno 2014”. Cavium Inc. has kindly supported this research providing access to documentation and platforms.Postprint (author's final draft
Multi-Node Advanced Performance and Power Analysis with Paraver
Performance analysis tools allow application developers to identify and characterize the inefficiencies that cause performance degradation in their codes. Due to the increasing interest in the High Performance Computing (HPC) community towards energy-efficiency issues, it is of paramount importance to be able to correlate performance and power figures within the same profiling and analysis tools. For this reason, we present a preliminary performance and energy-efficiency study aimed at demonstrating how a single tool can be used to collect most of the relevant metrics. Moreover we show how the same analysis techniques are applicable on different architectures, analyzing the same HPC application running on two clusters, based respectively on Intel Haswell and Arm Cortex-A57 CPUs.The research leading to these results has received funding from the European Community’s Seventh Framework Programme [FP7/2007-2013] and Horizon 2020 under the
Mont-Blanc projects, grant agreements n. 288777, 610402 and 671697. E.C. was
partially founded by “Contributo 5 per mille assegnato all’Universit`a degli Studi di Ferrara
- dichiarazione dei redditi dell’anno 2014”.Peer ReviewedPostprint (author's final draft
Energy-Efficiency Evaluation of Intel KNL for HPC Workloads
In this work we focus on energy performance of the Knights Landing Xeon Phi, the latest many-core architecture processor introduced by Intel for the HPC market. We take into account the 64-core Xeon Phi 7230, and analyze the computing and energy efficiency using both the on-chip MCDRAM and the off-chip DDR4 memory as main storage for the application data domain. As a benchmark application we use a Lattice Boltzmann code heavily optimized for this architecture, and implemented using different memory data layouts to store the data-domain. We then assess the energy consumption using different data-layouts, memory configurations (DDR4 or MCDRAM), and number of threads per core
- …
