Search CORE

1,720,989 research outputs found

Exploiting Multiple Sequence Lengths in Fast End to End Training for Image Captioning

Author: Hu Jia Cheng
Cavicchioli Roberto
Capotondi Alessandro
Publication venue
Publication date: 01/01/2023
Field of study

Archivio istituzionale della ricerca - Università di Modena e Reggio Emilia

On the effectiveness of OpenMP teams for cluster-based many-core accelerators

Author: CAPOTONDI ALESSANDRO
Alessandro Capotondi
Andrea Marongiu
MARONGIU ANDREA
Publication venue
Publication date: 01/01/2016
Field of study

With the introduction of more powerful and massively parallel embedded processors, embedded systems are becoming HPC-capable. Heterogeneous on-chip systems (SoC) that couple a general-purposehost processor to a many-core accelerator are becoming more and more widespread, and provide tremendous peak performance/watt, well suited to execute HPC-class programs. The increased computation potential is however traded off for ease programming. Application developers are indeed required to manually deal with outlining code parts suitable for acceleration, parallelize them efficiently over many available cores, and orchestrate data transfers to/from the accelerator. In addition, since most many-cores are organized as a collection ofclusters, featuring fast local communication but slow remote communication (i.e., to another cluster's local memory), the programmer should also take care of properly mapping the parallel computation so as to avoid poor data locality. OpenMP v4.0 introduces new constructs for computation offloading, as well as directives to deploy parallel computation in a cluster-aware manner. In this paper we assess the effectiveness of OpenMP v4.0 at exploiting the massive parallelism available in embedded heterogeneous SoCs, comparing to standard parallel loops over several computation-intensive applications from the linear algebra and image processing domains

Crossref

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

Archivio istituzionale della ricerca - Università di Modena e Reggio Emilia

Enabling zero-copy OpenMP ofloading on the PULP many-core accelerator

Author: Alessandro Capotondi
Andrea Marongiu
Marongiu Andrea
Capotondi Alessandro
Publication venue
Publication date: 01/01/2017
Field of study

Many-core heterogeneous designs are nowadays widely available among embedded systems. Initiatives such as the HSA push for a model where the host processor and the accelerator(s) communicate via coherent, Uniied Virtual Memory (UVM). In this paper we describe our experience in porting the OpenMP v4 programming model to a low-end, heterogeneous embedded system based on the PULP many-core accelerator featuring lightweight (software-managed) UVM support. We describe a GCC-based toolchain which enables: i) the automatic generation of host and accelerator binaries from a single, high-level, OpenMP parallel program; ii) the automatic instrumentation of the accelerator program to transparently manage UVM. This enables up to 4Ã faster execution compared to traditional copy-based oload mechanisms

Crossref

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

Archivio istituzionale della ricerca - Università di Modena e Reggio Emilia

Simplifying Many-Core-Based Heterogeneous SoC Programming with Offload Directives

Author: Tagliavini Giuseppe
Marongiu Andrea
Capotondi Alessandro
CAPOTONDI ALESSANDRO
TAGLIAVINI GIUSEPPE
BENINI LUCA
MARONGIU ANDREA
Benini Luca
Publication venue
Publication date: 01/01/2015
Field of study

Multiprocessor systems-on-chip (MPSoC) are evolving into heterogeneous architectures based on one host processor plus many-core accelerators. While heterogeneous SoCs promise higher performance/watt, they are programmed at the cost of major code rewrites with low-level programming abstractions (e.g, OpenCL). We present a programming model based on OpenMP, with additional directives to program the accelerator from a single host program. As a test case, we evaluate an implementation of this programming model for the STMicroelectronics STHORM development board. We obtain near-ideal throughput for most benchmarks, very close performance to hand-optimized OpenCL codes at a significantly lower programming complexity, and up to 30× speedup versus host execution time

Crossref

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

Archivio istituzionale della ricerca - Università di Modena e Reggio Emilia

Hero: An open-source research platform for HW/SW exploration of heterogeneous manycore systems

Author: Marongiu A.
CAPOTONDI ALESSANDRO
Vogel P.
Benini L.
Kurth A.
Publication venue
Publication date: 01/01/2018
Field of study

Heterogeneous systems on chip (HeSoCs) co-integrate a high-performance multicore host processor with programmable manycore accelerators (PMCAs) to combine “standard platform” software support (e.g. the Linux OS) with energy-efficient, domain-specific, highly parallel processing capabilities. In this work, we present HERO, a HeSoC platform that tackles this challenge in a novel way. HERO’s host processor is an industry-standard ARM Cortex-A multicore complex, while its PMCA is a scalable, silicon-proven, open-source many-core processing engine, based on the extensible, open RISC-V ISA. We evaluate a prototype implementation of HERO, where the PMCA implemented on an FPGA fabric is coupled with a hard ARM Cortex-A host processor, and show that the run time overhead compared to manually written PMCA code operating on private physical memory is lower than 10 % for pivotal benchmarks and operating conditions

Archivio istituzionale della ricerca - Università di Modena e Reggio Emilia

An FPGA Overlay for Efficient Real-Time Localization in 1/10th Scale Autonomous Vehicles

Author: Bernardi Andrea
Brilli Gianluca
Burgio Paolo
Capotondi Alessandro
Marongiu Andrea
Publication venue
Publication date: 01/01/2022
Field of study

Heterogeneous systems-on-chip (HeSoC) based on reconfigurable accelerators, such as Field-Programmable Gate Arrays (FPGA), represent an appealing option to deliver the performance/Watt required by the advanced perception and localization tasks employed in the design of Autonomous Vehicles. Different from software-programmed GPUs, FPGA development involves significant hardware design effort, which in the context of HeSoCs is further complicated by the system-level integration of HW and SW blocks. High-Level Synthesis is increasingly being adopted to ease hardware IP design, allowing engineers to quickly prototype their solutions. However, automated tools still lack the required maturity to efficiently build the complex hard-ware/software interaction between the host CPU and the FPGA accelerator(s). In this paper we present a fully integrated system design where a particle filter for LiDAR-based localization is efficiently deployed as FPGA logic, while the rest of the compute pipeline executes on programmable cores. This design constitutes the heart of a fully-functional 1/10th-scale racing autonomous car. In our design, accelerated IPs are controlled locally to the FPGA via a proxy core. Communication between the two and with the host CPU happens via shared memory banks also implemented as FPGA IPs. This allows for a scalable and easy-to-deploy solution both from the hardware and software viewpoint, while providing better performance and energy efficiency compared to state-of-the-art solutions

Archivio istituzionale della ricerca - Università di Modena e Reggio Emilia

Enabling Scalable and Fine-Grained Nested Parallelism on Embedded Many-cores

Author: CAPOTONDI ALESSANDRO
Alessandro Capotondi
BENINI LUCA
Andrea Marongiu
MARONGIU ANDREA
Luca Benini
Publication venue
Publication date: 01/01/2015
Field of study

Crossref

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

Archivio istituzionale della ricerca - Università di Modena e Reggio Emilia

Runtime Support for Multiple Offload-Based Programming Models on Embedded Manycore Accelerators

Author: Capotondi Alessandro
Haugou Germain
Marongiu Andrea
CAPOTONDI ALESSANDRO
Haugou Germain
Alessandro Capotondi
BENINI LUCA
Andrea Marongiu
MARONGIU ANDREA
Benini Luca
Luca Benini
Germain Haugou
Publication venue
Publication date: 01/01/2015
Field of study

Crossref

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

Archivio istituzionale della ricerca - Università di Modena e Reggio Emilia

ShareBERT: Embeddings Are Capable of Learning Hidden Layers

Author: Berardinelli Giulia
Hu Jia Cheng
Cavicchioli Roberto
Capotondi Alessandro
Publication venue
Publication date: 01/01/2024
Field of study

The deployment of Pre-trained Language Models in memory-limited devices is hindered by their massive number of parameters, which motivated the interest in developing smaller architectures. Established works in the model compression literature showcased that small models often present a noticeable performance degradation and need to be paired with transfer learning methods, such as Knowledge Distillation. In this work, we propose a parameter-sharing method that consists of sharing parameters between embeddings and the hidden layers, enabling the design of near-zero parameter encoders. To demonstrate its effectiveness, we present an architecture design called ShareBERT, which can preserve up to 95.5% of BERT Base performances, using only 5M parameters (21.9× fewer parameters) without the help of Knowledge Distillation. We demonstrate empirically that our proposal does not negatively affect the model learning capabilities and that it is even beneficial for representation learning. Code will be available at https://github.com/jchenghu/sharebert

Archivio istituzionale della ricerca - Università di Modena e Reggio Emilia

Association for the Advancement of Artificial Intelligence: AAAI Publications

A RISC-V-based FPGA Overlay to Simplify Embedded Accelerator Deployment

Author: Alessandro Capotondi
Francesco Conti
Bellocchi Gianluca
Andrea Marongiu
Marongiu Andrea
Capotondi Alessandro
Conti Francesco
Gianluca Bellocchi
Publication venue
Publication date: 01/01/2021
Field of study

Modern cyber-physical systems (CPS) are increasingly adopting heterogeneous systems-on-chip (HeSoCs) as a computing platform to satisfy the demands of their sophisticated workloads. FPGA-based HeSoCs can reach high performance and energy efficiency at the cost of increased design complexity. High-Level Synthesis (HLS) can ease IP design, but automated tools still lack the maturity to efficiently and easily tackle system-level integration of the many hardware and software blocks included in a modern CPS. We present an innovative hardware overlay offering plug-and-play integration of HLS-compiled or handcrafted acceleration IPs thanks to a customizable wrapper attached to the overlay interconnect and providing shared-memory communication to the overlay cores. The latter are based on the open RISC-V ISA and offer simplified software management of the acceleration IP. Deploying the proposed overlay on a Xilinx ZU9EG shows ≈ 20% LUT usage and ≈ 4× speedup compared to program execution on the ARM host core

Crossref

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

Archivio istituzionale della ricerca - Università di Modena e Reggio Emilia