Search CORE

1,721,900 research outputs found

Message from the Chairs

Author: Luca Benini
Jörg Henkel
Sudhakar Yalamanchili
BERTOZZI Davide
Publication venue
Publication date: 01/01/2015
Field of study

Welcome to the 8th IEEE/ACM International Symposium on Networks-on-Chip (NOCS). NOCS is the pre- mier event dedicated to interdisciplinary research on Networks-on-Chip innovations. It is a unique venue that brings together scientists and engineers from diverse, but inter-related research communities, including computer architecture, general networking, circuits and systems, embedded systems, and design automation

Archivio istituzionale della ricerca - Università di Ferrara

Designing next-generation smart sensor hubs for the Internet-of-Things5th IEEE International Workshop on Advances in Sensors and Interfaces IWASI

Author: BENINI LUCA
Luca Benini
Publication venue
Publication date: 01/01/2013
Field of study

Materializing the vision and the huge business opportunities offered by the Internet-of-Things requires a paradigm shift in sensor data processing, fusing, understanding. Centralized approaches (sensors at the edges, with centralized intelligence in the cloud) are not scalable, hierarchical, distributed processing is a strong requirement. In this talk I will describe recent trends in the development of new computing platforms geared to distributed sensor data management, discuss design challenges and research opportunities

Crossref

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

StreamDrive: A Dynamic Dataflow Framework for Clustered Embedded Architectures

Author: arthur stoutchinin
luca benini
Publication venue
Publication date: 01/01/2019
Field of study

In this paper, we present StreamDrive, a dynamic dataflow framework for programming clustered embedded multicore architectures. StreamDrive simplifies development of dynamic dataflow applications starting from sequential reference C code and allows seamless handling of heterogeneous and applicationspecific processing elements by applications. We address issues of ecient implementation of the dynamic dataflow runtime system in the context of constrained embedded environments, which have not been sufficiently addressed by previous research. We conducted a detailed performance evaluation of the StreamDrive implementation on our Application Specic MultiProcessor (ASMP) cluster using the Oriented FAST and Rotated BRIEF (ORB) algorithm typical of image processing domain.We have used the proposed incremental development flow for the transformation of the ORB original reference C code into an optimized dynamic dataflow implementation. Our implementation has less than 10% parallelization overhead, near-linear speedup when the number of processors increases from 1 to 8, and achieves the performance of 15 VGA frames per second with a small cluster configuration of 4 processing elements and 64KB of shared memory, and of 30 VGA frames per second with 8 processors and 128KB of shared memory

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

A retrospective look at xpipes: The exciting ride from a design experience to a design platform for nanoscale networks-on-chip

Author: Luca Benini
BERTOZZI Davide
Davide Bertozzi
Publication venue
Publication date: 01/01/2012
Field of study

This paper provides a retrospective look at the xpipes framework, and documents its evolution from a promising network-on-chip (NoC) design experience to a comprehensive design platform for the next-generation of nanoscale NoCs. Since the early days of xpipes, its cross-layer approach to NoC design has fostered the development and maturity of circuits, architectures and design flows, thus rapidly bridging the gap between the NoC concept and viable interconnect technology for industrial uptake

Crossref

Archivio istituzionale della ricerca - Università di Ferrara

ViT-LR: Pushing the Envelope for Transformer-Based On-Device Embedded Continual Learning

Author: Alberto Dequino
Francesco Conti
Luca Benini
Publication venue
Publication date: 01/01/2022
Field of study

State-of-the-Art Edge Artificial Intelligence (AI) is currently mostly targeted at a train-then-deploy paradigm: edge devices are exclusively responsible for inference, whereas training is delegated to data centers, leading to high energy and CO2 impact. On-Device Continual Learning could help in making Edge AI more sustainable by specializing AI models directly on-field. We deploy a continual image recognition model on a Jetson Xavier NX embedded system, and experimentally investigate how Attention influences performance and its viability as a Continual Learning backbone, analyzing the redundancy of its components to prune and further improve our solution efficiency. We achieve up to 83.81% accuracy on the Core50’s new instances and classes scenario, starting from a pre-trained tiny Vision Transformer, surpassing AR1*free with Latent Replay, and reach performance comparable and superior to the SoA without relying on growing Replay Examples

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

Slotted ALOHA Overlay on LoRaWAN: a Distributed Synchronization Approach

Author: Davide Brunelli
Luca Benini
Tommaso Polonelli
Publication venue
Publication date: 01/01/2018
Field of study

LoRaWAN is one of the most promising standards for IoT applications. Nevertheless, the high density of end-devices expected for each gateway, the absence of an effective synchronization scheme between gateway and end-devices, challenge the scalability of these networks. In this article, we propose to regulate the communication of LoRaWAN networks using a Slotted-ALOHA (S-ALOHA) instead of the classic ALOHA approach used by LoRa. The implementation is an overlay on top of the standard LoRaWAN; thus no modification in pre-existing LoRaWAN firmware and libraries is necessary. Our method is based on a novel distributed synchronization service that is suitable for low-cost IoT end-nodes. S-ALOHA supported by our synchronization service significantly improves the performance of traditional LoRaWAN networks regarding packet loss rate and network throughput

Crossref

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

A highly efficient, thread-safe software cache implementation for tightly-coupled multicore clusters2013 IEEE 24th International Conference on Application-Specific Systems, Architectures and Processors

Author: BENINI LUCA
Luca Benini
PINTO CHRISTIAN
Christian Pinto
Publication venue
Publication date: 01/01/2013
Field of study

A widely adopted design paradigm for many-core accelerators features processing elements grouped in clusters. Due to area, power and design simplicity, processors in the same clusters are often not equipped with data-caches but rather share a tightly coupled data memory (TCDM). Even if the use of a TCDM is more energy and area efficient than a cache it requires a higher programming effort because memory needs to be explicitly managed with DMA-based L3 to TCDM copies. In this context Software Caches can be used to automatically transfer data between the local TCDM and the external memory, simplifying the task of the programmer. In this paper we present an implementation of Software Cache for the STMicroelectronics STHORM many-core accelerator, featuring a L1 TCDM shared by 16 processors in a cluster. Our main contribution is the design of a fast and thread-safe cache allowing parallel access from different processing elements inside the same cluster. We evaluate our implementation with micro-benchmarks as well as a real world application from the computer vision domain. Results show that a software cache provides major performance improvements with respect to L3 allocation of large data structures even when it is aggressively shared among many parallel threads

Crossref

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

Exploring DMA-assisted prefetching strategies for software caches on multicore clusters

Author: BENINI LUCA
Luca Benini
PINTO CHRISTIAN
Christian Pinto
Publication venue
Publication date: 01/01/2014
Field of study

Modern many-core programmable accelerators are often composed by several computing units grouped in clusters, with a shared per-cluster scratchpad data memory. The main programming challenge imposed by these architectures is to hide the external memory to on-chip scratchpad memory transfer latency, trying to overlap as much as possible memory transfers with actual computation. This problem is usually tackled using complex DMA-based programming patterns (e.g. double buffering), which require a heavy refactoring of applications. Software caches are an alternative to hand-optimized DMA programming. However, even if a software cache can reduce the programming effort, it is still relying on synchronous memory transfers. In fact in case of a cache miss, the new line is copied in cache and the requesting processor has to wait for the completion of the transfer. While waiting, processors are not able to perform any other computation. Cache lines prefetching can be used to reduce the number of synchronous memory transfers, and increase the active time of each processor, by loading cache lines before they are actually needed. In this work we explore various DMA-based prefetching techniques applied to a software cache implementation, presenting both automatic and programmer assisted prefetch mechanisms applied to computer vision kernels

Crossref

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

Hierarchically Focused Guardbanding: An Adaptive Approach to Mitigate PVT Variations and Aging

Author: BENINI LUCA
Rajesh K. Gupta
Luca Benini
Abbas Rahimi
Publication venue
Publication date: 01/01/2013
Field of study

This paper proposes a new model of functional units for variation-induced timing errors due to PVT variations and device Aging (PVTA). The model takes into account PVTA parameter variations, clock frequency, and the physical details of Placed-and-Routed (P&R) functional units in 45nm TSMC analysis flow. Using this model and PVTA monitoring circuits, we propose Hierarchically Focused Guardbanding (HFG) as a method to adaptively mitigate PVTA variations. We demonstrate the effectiveness of HFG on GPU architecture at two granularities of observation and adaptation: (i) fine-grained instruction-level; and (ii) coarse-grained kernel-level. Using coarse-grained PVTA monitors with kernel-level adaptation, the throughput increases by 70% on average. By comparison, the instruction-by-instruction monitoring and adaptation enhances throughput by a factor of 1.8×–2.1× depending on the configuration of PVTA monitors and the type of instructions executed in the kernels

Crossref

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

COUNTDOWN - A Run-time Library for Application-agnostic Energy Saving in MPI Communication Primitives

Author: Carlo Cavazzoni
Piero Bonfà
Luca Benini
Andrea Bartolini
Daniele Cesarini
Publication venue
Publication date: 01/01/2018
Field of study

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna