1,721,900 research outputs found
Message from the Chairs
Welcome to the 8th IEEE/ACM International Symposium on Networks-on-Chip (NOCS). NOCS is the pre- mier event dedicated to interdisciplinary research on Networks-on-Chip innovations. It is a unique venue that brings together scientists and engineers from diverse, but inter-related research communities, including computer architecture, general networking, circuits and systems, embedded systems, and design automation
Designing next-generation smart sensor hubs for the Internet-of-Things5th IEEE International Workshop on Advances in Sensors and Interfaces IWASI
Materializing the vision and the huge business opportunities offered by the Internet-of-Things requires a
paradigm shift in sensor data processing, fusing, understanding. Centralized approaches (sensors at the
edges, with centralized intelligence in the cloud) are not scalable, hierarchical, distributed processing is a
strong requirement. In this talk I will describe recent trends in the development of new computing
platforms geared to distributed sensor data management, discuss design challenges and research
opportunities
StreamDrive: A Dynamic Dataflow Framework for Clustered Embedded Architectures
In this paper, we present StreamDrive, a dynamic dataflow framework for programming clustered embedded multicore architectures. StreamDrive simplifies development of dynamic dataflow applications starting from sequential reference C code and allows seamless handling of heterogeneous and applicationspecific processing elements by applications. We address issues of ecient implementation of the dynamic dataflow runtime system in the context of constrained embedded environments, which have not been sufficiently addressed by previous research. We conducted a detailed performance evaluation of the StreamDrive implementation on our Application Specic MultiProcessor (ASMP) cluster using the Oriented FAST and Rotated BRIEF (ORB) algorithm typical of image processing domain.We have used the proposed incremental
development flow for the transformation of the ORB original reference C code into an optimized dynamic dataflow implementation. Our implementation has less than 10% parallelization overhead, near-linear speedup when the number of processors increases from 1 to 8, and achieves the performance of 15 VGA frames per second with a small cluster configuration of 4 processing elements and 64KB of shared memory, and of 30 VGA frames per second with 8 processors and 128KB of shared memory
A retrospective look at xpipes: The exciting ride from a design experience to a design platform for nanoscale networks-on-chip
This paper provides a retrospective look at the xpipes framework, and documents its evolution from a promising network-on-chip (NoC) design experience to a comprehensive design platform for the next-generation of nanoscale NoCs. Since the early days of xpipes, its cross-layer approach to NoC design has fostered the development and maturity of circuits, architectures and design flows, thus rapidly bridging the gap between the NoC concept and viable interconnect technology for industrial uptake
ViT-LR: Pushing the Envelope for Transformer-Based On-Device Embedded Continual Learning
State-of-the-Art Edge Artificial Intelligence (AI) is currently mostly targeted at a train-then-deploy paradigm: edge devices are exclusively responsible for inference, whereas training is delegated to data centers, leading to high energy and CO2 impact. On-Device Continual Learning could help in making Edge AI more sustainable by specializing AI models directly on-field. We deploy a continual image recognition model on a Jetson Xavier NX embedded system, and experimentally investigate how Attention influences performance and its viability as a Continual Learning backbone, analyzing the redundancy of its components to prune and further improve our solution efficiency.
We achieve up to 83.81% accuracy on the Core50’s new instances and classes scenario, starting from a pre-trained tiny Vision Transformer, surpassing AR1*free with Latent Replay, and reach performance comparable and superior to the SoA without relying on growing Replay Examples
Slotted ALOHA Overlay on LoRaWAN: a Distributed Synchronization Approach
LoRaWAN is one of the most promising standards for IoT applications.
Nevertheless, the high density of end-devices expected for each gateway, the
absence of an effective synchronization scheme between gateway and end-devices,
challenge the scalability of these networks. In this article, we propose to
regulate the communication of LoRaWAN networks using a Slotted-ALOHA (S-ALOHA)
instead of the classic ALOHA approach used by LoRa. The implementation is an
overlay on top of the standard LoRaWAN; thus no modification in pre-existing
LoRaWAN firmware and libraries is necessary. Our method is based on a novel
distributed synchronization service that is suitable for low-cost IoT
end-nodes. S-ALOHA supported by our synchronization service significantly
improves the performance of traditional LoRaWAN networks regarding packet loss
rate and network throughput
A highly efficient, thread-safe software cache implementation for tightly-coupled multicore clusters2013 IEEE 24th International Conference on Application-Specific Systems, Architectures and Processors
A widely adopted design paradigm for many-core accelerators features processing elements grouped in clusters. Due to area, power and design simplicity, processors in the same clusters are often not equipped with data-caches but rather share a tightly coupled data memory (TCDM). Even if the use of a TCDM is more energy and area efficient than a cache it requires a higher programming effort because memory needs to be explicitly managed with DMA-based L3 to TCDM copies. In this context Software Caches can be used to automatically transfer data between the local TCDM and the external memory, simplifying the task of the programmer. In this paper we present an implementation of Software Cache for the STMicroelectronics STHORM many-core accelerator, featuring a L1 TCDM shared by 16 processors in a cluster. Our main contribution is the design of a fast and thread-safe cache allowing parallel access from different processing elements inside the same cluster. We evaluate our implementation with micro-benchmarks as well as a real world application from the computer vision domain. Results show that a software cache provides major performance improvements with respect to L3 allocation of large data structures even when it is aggressively shared among many parallel threads
Exploring DMA-assisted prefetching strategies for software caches on multicore clusters
Modern many-core programmable accelerators are
often composed by several computing units grouped in clusters,
with a shared per-cluster scratchpad data memory. The main
programming challenge imposed by these architectures is to hide
the external memory to on-chip scratchpad memory transfer
latency, trying to overlap as much as possible memory transfers
with actual computation. This problem is usually tackled using
complex DMA-based programming patterns (e.g. double buffering),
which require a heavy refactoring of applications. Software
caches are an alternative to hand-optimized DMA programming.
However, even if a software cache can reduce the programming
effort, it is still relying on synchronous memory transfers. In
fact in case of a cache miss, the new line is copied in cache
and the requesting processor has to wait for the completion of
the transfer. While waiting, processors are not able to perform
any other computation. Cache lines prefetching can be used
to reduce the number of synchronous memory transfers, and
increase the active time of each processor, by loading cache lines
before they are actually needed. In this work we explore various
DMA-based prefetching techniques applied to a software cache
implementation, presenting both automatic and programmer
assisted prefetch mechanisms applied to computer vision kernels
Hierarchically Focused Guardbanding: An Adaptive Approach to Mitigate PVT Variations and Aging
This paper proposes a new model of functional units for variation-induced timing errors due to PVT variations and device Aging (PVTA). The model takes into account PVTA parameter variations, clock frequency, and the physical details of Placed-and-Routed (P&R) functional units in 45nm TSMC analysis flow. Using this model and PVTA monitoring circuits, we propose Hierarchically Focused Guardbanding (HFG) as a method to adaptively mitigate PVTA variations. We demonstrate the effectiveness of HFG on GPU architecture at two granularities of observation and adaptation: (i) fine-grained instruction-level; and (ii) coarse-grained kernel-level. Using coarse-grained PVTA monitors with kernel-level adaptation, the throughput increases by 70% on average. By comparison, the instruction-by-instruction monitoring and adaptation enhances throughput by a factor of 1.8×–2.1× depending on the configuration of PVTA monitors and the type of instructions executed in the kernels
COUNTDOWN - A Run-time Library for Application-agnostic Energy Saving in MPI Communication Primitives
- …
