1,721,900 research outputs found

    Message from the Chairs

    No full text
    Welcome to the 8th IEEE/ACM International Symposium on Networks-on-Chip (NOCS). NOCS is the pre- mier event dedicated to interdisciplinary research on Networks-on-Chip innovations. It is a unique venue that brings together scientists and engineers from diverse, but inter-related research communities, including computer architecture, general networking, circuits and systems, embedded systems, and design automation

    Designing next-generation smart sensor hubs for the Internet-of-Things5th IEEE International Workshop on Advances in Sensors and Interfaces IWASI

    No full text
    Materializing the vision and the huge business opportunities offered by the Internet-of-Things requires a paradigm shift in sensor data processing, fusing, understanding. Centralized approaches (sensors at the edges, with centralized intelligence in the cloud) are not scalable, hierarchical, distributed processing is a strong requirement. In this talk I will describe recent trends in the development of new computing platforms geared to distributed sensor data management, discuss design challenges and research opportunities

    StreamDrive: A Dynamic Dataflow Framework for Clustered Embedded Architectures

    Full text link
    In this paper, we present StreamDrive, a dynamic dataflow framework for programming clustered embedded multicore architectures. StreamDrive simplifies development of dynamic dataflow applications starting from sequential reference C code and allows seamless handling of heterogeneous and applicationspecific processing elements by applications. We address issues of ecient implementation of the dynamic dataflow runtime system in the context of constrained embedded environments, which have not been sufficiently addressed by previous research. We conducted a detailed performance evaluation of the StreamDrive implementation on our Application Specic MultiProcessor (ASMP) cluster using the Oriented FAST and Rotated BRIEF (ORB) algorithm typical of image processing domain.We have used the proposed incremental development flow for the transformation of the ORB original reference C code into an optimized dynamic dataflow implementation. Our implementation has less than 10% parallelization overhead, near-linear speedup when the number of processors increases from 1 to 8, and achieves the performance of 15 VGA frames per second with a small cluster configuration of 4 processing elements and 64KB of shared memory, and of 30 VGA frames per second with 8 processors and 128KB of shared memory

    A retrospective look at xpipes: The exciting ride from a design experience to a design platform for nanoscale networks-on-chip

    No full text
    This paper provides a retrospective look at the xpipes framework, and documents its evolution from a promising network-on-chip (NoC) design experience to a comprehensive design platform for the next-generation of nanoscale NoCs. Since the early days of xpipes, its cross-layer approach to NoC design has fostered the development and maturity of circuits, architectures and design flows, thus rapidly bridging the gap between the NoC concept and viable interconnect technology for industrial uptake

    ViT-LR: Pushing the Envelope for Transformer-Based On-Device Embedded Continual Learning

    No full text
    State-of-the-Art Edge Artificial Intelligence (AI) is currently mostly targeted at a train-then-deploy paradigm: edge devices are exclusively responsible for inference, whereas training is delegated to data centers, leading to high energy and CO2 impact. On-Device Continual Learning could help in making Edge AI more sustainable by specializing AI models directly on-field. We deploy a continual image recognition model on a Jetson Xavier NX embedded system, and experimentally investigate how Attention influences performance and its viability as a Continual Learning backbone, analyzing the redundancy of its components to prune and further improve our solution efficiency. We achieve up to 83.81% accuracy on the Core50’s new instances and classes scenario, starting from a pre-trained tiny Vision Transformer, surpassing AR1*free with Latent Replay, and reach performance comparable and superior to the SoA without relying on growing Replay Examples

    Slotted ALOHA Overlay on LoRaWAN: a Distributed Synchronization Approach

    No full text
    LoRaWAN is one of the most promising standards for IoT applications. Nevertheless, the high density of end-devices expected for each gateway, the absence of an effective synchronization scheme between gateway and end-devices, challenge the scalability of these networks. In this article, we propose to regulate the communication of LoRaWAN networks using a Slotted-ALOHA (S-ALOHA) instead of the classic ALOHA approach used by LoRa. The implementation is an overlay on top of the standard LoRaWAN; thus no modification in pre-existing LoRaWAN firmware and libraries is necessary. Our method is based on a novel distributed synchronization service that is suitable for low-cost IoT end-nodes. S-ALOHA supported by our synchronization service significantly improves the performance of traditional LoRaWAN networks regarding packet loss rate and network throughput

    A highly efficient, thread-safe software cache implementation for tightly-coupled multicore clusters2013 IEEE 24th International Conference on Application-Specific Systems, Architectures and Processors

    No full text
    A widely adopted design paradigm for many-core accelerators features processing elements grouped in clusters. Due to area, power and design simplicity, processors in the same clusters are often not equipped with data-caches but rather share a tightly coupled data memory (TCDM). Even if the use of a TCDM is more energy and area efficient than a cache it requires a higher programming effort because memory needs to be explicitly managed with DMA-based L3 to TCDM copies. In this context Software Caches can be used to automatically transfer data between the local TCDM and the external memory, simplifying the task of the programmer. In this paper we present an implementation of Software Cache for the STMicroelectronics STHORM many-core accelerator, featuring a L1 TCDM shared by 16 processors in a cluster. Our main contribution is the design of a fast and thread-safe cache allowing parallel access from different processing elements inside the same cluster. We evaluate our implementation with micro-benchmarks as well as a real world application from the computer vision domain. Results show that a software cache provides major performance improvements with respect to L3 allocation of large data structures even when it is aggressively shared among many parallel threads

    Exploring DMA-assisted prefetching strategies for software caches on multicore clusters

    No full text
    Modern many-core programmable accelerators are often composed by several computing units grouped in clusters, with a shared per-cluster scratchpad data memory. The main programming challenge imposed by these architectures is to hide the external memory to on-chip scratchpad memory transfer latency, trying to overlap as much as possible memory transfers with actual computation. This problem is usually tackled using complex DMA-based programming patterns (e.g. double buffering), which require a heavy refactoring of applications. Software caches are an alternative to hand-optimized DMA programming. However, even if a software cache can reduce the programming effort, it is still relying on synchronous memory transfers. In fact in case of a cache miss, the new line is copied in cache and the requesting processor has to wait for the completion of the transfer. While waiting, processors are not able to perform any other computation. Cache lines prefetching can be used to reduce the number of synchronous memory transfers, and increase the active time of each processor, by loading cache lines before they are actually needed. In this work we explore various DMA-based prefetching techniques applied to a software cache implementation, presenting both automatic and programmer assisted prefetch mechanisms applied to computer vision kernels

    Hierarchically Focused Guardbanding: An Adaptive Approach to Mitigate PVT Variations and Aging

    No full text
    This paper proposes a new model of functional units for variation-induced timing errors due to PVT variations and device Aging (PVTA). The model takes into account PVTA parameter variations, clock frequency, and the physical details of Placed-and-Routed (P&R) functional units in 45nm TSMC analysis flow. Using this model and PVTA monitoring circuits, we propose Hierarchically Focused Guardbanding (HFG) as a method to adaptively mitigate PVTA variations. We demonstrate the effectiveness of HFG on GPU architecture at two granularities of observation and adaptation: (i) fine-grained instruction-level; and (ii) coarse-grained kernel-level. Using coarse-grained PVTA monitors with kernel-level adaptation, the throughput increases by 70% on average. By comparison, the instruction-by-instruction monitoring and adaptation enhances throughput by a factor of 1.8×–2.1× depending on the configuration of PVTA monitors and the type of instructions executed in the kernels
    corecore