1,721,017 research outputs found

    A multi banked - Multi ported - Non blocking shared L2 cache for MPSoC platforms

    Full text link
    On-chip L2 cache architectures, well established in high-performance parallel computing systems, are now becoming a performance-critical component also for multi/many-core architectures targeted at lower-power, embedded applications. The very stringent requirements on power and cost of these systems result in one of the key challenges in many-core designs, mandating the deployment of highly efficient L2 caches. In this perspective, sharing the L2 cache layer among all system cores has important advantages, such as increased utilization, fast inter-core communication, and reduced aggregate footprint because no undesired replication of lines occurs. This paper presents a novel architecture for a shared L2 cache system with multi-port and multi-bank features. We target this L2 cache to a many-core platform based on hierarchical cluster structure that does not employ private data caches, and therefore does not require complex coherency mechanisms. In fact, our shared L2 cache can be seen logically as a Last Level Cache (LLC) adopting the terminology of higher-performance many-core products, although in these latter the LLC is more often an L3 layer. Our experimental results show a maximum aggregate bandwidth of 28GB/s (89% of the maximum channel capacity) for 100% hit traffic with random banking conflicts, as a realistic case. Physical implementation results in 28nm Fully-Depleted-Silicon-on-Insulator (FDSoI) show that our L2 cache can operate at up to 1GHz with a memory density loss of only 20% with respect to an L2 scratchpad for a 2 MB configuration

    PULP: A Ultra-Low Power Parallel Accelerator for Energy-Efficient and Flexible Embedded Vision

    Full text link
    Novel pervasive devices such as smart surveillance cameras and autonomous micro-UAVs could greatly benefit from the availability of a computing device supporting embedded computer vision at a very low power budget. To this end, we propose PULP (Parallel processing Ultra-Low Power platform), an architecture built on clusters of tightly-coupled OpenRISC ISA cores, with advanced techniques for fast performance and energy scalability that exploit the capabilities of the STMicroelectronics UTBB FD-SOI 28nm technology. We show that PULP performance can be scaled over a 1x-354x range, with a peak theoretical energy efficiency of 211 GOPS/W. We present performance results for several demanding kernels from the image processing and vision domain, with post-layout power modeling: a motion detection application that can run at an efficiency up to 192 GOPS/W (90 % of the theoretical peak); a ConvNet-based detector for smart surveillance that can be switched between 0.7 and 27fps operating modes, scaling energy consumption per frame between 1.2 and 12mJ on a 320 ×240 image; and FAST + Lucas-Kanade optical flow on a 128 ×128 image at the ultra-low energy budget of 14 μJ per frame at 60fps

    Logic-Base Interconnect Design for Near Memory Computing in the Smart Memory Cube

    No full text
    Hybrid memory cube (HMC) has promised to improve bandwidth, power consumption, and density for the next-generation main memory systems. In addition, 3-D integration gives a second shot for revisiting near memory computation to fill the gap between processors and memories. In this paper, we study the required infrastructure inside the HMC to support near memory computation in a modular and flexible fashion. We propose a fully backward compatible extension to the standard HMC called the smart memory cube, and design a high bandwidth, low latency, and Advanced eXtensible Interface-4.0 compatible logic base (LoB) interconnect to serve the huge bandwidth demand by the HMCs serial links, and to provide extra bandwidth to a generic processor-in-memory (PIM) device embedded in the LoB. This interconnect features a novel address scrambling mechanism for the reduction in the vault/bank conflicts and robust operation even in the presence of pathological traffic patterns. Our cycle accurate simulation results demonstrate that this interconnect can easily meet the demands of the latest HMC specifications (up to 205 GB/s read bandwidth with 4 serial links and 32 memory vaults for injected random traffic). It further shown that the default addressing scheme of the HMC (low interleaving) is not reliable enough and operates poorly in the presence of specific traffic patterns from real applications. This is while the proposed scrambling mechanism operates robustly even in those cases. The interference between the PIM traffic and the main links is shown to be negligible when the number of PIM ports is limited to 2, requesting up to 64 GB/s without pushing the system into saturation. Finally, logic synthesis with Synopsys Design Compiler confirms that our interconnect is implementable and effective in terms of power, area, and timing (power consumption less than 5 mW up to 1 GHz and area less than 0.4 mm2)

    A Modular Shared L2 Memory Design for 3-D Integration

    No full text
    Large required size, and tolerance to latency and variations in memory access time make L2 memory a suitable option for 3-D integration. In this paper, we present a synthesizable 3-D-stackable L2 memory IP component, which can be attached to a cluster-based multicore platform through its network-on-chip interfaces offering high-bandwidth memory access with low average latency. Our design implements a scalable 3-D-nonuniform memory access (NUMA) architecture based on low latency logarithmic interconnects, which allows stacking of multiple identical memory dies (MDs), supports multiple outstanding transactions, and achieves high clock frequencies due to its highly pipelined nature. We implemented our design with STMicroelectronics CMOS-28-nm low-power technology and obtained a clock frequency of 500 MHz (limited by the access time of the memory arrays, whereas its logic components can operate up to 1 GHz), up to eight stacked dies (4 MB) with a memory density loss of 9%. Benchmark simulation results demonstrate that the addition of 3-D-NUMA to a multicluster system can lead to an average performance boost of 34%. Furthermore, experiments and estimations confirm that 3-D-NUMA is energy and power efficient (38% power reduction due to an architectural clock gating scheme), temperature friendly (over 40°C temperature reduction), and has unique features suitable for low-cost manufacturing ( 2.3\times cost reduction due to identical MD layouts). Finally, 22% yield improvement is achievable in 3-D-NUMA compared with its 2-D counterparts, using the state of the art through-silicon-via technologies

    A Hybrid Instruction Prefetching Mechanism for Ultra Low-Power Multicore Clusters

    Full text link
    The instruction memory hierarchy plays a critical role in performance and energy efficiency of ultralow-power (ULP) processors for the Internet-of-Things (IoT) end-nodes. This is mainly due to the extremely tight power envelope and area budgets, which imply small instruction-caches (I-Cache) operating at very low supply voltages (near-threshold). The challenge is aggravated by the fact that multiple processors, fetching in parallel, require plenty of bandwidth from the I-Caches. In this letter, we propose a low-cost and energy efficient hybrid instruction-prefetching mechanism to be integrated with a ULP multicore cluster. We study its performance for a wide range of IoT applications, from cryptography to computer vision, and show that it can effectively improve the hit-rate of almost all of them to above 95% (average performance improvement of over 2 \times ). In addition, we designed our prefetcher and integrated it in a 4-cores cluster in 28 nm fully-depleted silicon-on-insulator (FDSOI) technology. We show that system's power consumption increases only by about 11% and silicon area by less than 1%. Altogether, a total energy reduction of 1.9x is achieved, thanks to more than 2x performance improvement, enabling a significantly longer battery life

    Ultra-low-latency lightweight dma for tightly coupled multi-core clusters

    Get PDF
    The evolution of multi- and many-core platforms is rapidly increasing the available on-chip computational capabilities of embedded computing devices, while memory access is dominated by on-chip and off-chip interconnect delays which do not scale well. For this reason, the bottleneck of many applications is rapidly moving from computation to communication. More precisely, performance is often bound by the huge latency of direct memory accesses. In this scenario the challenge is to provide embedded multi and many-core systems with a powerful, low-latency, energy efficient and flexible way to move data through the memory hierarchy level. In this paper, a DMA engine optimized for clustered tightly coupled many-core systems is presented. The IP features a simple micro-coded programming interface and lock-free per-core command queues to improve flexibility while reducing the programming latency. Moreover it dramatically reduces the area and improves the energy efficiency with respect to conventional DMAs exploiting the cluster shared memory as local repository for data buffers. The proposed DMA engine improves the access and programming latency by one order of magnitude, it reduces IP area by 4x and power by 5x, with respect to a conventional DMA, while providing full bandwidth to 16 independent logical channels

    Ultra-Low-Power Digital Architectures for the Internet of Things

    No full text
    This chapter introduces the architectures implementing the digital processing platforms and control for Internet of things applications. It will provide a review of the state of the art Ultra-Low-Power (ULP) micro-controllers architecture, highlighting main challenges and perspectives, introducing the potential of exploiting parallelism in this field currently dominated by single issue processors

    Neurostream: Scalable and Energy Efficient Deep Learning with Smart Memory Cubes

    Full text link
    High-performance computing systems are moving towards 2.5D and 3D memory hierarchies, based on High Bandwidth Memory (HBM) and Hybrid Memory Cube (HMC) to mitigate the main memory bottlenecks. This trend is also creating new opportunities to revisit near-memory computation. In this paper, we propose a flexible processor-in-memory (PIM) solution for scalable and energy-efficient execution of deep convolutional networks (ConvNets), one of the fastest-growing workloads for servers and high-end embedded systems. Our co-design approach consists of a network of Smart Memory Cubes (modular extensions to the standard HMC) each augmented with a many-core PIM platform called NeuroCluster. NeuroClusters have a modular design based on NeuroStream coprocessors (for Convolution-intensive computations) and general-purpose RISC-V cores. In addition, a DRAM-friendly tiling mechanism and a scalable computation paradigm are presented to efficiently harness this computational capability with a very low programming effort. NeuroCluster occupies only 8 percent of the total logic-base (LoB) die area in a standard HMC and achieves an average performance of 240 GFLOPS for complete execution of full-featured state-of-the-art (SoA) ConvNets within a power budget of 2.5 W. Overall 11 W is consumed in a single SMC device, with 22.5 GFLOPS/W energy-efficiency which is 3.5X better than the best GPU implementations in similar technologies. The minor increase in system-level power and the negligible area increase make our PIM system a cost-effective and energy efficient solution, easily scalable to 955 GFLOPS with a small network of just four SMCs

    Design and evaluation of a processing-in-memory architecture for the smart memory cube

    No full text
    3D integration of solid-state memories and logic, as demonstrated by the Hybrid Memory Cube (HMC), offers major opportunities for revisiting near-memory computation and gives new hope to mitigate the power and performance losses caused by the “memory wall”. Several publications in the past few years demonstrate this renewed interest. In this paper we present the first exploration steps towards design of the Smart Memory Cube (SMC), a new Processor-in-Memory (PIM) architecture that enhances the capabilities of the logic-base (LoB) die in HMC. An accurate simulation environment called SMCSim has been developed, along with a full featured software stack. The key contribution of this work is full system analysis of near memory computation including high-level software to low-level firmware and hardware layers, considering offloading and dynamic overheads caused by the operating system (OS), cache coherence, and memory management. A zero-copy pointer passing mechanism has been devised to allow low overhead data sharing between the host and the PIM. Benchmarking results demonstrate up to 2X performance improvement in comparison with the host Systemon-Chip (SoC), and around 1.5X against a similar host-side accelerator. Moreover, by scaling down the voltage and frequency of PIM’s processor it is possible to reduce energy by around 70% and 55% in comparison with the host and the accelerator, respectively
    corecore