1,721,093 research outputs found
Memory Decoupled Architectures and related issues Guest Editor's Introduction
It is my great pleasure to serve as guest editor for this special issue of TCCA Newsletter, which is hosting eight papers from the MEDEA (MEmory DEcoupled Architectures) Workshop, jointly held with PACT-2000 conference. The rationale behind this workshop was to revive the original idea of Memory Access Decoupling, presented in the famous paper of Jim Smith, Decoupled Access/Execute Architectures. In that paper a novel architecture was proposed, as emerging among high performance architectures appearing in the industrial scenario (CDC Cyber 180/990, CSPI array processor) and the academy (Illinois SMA). At that time, Jim Smith came back to the University of Wisconsin to fuel his ideas. The main concept in Memory Access Decoupling was to use two instruction streams and queues to separate memory accesses and pure computations. After about 20 years the scenario of high performance microprocessors is quite changed. Superscalar and VLIW architectures are the dominant paradigms, and a variety of tricks are used to enhance the performance or reduce the consumption. Instruction Level Parallelism, Out-of-Order Execution, Speculative Loads, Branch Prediction, Multithreading, Chip Multiprocessors, Dynamic Compilation, are just some of the keywords entered in our common vocabulary. In this new scenario we asked for contributions that could show how memory decoupling could be applied to achieve design goals, and possibly to explore new sources of parallelism. As observed by Roth, Zilles, and Sohi, in this issue first paper, today's processors can tolerate latencies of about 10 cycles, but we are approaching the case where the processor-memory gap is going to exceed 100 cycles. So, at this time
TERAFLUX: Ideas for the Future Many-Cores
Recent Silicon advances promise to keep Moore s Forecast true for at least this decade. In numbers,
one TERA (10^12) transistors in a single chip or package will be available, posing three major challanges for
future computing systems: i) how to efficiently program these systems? ii) which architecture would lead to a
managable complexity? iii) how do we keep the system reliable? The TERAFLUX project ( http://teraflux.eu
- with a total cost of about 7.5 M-Euro) allows 10 Academic and Industrial partners to join forces in order to
propose a holistic solution able to address the three above challanges.
Many proposals for future many-core system are gaining attention nowadays: CUDA based systems contain
already 512 cores per chip, while x86 multi-core processors arrived already to 12 cores. TERAFLUX leverages
Dataflow Parallelism to reach power efficiency, reliability, efficient parallel programmability, scalability, data
bandwidth.
Dataflow is exploited both at task level and inside the threads, to offload accelerated codes, to localize the
computation, for managing the fault information with appropriate protocols, to easily migrate code to the available/
working components and to respect the power/performance/temperature/reliability envelope, to produce a
more predictable behavior, to efficiently handle the parallelism and have an easy and powerful execution model.
A special challange is the evaluation of such system comprising a target of at least 1000 cores. Our simulation
infrastructure relies on the COTSon simulator provided by HP-Labs (TERAFLUX partner). One more
contribution of this project is to provide an updated COTSon-based TERAFLUX simulator as an Open-Source
project
Evaluation of a Coherence Protocol for Eliminating Passive Sharing in Shared-Bus Multithreaded Multiprocessors
Single-chip multiprocessors and multiple-thread architectures are becoming an affordable solution for high-performance general-purpose workstations and servers. On these machines, the workload is typically constituted of both sequential and parallel applications. Shared-bus shared-memory multithreaded multiprocessor can be used to speed-up the execution of such workload. In this environment, the scheduler takes care of the load balancing by allocating a ready process on the first available processor, thus producing process migration. Process migration and the persistence of private data into different caches produce an undesired sharing, named passive sharing. The copies due to passive sharing produce useless coherence traffic on the bus and coping with such a problem may represent a challenging design problem for these machines. Many protocols use smart solutions to limit the overhead to maintain coherence among shared copies. None of these studies treats passive-sharing directly, although some indirect effect is present while dealing with the other kinds of sharing. Affinity scheduling can alleviate this problem, but this technique does not adapt to all load conditions, especially when the effects of migration are massive. A simple coherence protocol is presented. This protocol eliminates passive sharing using information from the compiler that is normally available in operating system kernels. The performance of this protocol has been evaluated and compared against other solutions proposed in the literature by means of enhanced trace-driven simulation. The performance of the proposed dolution outperforms the other protocols, especially in the case of a multithreaded processor, thus demonstrating its effectiveness in this kind of hardware platform. The complexity of the proposed approach has been evaluated in terms of the number of protocol states, additional bus lines and required software support. The protocol further limits the coherence-maintaining overhead by using information about access patterns to shared data exhibited in parallel applications.Single-chip multiprocessors and multiple-thread architectures are becoming an affordable solution for high-performance general-purpose workstations and servers. On these machines, the workload is typically constituted of both sequential and parallel applications. Shared-bus shared-memory multithreaded multiprocessor can be used to speed-up the execution of such workload. In this environment, the scheduler takes care of the load balancing by allocating a ready process on the first available processor, thus producing process migration. Process migration and the persistence of private data into different caches produce an undesired sharing, named passive sharing. The copies due to passive sharing produce useless coherence traffic on the bus and coping with such a problem may represent a challenging design problem for these machines. Many protocols use smart solutions to limit the overhead to maintain coherence among shared copies. None of these studies treats passive-sharing directly
Scalable Embedded Computing through Reconfigurable Hardware: comparing DF-Threads, Cilk, OpenMPI and Jump
Data-Flow Threads (DF-Threads) is a new execution model that permits to seamlessly distribute the workload across several cores (in a multi-core) and several nodes (in a multi-node/multi-board configuration).
In this paper, the advance in deploying this execution model is shown while developing it by using a combination of a simulator model (i.e., the COTSon framework) and a reconfigurable hardware platform (i.e., the AXIOM-board). The AXIOM platform consists of a custom board based on the Xilinx Zynq Ultrascale+ ZU9EG, which incorporates the largest FPGA available on that System-on-Chip at the moment, four 64-bit ARM cores and two 32-bit ARM cores, up to 32GiB of main memory and several 16Gbit/s transceivers.
While a complete DF-Threads system is still under development, but is already capable of running a full Linux OS and simple applications, so some initial results are presented here. In particular, well-known programming models that are used to exploit the Thread-Level Parallelism such as Cilk, OpenMPI and Jump are compared with DF-thread execution. Cilk is good for multi-cores, but it is not suitable for multi-nodes systems. In the latter cases, the distribution of the workload could be managed partly by the programmer when using programming models such as message-passing (OpenMPI has been chosen for reference) or distributed shared-memory (Jump in our case).
The obtained results show that a DF-Thread execution on a cluster of eight 4-core boards can provide a speed-up of more than 14x compared to the same configuration when using OpenMPI and more than 80x when compared with a OpenMPI single core, single node execution
TERAFLUX: Exploiting Dataflow Parallelism in Teradevices
The TERAFLUX project is a Future and Emerging Technologies (FET) Large-Scale Project funded by the European Union. TERAFLUX is at the forefront of major research challenges such as programmability, manageable architecture design, reliability of many-core or 1000+ core chips. In the near future, new computing systems will consist of a huge number of transistors - probably 1 Tera or 1000 billions by 2020: we name such systems as Teradevices. In this project, the aim is to solve the three challenges at once by using the dataflow principles wherever they are applicable or make sense in the general economy of the system. An Instruction Set Extension (ISE) for the x86-64 is illustrated. This ISE supports the dataflow execution of threads
Core Design and Scalability of Tiled SDF Architecture
Embedded systems are using more extensively multi-core chips to reach high performance goals. While current systems contain only a few cores, present trends and commercial/research roadmaps foresee that in a near future many cores will be integrated on the same chip to achieve the best tradeoff between power consumption and performance. At the same time, centralized designs are progressively abandoned in favour of more modular and scalable approaches that address explicitly wire delay problem and aim to exploit application parallelism. Such designs are often referred as tiled architectures. Here we present our idea how the tiled paradigm can be applied on the SDF architecture
Scheduling and NoC Traffic Reduction in T-SDF Architecture
As transistor size shrinks and chip complexity increases it is possible to place more transistor onto a singe chip, and thus it is possible to integrate more then one processor on a single chip. Clock frequency is also increased, and because of wire delay it is not possible to reach all parts of a chip in a single clock cycle, and interconnection network is becoming a bottleneck in such systems. Our research is focused on creating a multiprocessor on chip architecture based on SDF architecture, by placing multiple processing cores on a single chip, and by interconnecting them to work together. We are investigating possible ways to connect cores, ways in which threads are scheduled on individual cores and ways to reduce network traffic, specially the coherence traffic by using PSCR protocol
Issues in Embedded Single-Chip Multicore Architectures
Nowadays and future embedded and special purpose systems need a qualitative step forward in the research efforts better than continue in quantitatively improve the designs: it's time for scaling-out architectures, instead of scaling-up frequency. As transistor count is still increasing as expected by Moore's law, recent challenges like wire-delay, design complexity, and power requirements are becoming more and more important. These problems are preventing the evolution of chip architecture in the directions followed in the previous decades, when clock frequency as well could scale-up with Moore's law. Many researchers and companies have started to look at building multiprocessors on a single chip, following both past and novel design solutions: no doubt that we are all expecting several cores on a single chip in the near future
JCacheSim: simulatore visuale di gerarchia di memoria con interprete per programmi MIPS
La gerarchia di memoria ricopre un ruolo essenziale nella progettazione dei moderni calcolatori ma e' anche uno degli aspetti piu' difficilmente presentabili da un punto di vista didattico. Sono particolarmente utili strumenti di visualizzazione grafica. JCacheSim e' un simulatore in grado di riprodurre il comportamento reale della cache. E' stato sviluppato sottoforma di Java Applet in modo da poter essere disponibile su qualsiasi piattaforma; tramite un meccanismo di logging, JCacheSim consente la raccolta da parte del docente di informazioni sull'attivita' svolta dagli studenti. JCacheSim e' il framework ideale per lo studio della cache durante i Corsi di Calcolatori Elettronici
- …
