1,720,987 research outputs found
Optimizing instruction cache performance of embedded systems
In the embedded domain, the gap between memory and processor performance and the increase in application complexity need to be supported without wasting precious system resources: die size, power, etc. For these reasons, effective exploitation of small and simple cache memories is of the utmost importance. However, programs running on such caches can experience serious inefficiencies due to cache conflicts.
We present a new Cache-Aware Code Allocation Technique (CAT), which transforms the structure of programs so that their behavior toward memory can meet the locality features the cache
is able to exploit. The proposed approach uses detailed information of program execution to place program areas into memory and employs the new idea of “look-forward estimation” that helps to seek better global layouts during the placement of each area. CAT-optimized programs outperform the original ones achieving the same miss rate on two times, and sometimes four times, smaller caches. Moreover, CAT improves the instruction miss rate by more than 40% if compared to the best procedure-reordering algorithm. CAT performances derive from the increased number of cache lines that support the execution of optimized applications and from a more balanced load on them
Embedded Processors and Systems: Architectural Issues and Solutions for Emerging Applications
Accelerating DSS Workloads through Coherence Protocols
n this work, we analyze how a DSS (Decision Support System) workload can be accelerated in the case of a shared-bus shared-memory multiprocessor, by adding simple and inexpensive support for a coherence protocol, in order to reduce the overhead produced by thread migration. Indeed, it is well known that, in this kind of systems, the bus is the critical element that can limit the scalability of the machine. Nevertheless, many factors that influence bus utilization have not been yet investigated for this kind of workload, in particular the effects of thread migration. For the sake of completeness the DSS workload is not running alone in the machine but together with other programs that might be spawned by the DSS application, like system commands or other software typically running in this kind of machine. The operating system effects are also considered in our evaluation. We analyzed a basic four-processor and a high-end sixteen-processor machine, implementing three different coherence protocols (including MESI and another solution from the literature). We show that even in the four-processor case, the overhead induced by the sharing of private data, as a consequence of process migration, namely passive sharing, cannot be neglected. Indeed, the analysis shows that a protocol based on a selective strategy for dealing with private and shared data has a better performance than protocols either relying on the detection of migratory access-pattern or purely using a Write-Invalidate strategy, like MESI.DAWe varied the architectural parameters to show how passive sharing and other coherence overhead are influenced by different cache choices. Then, we considered the sixteen-processor case, where the effects on performance are more evident. We also end up that performance can take advantage of large caches and cache affinity scheduling. However, even with affinity scheduling, a selective protocol delivers better performance
OS Effects on Memory Hierarchy of a SMP Multiprocessor Running a DBMS Workload
In this work, we characterized the impact of operating system activities like process migration on a shared-bus shared-memory multiprocessor running typical DBMS workload. Our workload has been set-up utilizing the TPC-D benchmark on the PostgreSQL DBMS. Analysis has been performed via trace driven simulation enhanced technique which includes most important operating system activities and analyzes the sharing overhead in detail. We evaluated a basic four-processor and a high-end sixteen-processor machine, implementing MESI and other coherence protocols that deal with migration of processes and data. Our results show that even in the four-processor case operating system effects may not be neglected. In fact, different coherence protocols can more effectively reduce the effects of process migration. The consequences on performance become more important in high-end machines (16 or more processors). In this case, even little sharing, as we found in DBMS applications can become crucial for system performance. Better speed up may be achieved adopting several alternatives including redesign of kernel data structure. Cache affinity is somewhat useful in reducing migration effect, but it is not effective in every load conditions
Reducing coherence overhead and boosting performance of high-end SMP multiprocessors running a DSS workload
In this work, we characterized the memory performance - and in particular the impact of coherence overhead and process migration - of a shared-bus shared-memory multiprocessor running a DSS workload. When the number of processors is increased in order to achieve higher computational power, the bus becomes a major bottleneck of such architecture. We evaluated solutions that can greatly reduce that bottleneck. An area where this kind of optimization is important regards data base systems. For this reason, we considered a DSS workload and we setup the experiments following TPC-D specifications on the PostgreSQL DBMS in order to explore different optimizations on same kind of workloads as evaluated in the literature. In this scenario, we compare possible solutions to boost performance and we show the impact of process migration on coherence overhead. We found that the consequences of coherence overhead and process migration on performance are very important in machines with 16 or more processors. In this case, even little sharing, as in DSS applications, can become crucial for system performance. Another important result of our analysis regards the interaction between the coherence protocol and the scheduler. The basic cache affinity scheduling is useful in reducing migration, but it is not effective in every load condition. Specific coherence protocols can help reduce the effects of process migration, especially in situations when the scheduler cannot apply the affinity requirement. In these conditions, the use of a write-update protocol with a selective invalidation strategy for private data improves performance (and scalability) of about 20% with respect to a classical MESI-based solution. This advantage is about 50% in the case of high cache-to-cache transfer
Process Migration Effects on Memory Performance of Multiprocessor Web-Server
In this work we put into evidence how the memory performance of a Web-Server machine may depend on the sharing induced by process migration. We considered a shared-bus shared-memory multiprocessor as the simplest multiprocessor architecture to be used for accelerating Web-based and commercial applications. It is well known that, in this kind of system, the bus is the critical element that may limit the scalability of the machine. Nevertheless, many factors that influence bus utilization, when process migration is permitted, have not been thoroughly investigated yet. We analyze a basic four-processor and a high-performance sixteen-processor machine. We show that, even in the four-processor case, the overhead induced by the sharing of private data as a consequence of process migration, namely passive sharing, cannot be neglected. Then, we consider the sixteen-processor case, where the effects on performance are more massive. The results show that even though the performance may take advantage of larger caches or from cache affinity scheduling, there is still a great amount of passive sharing, besides false sharing and active sharing. In order to limit false sharing overhead, we can adopt an accurate design of kernel data structures. Passive sharing can be reduced, or even eliminated, by using appropriate coherence protocols. The evaluation of two of such protocols (AMSD and PSCR) shows that we can achieve better processor utilization compared to the MESI case
A Proposal for Input-sensitivity Analysis of Profile-driven Optimizations on Embedded Applications
The ever-increasing gap between processor and memory speed is an issue also in embedded systems, because of the increased complexity of multimedia elaborations and the strict resource constraints of these devices.Profile-driven code optimization techniques can be effectively employed for tuning application-cache interaction and performances of cache system itself. In fact, applications running on such systems are usually known in advance and do not change over time. In a previous paper, we presented a profile-based code restructuring technique (CAT) that was able to dramatically increase cache exploitation of embedded applications.However, it is well known that profile-driven optimizations can suffer from input-sensitivity problems: an application that is optimized for a particular input can perform even worse than the original one, when subjected other inputs.In this paper we take into account jpeg and mpeg compressor/decompressor applications and analyze the input-sensitivity of CAT improved layouts over a wide range of inputs. The input sets were accurately determined through both black-box and white-box analysis of applications.We propose two metrics for measuring the input-sensitivity of application layouts, and show how our profile-driven code transformation technique is able to reduce the input-sensitivity of the considered applications up to 48% on caches ranging from 1 KByte to 8KByte
Web-based training on Computer Architecture: The case for JCachesim
This paper describes possible advantages of adding an interactive tool with log capabilities, in an online learning environment. We describe the interactive, Java-based tool named JCachesim, which is used for experimenting cache behavior with simple assembly programs while varying cache features. The tool has embedded features that allow the teacher to monitor the progress of each individual student
Performance analysis of electronic commerce multiprocessor server
In this paper, the performance of an Electronic Commerce server, i.e. a system running Electronic Commerce applications, is evaluated in the case of shared-bus multiprocessor architecture. In particular, we focused on the memory subsystem design. We have analyzed the common case of a system using the MESI coherence protocol, for maintaining coherency among the processor private caches. We have evaluated the miss ratio and the bus traffic of such a system by varying cache size, number of ways, scheduling policy and number of processors, highlighting the relations with different types of data sharing generated by the application or the kernel. We found that passive sharing and false sharing are the major sources of coherence overhead, in the case of relatively large caches (over 1M-byte size). False sharing is mainly due to kernel data, and can be eliminated by using appropriate data structure design techniques. A scheduling technique, like cache-affinity can reduce passive sharing, but it is not effective in every load conditions. Thus, a special coherence protocol could be a better solution to completely eliminate passive sharing overhead and boost performance
A Cache-aware program transformation technique suitable for embedded systems
In embedded systems caches are very precious for keeping low the memory bandwidth and to allow employing slow and narrow off-chip devices. Conversely, the power and die size resources consumed by the cache force the embedded system designers to use small and simple cache memories. This kind of caches can experience poor performance because of their not flexible placement policy. In this scenario, a big fraction of the misses can originate from the mismatch between the cache behavior and the memory accesses' locality features (conflict misses). In this paper we analyze the conflict miss phenomenon and define a cache utilization measure. Then we propose an object level Cache Aware allocation Technique (CAT) to transform the application to fit the cache structure, minimize the number of conflict misses and maximize cache exploitation. The solution transforms the program layout using the standard functionalities of a linker. The CAT approach allowed the considered applications to deliver the same performance on two times and sometimes four times smaller caches. Moreover the CAT improved programs on direct-mapped caches outperformed the original versions on set-associative caches. In this way, the results highlight that our approach can help embedded system designers to meet the system requirements with smaller and simpler cache memories. (C) 2002 Elsevier Science B.V. All rights reserved
- …
