Académique Documents
Professionnel Documents
Culture Documents
I.
I NTRODUCTION
(1)
Miss Rate describes the ratio of the amount of cache misses with
respect to the total memory accesses. A lower miss rate means a lesser
number of accesses to the slower memory in the hierarchy. This in turn
indicates lower average memory access time and energy consumption
related to memory accesses.
The cache parameters described earlier directly correlate to the cache
miss rate in a proportional manner. A larger Block Size, Set Size and
Associativity improve the probability of a data to be present in the cache,
resulting in a low miss rate. However, this means the size of the cache C is
also increased, which increases the Hit Time and implementation cost as
well as the energy consumed by the cache. It has been demonstrated in [1]
how the execution time varies with the cache miss rate for different cache
configurations, using G721 encoder application. Therefore, this gives rise
to the need for accurately determining suitable cache parameter values in
order to achieve faster access times while not incurring excess costs.
Equally importantly, the utilization of a cache is highly application
dependent. Since different applications exhibit different memory access
patterns, the miss rate of a cache obviously varies with the running
application. This emphasizes the importance of dimensioning the cache
parameters for application specific processor systems which are typically
found in the embedded domain. In other words, for a given application
an embedded system designer would need to determine suitable values
for the different cache parameters, which minimize the cache miss rate.
This selection is subjected to factors such as constraints imposed on
the performance requirement, energy consumption and chip area, which
are deemed as very important in embedded systems. For an example, it
has been shown that caches can consume up to 43% of the power in a
processor [2].
Thus the problem can be stated as follows; Given an application and
a cache hierarchy for a processor system, autonomously determine the
values for cache parameters that minimize the miss rate within imposed
constraints. This calls for methods to accurately determine the hit and miss
rates incurred by different cache configurations by exploring the cache
design space through simualtion. This paper aims to comprehensively
analyze the previous work related to such simulation methodologies for
uniprocessor and multiprocessor cache systems. It will emphasize certain
aspects of the problem which need to be addressed in order to design a
robust solution.
II.
R ELATED W ORK
The other important attribute of an associative cache is the Replacement Policy. It determines the selection of a cache entry to evict, to make
space for another. Some of the most commonly used replacement policies
in caches are LRU (Least Recently Used) and FIFO (First In First Out).
Different values of these cache parameters combine to constitute a large
space of different cache configurations.
Various different approaches have been previously presented to efficiently determine the miss rates for different cache configurations for a
given application. Most of these methods use memory access patterns,
often referred to as traces (memory access traces), and simulate different
cache configurations. Traces are extracted from the execution of the
application on a particular instruction set simulator, or hardware itself.
(2)
Isuru Nawinne and Sri Parameswaran are with the School of Computer
Science and Engineering, University of New South Wales, Sydney, NSW 2052,
Australia. (email: isurun,sridevan@cse.unsw.edu.au)
332
2013 IEEE 8th International Conference on Industrial and Information Systems, ICIIS 2013, Aug. 18-20, 2013, Sri Lanka
Fig. 1: Overview and flow of memory access trace driven simulation methods
avoid exploring the whole design space. These methods are generally
faster at the expense of the accuracy of the result.
The suitability of a particular cache configuration is determined by the
effects it has on the constraints imposed on the system. Some commonly
found constraints, in the domain of embedded systems, when simulating
cache configurations are as follows.
It should also be noted that the majority of the prior work focus on
uniprocessor systems with one or more levels of caches in the memory
hierarchy. There are a few methods considering multiprocessor systems
and the complications added by the coherency of cached data. These
different work will be discussed in detail in the coming sections.
III.
Exact simulation tools like Dinero IV [3] are widely used to calculate
the accurate miss rate through simulation. Such trace-driven simulators
take a memory access trace as input and provide the cache miss rate
for a predefined set of cache configurations (Fig. 1). However Dinero
IV simulates one configuration at a time, therefore the designer has to
run the simulator repeatedly for different cache configurations, to find
the most suitable configuration for a given applications trace. Typically
a trace for a few seconds of an application can consist of millions of
memory accesses. Therefore, repeatedly executing the simulator algorithm
for different cache configurations, using such a trace can consume an
enormous amount of time, in the order of hours and even days.
Fig. 2: Simulation data structures used by Janapsatya et al. in [1]: Array contains hit/miss
counters for each configuration under simulation and pointers to the tree structures with
relevant block size. Each level in a tree corresponds to a set of cache configurations with
same block size and set size, and each tree node represents a cache set. A linked list is
associated with each tree node, which represents the cache ways and therefore different
associativities.
333
2013 IEEE 8th International Conference on Industrial and Information Systems, ICIIS 2013, Aug. 18-20, 2013, Sri Lanka
Fig. 3: Example CLT data structure used by Haque et al. in [9]: A CLT contain entries for
each memory block. In a single block entry, there are as many records as the number of
different cache set sizes in the simulation. Records serve to indicate the availability of that
memory block in a set of configurations with fixed set size and different associativities.
al. proposed a set of intersection properties [10] for caches with FIFO
replacement, which predict the availability of memory blocks in other
configurations subject to certain conditions.
334
2013 IEEE 8th International Conference on Industrial and Information Systems, ICIIS 2013, Aug. 18-20, 2013, Sri Lanka
IV.
Fig. 5: T-SPaCS Algorithm by Zang et al. in [12] (Si - Set Size for level i, B - Block
size, Wi - Associativity for level I, K Conflict tables)
values in the conflict tables at the end of the simulation. The formulation
of this method provides the benefit of single pass simulation of a trace
without consuming too much storage space. However, this results in
a large amount of operations being carried out on the stack structure,
which consumes majority of the simulation time. The results show that
SPCE simulator obtain the miss rates for a given trace 14.88 times faster
than Dinero IV on average, for the applications in Motorola PowerStone
benchmark suite. Therefore this method is not as time efficient as most
other simulators discussed above, but it is efficient in terms of storage
space.
Extending the work by Viana et al., Zang et al. proposed a stackbased single-pass cache simulator for two-level caches [12]. The major
challenge in two level cache simulation is to produce the filtered access
trace for the L2 cache. The L2 caches access trace is comprised of the
missed accesses from the L1 cache. Since a single-pass simulator analyzes
a vast space of L1 cache configurations simultaneously, the L2 access
trace for each of them is unique. This results in n different L2 cache
simulations where n is the number of L1 cache configurations. Therefore
the storage space and simulation time consumption could exponentially
increase beyond bounds. In order to avoid these complications, Zang et al.
limit their scope to exclusive two-level caches with LRU replacement for
L1 and FIFO replacement for L2. In exclusive caches, the content of each
cache level is a disjoint set from the other. A cache block in one level is
guaranteed not to exist in the other level. This enables the simulator to
view the two cache levels as one single cache using the original access
trace, with only a minimal loss of accuracy in the L2 miss rate estimation.
Fig. 5 depicts the two-level cache simulator, named T-SPaCS. However,
the combination of two caches enlarges the stack structure dramatically,
which degrades the performance further. In order to remedy this, the
authors associate tree and array data structure to determine conflicts faster
for different set sizes and different associativities.
Zang et al. continued their work in [13] by modifying T-SPaCS for
simulating unified two-level caches. In unified cache architectures, there
are separate instruction and data caches in the first level while the second
level cache hosts both types of blocks. The modified simulator is called USPaCS. The memory access trace is divided in to two separate instruction
and data traces, and two stacks are used for these accordingly. Separate
analyzes are carried out for the two L1 caches and L2 analysis occurs
in the event of a miss from either of the L1 caches. Both T-SPaCS and
U-SPaCS simulators support only exclusive two-level caches, and do not
possess the ability to dimension inclusive cache hierarchies.
It is obvious that, given the correct emulation of cache behavior,
exact simulation based on memory access traces can provide accurate
cache miss rates for different configurations. However the simulation
time and space taken by these algorithms are a significant concern. Even
though various enhancements by utilizing optimized data structures and
335
Fig. 7: Using the FPGA cache simulator with a running processor, to extract the memory
access trace in real-time
2013 IEEE 8th International Conference on Industrial and Information Systems, ICIIS 2013, Aug. 18-20, 2013, Sri Lanka
V.
VI.
336
configurations for L1 caches of the system, separate access traces for each
processor needs to be derived from the original trace. This is shown as
the secondary trace generation step in Fig. 9. Additional information is
recorded in the secondary traces which allow the simulator to consider the
cache coherence in L1 simulation. For an example, accesses in secondary
traces are attributed with whether they were a hit in the selected L2
configuration. Also, the cache inclusion causes every miss in the L2 cache
to also be a miss in the L1 cache. Making use of the inclusiveness of the
cache hierarchy and the additional information recorded in the secondary
traces, simulations for each L1 cache are carried out.
However, once the L1 caches are in place, the accesses seen by the
L2 cache is in reality composed of cache misses from the L1 caches.
Therefore, in this method, L2 cache configuration obtained with the
original access trace may not be valid anymore after the L1 caches are
dimensioned. The other point worth noting is that the generated secondary
traces might not be chronologically correct when considered in parallel
execution. This means that with the L1 caches present the order of
accesses to the level 2 could be potentially different from the original
trace, owing to the fact of varying hit/miss times on different L1 caches.
Obtaining accurate memory access traces is a vital part in exact simulation
of multiprocessor cache configurations. Thus, extracting the memory
access trace from actual hardware is preferable to software simulation of
the execution. Wilson et al. based their work in [22] multiprocessor cache
simulation for bus traffic analysis, by obtaining traces from hardware.
Rawlins and Gordon-Ross proposed a run time tuning methodology
[23] for reconfigurable data caches in a dual processor system. The main
objective of the tuner is to reduce the energy consumption of the data
caches. It uses a simple algorithm and heuristics where the caches are
initialized with smallest values for all parameters, which are periodically
incremented until no further decrease in energy is observed.
In summary, the current literature on accurate simulation of cache
configurations consists of trace-driven exact methods. Most methods
involve memory access traces extracted from instruction set simulation of
soft processors, while a few use traces extracted from hardware. Various
methods exploiting the correlation properties within cache configurations
were proposed to accelerate cache simulation. While many simulators
exist for cache configuration simulation of single processor systems, multiprocessor systems with complex cache hierarchies are seldom explored.
VII.
C ONCLUSION
2013 IEEE 8th International Conference on Industrial and Information Systems, ICIIS 2013, Aug. 18-20, 2013, Sri Lanka
R EFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
A. Janapsatya and A. Ignjatovic, Finding Optimal L1 Cache Configuration for Embedded Systems, in Proceedings of the Asia and South
Pacific Design Automation Conference (ASP-DAC06), 2006, pp. 16.
R. T. Witek, A. J. Black, E. M. Cooper, D. W. Dobberpuhl, P. M.
Donahue, G. W. Hoeppner, T. H. Lee, P. C. M. Lin, L. Madden, M. H.
Pearce, K. J. Snyder, and S. C. Thierauf, A 160-MHz, 32-b, 0.5-W
CMOS RISC microprocessor, IEEE Journal of Solid-State Circuits,
vol. 9, no. 1, pp. 17031714, 1996.
M. D. Hill. Dinero IV Trace-Driven Uniprocessor Cache Simulator.
[Online]. Available: http://pages.cs.wisc.edu/markhill/DineroIV/
M. Hill and A. Smith, Evaluating Associativity in CPU Caches, IEEE
Transactions on Computers, no. 12, pp. 16121630.
I. Mattson, R.L. Gecsei, J. Slutz, D.R. Traiger, Evaluation techniques
for storage hierarchies, IBM Systems Journal, vol. 9, no. 2, pp. 78117,
1970.
N. Tojo, N. Togawa, M. Yanagisawa, and T. Ohtsuki, Exact and fast
L1 cache simulation for embedded systems, in Proceedings of the
Asia and South Pacific Design Automation Conference (ASP-DAC09).
Ieee, Jan. 2009, pp. 817822.
M. S. Haque, A. Janapsatya, and S. Parameswaran, SuSeSim : A
Fast Simulation Strategy to Find Optimal L1 Cache Configuration for
Embedded Systems, in Proceedings of the 7th IEEE/ACM international conference on Hardware/software codesign and system synthesis
(CODES+ISSS09), 2009, pp. 295304.
M. S. Haque, J. Peddersen, A. Janapsatya, and S. Parameswaran, DEW
: A Fast Level 1 Cache Simulation Approach for Embedded Processors
with FIFO Replacement Policy, in Proceedings of the Design Automation & Test in Europe Conference & Exhibition (DATE10), 2010, pp.
496501.
M. S. Haque, J. Peddersen, A. Janapsatya, and S. Parameswaran,
SCUD : A Fast Single-pass L1 Cache Simulation Approach for Embedded Processors with Round-robin Replacement Policy, in Proceedings
of the Design Automation Conference (DAC10), 2012, pp. 356361.
M. S. Haque, J. Peddersen, and S. Parameswaran, CIPARSim:
Cache Intersection Property Assisted Rapid Single-pass FIFO Cache
Simulation Technique, in IEEE/ACM International Conference on
Computer-Aided Design (ICCAD11). Ieee, Nov. 2011, pp. 126133.
337
[16]
[17]
[18]
[19]
[20]
[21]
[22]
[23]