Application Specific SoC Memory PDF

2013 IEEE 8th International Conference on Industrial and Information Systems, ICIIS 2013, Aug.
18-20, 2013, Sri Lanka
A Survey on Exact Cache Design Space Exploration

Methodologies for Application Specific SoC Memory
Hierarchies
Isuru Nawinne, Student Member, IEEE, Sri Parameswaran, Member, IEEE
AbstractCaching is the most widely used solution to improve the
memory access speed of a processor. Behaviour of a cache memory is
characterized by several parameters such as the set size, associativity,
block size and replacement policy which compose the configuration of the
cache. Cache hits and misses encountered by an application are decided
by the configuration. While a cache improves the memory performance,
it also imposes additional costs in power consumption and chip area
which vary according to the configuration. Deciding the suitable set of
cache memories for an application specific embedded systems memory
hierarchy is a tedious problem which involves design space exploration of
how different configurations behave for a given application to accurately
or exactly determine the number of hits and misses for each configuration.
The literature consists of several different approaches to perform such
explorations efficiently while reducing the design time taken. This paper
presents a critical analysis of a representative set of such methods.
I.
I NTRODUCTION
The performance of a computing system is determined not only by

the internal micro-architecture of the processor or operating frequency
but also the performance of the memory attached to the processor.
The recent advancements in processor architectures and manufacturing
technologies have enabled processors to operate at increasingly high
frequencies. Unfortunately the same cannot be said for the memory
systems. The memory typically operates on a significantly lower speed
than the processor. This leads the memory to become a bottleneck in
terms of performance.
Caching memory is a widely used solution to avoid continually having
high memory access latencies. A cache memory is attributed with several
parameters which describes its structure.
Block Size B : Size of a single cache entry in Bytes
Set Size S : Number of cache sets (A set represents the place,

or the collection of places, in the cache where a given block
from memory can reside)
Associativity A : Number of locations in the cache where a given

block from memory can reside
The total size or capacity of a cache C can be described as follows.

C = (B S A)Bytes
(1)
Miss Rate describes the ratio of the amount of cache misses with
respect to the total memory accesses. A lower miss rate means a lesser
number of accesses to the slower memory in the hierarchy. This in turn
indicates lower average memory access time and energy consumption
related to memory accesses.
The cache parameters described earlier directly correlate to the cache
miss rate in a proportional manner. A larger Block Size, Set Size and
Associativity improve the probability of a data to be present in the cache,
resulting in a low miss rate. However, this means the size of the cache C is
also increased, which increases the Hit Time and implementation cost as
well as the energy consumed by the cache. It has been demonstrated in [1]
how the execution time varies with the cache miss rate for different cache
configurations, using G721 encoder application. Therefore, this gives rise
to the need for accurately determining suitable cache parameter values in
order to achieve faster access times while not incurring excess costs.
Equally importantly, the utilization of a cache is highly application
dependent. Since different applications exhibit different memory access
patterns, the miss rate of a cache obviously varies with the running
application. This emphasizes the importance of dimensioning the cache
parameters for application specific processor systems which are typically
found in the embedded domain. In other words, for a given application
an embedded system designer would need to determine suitable values
for the different cache parameters, which minimize the cache miss rate.
This selection is subjected to factors such as constraints imposed on
the performance requirement, energy consumption and chip area, which
are deemed as very important in embedded systems. For an example, it
has been shown that caches can consume up to 43% of the power in a
processor [2].
Thus the problem can be stated as follows; Given an application and
a cache hierarchy for a processor system, autonomously determine the
values for cache parameters that minimize the miss rate within imposed
constraints. This calls for methods to accurately determine the hit and miss
rates incurred by different cache configurations by exploring the cache
design space through simualtion. This paper aims to comprehensively
analyze the previous work related to such simulation methodologies for
uniprocessor and multiprocessor cache systems. It will emphasize certain
aspects of the problem which need to be addressed in order to design a
robust solution.
II.
R ELATED W ORK
The other important attribute of an associative cache is the Replacement Policy. It determines the selection of a cache entry to evict, to make
space for another. Some of the most commonly used replacement policies
in caches are LRU (Least Recently Used) and FIFO (First In First Out).
Different values of these cache parameters combine to constitute a large
space of different cache configurations.
Various different approaches have been previously presented to efficiently determine the miss rates for different cache configurations for a
given application. Most of these methods use memory access patterns,
often referred to as traces (memory access traces), and simulate different
cache configurations. Traces are extracted from the execution of the
application on a particular instruction set simulator, or hardware itself.
The average memory access time T of a system with a cache can be

described by the metrics Miss Rate M, Hit Time H and Miss Penalty P.
Majority of the trace-driven simulation methods in the literature are

exact cache simulations, where all the points in the configuration design
space are explored. There, the behavior of a cache is closely simulated,
except for the storing of cached data. The aim of these algorithms is to
accurately determine cache miss rate. These simulations characteristically
tend to be highly time consuming. Most other approaches to dimension
cache parameters use heuristics and design of experiment methods to
T = T otalN umberOf Accesses (H + M P )
(2)
Isuru Nawinne and Sri Parameswaran are with the School of Computer
Science and Engineering, University of New South Wales, Sydney, NSW 2052,
Australia. (email: isurun,sridevan@cse.unsw.edu.au)
978-1-4799-0910-0/13/$31.00 2013 IEEE
332
2013 IEEE 8th International Conference on Industrial and Information Systems, ICIIS 2013, Aug. 18-20, 2013, Sri Lanka
Fig. 1: Overview and flow of memory access trace driven simulation methods
avoid exploring the whole design space. These methods are generally
faster at the expense of the accuracy of the result.
The suitability of a particular cache configuration is determined by the
effects it has on the constraints imposed on the system. Some commonly
found constraints, in the domain of embedded systems, when simulating
cache configurations are as follows.
Meeting the performance/timing requirements for memory accesses
Meeting the energy/power consumption limits for the cache
Meeting the chip area usage limits for the cache
It should also be noted that the majority of the prior work focus on
uniprocessor systems with one or more levels of caches in the memory
hierarchy. There are a few methods considering multiprocessor systems
and the complications added by the coherency of cached data. These
different work will be discussed in detail in the coming sections.
III.
T RACE - DRIVEN E XACT S IMULATION OF U NIPROCESSOR

C ACHE C ONFIGURATIONS
Exact simulation tools like Dinero IV [3] are widely used to calculate
the accurate miss rate through simulation. Such trace-driven simulators
take a memory access trace as input and provide the cache miss rate
for a predefined set of cache configurations (Fig. 1). However Dinero
IV simulates one configuration at a time, therefore the designer has to
run the simulator repeatedly for different cache configurations, to find
the most suitable configuration for a given applications trace. Typically
a trace for a few seconds of an application can consist of millions of
memory accesses. Therefore, repeatedly executing the simulator algorithm
for different cache configurations, using such a trace can consume an
enormous amount of time, in the order of hours and even days.
A. Binomial tree based simulation

To avoid the repetitive simulation, Janapsatya et al. introduced an
exact simulation method [1] for level 1 caches based on the formulations
presented earlier in [4] by Hill et al. The basic idea there is to simulate
multiple cache configurations simultaneously, in order to reduce the time
consumed to explore the design space. It is a single-pass simulation
method, in the sense that the memory access trace is read just once to
evaluate all the configurations. A forest of binomial tree data structures,
linked lists and an array are used to model the space of all cache
configurations that should be explored, as illustrated in Fig. 2.
The linked lists store the address tags which are used to compare
with addresses from the trace and assess whether the access is a hit or
a miss. However, visiting all of these nodes for each memory address in
the access trace is an exhaustive task and consumes a large amount of
time, especially with a vast number of cache configurations.
Fig. 2: Simulation data structures used by Janapsatya et al. in [1]: Array contains hit/miss
counters for each configuration under simulation and pointers to the tree structures with
relevant block size. Each level in a tree corresponds to a set of cache configurations with
same block size and set size, and each tree node represents a cache set. A linked list is
associated with each tree node, which represents the cache ways and therefore different
associativities.
Property 2: The second observation describes that a cache hit for

an address MA in a configuration (B, S, A) implies cache hits for MA
in all the configurations (B, S, A), where A>A, with same replacement
policy.
These correlations in cache configurations were first used by Mattson
et al. in [5] to find the frequency of access to different levels in a memory
hierarchy, and by Hill et al. in [4] to analyze the effect of associativity
on miss rate. Janapsatya et al. [1] use the same correlations in a manner
that allows them to assess cache misses for a group of configurations at
once, enabling rapid evaluation of miss rates.
Analytical models for system timing and energy consumption are
employed to quantify the suitability of the explored cache configurations.
The equations incorporate the calculated exact cache miss rate in to
finding the memory access delays and consumed energy.
The work in [6] by Tojo et al. proposed additional improvements to
the approach of Janapsatya et al. They utilize the cache inclusion property
presented in [5] to define a new heuristic. Cache inclusion property states
that a cache configuration c1 is a subset of a cache c2 if all the contents
of c1 are included in the contents of c2. It can be observed that a cache
configuration can be a subset of another configuration with a higher
number of sets. Therefore, the heuristic is constructed as follows.
Property 3: A cache hit for an address MA in a direct mapped cache
configuration (B, S, 1) implies hits for MA in all the set associative cache
configurations (B, S, A), where S>S.
Based on this, Tojo et al. proposed a modified algorithm called
CRCB1 which reduces the design space to be explored further, without
compromising the accuracy. Further, the authors extend their heuristic to
cover additional ground.
Property 4: Consecutive accesses to the most recently accessed
memory address MA will result in hits for all the cache configurations
(B, S, A) with S>=1, B>=1, A>=1.
This is a generalization of property 3. This observation is used in the
CRCB2 algorithm, which is added on top of CRCB1 and it reduces the
number of hit/miss assessments by a significant amount. This approach
is claimed to provide on average 1.8 times faster exact cache simulation,
compared to Janapsatyas method.
In order to remedy this, Janapsatya make use of two observations,

first introduced by Mattson et al. [5] and Hill et al. in [4].
Haque and Janapsatya proposed enhancements to the algorithm

from [1] in their subsequent work [7] called SuSeSim. Two additional
correlations were observed in the space of cache configurations which
could be used to further reduce the total simulation time.
Property 1: The first observation implies that when a cache hit

occurs for an address MA in a cache configuration (B, S, A), all other
configurations (B, S,A) , where S>S, with LRU replacement policy can
also be guaranteed to have hits for MA.
Property 5: When a cache miss occurs for an address MA in a cache

configuration (B, S, A), all other configurations (B, S,A) , where S<S,
with LRU replacement policy can also be guaranteed to have misses for
MA. (This property states the contrapositive of property 1).
333
Property 6: For an address MA, a cache hit in a configuration (B,

S, A) implies that cache misses will occur in more recently used cache
ways of all configuration (B, S, A), where S<S, with LRU replacement
policy.
Therefore this method takes a bottom up approach where the cache
configurations with a higher number of sets are evaluated first. Consequently, the algorithm counts cache misses for each configuration rather
than cache hits. Haque et al. make use of doubly linked lists to store
cache tags and incorporates forward and reverse search functions when
searching for tag matches. This allows the tag searching to be 16% faster
on average and the overall algorithm to be 33% faster than the method
in [6].
In designing embedded processor systems, FIFO cache replacement
policy is generally preferred over LRU replacement policy. This is largely
due to the fact that FIFO replacement is comparatively simpler to implement and consume less chip area as well as less energy. Developing on
the previous works, Haque et al. proposed a single pass cache simulation
method [8] named DEW for caches using FIFO replacement. Each tag
stored in the linked list is associated with a wave pointer which points
to the corresponding tag in the cache with the next larger set size. This
additional information allows the simulation algorithm to directly access
the location where a cache entry should exist, without having to search
through a list. Hence the search time is dramatically reduced compared
to the previous approaches. Also, a binomial tree node, which represents
a cache set in a configuration, is associated with details about the most
recently accessed address (MRA) and the most recently evicted address
(MRE). Property 4 discussed above states that consequent accesses to the
most recently accessed addressed is always a hit. Therefore it follows
that;
Property 7: Access to the most recently evicted memory address MA
is always a miss for all the cache configurations (B, S, A) with S>=1,
B>=1, A>=1.
Thus, storing the MRA and MRE tags in the cache set allows faster
assessment for a hit and a miss relatively. According to temporal locality,
which states recently accessed addresses are more likely to be accessed
again, MRA and MRE addresses are the most likely to be re-accessed out
of all the resident and evicted cache entries. This enhancement potentially
reduces the search time even further. Haque et al. claim that the DEW
algorithm is up to 40 times faster than Dinero IV and at least 8 times faster
in the worst case for Mediabench applications. However, it is worthwhile
noting that these improvements to the simulation time are achieved at the
expense of storage space for the simulator.
Subsequent work of Haque et al. [9], named SCUD, presented a
different approach from the previous continuations. It is still a tracedriven exact cache simulation, however the simulation space is observed
from the perspective of memory blocks, as opposed to cache locations.
To this end, the authors use a data structure named Central Lookup Table
(CLT), as depicted in Fig. 3. The simulator consists of one CLT for each
memory block size simulated, and each of them are sorted by the block
addresses.
Similar to the previous works, SCUD uses a binomial tree of cache
set nodes in association with the new CLT to update it while moving
through the memory access trace. Using the CLTs provides the SCUD
simulator with the ability to quickly determine whether a memory access
is a hit or a miss in all the configurations. This is made possible by the
Count value associated with CLT entry. The count for a particular block
being 0 indicates a miss for all configurations, and the count being the
highest possible value indicates a hit. Haque et al. claim that the SCUD
simulator is on average 19 times faster than Dinero IV for MediaBench
applications and 10 times faster for SPEC CPU2000 applications. The
downside is these speedups are obtained at a considerable expense of
storage space.
Fig. 3: Example CLT data structure used by Haque et al. in [9]: A CLT contain entries for
each memory block. In a single block entry, there are as many records as the number of
different cache set sizes in the simulation. Records serve to indicate the availability of that
memory block in a set of configurations with fixed set size and different associativities.
al. proposed a set of intersection properties [10] for caches with FIFO
replacement, which predict the availability of memory blocks in other
configurations subject to certain conditions.
B. Stack/Table based simulation

Viana et al. formulated a different trace-driven cache simulator, called
SPCE, in their work [11]. They determine whether a memory access is a
hit or a miss by keeping track of how many unique addresses, mapping
to the same cache set, were accessed after the previous reference of the
current address. The term Conflicts is used for this count. In other words,
once a block is fetched to the cache, the number of conflicts on the same
set determines when that block will be evicted from the cache due to a
conflict. If the associativity of the concerned configuration is larger than
the number of conflicts recorded, then the currently accessed block must
still be available in the cache. A set of structures named Conflict Tables
are used to analyze the conflicts. One table is created for each degree
of associativity under consideration, containing entries for different block
sizes and set sizes.
The SPCE algorithm uses a stack to keep track of previously accessed
memory blocks (Fig. 4). If an address is not found in the stack, it is
pushed on to the top of the stack, and the access is deemed a miss for all
configurations. Once an address is found, the stack is scanned to see how
many conflicts occurred after it was previously accessed. This leads to
determining what levels of associativity will allow that block to remain
in the cache, and the conflict tables are updated accordingly. The cache
inclusion property is used to determine the cache set sizes where hits
could occur. The address is then removed and pushed back on to the
top of the stack. The final hit and miss rates are calculated using the
Most of the correlation properties studied above are based on the

contents of a smaller configuration being a subset of the contents of a
larger configuration. This made it possible to draw conclusions about
larger configurations when simulating a smaller one. Similarly, Haque et
Fig. 4: SPCE Algorithm by Viana et al. in [11]
334
Fig. 6: Operation of the FPGA cache simulator
correlation properties of cache entries were used to reduce the simulation

time without compromising accuracy, exact methods still consume time
in the order of hours for real application traces.
IV.
Fig. 5: T-SPaCS Algorithm by Zang et al. in [12] (Si - Set Size for level i, B - Block
size, Wi - Associativity for level I, K Conflict tables)
values in the conflict tables at the end of the simulation. The formulation
of this method provides the benefit of single pass simulation of a trace
without consuming too much storage space. However, this results in
a large amount of operations being carried out on the stack structure,
which consumes majority of the simulation time. The results show that
SPCE simulator obtain the miss rates for a given trace 14.88 times faster
than Dinero IV on average, for the applications in Motorola PowerStone
benchmark suite. Therefore this method is not as time efficient as most
other simulators discussed above, but it is efficient in terms of storage
space.
Extending the work by Viana et al., Zang et al. proposed a stackbased single-pass cache simulator for two-level caches [12]. The major
challenge in two level cache simulation is to produce the filtered access
trace for the L2 cache. The L2 caches access trace is comprised of the
missed accesses from the L1 cache. Since a single-pass simulator analyzes
a vast space of L1 cache configurations simultaneously, the L2 access
trace for each of them is unique. This results in n different L2 cache
simulations where n is the number of L1 cache configurations. Therefore
the storage space and simulation time consumption could exponentially
increase beyond bounds. In order to avoid these complications, Zang et al.
limit their scope to exclusive two-level caches with LRU replacement for
L1 and FIFO replacement for L2. In exclusive caches, the content of each
cache level is a disjoint set from the other. A cache block in one level is
guaranteed not to exist in the other level. This enables the simulator to
view the two cache levels as one single cache using the original access
trace, with only a minimal loss of accuracy in the L2 miss rate estimation.
Fig. 5 depicts the two-level cache simulator, named T-SPaCS. However,
the combination of two caches enlarges the stack structure dramatically,
which degrades the performance further. In order to remedy this, the
authors associate tree and array data structure to determine conflicts faster
for different set sizes and different associativities.
Zang et al. continued their work in [13] by modifying T-SPaCS for
simulating unified two-level caches. In unified cache architectures, there
are separate instruction and data caches in the first level while the second
level cache hosts both types of blocks. The modified simulator is called USPaCS. The memory access trace is divided in to two separate instruction
and data traces, and two stacks are used for these accordingly. Separate
analyzes are carried out for the two L1 caches and L2 analysis occurs
in the event of a miss from either of the L1 caches. Both T-SPaCS and
U-SPaCS simulators support only exclusive two-level caches, and do not
possess the ability to dimension inclusive cache hierarchies.
It is obvious that, given the correct emulation of cache behavior,
exact simulation based on memory access traces can provide accurate
cache miss rates for different configurations. However the simulation
time and space taken by these algorithms are a significant concern. Even
though various enhancements by utilizing optimized data structures and
335
T RACE - DRIVEN EXACT SIMULATION OF CACHE

CONFIGURATIONS IN HARDWARE
The major limitation of the exact simulation of cache configuration

based on a memory access trace is the significantly high time consumption. This occurs in two ways. The first is the time taken to extract the
memory access trace of an application program running on a processor.
Typically this is done by simulating the instruction set of the processor in
software, which takes up to several days to generate the access trace for
a few seconds execution of the program. For an example, encoding 24
low-resolution frames with MPEG2 took 72 hours to extract 11.15 billion
instruction memory accesses and 1.316 billion data memory accesses
and 129.4GB of storage space. Extracting the address traces through
specialized hardware is faster but such devices incur very high costs
and extracts very small access traces. The other contribution to the high
time consumption comes from the software simulation itself, which could
take from several minutes to several hours depending on the number of
simulated cache configurations and the size of the access trace.
In recent works, Schneider et al. proposed a novel solution to
alleviate these extreme time penalties. Their simulator [14] is a hardware
component designed using VHDL and implemented in FPGA. As depicted
in Fig. 6, the simulator is running on an FPGA and receives the memory
access trace from the host computer attached to it through a PCIe
connection. The simulation is done in parallel for different associativities
and the hardware component is pipelined with respect to different set
sizes to improve simulation speed. The structure of this FPGA simulation
core adopts most features of binomial-tree-based simulation method and
utilizes property 2 and property 5 described in section III-A to reduce the
amount of logic blocks used by the simulator on the FPGA.
Schneider et al. demonstrate speed ups up to 53 times using the FPGA
cache simulator compared to the fastest software simulators available. The
presented simulation times are in the order of seconds and milliseconds,
which are substantial improvements over the previous methods.
The most interesting feature of the FPGA cache simulator is its
potential of being used with a processor simultaneously. In other words
the FPGA simulator could be formulated and interfaced in such a way
that it consumes the memory addresses generated by a processor running
on the same FPGA (Fig. 7, eliminating the need to extract the memory
access trace beforehand. This feature could potentially save a significant
amount of time for designers.
Fig. 7: Using the FPGA cache simulator with a running processor, to extract the memory
access trace in real-time
V.
DYNAMIC C ACHE T UNING
Analytical model based and simulation based methods to dimension

a cache are done at design time, and they could provide highly accurate
measures for cache miss rate on different configurations. An alternative is
to configure the cache at run time, which is referred to as dynamic cache
tuning, along with a configurable cache. This aims to relieve the system
designer from the task of finding out the optimal cache configuration,
reducing the design effort and time. The major drawbacks are the added
system complexity, significant energy and chip area requirement for the
cache tuner as well as for using a configurable cache rather than a
conventional cache.
However, a dynamic cache tuning method needs to be extremely fast
and implemented in such a manner as not to interfere with the behavior
of the original system. Since such an arrangement is essentially a control
system, it needs to monitor data signals from the system, which contribute
to an optimization criteria. The monitored data could be the overall energy
usage or memory access time. Based on the decisions made through these
data, the parameters of the cache should be modified without disrupting
the system processing. Which means the tuners reaction time needs to be
extremely fast. Therefore dynamic cache tuning methods use heuristics
to quickly adapt the cache parameters rather than exact analysis.
Since dynamic cache tuning allows the cache to be adapted according
to different system behaviors at run time, it is suitable for applications
with unpredictable memory behavior. The works in [15], [16], [17], [18],
[19] attempts to implement cache tuning at run time, with configurable
caches. They make use of co-processors which act as the control system
for cache tuning, and a configurable cache model. However, it is beyond
the scope of this paper to analyze these in detail.
VI.
E XACT S IMULATION OF M ULTIPROCESSOR C ACHE

C ONFIGURATIONS
The domain of embedded processing systems has seen a major shift

towards using Multi Processor Systems on Chip (MPSoC) to achieve
better performance in the recent years. They allow overlapped and parallel
execution of programs to achieve higher throughputs. Sharing memory
address spaces is a preferred way of facilitating communication between
programs on a multiprocessor. Among many shared memory models,
Symmetric Multi-Processor (SMP) is the most widely used architecture.
There, all the processors in the system share a single memory (Fig. 8),
with partitioned address spaces. The unique feature of this model is that
each processor has similar memory access times, as opposed to distributed
memory models.
Exploring the configuration design space for such cache hierarchies
can consume even higher times in the design process, and is a seldom
addressed version of the problem. It imposes additional complications
due to the caching of shared data. When a processor writes to a shared
memory block already cached by another processor, the cache entry in
the second processors cache becomes invalid (or stale). Different cache
coherency techniques are used to make sure that contents of all the caches
are up to date [20].
When finding the best suited set of cache configurations for a
particular multiprocessor system, it is imperative that the coherency of the
caches should be considered in to the simulation. Haque et al. proposed
a single-pass trace-driven cache simulation framework for SMP MPSoC
architectures in [21], considering cache coherency of level 1 caches. It
is the only available trace-driven exact simulator for multiprocessors. It
assumes an inclusive cache hierarchy for the set of processors, with a
shared L2 cache and private L1 caches per processor. The aim of this
simulator, named DIMSim (Fig. 9), is to find a suitable set of cache
configurations which allows the systems to meet required memory access
timing. It uses a memory access trace from the memorys point of view
to first dimension the shared L2 cache. The simulator employed here is
derived from CIPRASim [10], using FIFO replacement policy.
The original access trace is comprised of a time ordered sequence
of memory accesses to the main memory. In order to simulate the cache
336
Fig. 8: An example Symmetric Multiprocessor System
configurations for L1 caches of the system, separate access traces for each
processor needs to be derived from the original trace. This is shown as
the secondary trace generation step in Fig. 9. Additional information is
recorded in the secondary traces which allow the simulator to consider the
cache coherence in L1 simulation. For an example, accesses in secondary
traces are attributed with whether they were a hit in the selected L2
configuration. Also, the cache inclusion causes every miss in the L2 cache
to also be a miss in the L1 cache. Making use of the inclusiveness of the
cache hierarchy and the additional information recorded in the secondary
traces, simulations for each L1 cache are carried out.
However, once the L1 caches are in place, the accesses seen by the
L2 cache is in reality composed of cache misses from the L1 caches.
Therefore, in this method, L2 cache configuration obtained with the
original access trace may not be valid anymore after the L1 caches are
dimensioned. The other point worth noting is that the generated secondary
traces might not be chronologically correct when considered in parallel
execution. This means that with the L1 caches present the order of
accesses to the level 2 could be potentially different from the original
trace, owing to the fact of varying hit/miss times on different L1 caches.
Obtaining accurate memory access traces is a vital part in exact simulation
of multiprocessor cache configurations. Thus, extracting the memory
access trace from actual hardware is preferable to software simulation of
the execution. Wilson et al. based their work in [22] multiprocessor cache
simulation for bus traffic analysis, by obtaining traces from hardware.
Rawlins and Gordon-Ross proposed a run time tuning methodology
[23] for reconfigurable data caches in a dual processor system. The main
objective of the tuner is to reduce the energy consumption of the data
caches. It uses a simple algorithm and heuristics where the caches are
initialized with smallest values for all parameters, which are periodically
incremented until no further decrease in energy is observed.
In summary, the current literature on accurate simulation of cache
configurations consists of trace-driven exact methods. Most methods
involve memory access traces extracted from instruction set simulation of
soft processors, while a few use traces extracted from hardware. Various
methods exploiting the correlation properties within cache configurations
were proposed to accelerate cache simulation. While many simulators
exist for cache configuration simulation of single processor systems, multiprocessor systems with complex cache hierarchies are seldom explored.
VII.
C ONCLUSION
As the future processing platforms trend towards heterogeneous

multiprocessor fabrics for application specific systems as well as general
purpose computing, deciding the most suitable set of configurations for a
cache hierarchy is essential to achieve better performance with minimal
power and chip area overheads. The importance of exact simulation of
cache configurations lies in the highly accurate calculation of cache miss
rates, which enables the estimation of parameters such as performance
and energy consumption. However, the available methods still take a
considerable amount of design time (in the order of hours) even when
using memory access traces of few seconds of execution. This grows
by several magnitudes when considering the time taken to extract the
memory access trace. Since such simulations are used multiple times in
the design process with varying parameters, it is essential that the results
could be obtained much faster (in a matter of seconds). Incorporating
specialized hardware in the design space exploration is a new trend to
achieve significant simulation speed improvements.
Fig. 9: DIMSim Algorithm by Haque et al. in [21]
Having cache hierarchies with multiple private and shared caches

in a system extends the configuration design space exponentially and
with that the simualtion time. Moreover, such systems impose additional
complications in to simulation such as counting misses occuring due to
managing cache coherency, and dependencies between different levels
of caches. Methods to efficiently achieve these are a class of design
automation problems that needs addressing. Using specialized hardware
for simualtion allows to analyse memory systems in reality and enables
the realization of advanced solutions. It should also be noted that
while most current simulation methods support associative caches with
different replacement policies, support for simulation of advanced caching
techniques such as prefetching of data are yet to be seen.
R EFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
A. Janapsatya and A. Ignjatovic, Finding Optimal L1 Cache Configuration for Embedded Systems, in Proceedings of the Asia and South
Pacific Design Automation Conference (ASP-DAC06), 2006, pp. 16.
R. T. Witek, A. J. Black, E. M. Cooper, D. W. Dobberpuhl, P. M.
Donahue, G. W. Hoeppner, T. H. Lee, P. C. M. Lin, L. Madden, M. H.
Pearce, K. J. Snyder, and S. C. Thierauf, A 160-MHz, 32-b, 0.5-W
CMOS RISC microprocessor, IEEE Journal of Solid-State Circuits,
vol. 9, no. 1, pp. 17031714, 1996.
M. D. Hill. Dinero IV Trace-Driven Uniprocessor Cache Simulator.
[Online]. Available: http://pages.cs.wisc.edu/markhill/DineroIV/
M. Hill and A. Smith, Evaluating Associativity in CPU Caches, IEEE
Transactions on Computers, no. 12, pp. 16121630.
I. Mattson, R.L. Gecsei, J. Slutz, D.R. Traiger, Evaluation techniques
for storage hierarchies, IBM Systems Journal, vol. 9, no. 2, pp. 78117,
1970.
N. Tojo, N. Togawa, M. Yanagisawa, and T. Ohtsuki, Exact and fast
L1 cache simulation for embedded systems, in Proceedings of the
Asia and South Pacific Design Automation Conference (ASP-DAC09).
Ieee, Jan. 2009, pp. 817822.
M. S. Haque, A. Janapsatya, and S. Parameswaran, SuSeSim : A
Fast Simulation Strategy to Find Optimal L1 Cache Configuration for
Embedded Systems, in Proceedings of the 7th IEEE/ACM international conference on Hardware/software codesign and system synthesis
(CODES+ISSS09), 2009, pp. 295304.
M. S. Haque, J. Peddersen, A. Janapsatya, and S. Parameswaran, DEW
: A Fast Level 1 Cache Simulation Approach for Embedded Processors
with FIFO Replacement Policy, in Proceedings of the Design Automation & Test in Europe Conference & Exhibition (DATE10), 2010, pp.
496501.
M. S. Haque, J. Peddersen, A. Janapsatya, and S. Parameswaran,
SCUD : A Fast Single-pass L1 Cache Simulation Approach for Embedded Processors with Round-robin Replacement Policy, in Proceedings
of the Design Automation Conference (DAC10), 2012, pp. 356361.
M. S. Haque, J. Peddersen, and S. Parameswaran, CIPARSim:
Cache Intersection Property Assisted Rapid Single-pass FIFO Cache
Simulation Technique, in IEEE/ACM International Conference on
Computer-Aided Design (ICCAD11). Ieee, Nov. 2011, pp. 126133.
337
[16]
[17]
[18]
[19]
[20]
[21]
[22]
[23]
P. Viana, A. Gordon-Ross, E. Barros, and F. Vahid, A Table-based

Method for Single-pass Cache Optimization, in Proceedings of the
18th ACM Great Lakes symposium on VLSI (GLSVLSI08). New
York, New York, USA: ACM Press, 2008, p. 71.
W. Zang and A. Gordon-Ross, T-SPaCS A Two-Level Single-Pass
Cache Simulation Methodology, IEEE Transactions on Computers,
vol. 62, no. 2, pp. 390403, 2011.
W. Zang and A. Gordon-Ross, A Single-pass Cache Simulation
Methodology for Two-level Unified Caches, in Proceedings of the
2012 IEEE International Symposium on Performance Analysis of
Systems & Software (ISPASS12). Ieee, Apr. 2012, pp. 168177.
J. Schneider, J. Peddersen, and S. Parameswaran, A Scorchingly
Fast FPGA-Based Precise L1 LRU Cache Simulator, to be published
in Proceedings of the Asia and South Pacific Design Automation
Conference (ASP-DAC14). IEEE, Jan. 2014.
A. Gordon-Ross, F. Vahid, and N. Dutt, Fast Configurable-Cache
Tuning with a Unified Second-level Cache, in Proceedings of the
2005 International Symposium on Low Power Electronics and Design
(ISLPED05), 2005, pp. 323326.
C. Zhang, F. Vahid, and R. Lysecky, A self-tuning cache architecture
for embedded systems, ACM Transactions on Embedded Computing
Systems (TECS), vol. 3, no. 2, pp. 407425, May 2004.
A. Gordon-Ross, J. Lau, and B. Calder, Phase-based Cache
Reconfiguration for a Highly-configurable Two-level Cache Hierarchy,
in Proceedings of the 18th ACM Great Lakes symposium on VLSI
(GLSVLSI 08). New York, New York, USA: ACM Press, 2008, pp.
379382.
A. Gordon-Ross, F. Vahid, and N. Dutt, Automatic tuning of twolevel caches to embedded applications, in Proceedings of the Design
Automation & Test in Europe Conference & Exhibition (DATE04).
IEEE Comput. Soc, 2004, pp. 208213.
W. Wang, P. Mishra, and A. Gordon-ross, Dynamic Cache Reconfiguration for Soft Real-Time Systems, ACM Transactions on Embedded
Computing Systems (TECS), vol. 11, no. 2, 2012.
M. R. Marty, Cache coherence techniques for multicore processors,
Ph.D. dissertation, University of Wisconsin - Madison, 2008.
M. Haque, R. Ragel, A. Ambrose, S. Radhakrishnan, and
S. Parameswaran, DIMSim : A Rapid Two-level Cache Simulation
Approach for deadline-based MPSoCs, in Proceedings of the seventh
IEEE/ACM/IFIP international conference on Hardware/software
co-design and system synthesis (CODES+ISSS12), 2012, pp. 151160.
A. W. Wilson Jr, Multiprocessor Cache Simulation Using Hardware
Collected Address Traces, in Proceedings of the Twenty-Third Annual
Hawaii International Conference on System Sciences, 1990, pp. 252
260.
M. Rawlins and A. Gordon-Ross, CPACT - The conditional parameter
adjustment cache tuner for dual-core architectures, in IEEE 29th
International Conference on Computer Design (ICCD). Ieee, Oct.
2011, pp. 396403.

Application Specific SoC Memory PDF

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Application Specific SoC Memory PDF

Transféré par

Droits d'auteur :

Formats disponibles

2013 IEEE 8th International Conference on Industrial and Information Systems, ICIIS 2013, Aug.

18-20, 2013, Sri Lanka

A Survey on Exact Cache Design Space Exploration

The performance of a computing system is determined not only by

Block Size B : Size of a single cache entry in Bytes

Set Size S : Number of cache sets (A set represents the place,

Associativity A : Number of locations in the cache where a given

The total size or capacity of a cache C can be described as follows.

The average memory access time T of a system with a cache can be

Majority of the trace-driven simulation methods in the literature are

T = T otalN umberOf Accesses (H + M P )

978-1-4799-0910-0/13/$31.00 2013 IEEE

Meeting the performance/timing requirements for memory accesses

Meeting the energy/power consumption limits for the cache

Meeting the chip area usage limits for the cache

T RACE - DRIVEN E XACT S IMULATION OF U NIPROCESSOR

A. Binomial tree based simulation

Property 2: The second observation describes that a cache hit for

In order to remedy this, Janapsatya make use of two observations,

Haque and Janapsatya proposed enhancements to the algorithm

Property 1: The first observation implies that when a cache hit

Property 5: When a cache miss occurs for an address MA in a cache

Property 6: For an address MA, a cache hit in a configuration (B,

B. Stack/Table based simulation

Most of the correlation properties studied above are based on the

Fig. 6: Operation of the FPGA cache simulator

correlation properties of cache entries were used to reduce the simulation

T RACE - DRIVEN EXACT SIMULATION OF CACHE

The major limitation of the exact simulation of cache configuration

DYNAMIC C ACHE T UNING

Analytical model based and simulation based methods to dimension

E XACT S IMULATION OF M ULTIPROCESSOR C ACHE

The domain of embedded processing systems has seen a major shift

Fig. 8: An example Symmetric Multiprocessor System

As the future processing platforms trend towards heterogeneous

Fig. 9: DIMSim Algorithm by Haque et al. in [21]

Having cache hierarchies with multiple private and shared caches

P. Viana, A. Gordon-Ross, E. Barros, and F. Vahid, A Table-based

Vous aimerez peut-être aussi