Vous êtes sur la page 1sur 14

Five techniques to reduce miss penalty

Multilevel caches
Critical word first and early start
Giving priority to read misses over writes
Merging write buffer
Victim Caches

Multilevel caches
CPU gets faster
Caches speed need to keep up with CPU!
Caches need to be BIGGER.
We need both but hard.

How to solve bigger and faster problem?


Add a second level of cache
First level cache is small enough to match clock cycle time of the
fast CPU.
Second level cache should be large enough to capture many access
that would go to main memory?
How do we analyze performance?

Miss rate
Local miss rate is the number of misses in a given cache divided by
the total number of memory access to this cache.




Miss rate

for L1.

Miss rate

for L2.

Miss rate
Local miss rate is large for 2nd-level caches
because 1st level cache skims the juicy memory accesses.
Global miss rate is a more useful measure
Global miss rate measure what fraction of memory accesses must
go all the way to memory.

L1 and L2
For data in L1 cache, should it be on the L2 cache too?
Multilevel inclusion
consistency between I/O and caches (or among caches).
Bad things: Statistics suggest it is a good idea to have small block
sizes for L1 cache and bigger block size for L2 caches.
Still doable, Pentium 4 does it.

Multilevel Exclusion
L1 data is NEVER found in L2 cache
If designer can only afford a L2 cache that is slightly bigger than L1
cache, then they dont want to waste space in L2 cache.
AMD Athlon uses exclusion.

Critical word first


Normally, one needs extra hardware to reduce miss penalty.
This technique does not.
CPU normally needs just one word of a block at a time.
IDEA: dont wait for the full block to be loaded before sending the
requested word to restart the CPU.

Two methods
Critical word first: request the missed word from memory and send
it to CPU as soon as it arrives;
Let CPU execute while filling the rest of the word in the block.
Early restart Fetch the words in normal order, but as soon as the
requested word of the block arrives, send it to the CPU and let it
works.

Drawback
This has benefits when the block is large.
One problem is that with spatial locality, there is more than random
chance that the next miss is in remainder of the block.
The effective miss penalty is the time from the miss until the second
piece arrives.

Giving priority to read misses over writes


Serve reads before writes have been completed.
SW R3, 512(R0)
LW R1, 1024(R0)
LW R2, 512(R0)
assume direct-mapped and 1024 and 512 are mapped to the same
block.

Solution
Wait until the write buffer is empty
OR check the contents of the write buffer on a read miss, if no
conflicts and the memory system is available, let the read miss
continue.
Also save cost on write-back cache if the content is found on write
buffer.

Merging Write Buffer


When the write buffer is full, CPU must wait before the write buffer
has an empty entry.
When the write buffer is empty, and CPU puts the block and its
address in the write buffer; CPU continues on execution.
If the buffer contains other modified blocks, the addresses can be
checked to see if the address of this new data matches the address
of a valid write buffer entry.
If so, the new data are combined with that entry, called write
merging.
Multiword write are faster than writes perform one word at a time.
Reduce stalls due to the write buffer being full.

Victim caches
Remember what was discarded in case it is needed again.
The discarded data has already been fetched, it can be used again
at small cost.
Require a small, fully associative cache between a cache and its
refill path.
AMD Athlon uses victim cache with 8 entries.
Victim cache of 1 to 5 entries are effective at reducing misses,
especially for small, direct-mapped data caches.

Vous aimerez peut-être aussi