Vous êtes sur la page 1sur 23

Lecture 12:

Memory Hierarchy—Ways to Reduce


Misses

DAP Spr.‘98 ©UCB 1


Block Replacement

When a miss occurs, the cache controller must select a block to be replaced with
the desired data. A replacement policy determines which block should be replaced.
With direct-mapped placement the decision is simple because there is no choice:
only one block frame is checked for a hit and only that block can be replaced.
With fully-associative or set-associative placement , there are more than one block to
choose from on a miss.
Primary strategies:
Random - to spread allocation uniformly, candidate blocks are randomly selected.
Advantage: simple to implement in hardware
Disadvantage: ignores principle of locality
Least-Recently Used (LRU) - to reduce a chance of throwing out information that
will be needed soon, accesses to blocks are recorded. The block replaced is the one
that has been unused for the longest time.
Advantage: takes locality into account
Disadvantage: as the number of blocks to keep track of increases,
LRU becomes more expensive (harder to implement, slower and often just approximated)
Other strategies:
First In First Out (FIFO)
Most-Recently Used (MRU)
Least-Frequently Used (LFU) DAP Spr.‘98 ©UCB 2

Most-Frequently Used (MFU)


Review: Four Questions for
Memory Hierarchy Designers
• Q1: Where can a block be placed in the upper level?
(Block placement)
– Fully Associative, Set Associative, Direct Mapped
• Q2: How is a block found if it is in the upper level?
(Block identification)
– Tag/Block
• Q3: Which block should be replaced on a miss?
(Block replacement)
– Random, LRU
• Q4: What happens on a write?
(Write strategy)
– Write Back or Write Through (with Write Buffer) DAP Spr.‘98 ©UCB 3
Review: Cache Performance
CPUtime = Instruction Count x (CPIexecution +
Mem accesses per instruction x Miss rate x
Miss penalty) x Clock cycle time

Misses per instruction = Memory accesses per


instruction x Miss rate
CPUtime = IC x (CPIexecution + Misses per
instruction x Miss penalty) x Clock cycle time
To Improve Cache Performance:
1. Reduce the miss rate
2. Reduce the miss penalty
3. Reduce the time to hit in the cache.

DAP Spr.‘98 ©UCB 4


Reducing Misses
• Classifying Misses: 3 Cs
– Compulsory—The first access to a block is not in the
cache, so the block must be brought into the cache. Also
called cold start misses or first reference misses.
(Misses in even an Infinite Cache)
– Capacity—If the cache cannot contain all the blocks needed
during execution of a program, capacity misses will occur
due to blocks being discarded and later retrieved.
(Misses in Fully Associative Size X Cache)
– Conflict—If block-placement strategy is set associative or
direct mapped, conflict misses (in addition to compulsory &
capacity misses) will occur because a block can be
discarded and later retrieved if too many blocks map to its
set. Also called collision misses or interference misses.
(Misses in N-way Associative, Size X Cache)

DAP Spr.‘98 ©UCB 5


3Cs Absolute Miss Rate
(SPEC92)
0.14
1-way
0.12 Conflict
2-way
Miss Rate per Type

0.1
4-way
0.08
8-way
0.06
Capacity
0.04
0.02
0
1

16

32

64

128
Note: Compulsory Compulsory
Cache Size (KB)
Miss small DAP Spr.‘98 ©UCB 6
2:1 Cache Rule
miss rate 1-way associative cache size X
= miss rate 2-way associative cache size X/2
0.14
1-way
0.12 Conflict
2-way
Miss Rate per Type

0.1
4-way
0.08
8-way
0.06
Capacity
0.04
0.02
0
1

16

32

64

128
Cache Size (KB) Compulsory
DAP Spr.‘98 ©UCB 7
How Can Reduce Misses?
• 3 Cs: Compulsory, Capacity, Conflict
• In all cases, assume total cache size not changed:
• What happens if:
1) Change Block Size:
Which of 3Cs is obviously affected?

2) Change Associativity:
Which of 3Cs is obviously affected?

3) Change Compiler:
Which of 3Cs is obviously affected?

DAP Spr.‘98 ©UCB 8


1. Reduce Misses via Larger
Block Size
25%

20% 1K

4K
15%
Miss
16K
Rate
10%
64K
5% 256K

0%
16

32

64

128

Block Size (bytes) 256


DAP Spr.‘98 ©UCB 9
2. Reduce Misses via Higher
Associativity
• 2:1 Cache Rule:
– Miss Rate DM cache size N Miss Rate 2-way cache
size N/2
• Beware: Execution time is only final measure!
– Will Clock Cycle time increase?
– Hill [1988] suggested hit time for 2-way vs. 1-way
external cache +10%,
internal + 2%

DAP Spr.‘98 ©UCB 10


Example: Avg. Memory Access
Time vs. Miss Rate
• Example: assume CCT = 1.10 for 2-way, 1.12 for 4-way,
1.14 for 8-way vs. CCT direct mapped
Cache Size Associativity
(KB) 1-way 2-way 4-way 8-way
1 2.33 2.15 2.07 2.01
2 1.98 1.86 1.76 1.68
4 1.72 1.67 1.61 1.53
8 1.46 1.48 1.47 1.43
16 1.29 1.32 1.32 1.32
32 1.20 1.24 1.25 1.27
64 1.14 1.20 1.21 1.23
128 1.10 1.17 1.18 1.20

(Red means A.M.A.T. not improved by more associativity) DAP Spr.‘98 ©UCB 11
3. Reducing Misses via a
“Victim Cache”
• How to combine fast hit time of direct mapped
yet still avoid conflict misses?
• Add buffer to place data discarded from cache
• Jouppi [1990]: 4-entry victim cache removed 20% to
95% of conflicts for a 4 KB direct mapped data cache
• Used in Alpha, HP machines

DAP Spr.‘98 ©UCB 12


5. Reducing Misses by
Prefetching of Instructions & Data
• Instruction prefetching – Sequentially prefetch
instructions from IM to the instruction Queue (IQ)
together with branch prediction – All computers
employ this.
• Data prefetching – Difficult to predict data that will
be used in future. Following questions must be
answered.
1. What to prefetch? – How to know which data will
be used? Unnecessary prefetches will waste
memory/bus bandwidth and will replace useful
data in the cache (cache pollution problem) giving
rise to negative impact on the execution time.
2. When to prefetch? – Must be early enough for
the data to be useful, but too early will cause
cache pollution problem.
DAP Spr.‘98 ©UCB 13
Data Prefetching
• Software Prefetching – Explicit instructions to
prefetch data are inserted in the program.
Difficult to decide where to put in the program.
Needs good compiler analysis. Some computers
already have prefetch intructions. Examples are:
-- Load data into register (HP PA-RISC loads)
– Cache Prefetch: load into cache
(MIPS IV, PowerPC, SPARC v. 9)

• Hardware Prefetching – Difficult to predict and


design. Different results for different applications

DAP Spr.‘98 ©UCB 14


5. Reducing Cache Pollution
• E.g., Instruction Prefetching
– Alpha 21064 fetches 2 blocks on a miss
– Extra block placed in “stream buffer”
– On miss check stream buffer
• Works with data blocks too:
– Jouppi [1990] 1 data stream buffer got 25% misses from
4KB cache; 4 streams got 43%
– Palacharla & Kessler [1994] for scientific programs for 8
streams got 50% to 70% of misses from
2 64KB, 4-way set associative caches
• Prefetching relies on having extra memory
bandwidth that can be used without penalty

DAP Spr.‘98 ©UCB 15


Summary
 Memory accesses 
CPUtime  IC  CPI   Miss rate  Miss penalty  Clock cycle time
 Execution
Instruction 

• 3 Cs: Compulsory, Capacity, Conflict Misses


• Reducing Miss Rate
1. Reduce Misses via Larger Block Size
2. Reduce Misses via Higher Associativity
3. Reducing Misses via Victim Cache
4. 5. Reducing Misses by HW Prefetching Instr, Data
6. Reducing Misses by SW Prefetching Data
7. Reducing Misses by Compiler Optimizations
• Remember danger of concentrating on just one
parameter when evaluating performance
DAP Spr.‘98 ©UCB 16
Review: Improving Cache
Performance
1. Reduce the miss rate,
2. Reduce the miss penalty, or
3. Reduce the time to hit in the cache.

DAP Spr.‘98 ©UCB 17


1. Reducing Miss Penalty:
Read Priority over Write on Miss
• Write through with write buffers offer RAW conflicts
with main memory reads on cache misses
• If simply wait for write buffer to empty, might
increase read miss penalty (old MIPS 1000 by 50% )
• Check write buffer contents before read;
if no conflicts, let the memory access continue
• Write Back?
– Read miss replacing dirty block
– Normal: Write dirty block to memory, and then do the read
– Instead copy the dirty block to a write buffer, then do the
read, and then do the write
– CPU stall less since restarts as soon as do read

DAP Spr.‘98 ©UCB 18


4. Reduce Miss Penalty: Non-blocking
Caches to reduce stalls on misses
• Non-blocking cache or lockup-free cache allow data
cache to continue to supply cache hits during a miss
– requires out-of-order execution CPU
• “hit under multiple miss” or “miss under miss” may
further lower the effective miss penalty by overlapping
multiple misses
– Significantly increases the complexity of the cache controller
as there can be multiple outstanding memory accesses
– Requires multiple memory banks (otherwise cannot support)
– Pentium Pro allows 4 outstanding memory misses
 The technique requires use of a few miss status
holding registers (MSHRs) to hold the outstanding
memory requests.
DAP Spr.‘98 ©UCB 19
5th Miss Penalty Reduction:
Second Level Cache
• L2 Equations
AMAT = Hit TimeL1 + Miss RateL1 x Miss PenaltyL1

Miss PenaltyL1 = Hit TimeL2 + Miss RateL2 x Miss PenaltyL2

AMAT = Hit TimeL1 + Miss RateL1 x (Hit TimeL2 + Miss RateL2 +


Miss PenaltyL2)

• Definitions:
– Local miss rate— misses in this cache divided by the total
number of memory accesses to this cache (Miss rateL2)
– Global miss rate—misses in this cache divided by the total
number of memory accesses generated by the CPU
(Miss RateL1 x Miss RateL2)
– Global Miss Rate is what matters
DAP Spr.‘98 ©UCB 20
An Example (pp. 576)
Q: Suppose we have a processor with a base CPI of 1.0 assuming
all references hit in the primary cache and a clock rate of 500
MHz. The main memory access time is 200 ns. Suppose the
miss rate per instn is 5%. What is the revised CPI? How much
faster will the machine run if we put a secondary cache (with 20-
ns access time) that reduces the miss rate to memory to 2%?
Assume same access time for hit or miss.
A: Miss penalty to main memory = 200 ns = 100 cycles. Total CPI =
Base CPI + Memory-stall cycles per instn. Hence, revised CPI =
1.0 + 5% x 100 = 6.0
When an L2 with 20-ns (10 cycles) access time is put, the miss
rate to memory is reduced to 2%. So, out of 5% L1 miss, L2 hit is
3% and miss is 2%.
The CPI is reduced to 1.0 + 5% ( 10 + 40% x 100) = 3.5. Thus, the
m/c with secondary cache is faster by 6.0/3.5 = 1.7

DAP Spr.‘98 ©UCB 21


Reducing Miss Penalty Summary
 Memory accesses 
CPUtime  IC  CPI   Miss rate  Miss penalty  Clock cycle time
 Execution
Instruction 
• Five techniques
– Read priority over write on miss
– Subblock placement
– Early Restart and Critical Word First on miss
– Non-blocking Caches (Hit under Miss, Miss under Miss)
– Second Level Cache
• Can be applied recursively to Multilevel Caches
– Danger is that time to DRAM will grow with multiple
levels in between
– First attempts at L2 caches can make things worse, since
increased worst case is worse

DAP Spr.‘98 ©UCB 22


Cache Optimization Summary
Technique MR MP HT Complexity
miss rate

Larger Block Size + – 0


Higher Associativity + – 1
Victim Caches + 2
Pseudo-Associative Caches + 2
HW Prefetching of Instr/Data + 2
Compiler Controlled Prefetching + 3
Compiler Reduce Misses + 0
Priority to Read Misses + 1
miss penalty

Subblock Placement + + 1
Early Restart & Critical Word 1st + 2
Non-Blocking Caches + 3
Second Level Caches + 2

DAP Spr.‘98 ©UCB 23

Vous aimerez peut-être aussi