Block Cache

Lecture 12:
Memory Hierarchy—Ways to Reduce

Misses
DAP Spr.‘98 ©UCB 1

Block Replacement
When a miss occurs, the cache controller must select a block to be replaced with
the desired data. A replacement policy determines which block should be replaced.
With direct-mapped placement the decision is simple because there is no choice:
only one block frame is checked for a hit and only that block can be replaced.
With fully-associative or set-associative placement , there are more than one block to
choose from on a miss.
Primary strategies:
Random - to spread allocation uniformly, candidate blocks are randomly selected.
Advantage: simple to implement in hardware
Disadvantage: ignores principle of locality
Least-Recently Used (LRU) - to reduce a chance of throwing out information that
will be needed soon, accesses to blocks are recorded. The block replaced is the one
that has been unused for the longest time.
Advantage: takes locality into account
Disadvantage: as the number of blocks to keep track of increases,
LRU becomes more expensive (harder to implement, slower and often just approximated)
Other strategies:
First In First Out (FIFO)
Most-Recently Used (MRU)
Least-Frequently Used (LFU) DAP Spr.‘98 ©UCB 2
Most-Frequently Used (MFU)

Review: Four Questions for
Memory Hierarchy Designers
• Q1: Where can a block be placed in the upper level?
(Block placement)
– Fully Associative, Set Associative, Direct Mapped
• Q2: How is a block found if it is in the upper level?
(Block identification)
– Tag/Block
• Q3: Which block should be replaced on a miss?
(Block replacement)
– Random, LRU
• Q4: What happens on a write?
(Write strategy)
– Write Back or Write Through (with Write Buffer) DAP Spr.‘98 ©UCB 3
Review: Cache Performance
CPUtime = Instruction Count x (CPIexecution +
Mem accesses per instruction x Miss rate x
Miss penalty) x Clock cycle time
Misses per instruction = Memory accesses per

instruction x Miss rate
CPUtime = IC x (CPIexecution + Misses per
instruction x Miss penalty) x Clock cycle time
To Improve Cache Performance:
1. Reduce the miss rate
2. Reduce the miss penalty
3. Reduce the time to hit in the cache.

Reducing Misses
• Classifying Misses: 3 Cs
– Compulsory—The first access to a block is not in the
cache, so the block must be brought into the cache. Also
called cold start misses or first reference misses.
(Misses in even an Infinite Cache)
– Capacity—If the cache cannot contain all the blocks needed
during execution of a program, capacity misses will occur
due to blocks being discarded and later retrieved.
(Misses in Fully Associative Size X Cache)
– Conflict—If block-placement strategy is set associative or
direct mapped, conflict misses (in addition to compulsory &
capacity misses) will occur because a block can be
discarded and later retrieved if too many blocks map to its
set. Also called collision misses or interference misses.
(Misses in N-way Associative, Size X Cache)

3Cs Absolute Miss Rate
(SPEC92)
0.14
1-way
0.12 Conflict
2-way
Miss Rate per Type
0.1
4-way
0.08
8-way
0.06
Capacity
0.04
0.02
0
1
16
32
64
128
Note: Compulsory Compulsory
Cache Size (KB)
Miss small DAP Spr.‘98 ©UCB 6
2:1 Cache Rule
miss rate 1-way associative cache size X
= miss rate 2-way associative cache size X/2
0.14
1-way
0.12 Conflict
2-way
Miss Rate per Type
0.1
4-way
0.08
8-way
0.06
Capacity
0.04
0.02
0
1
16
32
64
128
Cache Size (KB) Compulsory
How Can Reduce Misses?
• 3 Cs: Compulsory, Capacity, Conflict
• In all cases, assume total cache size not changed:
• What happens if:
1) Change Block Size:
Which of 3Cs is obviously affected?
2) Change Associativity:
3) Change Compiler:

1. Reduce Misses via Larger
Block Size
25%
20% 1K
4K
15%
Miss
16K
Rate
10%
64K
5% 256K
0%
16
32
64
128
Block Size (bytes) 256

2. Reduce Misses via Higher
Associativity
• 2:1 Cache Rule:
– Miss Rate DM cache size N Miss Rate 2-way cache
size N/2
• Beware: Execution time is only final measure!
– Will Clock Cycle time increase?
– Hill [1988] suggested hit time for 2-way vs. 1-way
external cache +10%,
internal + 2%

Example: Avg. Memory Access
Time vs. Miss Rate
• Example: assume CCT = 1.10 for 2-way, 1.12 for 4-way,
1.14 for 8-way vs. CCT direct mapped
Cache Size Associativity
(KB) 1-way 2-way 4-way 8-way
1 2.33 2.15 2.07 2.01
2 1.98 1.86 1.76 1.68
4 1.72 1.67 1.61 1.53
8 1.46 1.48 1.47 1.43
16 1.29 1.32 1.32 1.32
32 1.20 1.24 1.25 1.27
64 1.14 1.20 1.21 1.23
128 1.10 1.17 1.18 1.20
(Red means A.M.A.T. not improved by more associativity) DAP Spr.‘98 ©UCB 11
3. Reducing Misses via a
“Victim Cache”
• How to combine fast hit time of direct mapped
yet still avoid conflict misses?
• Add buffer to place data discarded from cache
• Jouppi [1990]: 4-entry victim cache removed 20% to
95% of conflicts for a 4 KB direct mapped data cache
• Used in Alpha, HP machines

5. Reducing Misses by
Prefetching of Instructions & Data
• Instruction prefetching – Sequentially prefetch
instructions from IM to the instruction Queue (IQ)
together with branch prediction – All computers
employ this.
• Data prefetching – Difficult to predict data that will
be used in future. Following questions must be
answered.
1. What to prefetch? – How to know which data will
be used? Unnecessary prefetches will waste
memory/bus bandwidth and will replace useful
data in the cache (cache pollution problem) giving
rise to negative impact on the execution time.
2. When to prefetch? – Must be early enough for
the data to be useful, but too early will cause
cache pollution problem.
Data Prefetching
• Software Prefetching – Explicit instructions to
prefetch data are inserted in the program.
Difficult to decide where to put in the program.
Needs good compiler analysis. Some computers
already have prefetch intructions. Examples are:
-- Load data into register (HP PA-RISC loads)
– Cache Prefetch: load into cache
(MIPS IV, PowerPC, SPARC v. 9)
• Hardware Prefetching – Difficult to predict and

design. Different results for different applications

5. Reducing Cache Pollution
• E.g., Instruction Prefetching
– Alpha 21064 fetches 2 blocks on a miss
– Extra block placed in “stream buffer”
– On miss check stream buffer
• Works with data blocks too:
– Jouppi [1990] 1 data stream buffer got 25% misses from
4KB cache; 4 streams got 43%
– Palacharla & Kessler [1994] for scientific programs for 8
streams got 50% to 70% of misses from
2 64KB, 4-way set associative caches
• Prefetching relies on having extra memory
bandwidth that can be used without penalty

Summary
 Memory accesses 
CPUtime  IC  CPI   Miss rate  Miss penalty  Clock cycle time
 Execution
Instruction 
• 3 Cs: Compulsory, Capacity, Conflict Misses

• Reducing Miss Rate
1. Reduce Misses via Larger Block Size
2. Reduce Misses via Higher Associativity
3. Reducing Misses via Victim Cache
4. 5. Reducing Misses by HW Prefetching Instr, Data
6. Reducing Misses by SW Prefetching Data
7. Reducing Misses by Compiler Optimizations
• Remember danger of concentrating on just one
parameter when evaluating performance
Review: Improving Cache
Performance
1. Reduce the miss rate,
2. Reduce the miss penalty, or
3. Reduce the time to hit in the cache.

1. Reducing Miss Penalty:
Read Priority over Write on Miss
• Write through with write buffers offer RAW conflicts
with main memory reads on cache misses
• If simply wait for write buffer to empty, might
increase read miss penalty (old MIPS 1000 by 50% )
• Check write buffer contents before read;
if no conflicts, let the memory access continue
• Write Back?
– Read miss replacing dirty block
– Normal: Write dirty block to memory, and then do the read
– Instead copy the dirty block to a write buffer, then do the
read, and then do the write
– CPU stall less since restarts as soon as do read

4. Reduce Miss Penalty: Non-blocking
Caches to reduce stalls on misses
• Non-blocking cache or lockup-free cache allow data
cache to continue to supply cache hits during a miss
– requires out-of-order execution CPU
• “hit under multiple miss” or “miss under miss” may
further lower the effective miss penalty by overlapping
multiple misses
– Significantly increases the complexity of the cache controller
as there can be multiple outstanding memory accesses
– Requires multiple memory banks (otherwise cannot support)
– Pentium Pro allows 4 outstanding memory misses
 The technique requires use of a few miss status
holding registers (MSHRs) to hold the outstanding
memory requests.
5th Miss Penalty Reduction:
Second Level Cache
• L2 Equations
AMAT = Hit TimeL1 + Miss RateL1 x Miss PenaltyL1
Miss PenaltyL1 = Hit TimeL2 + Miss RateL2 x Miss PenaltyL2
AMAT = Hit TimeL1 + Miss RateL1 x (Hit TimeL2 + Miss RateL2 +

Miss PenaltyL2)
• Definitions:
– Local miss rate— misses in this cache divided by the total
number of memory accesses to this cache (Miss rateL2)
– Global miss rate—misses in this cache divided by the total
number of memory accesses generated by the CPU
(Miss RateL1 x Miss RateL2)
– Global Miss Rate is what matters
An Example (pp. 576)
Q: Suppose we have a processor with a base CPI of 1.0 assuming
all references hit in the primary cache and a clock rate of 500
MHz. The main memory access time is 200 ns. Suppose the
miss rate per instn is 5%. What is the revised CPI? How much
faster will the machine run if we put a secondary cache (with 20-
ns access time) that reduces the miss rate to memory to 2%?
Assume same access time for hit or miss.
A: Miss penalty to main memory = 200 ns = 100 cycles. Total CPI =
Base CPI + Memory-stall cycles per instn. Hence, revised CPI =
1.0 + 5% x 100 = 6.0
When an L2 with 20-ns (10 cycles) access time is put, the miss
rate to memory is reduced to 2%. So, out of 5% L1 miss, L2 hit is
3% and miss is 2%.
The CPI is reduced to 1.0 + 5% ( 10 + 40% x 100) = 3.5. Thus, the
m/c with secondary cache is faster by 6.0/3.5 = 1.7

Reducing Miss Penalty Summary
 Memory accesses 
CPUtime  IC  CPI   Miss rate  Miss penalty  Clock cycle time
 Execution
Instruction 
• Five techniques
– Read priority over write on miss
– Subblock placement
– Early Restart and Critical Word First on miss
– Non-blocking Caches (Hit under Miss, Miss under Miss)
– Second Level Cache
• Can be applied recursively to Multilevel Caches
– Danger is that time to DRAM will grow with multiple
levels in between
– First attempts at L2 caches can make things worse, since
increased worst case is worse

Cache Optimization Summary
Technique MR MP HT Complexity
miss rate
Larger Block Size + – 0

Higher Associativity + – 1
Victim Caches + 2
Pseudo-Associative Caches + 2
HW Prefetching of Instr/Data + 2
Compiler Controlled Prefetching + 3
Compiler Reduce Misses + 0
Priority to Read Misses + 1
miss penalty
Subblock Placement + + 1
Early Restart & Critical Word 1st + 2
Non-Blocking Caches + 3
Second Level Caches + 2

Block Cache

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Block Cache

Transféré par

Droits d'auteur :

Formats disponibles

Lecture 12:

Memory Hierarchy—Ways to Reduce

DAP Spr.‘98 ©UCB 1

Most-Frequently Used (MFU)

Misses per instruction = Memory accesses per

DAP Spr.‘98 ©UCB 4

DAP Spr.‘98 ©UCB 5

DAP Spr.‘98 ©UCB 8

Block Size (bytes) 256

DAP Spr.‘98 ©UCB 10

DAP Spr.‘98 ©UCB 12

• Hardware Prefetching – Difficult to predict and

DAP Spr.‘98 ©UCB 14

DAP Spr.‘98 ©UCB 15

• 3 Cs: Compulsory, Capacity, Conflict Misses

DAP Spr.‘98 ©UCB 17

DAP Spr.‘98 ©UCB 18

Miss PenaltyL1 = Hit TimeL2 + Miss RateL2 x Miss PenaltyL2

AMAT = Hit TimeL1 + Miss RateL1 x (Hit TimeL2 + Miss RateL2 +

DAP Spr.‘98 ©UCB 21

DAP Spr.‘98 ©UCB 22

Larger Block Size + – 0

DAP Spr.‘98 ©UCB 23

Vous aimerez peut-être aussi