Caches

EEL 5708
Caches
Lotzi Blni
EEL 5708
Acknowledgements
All the lecture slides were adopted from the slides
of David Patterson (1998, 2001) and David E.
Culler (2001), Copyright 1998-2002, University of
California Berkeley
EEL 5708

CPU-DRAM Gap

1980: no cache in proc; 1995 2-level cache on chip
(1989 first Intel proc with a cache on chip)
Question: Who Cares About the
Memory Hierarchy?
Proc
60%/yr.
DRAM
7%/yr.
1
10
100
1000
1
9
8
0

1
9
8
1

1
9
8
3

1
9
8
4

1
9
8
5

1
9
8
6

1
9
8
7

1
9
8
8

1
9
8
9

1
9
9
0

1
9
9
1

1
9
9
2

1
9
9
3

1
9
9
4

1
9
9
5

1
9
9
6

1
9
9
7

1
9
9
8

1
9
9
9

2
0
0
0

DRAM
CPU
1
9
8
2

Processor-Memory
Performance Gap:
(grows 50% / year)
P
e
r
f
o
r
m
a
n
c
e

Moores Law
Less Law?
EEL 5708
Generations of Microprocessors
Time of a full cache miss in instructions executed:
1st Alpha: 340 ns/5.0 ns = 68 clks x 2 or 136
2nd Alpha: 266 ns/3.3 ns = 80 clks x 4 or 320
3rd Alpha: 180 ns/1.7 ns =108 clks x 6 or 648
1/2X latency x 3X clock rate x 3X Instr/clock 5X
EEL 5708
Processor-Memory
Performance Gap Tax
Processor % Area %Transistors
(cost) (power)
Alpha 21164 37% 77%
StrongArm SA110 61% 94%
Pentium Pro 64% 88%
2 dies per package: Proc/I$/D$ + L2$
Caches have no inherent value,
only try to close performance gap
EEL 5708
What is a cache?
Small, fast storage used to improve average access
time to slow memory.
Exploits spatial and temporal locality
In computer architecture, almost everything is a cache!
Registers a cache on variables
First-level cache a cache on second-level cache
Second-level cache a cache on memory
Memory a cache on disk (virtual memory)
TLB a cache on page table
Branch-prediction a cache on prediction information?

Proc/Regs
L1-Cache
L2-Cache
Memory
Disk, Tape, etc.
Bigger Faster
EEL 5708
Example: 1 KB Direct Mapped Cache
For a 2 ** N byte cache:
The uppermost (32 - N) bits are always the Cache Tag
The lowest M bits are the Byte Select (Block Size = 2 ** M)
Cache Index
0
1
2
3
:
Cache Data
Byte 0
0 4 31
:
Cache Tag Example: 0x50
Ex: 0x01
0x50
Stored as part
of the cache state
Valid Bit
:
31
Byte 1 Byte 31
:

Byte 32 Byte 33 Byte 63
:

Byte 992 Byte 1023
:

Cache Tag
Byte Select
Ex: 0x00
9
Block address
EEL 5708
Set Associative Cache
N-way set associative: N entries for each Cache
Index
N direct mapped caches operates in parallel
Example: Two-way set associative cache
Cache Index selects a set from the cache
The two tags in the set are compared to the input in parallel
Data is selected based on the tag result
Cache Data
Cache Block 0
Cache Tag Valid
: : :
Cache Data
Cache Block 0
Cache Tag Valid
: : :
Cache Index
Mux
0 1
Sel1 Sel0
Cache Block
Compare
Adr Tag
Compare
OR
Hit
EEL 5708
Disadvantage of Set Associative Cache
N-way Set Associative Cache versus Direct Mapped
Cache:
N comparators vs. 1
Extra MUX delay for the data
Data comes AFTER Hit/Miss decision and set selection
In a direct mapped cache, Cache Block is available
BEFORE Hit/Miss:
Possible to assume a hit and continue. Recover later if miss.
Cache Data
Cache Block 0
Cache Tag Valid
: : :
Cache Data
Cache Block 0
Cache Tag Valid
: : :
Cache Index
Mux
0 1
Sel1 Sel0
Cache Block
Compare
Adr Tag
Compare
OR
Hit
EEL 5708
Miss-oriented Approach to Memory Access:

CPI
Execution
includes ALU and Memory instructions

CycleTime y MissPenalt MissRate
I nst
MemAccess
Execution
CPI I C CPUtime
|
.
|
\
|
+ =
CycleTime y MissPenalt
I nst
MemMisses
Execution
CPI I C CPUtime
|
.
|
\
|
+ =
Review: Cache performance

Separating out Memory component entirely
AMAT = Average Memory Access Time
CPI
ALUOps
does not include memory instructions

CycleTime AMAT
I nst
MemAccess
CPI
I nst
AluOps
I C CPUtime
AluOps

|
.
|
\
|
+ =
y MissPenalt MissRate HitTime AMAT + =
( )
( )
Data Data Data
Inst Inst Inst
y MissPenalt MissRate HitTime
y MissPenalt MissRate HitTime
+
+ + =

EEL 5708
Impact on Performance
Suppose a processor executes at
Clock Rate = 200 MHz (5 ns per cycle), Ideal (no misses) CPI = 1.1
50% arith/logic, 30% ld/st, 20% control
Suppose that 10% of memory operations get 50 cycle
miss penalty
Suppose that 1% of instructions get same miss penalty
CPI = ideal CPI + average stalls per instruction
1.1(cycles/ins) +
[ 0.30 (DataMops/ins)
x 0.10 (miss/DataMop) x 50 (cycle/miss)] +
[ 1 (InstMop/ins)
x 0.01 (miss/InstMop) x 50 (cycle/miss)]
= (1.1 + 1.5 + .5) cycle/ins = 3.1
58% of the time the proc is stalled waiting for memory!
AMAT=(1/1.3)x[1+0.01x50]+(0.3/1.3)x[1+0.1x50]=2.54
EEL 5708
Example: Harvard Architecture
Unified vs Separate I&D (Harvard)

Table on page 384:
16KB I&D: Inst miss rate=0.64%, Data miss rate=6.47%
32KB unified: Aggregate miss rate=1.99%
Which is better (ignore L2 cache)?
Assume 33% data ops 75% accesses from instructions (1.0/1.33)
hit time=1, miss time=50
Note that data hit has 1 stall for unified cache (only one port)

AMAT
Harvard
=75%x(1+0.64%x50)+25%x(1+6.47%x50) = 2.05
AMAT
Unified
=75%x(1+1.99%x50)+25%x(1+1+1.99%x50)= 2.24
Proc I-Cache-1
Proc
Unified
Cache-1
Unified
Cache-2
D-Cache-1
Proc
Unified
Cache-2
EEL 5708
Review: Four Questions for
Memory Hierarchy Designers
Q1: Where can a block be placed in the upper level?
(Block placement)
Fully Associative, Set Associative, Direct Mapped
Q2: How is a block found if it is in the upper level?
(Block identification)
Tag/Block
Q3: Which block should be replaced on a miss?
(Block replacement)
Random, LRU
Q4: What happens on a write?
(Write strategy)
Write Back or Write Through (with Write Buffer)
EEL 5708
Review: Improving Cache
Performance
1. Reduce the miss rate,
2. Reduce the miss penalty, or
3. Reduce the time to hit in the cache.

EEL 5708
Reducing Misses
Classifying Misses: 3 Cs
CompulsoryThe first access to a block is not in the cache, so the
block must be brought into the cache. Also called cold start misses or
first reference misses.
(Misses in even an Infinite Cache)
CapacityIf the cache cannot contain all the blocks needed during
execution of a program, capacity misses will occur due to blocks being
discarded and later retrieved.
(Misses in Fully Associative Size X Cache)
ConflictIf block-placement strategy is set associative or direct
mapped, conflict misses (in addition to compulsory & capacity misses)
will occur because a block can be discarded and later retrieved if too
many blocks map to its set. Also called collision misses or interference
misses.
(Misses in N-way Associative, Size X Cache)
More recent, 4th C:
Coherence - Misses caused by cache coherence.
EEL 5708
Cache Size (KB)
M
i
s
s

R
a
t
e

p
e
r

T
y
p
e
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
1 2 4 8
1
6
3
2
6
4
1
2
8
1-way
2-way
4-way
8-way
Capacity
Compulsory
3Cs Absolute Miss Rate
(SPEC92)
Conflict
Compulsory vanishingly
small
EEL 5708
Cache Size (KB)
M
i
s
s

R
a
t
e

p
e
r

T
y
p
e
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
1 2 4 8
1
6
3
2
6
4
1
2
8
1-way
2-way
4-way
8-way
Capacity
Compulsory
2:1 Cache Rule
Conflict
miss rate 1-way associative cache size X
= miss rate 2-way associative cache size X/2
EEL 5708
3Cs Relative Miss Rate
Cache Size (KB)
M
i
s
s

R
a
t
e

p
e
r

T
y
p
e
0%
20%
40%
60%
80%
100%
1 2 4 8
1
6
3
2
6
4
1
2
8
1-way
2-way
4-way
8-way
Capacity
Compulsory
Conflict
Flaws: for fixed block size
Good: insight => invention
EEL 5708
Block Size (bytes)
Miss
Rate
0%
5%
10%
15%
20%
25%
1
6
3
2
6
4
1
2
8
2
5
6
1K
4K
16K
64K
256K
1. Reduce Misses via Larger
Block Size
EEL 5708
2. Reduce Misses via Higher
Associativity
2:1 Cache Rule:
Miss Rate DM cache size N Miss Rate 2-way cache size N/2
Beware: Execution time is only final measure!
Will Clock Cycle time increase?
Hill [1988] suggested hit time for 2-way vs. 1-way
external cache +10%,
internal + 2%

EEL 5708
Example: Avg. Memory Access
Time vs. Miss Rate
Example: assume CCT = 1.10 for 2-way, 1.12 for
4-way, 1.14 for 8-way vs. CCT direct mapped

Cache Size Associativity
(KB) 1-way 2-way 4-way 8-way
1 2.33 2.15 2.07 2.01
2 1.98 1.86 1.76 1.68
4 1.72 1.67 1.61 1.53
8 1.46 1.48 1.47 1.43
16 1.29 1.32 1.32 1.32
32 1.20 1.24 1.25 1.27
64 1.14 1.20 1.21 1.23
128 1.10 1.17 1.18 1.20

(Red means A.M.A.T. not improved by more associativity)
EEL 5708
3. Reducing Misses via a
Victim Cache
How to combine fast hit time
of direct mapped
yet still avoid conflict misses?
Add buffer to place data
discarded from cache
Jouppi [1990]: 4-entry victim
cache removed 20% to 95% of
conflicts for a 4 KB direct
mapped data cache
Used in Alpha, HP machines
To Next Lower Level In
Hierarchy
DATA
TAGS
One Cache line of Data
Tag and Comparator
Tag and Comparator
Tag and Comparator
Tag and Comparator
EEL 5708
4. Reducing Misses via
Pseudo-Associativity
How to combine fast hit time of Direct Mapped and have the
lower conflict misses of 2-way SA cache?
Divide cache: on a miss, check other half of cache to see if
there, if so have a pseudo-hit (slow hit)

Drawback: CPU pipeline is hard if hit takes 1 or 2 cycles
Better for caches not tied directly to processor (L2)
Used in MIPS R1000 L2 cache, similar in UltraSPARC
Hit Time
Pseudo Hit Time
Miss Penalty
Time
EEL 5708
5. Reducing Misses by Hardware
Prefetching of Instructions & Data
E.g., Instruction Prefetching
Alpha 21064 fetches 2 blocks on a miss
Extra block placed in stream buffer
On miss check stream buffer
Works with data blocks too:
Jouppi [1990] 1 data stream buffer got 25% misses from
4KB cache; 4 streams got 43%
Palacharla & Kessler [1994] for scientific programs for 8
streams got 50% to 70% of misses from
2 64KB, 4-way set associative caches
Prefetching relies on having extra memory
bandwidth that can be used without penalty
EEL 5708
6. Reducing Misses by
Software Prefetching Data
Data prefetch
Load data into register (HP PA-RISC loads)
Cache Prefetch: load into cache (MIPS IV, PowerPC, SPARC v. 9)
Special prefetching instructions cannot cause faults; a form of
speculative execution
Prefetching comes in two flavors:
Binding prefetch: Requests load directly into register.
Must be correct address and register!
Non-Binding prefetch: Load into cache.
Can be incorrect. Frees HW/SW to guess!
Issuing prefetch instructions takes time
Is cost of prefetch issues < savings in reduced misses?
Higher superscalar reduces difficulty of issue bandwidth
EEL 5708
7. Reducing Misses by
Compiler Optimizations
McFarling [1989] reduced caches misses by 75%
on 8KB direct mapped cache, 4 byte blocks in software
Instructions
Reorder procedures in memory so as to reduce conflict misses
Profiling to look at conflicts(using tools they developed)
Data
Merging Arrays: improve spatial locality by single array of compound elements
vs. 2 arrays
Loop Interchange: change nesting of loops to access data in order stored in
memory
Loop Fusion: Combine 2 independent loops that have same looping and some
variables overlap
Blocking: Improve temporal locality by accessing blocks of data repeatedly
vs. going down whole columns or rows
EEL 5708
Merging Arrays Example
/* Before: 2 sequential arrays */
int val[SIZE];
int key[SIZE];

/* After: 1 array of stuctures */
struct merge {
int val;
int key;
};
struct merge merged_array[SIZE];

Reducing conflicts between val & key;
improve spatial locality

EEL 5708
Loop Interchange Example
/* Before */
for (k = 0; k < 100; k = k+1)
for (j = 0; j < 100; j = j+1)
for (i = 0; i < 5000; i = i+1)
x[i][j] = 2 * x[i][j];
/* After */
for (k = 0; k < 100; k = k+1)
for (i = 0; i < 5000; i = i+1)
for (j = 0; j < 100; j = j+1)
x[i][j] = 2 * x[i][j];

Sequential accesses instead of striding
through memory every 100 words; improved
spatial locality
EEL 5708
Loop Fusion Example
/* Before */
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)
a[i][j] = 1/b[i][j] * c[i][j];
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)
d[i][j] = a[i][j] + c[i][j];
/* After */
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)
{ a[i][j] = 1/b[i][j] * c[i][j];
d[i][j] = a[i][j] + c[i][j];}

2 misses per access to a & c vs. one miss per
access; improve spatial locality
EEL 5708
Blocking Example
/* Before */
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)
{r = 0;
for (k = 0; k < N; k = k+1){
r = r + y[i][k]*z[k][j];};
x[i][j] = r;
};
Two Inner Loops:
Read all NxN elements of z[]
Read N elements of 1 row of y[] repeatedly
Write N elements of 1 row of x[]
Capacity Misses a function of N & Cache Size:
2N
3
+ N
2
=> (assuming no conflict; otherwise )
Idea: compute on BxB submatrix that fits
EEL 5708
Blocking Example
/* After */
for (jj = 0; jj < N; jj = jj+B)
for (kk = 0; kk < N; kk = kk+B)
for (i = 0; i < N; i = i+1)
for (j = jj; j < min(jj+B-1,N); j = j+1)
{r = 0;
for (k = kk; k < min(kk+B-1,N); k = k+1) {
r = r + y[i][k]*z[k][j];};
x[i][j] = x[i][j] + r;
};

B called Blocking Factor
Capacity Misses from 2N
3
+ N
2
to N
3
/B+2N
2
Conflict Misses Too?

EEL 5708
Reducing Conflict Misses by Blocking
Conflict misses in caches not FA vs. Blocking size
Lam et al [1991] a blocking factor of 24 had a fifth the misses
vs. 48 despite both fit in cache
Blocking Factor
M
i
s
s

R
a
t
e
0
0.05
0.1
0 50 100 150
Fully Associative Cache
Direct Mapped Cache
EEL 5708
Performance Improvement
1 1.5 2 2.5 3
compress
cholesky
(nasa7)
spice
mxm (nasa7)
btrix (nasa7)
tomcatv
gmty (nasa7)
vpenta (nasa7)
merged
arrays
loop
interchange
loop fusion blocking
Summary of Compiler Optimizations to
Reduce Cache Misses (by hand)
EEL 5708
Summary: Miss Rate Reduction
3 Cs: Compulsory, Capacity, Conflict
1. Reduce Misses via Larger Block Size
2. Reduce Misses via Higher Associativity
3. Reducing Misses via Victim Cache
4. Reducing Misses via Pseudo-Associativity
5. Reducing Misses by HW Prefetching Instr, Data
6. Reducing Misses by SW Prefetching Data
7. Reducing Misses by Compiler Optimizations
Prefetching comes in two flavors:
Binding prefetch: Requests load directly into register.
Must be correct address and register!
Non-Binding prefetch: Load into cache.
Can be incorrect. Frees HW/SW to guess!
CPUtime = IC CPI
Executi on
+
Memory accesses
Instruction
Miss rate Miss penalty
|
\
|
.
Clock cycle time
EEL 5708
Performance

EEL 5708
Write Policy:
Write-Through vs Write-Back
Write-through: all writes update cache and underlying
memory/cache
Can always discard cached data - most up-to-date data is in memory
Cache control bit: only a valid bit
Write-back: all writes simply update cache
Cant just discard cached data - may have to write it back to memory
Cache control bits: both valid and dirty bits
Other Advantages:
Write-through:
memory (or other processors) always have latest data
Simpler management of cache
Write-back:
much lower bandwidth, since data often overwritten multiple times
Better tolerance to long-latency memory?
EEL 5708
Write Policy 2:
Write Allocate vs Non-Allocate
(What happens on write-miss)
Write allocate: allocate new cache line in cache
Usually means that you have to do a read miss to
fill in rest of the cache-line!
Alternative: per/word valid bits
Write non-allocate (or write-around):
Simply send write data through to underlying
memory/cache - dont allocate new cache line!
EEL 5708
1. Reducing Miss Penalty:
Read Priority over Write on Miss
write
buffer
CPU

in out
DRAM
(or lower mem)
Write Buffer
EEL 5708
1. Reducing Miss Penalty:
Read Priority over Write on Miss
Write-through with write buffers offer RAW conflicts
with main memory reads on cache misses
If simply wait for write buffer to empty, might increase read miss
penalty (old MIPS 1000 by 50% )
Check write buffer contents before read;
if no conflicts, let the memory access continue
Write-back also want buffer to hold misplaced blocks
Read miss replacing dirty block
Normal: Write dirty block to memory, and then do the read
Instead copy the dirty block to a write buffer, then do the read,
and then do the write
CPU stall less since restarts as soon as do read
EEL 5708
2. Reduce Miss Penalty:
Early Restart and Critical Word
First
Dont wait for full block to be loaded before
restarting CPU
Early restartAs soon as the requested word of the block
arrives, send it to the CPU and let the CPU continue execution
Critical Word FirstRequest the missed word first from memory
and send it to the CPU as soon as it arrives; let the CPU continue
execution while filling the rest of the words in the block. Also
called wrapped fetch and requested word first
Generally useful only in large blocks,
Spatial locality a problem; tend to want next
sequential word, so not clear if benefit by early
restart
block
EEL 5708
3. Reduce Miss Penalty: Non-
blocking Caches to reduce stalls on
misses
Non-blocking cache or lockup-free cache allow data
cache to continue to supply cache hits during a miss
requires F/E bits on registers or out-of-order execution
requires multi-bank memories
hit under miss reduces the effective miss penalty
by working during miss vs. ignoring CPU requests
hit under multiple miss or miss under miss may
further lower the effective miss penalty by
overlapping multiple misses
Significantly increases the complexity of the cache controller as
there can be multiple outstanding memory accesses
Requires muliple memory banks (otherwise cannot support)
Pentium Pro allows 4 outstanding memory misses
EEL 5708
Value of Hit Under Miss for SPEC
FP programs on average: AMAT= 0.68 -> 0.52 -> 0.34 -> 0.26
Int programs on average: AMAT= 0.24 -> 0.20 -> 0.19 -> 0.19
8 KB Data Cache, Direct Mapped, 32B block, 16 cycle miss
Hit Under i Misses
A
v
g
.

M
e
m
.

A
c
c
e
s
s

T
i
m
e
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
e
q
n
t
o
t
t
e
s
p
r
e
s
s
o
x
l
i
s
p
c
o
m
p
r
e
s
s
m
d
l
j
s
p
2
e
a
r
f
p
p
p
p
t
o
m
c
a
t
v
s
w
m
2
5
6
d
o
d
u
c
s
u
2
c
o
r
w
a
v
e
5
m
d
l
j
d
p
2
h
y
d
r
o
2
d
a
l
v
i
n
n
n
a
s
a
7
s
p
i
c
e
2
g
6
o
r
a
0->1
1->2
2->64
Base
Integer
Floating Point
Hit under n Misses
0->1
1->2
2->64
Base
EEL 5708
4: Add a second-level cache
L2 Equations
AMAT = Hit Time
L1
+ Miss Rate
L1
x Miss Penalty
L1

Miss Penalty
L1
= Hit Time
L2
+ Miss Rate
L2
x Miss Penalty
L2

AMAT = Hit Time
L1
+
Miss Rate
L1
x (Hit Time
L2
+ Miss Rate
L2
x Miss Penalty
L2
)

Definitions:
Local miss rate misses in this cache divided by the total number of memory
accesses to this cache (Miss rate
L2
)
Global miss ratemisses in this cache divided by the total number of memory
accesses generated by the CPU
(Miss Rate
L1
x Miss Rate
L2
)
Global Miss Rate is what matters
EEL 5708
Comparing Local and Global
Miss Rates
32 KByte 1st level cache;
Increasing 2nd level cache
Global miss rate close to single
level cache rate provided L2 >>
L1
Dont use local miss rate
L2 not tied to CPU clock cycle!
Cost & A.M.A.T.
Generally Fast Hit Times and
fewer misses
Since hits are few, target miss
reduction
Linear
Log
Cache Size
Cache Size
EEL 5708
Reducing Misses:
Which apply to L2 Cache?
Reducing Miss Rate
1. Reduce Misses via Larger Block Size
2. Reduce Conflict Misses via Higher Associativity
3. Reducing Conflict Misses via Victim Cache
4. Reducing Conflict Misses via Pseudo-Associativity
5. Reducing Misses by HW Prefetching Instr, Data
6. Reducing Misses by SW Prefetching Data
7. Reducing Capacity/Conf. Misses by Compiler Optimizations

EEL 5708
Relative CPU Time
Block Size
1
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
2
16 32 64 128 256 512
1.36
1.28
1.27
1.34
1.54
1.95
L2 cache block size &
A.M.A.T.
32KB L1, 8 byte path to memory
EEL 5708
Reducing Miss Penalty Summary
Four techniques
Read priority over write on miss
Early Restart and Critical Word First on miss
Non-blocking Caches (Hit under Miss, Miss under Miss)
Second Level Cache
Can be applied recursively to Multilevel Caches
Danger is that time to DRAM will grow with multiple levels in
between
First attempts at L2 caches can make things worse, since
increased worst case is worse
CPUtime = IC CPI
Executi on
+
Memory accesses
Instruction
Miss rate Miss penalty
|
\
|
.
Clock cycle time
EEL 5708
Performance

y MissPenalt MissRate HitTime AMAT + =
EEL 5708
1. Fast Hit times
via Small and Simple Caches
Why Alpha 21164 has 8KB Instruction and 8KB
data cache + 96KB second level cache?
Small data cache and clock rate
Direct Mapped, on chip
EEL 5708
2. Fast hits by Avoiding
Address Translation
CPU
TB
$
MEM
VA
PA
PA
Conventional
Organization
CPU
$
TB
MEM
VA
VA
PA
Virtually Addressed Cache
Translate only on miss
Synonym Problem
CPU
$ TB
MEM
VA
PA
Tags
PA
Overlap $ access
with VA translation:
requires $ index to
remain invariant
across translation
VA
Tags
L2 $
EEL 5708
2. Fast hits by Avoiding Address
Translation
Send virtual address to cache? Called Virtually
Addressed Cache or just Virtual Cache vs. Physical
Cache
Every time process is switched logically must flush the cache; otherwise
get false hits
Cost is time to flush + compulsory misses from empty cache
Dealing with aliases (sometimes called synonyms);
Two different virtual addresses map to same physical address
I/O must interact with cache, so need virtual address
Solution to aliases
HW guarantees covers index field & direct mapped, they must be unique;
called page coloring
Solution to cache flush
Add process identifier tag that identifies process as well as address
within process: cant get a hit if wrong process
EEL 5708
2. Fast Cache Hits by Avoiding
Translation: Process ID impact
Black is uniprocess
Light Gray is multiprocess when
flush cache
Dark Gray is multiprocess when use
Process ID tag
Y axis: Miss Rates up to 20%
X axis: Cache size from 2 KB to
1024 KB
EEL 5708
2. Fast Cache Hits by Avoiding
Translation: Index with Physical
Portion of Address
If index is physical part of address, can start tag
access in parallel with translation so that can
compare to physical tag

Limits cache to page size: what if want bigger
caches and uses same trick?
Higher associativity moves barrier to right
Page coloring
Page Address
Page Offset
Address Tag
Index
Block Offset
EEL 5708
3: Fast Hits by pipelining Cache
Case Study: MIPS R4000
8 Stage Pipeline:
IFfirst half of fetching of instruction; PC selection happens
here as well as initiation of instruction cache access.
ISsecond half of access to instruction cache.
RFinstruction decode and register fetch, hazard checking and
also instruction cache hit detection.
EXexecution, which includes effective address calculation, ALU
operation, and branch target computation and condition
evaluation.
DFdata fetch, first half of access to data cache.
DSsecond half of access to data cache.
TCtag check, determine whether the data cache access hit.
WBwrite back for loads and register-register operations.
What is impact on Load delay?
Need 2 instructions between a load and its use!
EEL 5708
Case Study: MIPS R4000
IF IS
IF
RF
IS
IF
EX
RF
IS
IF
DF
EX
RF
IS
IF
DS
DF
EX
RF
IS
IF
TC
DS
DF
EX
RF
IS
IF
WB
TC
DS
DF
EX
RF
IS
IF
TWO Cycle
Load Latency
IF IS
IF
RF
IS
IF
EX
RF
IS
IF
DF
EX
RF
IS
IF
DS
DF
EX
RF
IS
IF
TC
DS
DF
EX
RF
IS
IF
WB
TC
DS
DF
EX
RF
IS
IF
THREE Cycle
Branch Latency
(conditions evaluated
during EX phase)
Delay slot plus two stalls
Branch likely cancels delay slot if not taken
EEL 5708
R4000 Performance
Not ideal CPI of 1:
Load stalls (1 or 2 clock cycles)
Branch stalls (2 cycles + unfilled slots)
FP result stalls: RAW data hazard (latency)
FP structural stalls: Not enough FP hardware (parallelism)
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
e
q
n
t
o
t
t
e
s
p
r
e
s
s
o
g
c
c
l
i
d
o
d
u
c
n
a
s
a
7
o
r
a
s
p
i
c
e
2
g
6
s
u
2
c
o
r
t
o
m
c
a
t
v
Base Load stalls Branch stalls FP result stalls FP structural
stalls
EEL 5708
What is the Impact of What
Youve Learned About Caches?
1960-1985: Speed
= (no. operations)
1990
Pipelined
Execution &
Fast Clock Rate
Out-of-Order
execution
Superscalar
Instruction Issue
1998: Speed =
(non-cached memory accesses)
What does this mean for
Compilers?,Operating Systems?, Algorithms?
Data Structures?
1
10
100
1000
1
9
8
0
1
9
8
1
1
9
8
2
1
9
8
3
1
9
8
4
1
9
8
5
1
9
8
6
1
9
8
7
1
9
8
8
1
9
8
9
1
9
9
0
1
9
9
1
1
9
9
2
1
9
9
3
1
9
9
4
1
9
9
5
1
9
9
6
1
9
9
7
1
9
9
8
1
9
9
9
2
0
0
0
DRAM
CPU
EEL 5708
Alpha 21064
Separate Instr & Data
TLB & Caches
TLBs fully associative
TLB updates in SW
(Priv Arch Libr)
Caches 8KB direct
mapped, write thru
Critical 8 bytes first
Prefetch instr. stream
buffer
2 MB L2 cache, direct
mapped, WB (off-chip)
256 bit path to main
memory, 4 x 64-bit
modules
Victim Buffer: to give
read priority over
write
4 entry write buffer
between D$ & L2$
Stream
Buffer
Write
Buffer
Victim Buffer
Instr Data
EEL 5708
0.01%
0.10%
1.00%
10.00%
100.00%
A
l
p
h
a
S
o
r
t
T
P
C
-
B

(
d
b
1
)
L
i
S
c
C
o
m
p
r
e
s
s
O
r
a
E
a
r
D
o
d
u
c
T
o
m
c
a
t
v
M
d
l
j
p
2
S
p
i
c
e
S
u
2
c
o
r
M
i
s
s

R
a
t
e
I $
D $
L2
Alpha Memory Performance:
Miss Rates of SPEC92
8K
8K
2M
I$ miss = 2%
D$ miss = 13%
L2 miss =
0.6%
I$ miss = 1%
D$ miss = 21%
L2 miss = 0.3%
I$ miss = 6%
D$ miss = 32%
L2 miss = 10%
EEL 5708
0.00
0.50
1.00
1.50
2.00
2.50
3.00
3.50
4.00
4.50
5.00
A
l
p
h
a
S
o
r
t
T
P
C
-
B

(
d
b
1
)
L
i
S
c
C
o
m
p
r
e
s
s
O
r
a
E
a
r
D
o
d
u
c
T
o
m
c
a
t
v
M
d
l
j
p
2
C
P
I
L2
I$
D$
I Stall
Other
Alpha CPI Components
Instruction stall: branch mispredict (green);
Data cache (blue); Instruction cache (yellow); L2$ (pink)
Other: compute + reg conflicts, structural conflicts
EEL 5708
Pitfall: Predicting Cache Performance
from Different Prog. (ISA, compiler,
...)
4KB Data cache miss
rate 8%,12%, or
28%?
1KB Instr cache miss
rate 0%,3%,or 10%?
Alpha vs. MIPS
for 8KB Data $:
17% vs. 10%
Why 2X Alpha v.
MIPS?
0%
5%
10%
15%
20%
25%
30%
35%
1 2 4 8 16 32 64 128
Cache Size (KB)
Miss
Rate
D: tomcatv
D: gcc
D: espresso
I: gcc
I: espresso
I: tomcatv
D$, Tom
D$, gcc
D$, esp
I$, gcc
I$, esp
I$, Tom
EEL 5708
Cache Optimization Summary
Technique MR MP HT Complexity
Larger Block Size + 0
Higher Associativity + 1
Victim Caches + 2
Pseudo-Associative Caches + 2
HW Prefetching of Instr/Data + 2
Compiler Controlled Prefetching + 3
Compiler Reduce Misses + 0
Priority to Read Misses + 1
Early Restart & Critical Word 1st + 2
Non-Blocking Caches + 3
Second Level Caches + 2
Better memory system + 3
Small & Simple Caches + 0
Avoiding Address Translation + 2
Pipelining Caches + 2
m
i
s
s

r
a
t
e

h
i
t

t
i
m
e

m
i
s
s

p
e
n
a
l
t
y

Caches

Transféré par

Informations du document

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Caches

Transféré par

Droits d'auteur :

Formats disponibles

EEL 5708

Vous aimerez peut-être aussi