Académique Documents
Professionnel Documents
Culture Documents
Cache Memories
Slides Source: Bryant
Topics
class11.ppt
Cache Memories
Cache memories are small, fast SRAM-based memories
managed automatically in hardware.
CPU looks first for data in L1, then in L2, then in main
memory.
Typical bus structure:
CPU chip
register file
cache bus
L2 cache
2
L1
cache
bus interface
ALU
system bus memory bus
I/O
bridge
main
memory
15-213, F02
line 1
abcd
block 21
pqrs
block 30
wxyz
...
...
...
15-213, F02
set 1:
tag
B = 2b bytes
per cache block
0
B1
valid
tag
B1
valid
tag
B1
B1
B1
B1
E lines
per set
valid
tag
valid
set S-1:
tag
valid
tag
15-213, F02
Addressing Caches
Address A:
t bits
set 0:
set 1:
tag
tag
tag
tag
B1
B1
B1
B1
B1
B1
set S-1:
tag
tag
m-1
s bits
b bits
0
15-213, F02
Direct-Mapped Cache
Simplest kind of cache
Characterized by exactly one line per set.
set 0:
valid
tag
cache block
set 1:
valid
tag
cache block
set S-1:
valid
tag
cache block
15-213, F02
selected set
set 0:
valid
tag
cache block
set 1:
valid
tag
cache block
t bits
m-1
tag
s bits
b bits
00 001
set index block offset0
tag
cache block
15-213, F02
0110
w0
w1 w2
m-1
t bits
0110
tag
s bits
b bits
i
100
set index block offset0
w3
15-213, F02
b=1
x
11
m[1]
m[0]
M[0-1]
(1)
(3)
11
(4)
9
13 [11012] (miss)
v tag
data
8 [10002] (miss)
tag
data
1
1
m[9]
m[8]
M[8-9]
M[12-13]
1
1
0
0
1
1
1 m[13]
m[12]
1
M[12-13]
0 [00002] (miss)
tag
data
11
(5)
11
m[1]
m[0]
M[0-1]
0
0
m[1]
m[0]
M[0-1]
1
1
m[13]
m[12]
M[12-13]
15-213, F02
4-line Cache
00
01
10
11
10
0000
0001
0010
0011
0100
0101
0110
0111
1000
1001
1010
1011
1100
1101
1110
1111
Middle-Order
Bit Indexing
0000
0001
0010
0011
0100
0101
0110
0111
1000
1001
1010
1011
1100
1101
1110
1111
15-213, F02
set 0:
set 1:
valid
tag
cache block
valid
tag
cache block
valid
tag
cache block
valid
tag
cache block
set S-1:
11
valid
tag
cache block
valid
tag
cache block
15-213, F02
Selected set
set 1:
valid
tag
cache block
valid
tag
cache block
valid
tag
cache block
valid
tag
cache block
t bits
m-1
tag
12
set S-1:
s bits
b bits
00 001
set index block offset0
valid
tag
cache block
valid
tag
cache block
15-213, F02
must compare the tag in each valid line in the selected set.
=1? (1) The valid bit must be set.
0
1001
0110
w0
t bits
0110
tag
w1 w2
w3
=?
m-1
13
s bits
b bits
i
100
set index block offset0
15-213, F02
Multi-Level Caches
Options: separate data and instruction caches, or a
unified cache
Processor
Regs
L1
d-cache
L1
i-cache
size:
speed:
$/Mbyte:
line size:
14
200 B
3 ns
Unified
Unified
L2
L2
Cache
Cache
Memory
Memory
disk
disk
8-64 KB
3 ns
15-213, F02
Regs.
L1 Data
1 cycle latency
16 KB
4-way assoc
Write-through
32B lines
L1 Instruction
16 KB, 4-way
32B lines
L2
L2Unified
Unified
128KB--2
128KB--2MB
MB
4-way
assoc
4-way assoc
Write-back
Write-back
Write
Writeallocate
allocate
32B
lines
32B lines
Main
Main
Memory
Memory
Up
Uptoto4GB
4GB
Processor
ProcessorChip
Chip
15
15-213, F02
3-10% for L1
can be quite small (e.g., < 1%) for L2, depending on size, etc.
Hit Time
Miss Penalty
16
15-213, F02
Memory mountain
18
15-213, F02
15-213, F02
15-213, F02
1200
1000
L1
800
600
400
Slopes of
Spatial
Locality
Ridges of
Temporal
Locality
xe
L2
200
2k
8k
32k
128k
512k
2m
s11
s13
s15
8m
21
s9
stride (words)
mem
s7
s5
s3
s1
15-213, F02
main memory
region
1000
L2 cache
region
L1 cache
region
800
600
400
1k
2k
4k
8k
16k
32k
64k
128k
256k
512k
2m
1024k
22
4m
8m
200
15-213, F02
800
700
600
500
one access per cache line
400
300
200
100
0
s1
23
s2
s3
s4
s5
s6
s7
s8
stride (words)
15-213, F02
Exploit temporal locality and keep the working set small (e.g., by using
blocking)
Block size
Description:
Multiply N x N matrices
Accesses
Variable sum
/*
/* ijk
ijk */
*/
for
for (i=0;
(i=0; i<n;
i<n; i++)
i++) {{ held in register
for
for (j=0;
(j=0; j<n;
j<n; j++)
j++) {{
sum
sum == 0.0;
0.0;
for
for (k=0;
(k=0; k<n;
k<n; k++)
k++)
sum
sum +=
+= a[i][k]
a[i][k] ** b[k][j];
b[k][j];
c[i][j]
c[i][j] == sum;
sum;
}}
}}
15-213, F02
Analysis Method:
25
j
i
15-213, F02
26
15-213, F02
Inner loop:
(*,j)
(i,*)
A
Row-wise
Columnwise
(i,j)
C
Fixed
27
A B
0.25
1.0
0.0
15-213, F02
28
A B
0.25
1.0
Inner loop:
(*,j)
(i,*)
A
Row-wise Columnwise
(i,j)
C
Fixed
0.0
15-213, F02
}}
Inner loop:
(i,k)
A
Fixed
(k,*)
B
(i,*)
C
Row-wise Row-wise
0.00.25
29
0.25
15-213, F02
Inner loop:
(i,k)
A
Fixed
(k,*)
B
(i,*)
C
Row-wise Row-wise
0.00.25
30
0.25
15-213, F02
1.00.0
31
Inner loop:
(*,k)
Column wise
(k,j)
B
Fixed
(*,j)
Columnwise
1.0
15-213, F02
Inner loop:
(*,k)
Columnwise
(k,j)
B
Fixed
(*,j)
Columnwise
1.00.0
32
1.0
15-213, F02
2 loads, 0 stores
misses/iter = 1.25
for (i=0; i<n; i++) {
2 loads, 1 store
misses/iter = 0.5
for (k=0; k<n; k++) {
[j];
2 loads, 1 store
misses/iter = 2.0
for (j=0; j<n; j++) {
sum = 0.0;
r = a[i][k];
r = b[k][j];
c[i][j] += r * b[k][j];
c[i][j] = sum;
}
}
}
c[i][j] += a[i][k] * r;
}
}
33
15-213, F02
Cycles/iteration
50
40
kji
jki
kij
ikj
jik
ijk
30
20
10
34
25
50
75 100 125 150 175 200 225 250 275 300 325 350 375 400
Array size (n)
15-213, F02
B11 B12
X
B21 B22
C11 C12
C21 C22
Key idea: Sub-blocks (i.e., Axy) can be treated just like scalars.
35
36
15-213, F02
37
kk
Update successive
row sliver accessed
elements of sliver
bsize times
block reused n
times in succession
15-213, F02
60
Cycles/iteration
50
kji
jki
kij
ikj
jik
ijk
bijk (bsize = 25)
bikj (bsize = 25)
40
30
20
10
0
38
15-213, F02
Concluding Observations
Programmer can optimize for cache performance
39
15-213, F02