Caches

Computer Architecture Memory Management
Memory Paging Segmentation Virtual Memory Caches

Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis
Memory Hierarchy
The farther away from CPU, the larger and slower the memory. The hierarchy is the consequence of locality.
Caches
Memory hierarchy levels in typical desktop / server computers, figure from [HP06 p.288]
Computer Architecture
WS 06/07
Dr.-Ing. Stefan Freinatis
Locality Principle
Programs tend to reuse data and instructions. Rule of thumb:
[HP06 p.38]
A program spends 90% of its execution time in only 10% of the code.
Temporal locality: recently accessed items are likely to be accessed in near future. Spatial locality: items whose addresses are near one another tend to be referenced close together in time.
Locality Principle
Example of a memory-access trace of a process
Figure from [Sil00 p.327] Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis
Caches
Cache: a safe place for hiding or storing things
Websters Dictionary [HP06 p. C-1]
Here: Fast memory that stores copies of data from the most frequently used main memory locations. Used by the CPU to reduce the average time to access memory locations. Effect: instructions (in execution) can proceed quicker.
Instruction fetch is quicker Memory operands are accessed quicker
from the CPUs point of view
Result: faster program execution improved system performance

Cached Memory Access

Steps in accessing memory (here: reading from memory), simplified.
Caches
CPU requests content from a memory location Cache is checked for this datum When present, deliver datum from cache When not, transfer datum from main memory to cache Then deliver from cache to CPU
WS 06/07
Caches
To take advantage of spatial locality a cache contains blocks of data rather than individual bytes. A block is a contiguous line of processor words. It is also called a cache line. Common block sizes: 8 ... 128 bytes block transfer
Cache components
Data Area Tag Area word transfer
Computer Architecture WS 06/07
Attribute Area
Data Area
Caches
All blocks in the cache make up the data area.

Block
N bytes per block
0 1 2 3 4
...
...
...
Data area
Cache capacity = B N bytes
WS 06/07
...
B1
Tag Area
Caches
The block addresses of the cached blocks make up the 1 tags of the cache lines. All tags form the tag area.
Block
N byte per block
0 1 2 3 4
...
...
Tag area
...
Data area
1 The statement is slightly simplified. In real caches, often just a
fraction of the block address is used as tag.

Attribute Area
Caches
The attribute area contains attribute bits for each cache line.
V D
Validity bit V
V = 1 data is valid V = 0 data is invalid
indicates whether the cache line holds valid data 0 N bytes per block
1 2 3
...
Attributes
4 Dirty bit D the cache line data is modified ...indicates whether...
with respect to main memory D = 1 data is modified Tag area Data area D = 0 data is not modified
WS 06/07
...
...
B1
Block
B1
Caches
Each cache line plus its tag plus its attributes forms a slot.
Block / Slot
V D
N bytes per block
0 1 2
Cache slot
3 4
...
Attributes
...
Tag area
...
Data area
WS 06/07
Caches
How to find a certain byte in the cache? Caches
The address generated by the CPU is divided into two fields. High order bits make up the block address Low order bits determine the offset within that block
m block address m-n offset n
Block address is compared against all tags simultaneously In case of a match (cache hit), the offset selects the byte
Remark: CPU address space = 2m, Cache line size (block size) = 2n
...
B1
Block Address
Memory can be considered as an array of blocks.
block address 0 1 2 3 4 5 6 7 8 9 0 4 8 12 16 20 24 28 32 36 4 bytes per block
Caches
Memory address (binary) 000000 000100 001000 001100 010000 010100 011000 011100 100000 100100
The block address should not be confused with the memory address at which the block starts. The block address is a block number.
block address = memory address DIV block size
Memory
memory address (decimal)
block address offset
WS 06/07
Caches
V D Tags Data
Caches
Cache mechanism
...
...
...
Comparator
hit / miss
Data out
block address
offset
CPU memory address

Hit Rate
Caches
Cache capacity is smaller than the capacity of main memory. Consequently, not all memory locations can be mirrored in the cache. When a required datum is found in the cache, we have a cache hit, otherwise a cache miss.
The hit rate is the fraction of cache accesses that result in a hit.
Hit rate = number of hits number of memory accesses
The miss rate is the fraction of cache accesses that result in a miss.
Amdahls Law
Used to find the maximum expected improvement to an overall system when a part of the system is improved.
The law is a general law, not restricted to caches or computers.
I=
1 (1 P ) + P S
I: maximum expected improvement, I > 0 (usually I > 1) P: proportion of the system improved, 0 P 1 S: speedup of that proportion, S > 0, usually S > 1
Amdahls Law
Example: 30% of the computations can be made twice as fast. P = 0.3, S = 2. Improvement I =
1 (1 0.3) + 0.3 2
1 = 1.177 0.7 + 0.15
Amdahls Law in the special case of parallelization
I=
1 (1 F ) F+ N
See lecture Advanced Computer Architecture
F: proportion of sequential calculations (no speedup possible), 0 F 1 N: grade of parallelism (e.g. N processors), N > 0
Caches
CPU (registers) Memory space Access time 500 Byte 250 ps Cache (SRAM) 64 kB 1 ns Main memory (DRAM) 1 GB 100 ns I/O Devices (disks) 1 TB 10 ms
Example: Assume: Cache = 1 ns, main memory = 100 ns, 90% hit rate. What is the overall improvement?
P = 0.9, S =
100 ns = 100 1 ns
I=
1 (1 0.9) + 0.9 100
= 9.175
Memory accesses (as seen by the CPU) now are more than 9 times as fast than without a cache.
WS 06/07
Read Access
Reading from memory (improvement)
CPU requests datum Search cache while fetching block from memory Cache hit: deliver datum, discard fetched block Cache miss: put block in cache and deliver datum
In case of a hit, the datum is available quickly. In case of a miss there is no benefit from the cache, but also no harm. Things are not that easy when writing into memory. Lets look at the cases of a write hit and a write miss.
Write Hit Policy

Assume a write hit. How to keep cache and main memory consistent on write-accesses?
Write through
The datum is written to both the block in the cache and the block in memory.
CPU
Cache
Memory
Write Buffer
Write back
Cache always clean (no dirty bit required) CPU write stall (problem reduced through write buffer) Main memory always has the most current copy (cache coherency in multi-processor systems)
The datum is only written to the cache (dirty bit is set). The modified block is written to main memory once it is evicted from cache.
Write speed = cache speed Multiple writes to the same block still result in only one write to memory Less memory bandwidth needed
Write Miss Policy

Assume a write miss. What to do?
Write allocate
The block containing the referenced datum is transferred from main memory to the cache. Then one of the write hit policies is applied. Normally used with write back caches.
No-write allocate
Write misses do not affect the cache. Instead the datum is modified only in main memory. Write hits however do affect the cache. Normally used with write through caches.
WS 06/07
Write Miss Policy

Assume an empty cache and the following sequence of memory operations.
WriteMem[100] WriteMem[100] ReadMem[200] WriteMem[200] WriteMem[100]
What are the number of hits and misses when using no-write allocate versus write allocate?
No-write allocate WriteMem[100] WriteMem[100] ReadMem[200] WriteMem[200] WriteMem[100] miss miss miss hit miss
WS 06/07
Write allocate miss hit miss hit hit

Caches
Cache
Where exactly are the blocks placed in the cache? Cache Organization What if the cache if full? Replacement Strategies
... Memory
Cache Organization
Where can a block be placed in the cache?
Direct Mapped
With this mapping scheme a memory block can be placed in only one particular slot. The slot number is calculated from
((memory address) DIV (blocksize)) MOD (slots in cache).
Fully Associative
The block can be placed in any slot.
Set Associative
The block can be placed in a restricted set of slots. A set is a group of slots. The block is first mapped onto the set and can then be placed anywhere within the set. The set number is calculated from
(memory address) DIV (block size) MOD (number of sets in cache).
Direct Mapped
Each memory block is mapped to exactly one slot in the cache (many-to-one mapping).
Memory
0 4 8 12 16 20 24 28
Cache
Slot
0 1 2 3
...
...
Block size = 4 byte Cache capacity = 4 x 4 = 16 byte

If slot occupied (V = 1) evict cache line
memory address (decimal)

Direct Mapped
Slot = ((memory address) DIV (blocksize)) MOD (slots in cache).
offset within slot = (memory address) MOD (blocksize).
Examples
In which slot goes the block located at address 12D? 12 DIV 4 = 3 3 MOD 4 = 3 (slot 3) In which slot goes the block located at address 20D? 20 DIV 4 = 5 5 MOD 4 = 1 (slot 1)
MOD 4
Where goes the byte located at address 23D? 23 DIV 4 = 5 5 MOD 4 = 1 The byte goes in cache line (slot) 1 at offset 3 23 MOD 4 = 3
Direct Mapped
Extracting slot number and offset directly from memory address
block address
m tag bits m-n slot offset n
The lower bits of the block address select the slot. The size of the slot field depends on the number of slots (size = ld(number of slots)).
Example
ld = logarithmus dualis (base 2)
Where goes the byte located at address 23D?
23D = 1 0 1 1 1B
slot offset
Slot 1, offset 3
WS 06/07 Dr.-Ing. Stefan Freinatis
Direct Mapped
Address (showing bit positions)
31 30 29 28
. . . . . . . . 19 18 17 16 15 14 13 12 . . . . 7 6 5 4 3 2 1 0
Slot
Hit 16 12
Word Byte offset offset
Data
16 bits Valid Tag Data
128 bits
16k 4k entries
lines
16
32
32
32
32
64 kByte cache using four-word (16 Byte) blocks
MUX 32
Figure from lecture CA WS05/06
WS 06/07
Direct Mapped
Explanations for previous slide
Logical address space of CPU: 232 byte Number of cache slots: 64kB / 16 Byte = 4K = 4096 slots. Bit 0,1 determine the position of the selected byte in a word. However, as the CPU uses 4-byte words as smallest entity, the byte offset is not used. Bit 2,3 determine the position of the word within a cache line. Bits 4 to 15 (12 bits) determine the slot. 212 = 4K = number of slots. Bits 16 to 31 are compared against the tags to see whether or not the block is in the cache.

Caches

Transféré par

Informations du document

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Caches

Transféré par

Droits d'auteur :

Formats disponibles

Computer Architecture Memory Management

Memory Paging Segmentation Virtual Memory Caches

Dr.-Ing. Stefan Freinatis

Example of a memory-access trace of a process

Result: faster program execution improved system performance

Cached Memory Access

Dr.-Ing. Stefan Freinatis

All blocks in the cache make up the data area.

N bytes per block

Cache capacity = B N bytes

N byte per block

1 The statement is slightly simplified. In real caches, often just a

fraction of the block address is used as tag.

4 Dirty bit D the cache line data is modified ...indicates whether...

Dr.-Ing. Stefan Freinatis

N bytes per block

block address offset

Dr.-Ing. Stefan Freinatis

CPU memory address

1 = 1.177 0.7 + 0.15

Amdahls Law in the special case of parallelization

See lecture Advanced Computer Architecture

1 (1 0.9) + 0.9 100

Dr.-Ing. Stefan Freinatis

Write Hit Policy

Write Miss Policy

Dr.-Ing. Stefan Freinatis

Write Miss Policy

Write allocate miss hit miss hit hit

Block size = 4 byte Cache capacity = 4 x 4 = 16 byte

memory address (decimal)

m tag bits m-n slot offset n

ld = logarithmus dualis (base 2)

Where goes the byte located at address 23D?

Word Byte offset offset

16 bits Valid Tag Data

64 kByte cache using four-word (16 Byte) blocks

Figure from lecture CA WS05/06

Dr.-Ing. Stefan Freinatis

Vous aimerez peut-être aussi