Vous êtes sur la page 1sur 66

3.

1 tjwc - 12-Nov-11 SE1 / EE2 ntroduction to Computer Architecture


Part 3
Hardware
Part of CharIes Babbage's never
compIeted "difference engine". It wouId
have been the worId's first computer,
mechanicaI, driven by steam
3.2 tjwc - 12-Nov-11 SE1 / EE2 ntroduction to Computer Architecture
Contents
ecture 11 & 12 - Input &
Output, Exceptions &
Interrupts
nterrupt-driven vs DMA vs
polled /O
SWs, exceptions & interrupts
ARM modes, shadow
registers
ecture 13 - ARM Hardware
!ipelining instruction fetch
and execution
Throughput vs Latency
ARM !rocessor organisation
ecture 14 - Cache
Hierarchy
Adding a cache to the
memory hierarchy
Cache purpose, hit & miss
calculations
Direct mapped caches:
basics
Cache lines
Tag, index, select fields
Startup issues
Valid bit
3.3 tjwc - 12-Nov-11 SE1 / EE2 ntroduction to Computer Architecture
ecture 11 & 12 I/O, Exceptions and Interrupts
Pandllng l/C efflclenLly ls a key requlremenL
LxcepLlons ( lnLerrupLs) allow processors Lo
handle evenLs LhaL occur whlch are noL dlrecLly
relaLed Lo Lhe user program
xceptloos Lyplcally caused by unexpecLed runLlme
errors ln user code (eg dlvlslon by 0)
otettopts spec|a| case of except|on caused by
hardware condlLlons LhaL requlre processor acLlon eg
l/C servlce compleLely lndependenL of user code
"There is no usefuI ruIe wifhouf on excepfion" Thomos FuIIer
Hardware for I/O - Overview
9D
,emory
I]C dev|ce
D|rect ,emory
Access (D,A)
ontro||er
Interrupt
contro||er
I]C dev|ce
Address bus
Databus
Ik II
other dev|ces
other
dev|ces
%I]C dev|ces commun|cate v|a memorymapped reg|sters
%D,A Interrupts are opt|ona|
3.5 tjwc - 12-Nov-11 SE1 / EE2 ntroduction to Computer Architecture
$ervicing an I/O device
Typical input
!C serial port receives data at fixed rate (say 9600
baud) independent of C!& instructions.
New data byte is available 9600/10 = 960 times per
second
C!& must read data register to input new value
repeatedly, at correct time.
This operation is known as servicing the /O device
The problem
How can the C!& know when to service the /O
device?
3.6 tjwc - 12-Nov-11 SE1 / EE2 ntroduction to Computer Architecture
PoIIing and Interrupt
oth are methods to notify processor that /O device
needs attention
PoIIing
simple, but slow
processor check status of /O device regularly to see if it needs
attention
similar to checking a telephone without bells!
responsibility of user code to check /O device status soon enough
to avoid buffer overrun.
Interrupt
fast, but more complicated
processor is notified by /O device (interrupted) when device
needs attention
similar to a telephone with bells
Method 1: PoIIing
C!& repeatedly reads status bit from serial port (bit 31 of
!ORT_STAT&S here) to determine if new data is available
When needed it reads data from !ORT_DATA(7:0)
!olling is simple, requires no additionaI hardware
Requires 100% C!& use, inefficient, inflexible
IC_SLkVIL_CDL
ADkL k10 SLk_9Ck1 k10 SLkIAL_9Ck1
ADk k11 8DIILk k11 array of bytes read
9CLL_LCC9 examp|e of po|||ng I]C
LDk k0 k10#9Ck1_S1A1DS|nput status
,9 k0 #0
89L 9CLL_LCC9 wa|t t||| port |s ready b311
LDk8 k0 k10#9Ck1_DA1A |nput data
S1k8 k0 k11 #1 store data |n memory
8 9CLL_LCC9 repeat
3.8 tjwc - 12-Nov-11 SE1 / EE2 ntroduction to Computer Architecture
Method 2: Interrupt
C!& executes user code independently of /O
/O device raises interrupt line when it requires
service
C!& receives interrupt, suspends current
execution, handles device in an interrupt
service routine, then returns to previous
execution as though nothing had happened
nterrupt is invisible to executing code except is
small loss of speed
Interrupt Hardware for ARM
9D
,emory
I]C dev|ce
Interrupt
contro||er
I]C dev|ce
Address bus
Databus
Ik II
other
dev|ces
%I]C dev|ces commun|cate v|a memorymapped reg|sters
%D,A Interrupts are opt|ona|
Ak, has two |nterrupt |nputs Ik II
Cpt|ona| Interrupt contro||er a||ows more
I]C dev|ces to use 2 |nterrupt ||nes
3.10 tjwc - 12-Nov-11 SE1 / EE2 ntroduction to Computer Architecture
Anatomy of an interrupt exception
The interrupt causes user
code to suspend and a
branch to a fixed hardware
determined memory location
- so-called interrupt "vector"
(labelled RQ_VEC here)
C!& enters IRQ mode
The vector contains a single
branch to a software-
determined interrupt
service routine RQ_S&
When the interrupt has been
dealt with the SR makes an
interrupt return
&ser code continues
execution as though interrupt
had never happened - no
register is disturbed
User Hode:
H0V .
A00 .
H0V .
E0 .
L0R .
A00 .
8TR .
|R0_VE6 |R0_8U
|R0_8U ; hand|er code
; ..
8U8 pc, r14, #4
Exception
RO Mode
Interrupt
service
routine
Interrupt
service
routine
709:73
3.11 tjwc - 12-Nov-11 SE1 / EE2 ntroduction to Computer Architecture
ARM Operating Modes
The ARM processor can work in one of many operating modes.
So far we have considered user mode, which is the "normaI"
mode of operation.
The processor can also enter "priviIeged" operating modes which
are used to handle exceptions and $Is
The Current !rocessor Status Register CP$R has 5 bits [bit4:0] to
indicate which mode the processor is in:-
Mode bits
3.12 tjwc - 12-Nov-11 SE1 / EE2 ntroduction to Computer Architecture
$hadow Registers - CIever Feature of
ARM I$A enabIes Exceptions
As the processor enters an exception mode, some new registers
are automatically switched in to avoid overwriting user regs:-
(II ,ode) ,CV k6 k8
(Dser ,ode),CV k6 k8
3.13 tjwc - 12-Nov-11 SE1 / EE2 ntroduction to Computer Architecture
$hadow Registers - Key points
or example, an external event (such as movement of
the mouse) occurs that generates a ast nterrupt (on
the Q pin), the processor enters Q operating mode.
t sees the same r0 - r7 as before, and a new set of
shadow r8 - r14
y swapping to the new registers, it is easier for the
programmer to preserve the state of the processor.
or example, during Q mode, the Q versions of r8 - r14 can be
used freely. On returning back to user mode, the original values of
r8 - r14 will never have changed.
The shadow r8-r14 preserve their values across Q
interrupts and can be used to store persistent FIQ
state.
3.14 tjwc - 12-Nov-11 SE1 / EE2 ntroduction to Computer Architecture
ARM Exception HandIer Routines
r13, r14 are shadowed automatically
n handler the alternative shadow registers
are used. &ser r13,r14 not changed.
R13 - stack pointer
R14 link register holds return address
&ser R1 is saved on SR mode stack
indexed by shadow r13 S!.
R1 used for temp data in SR -holds
value of TME
Code increments TME
R1 is restored at the end of the SR
C!SR is saved/restored automatically
Exception_handIer (not FIQ)
$%MED r13!, {r1}
DR r1, %IME
ADD r1, r1, #1
$%R r1, %IME
DMED r13!, {r1}
$&B$ pc, r14, #4
3.15 tjwc - 12-Nov-11 SE1 / EE2 ntroduction to Computer Architecture
ARM Exception HandIer Routines
Exception_handIer (not FIQ)
$%MED r13!, {r1}
DR r1, %IME
ADD r1, r1, #1
$%R r1, %IME
DMED r13!, {r1}
$&B$ pc, r14, #4
FIQ_handIer (FIQ)
DR r8, %IME ;
ADD r8, r8, #1
$%R r8, %IME
$&B$ pc, r14, #4
4
4
1
4
4
4
4
1
4
4
FIQ_handIer_opt (FIQ)
ADD r8, r8, #1
$%R r8, %IME
$&B$ pc, r14, #4
1
4
4
cycles
21 cycles
13 cycles
3.16 tjwc - 12-Nov-11 SE1 / EE2 ntroduction to Computer Architecture
How are exceptions generated?
|ass|fy except|ons accord|ng to cause
1 As a direct resu/t of executinq on instruction such as
Software Interrupt Instruct|on (SWI)
Dndef|ned or |||ega| |nstruct|on
,emory error dur|ng fetch|ng an |nstruct|on
2 As a sideeffect of on instruction such as
,emory fau|t dur|ng data read]wr|te from memory
Ar|thmet|c error (eg d|v|de by zero |f 9D has d|v|de |nstruct|on)
3 As a resu/t of externo/ hordwore siqno/s such as
keset (eg from reset sw|tch)
Iast Interrupt (II)
Norma| Interrupt (Ik)
Ik II are typ|ca||y connected
to hardware dev|ces requ|r|ng
|nterrupt serv|ce
Ik II are typ|ca||y connected
to hardware dev|ces requ|r|ng
|nterrupt serv|ce
3.17 tjwc - 12-Nov-11 SE1 / EE2 ntroduction to Computer Architecture
hat does an exception do?
Dser code ls execuLlng and an excepLlon happens
1he handler runs llke a subrouLlne and aL Lhe end
reLurns Lo Lhe nexL user lnsLrucLlon ([usL llke a
subrouLlne) buL also resLores a|| processor reg|sters
(|nc|ud|ng 9Sk) 1hls makes Lhe handler
compleLely LransparenL Lo Lhe Dsermode code
1he [ob of Lhe handler ls Lo deal wlLh Lhe condlLlon
LhaL caused Lhe excepLlon
8ecover from memory faulL
erform SWl operaLlon
Servlce Lhe hardware devlce LhaL caused Lhe lnLerrupL
3.18 tjwc - 12-Nov-11 SE1 / EE2 ntroduction to Computer Architecture
hat happens when an exception occurs?
Ak, comp|etes current |nstruct|on as best |t can
It departs from current |nstruct|on sequence to hand|e
the except|on by perform|ng the fo||ow|ng steps
1 It changes the operat|ng mode (see 310) correspond|ng to the
part|cu|ar except|on
2 It saves the current 9 |n the k14 correspond|ng to the new
mode Ior examp|e |f II occurs the 9 va|ue |s stored |n
k14(II) 1h|s w||| be the return address (but see s||de 32S)
3 It saves the o|d va|ue of 9Sk |n a spec|a| reg|ster of the new
mode
4 It d|sab|es except|ons of |ower pr|or|ty
S It forces the 9 to a new va|ue correspond|ng to the except|on
1h|s |s effect|ve|y a forced [ump to the Lxcept|on nand|er or
Interrupt Serv|ce kout|ne
3.19 tjwc - 12-Nov-11 SE1 / EE2 ntroduction to Computer Architecture
see 3.11
1he CS8 musL be preserved by lnLerrupLs
A8M archlLecLure conLalns a mechanlsm Lo do Lhls
buL lL ls noL a shadow reglsLer
Cn lnLerrupL or excepLlon Lhe old CS8 (currenL S8)
contents are saved ln an SS8 (saved S8) reglsLer speclflc
Lo Lhe mode of Lhe lnLerrupL or excepLlon however Lhe
same CS8 conLlnues Lo be used lLs mode blLs are changed
Lo Lhe excepLlon mode
Cn reLurn Lhe saved value ls wrlLLen back Lo CS8 by
hardware
1hls ls subLly dlfferenL from Lhe shadowlng mechanlsm
slnce the same reg|ster |s used for 9Sk |n a|| modes
CP$R during interrupts
3.20 tjwc - 12-Nov-11 SE1 / EE2 ntroduction to Computer Architecture
here is the exception handIer routine?
Lxcept|ons can be v|ewed as forced
subrout|ne ca||s
When and |f an except|on occurs |s not
pred|ctab|e (un|ess |t |s an SWI except|on)
1he address to wh|ch the processor |s forced to
branch to |s ca||ed the except|on (or
|nterrupt) vector
It |s f|xed by hardware
3.21 tjwc - 12-Nov-11 SE1 / EE2 ntroduction to Computer Architecture
Exception vector addresses
Lach vector (except II) |s 4 bytes |ong (|e one |nstruct|on) at
a f|xed pos|t|on |n |ow memory
ou put a branch |nstruct|on at th|s address eg
8 svc_except|on_hand|er
See next s||de
(x introduces a
hex constant)
II |s spec|a| |n two ways
1 ou can put the actua| II hand|er (a|so ca||ed Iast
Interrupt Serv|ce kout|ne) at 0x0000001 onwards
because II vector occup|es the h|ghest address
2 II has many more shadow reg|sters So you don't
have to save as many reg|sters on the stack as other
except|ons faster
kS1_VL 8 kLSL1_SD8
DND_VL 8 DND_SD8
SWI_VL 8 SWI_SD8
9kL_VL 8 9kLIL1n_SD8
DA1A_VL 8 DA1A_SD8
Ik_VL 8 Ik_SD8 branch adds 4 cyc|es to hand|er t|me
II_VL ADD k8 k8 #1 dont need a 8 here
S1k k8 1I,L
SD8S pc r14 #4 return from |nterrupt
1I,L DD 0 memory |ocat|on for 1I,L var|ab|e
NB - why B not BL?
3.23 tjwc - 12-Nov-11 SE1 / EE2 ntroduction to Computer Architecture
MuItipIe $tacks
LxcepLlon modes replace user 813 by a shadow
reglsLer whlch sLores Lhe S of Lhe excepLlon sLack
SeparaLe shadow reglsLer for each mode
dlfferenL sLack for each mode
Dser reglsLers can Lhen be saved/resLored safely on
Lhe excepLlon sLack
User stack |R0 stack F|0 stack
3.24 tjwc - 12-Nov-11 SE1 / EE2 ntroduction to Computer Architecture
Exception Return
Once the exception has been handled (by the exception handler),
the user task is resumed.
The handler program (or nterrupt Service Routine) must restore
the user state exactly as it was before the exception occurred:
1. Any user registers saved on the exception handler's stack must be
restored from it
2. The C!SR must be restored from the appropriate S!SR (done by
processor automatically).
3. !C must be changed back to the instruction address in the user
instruction stream
Steps 1 and 3 are done explicitly by exception handler code
Restoring registers from the stack (and saving them initially) would
be the same as in the case of subroutines
Restoring !C value is more complicated. The exact way to do it
depends on which exception you are returning from.
Exception Return (con't)
8emember LhaL Lhe reLurn address was saved ln r14 before
enLerlng Lhe excepLlon handler
1o reLurn from a SWl or undeflned lnsLrucLlon Lrap use
,CVS pc r14
1o reLurn from an l8C llC or prefeLch aborL use
SD8S pc r14 #4
1o reLurn from a daLa aborL Lo reLry Lhe daLa access lnsLrucLlon LhaL
falled use
SD8S pc r14 #8
lf Lhe desLlnaLlon reglsLer ls Lhe C Lhe 'S' modlfler does nC1 mean
seL Lhe flags" buL resLore Lhe CS8 from Lhe SS8"
1he dlfferences beLween Lhese Lhree meLhods of reLurn ls due Lo
Lhe p|pe||ne arch|tecture of Lhe A8M processor
Direct Memory Access vs Interrupt & PoIIing
DMA controller shares memory bus cycles with the processor - a
technique known as cycIe steaIing. The C!& notices only a slightly
slower memory system.
DMA controller manages transfers
9D
,emory
I]C dev|ce
D|rect ,emory
Access (D,A)
ontro||er
Interrupt
contro||er
I]C dev|ce
Address bus
Databus
Ik II
other dev|ces
other
dev|ces
3.27 tjwc - 12-Nov-11 SE1 / EE2 ntroduction to Computer Architecture
Direct Memory Access (con't)
DMA is initiated by a processor. The following
information must be sent by the processor to the DMA
module (AKA DMA controller):
/O device Read or Write is required?
Memory address of the /O device involved in the transfer
The starting location in memory to read from or write to
The number of word to read or write
Once DMA is initiated, the processor can continue with
other work.
!rocessor can work concurrently with transfer between
/O device and memory.
Data transfer without interrupt overhead
3.28 tjwc - 12-Nov-11 SE1 / EE2 ntroduction to Computer Architecture
%echniques for Inputting a BIock of Data: DMA
!olling
nterrupt
DMA
1 per block 1 per byte
3.29 tjwc - 12-Nov-11 SE1 / EE2 ntroduction to Computer Architecture
ecture 13 ARM Processor
Organization
irst ARM processor developed on 3
micron technology in '83-'85
This course is mainly based on the
ARM6/7 architecture developed
between '90-'95 and now the market
leader fior low/medium performance
embedded applications.
ARM has continued development.
Recent faster compute engines:
ARM9: 220M!S at 200Mhz clock,
0.2mW/Mhz
ARM10, 400+ M!S
ARM11M! (4 processors, 2600M!S)!
ntel has developed an ARM-based
Xscale architecture for !DAs etc.
Later designs use 0.13 technology:
25X smaller than the first ARM!
Recent ARM11-M!
multiprocessing core
"Ifs o poor bureoucrof who conf sfoII o good
ideo unfiI even ifs sponsor is reIieved fo see if
deod ond officioIIy buried" Poberf Townsend
hat is a computer?
MicrocontroIIer
ROM,RAM,C!&,
peripherals all on one
chip
Address, data,control
bus inside chip
PC
C!& on one chip (+
cache, see later)
ROM (OS chip)
RAM DDR2 DRAM
Address,data,control
bus on motherboard
!eripherals, on
motherboard or in
!C slots
9D
kA,
I]C
kC,
3.31 tjwc - 12-Nov-11 SE1 / EE2 ntroduction to Computer Architecture
InternaI Organization of ARM CP&
1wo maln blocks datapath and
decoder
8eglsLer bank (r0 Lo r13)
1wo read porLs Lo Abus/8bus
Cne wrlLe porL from ALDbus
AddlLlonal read/wrlLe porLs for
program counLer r13
8arrel shlfLer shlfL/roLaLe 2nd
operand by any number of blLs
ALD performs arlLhmeLlc/loglc
funcLlons
uedlcaLed C lncremenLer
Address reglsLer elLher from
C or from ALD
3.32 tjwc - 12-Nov-11 SE1 / EE2 ntroduction to Computer Architecture
InternaI Organization of ARM (con't)
Data register holds read/write data
from/to memory
nstruction decoder decodes
machine code instructions to
produce control signals to datapath
Data processing instructions take a
single cycle: data values are read on
the A-bus & -bus, the results from
AL& is written back into register
bank
,n you see why register
v,ued shiIts:
r0 := r1 + (r2 IsI r3)
t,e ,n extr, cyce?
3.33 tjwc - 12-Nov-11 SE1 / EE2 ntroduction to Computer Architecture
ADD R0, R1, R2, IsI #4
keg|sters
LU
ShiIt
A (k) 8 (k)
ALD
(W)
4 from lnsLrucLlon
A k1
8 k2
ALD k0
8eglsLer bank can
read A8 reglsLers
wrlLe ALD reglsLer
slmulLaneously
8eglsLer nos
A8ALD come
from lnsLrucLlon
Who|e execute path
computed |n one cyc|e
3.34 tjwc - 12-Nov-11 SE1 / EE2 ntroduction to Computer Architecture
$peeding up CP& execution
All computers typically perform a sequence of
operations in each instruction:
etch instruction word: FE%CH
Decode it (work out what to do): DECODE
etch register values
!erform AL& operation
Write back result to registers
Each of these operations is done by different hardware,
and doing them in strict sequence means that each
hardware block would spend most of its time waiting for
others
This motivates the concept of pipeIining used by all
modern C!&s
EXEC&%E (aII one cycIe on ARM)
PipeIining
The maximum processing rate is determined by the
propagation delay of the computational logic in
abstract:
Above we can process one input every T1+T2
Above we can process a new input every max{T1,T2},
but each input takes 2*max(T1,T2) > T1+T2 to be
completely processed
Iunction 1 Iunction 2
time T1 time T2
time T1
time T2
Iunction 1 Iunction 2
Registers
A8C
In Out
A f2(f1(A))
f2(f1())
C f2(f1(C))
In Reg1 Reg2
A
f1(A)
C f1() f2(f1(A))
f1(C) f2(f1())
f1(f2(C))
A8C
1lmesLep
11+12
1lmesLep
max(1112)
oc
3.36 tjwc - 12-Nov-11 SE1 / EE2 ntroduction to Computer Architecture
ARM PipeIining
ARM core uses a 3-stage instruction pipeline
Fetch: fetch instruction code from memory into the instruction
pipeIine
Decode: instruction decoded to obtain controI signaIs for the
datapath ready for the next stage
Execute: instruction "owns" the datapath - register read;
shifting; A& resuIts generated and write-back, aII in one cycIe.
Results for each stage stored in registers
The consequence is that the clock period is much shorter than
without pipelining
Datapath
Execution
nst. etch
Logic
Decoder
Logic
R
E
G
I
$
%
E
R
R
E
G
I
$
%
E
R
R
E
G
I
$
%
E
R
clock
instr. code
IL1n
1008
(8I k3k3#1)
,emory
1000 ADD k09 #4
1004 SD8 k3 k4 k0
1008 8I k3 k3 #1
(SD8)
(ADD k09#4)
1008
9
4
LkLD1L IL1n DLCDL
(8I k3k3#1) (SD8 k3k4k0)
A8M CD when
Auu lnsLrucLlon
ls execuLlng
ADD
k0
3.38 tjwc - 12-Nov-11 SE1 / EE2 ntroduction to Computer Architecture
ARM PipeIining (con't)
At any time, 3 different instructions may occupy each of the the 3-
stages of pipeline
t may take three cycles to complete a single-cycle instruction.
This is said to have a three cycle Iatency
Once a pipeline fills, the processor completes a single-cycle
instruction every clock cycle. Therefore the throughput (%) is
one instruction per cycle.
A00 R0, R1, R2
8U R3, R4, R0
|6 R3, R3, #1
3.39 tjwc - 12-Nov-11 SE1 / EE2 ntroduction to Computer Architecture
Branches & pipeIine staIIs
!ipelining is a key design technique.
or example the ntel !entium 4 pipeline was 20 stages long!
This is excessive, and this design was abandoned in the later !entium
Mobile core
AMD new 64 bit Hammer Athlon architecture uses 10/12 stages
Long pipelines mean that many future instructions must start
processing before the current instruction has finished
!roblems occur when branches happen changing the future course
of instruction execution
&nconditional branch may be OK extra fetch hardware can check for
and follow branches - though ARM does not have this
Conditional branches are a real problem cannot tell which way to go!
Cost of wrong branch prediction is a pipeIine staII waiting for the
correct instructions to fill up the pipeline
PipeIine staIIs due to branches on ARM
A8M plpellne wlll sLall whenever a
branch |s taken
SLall ls deLecLed aL end of LxLCD1L
cycle of 8ranch lnsLrucLlon
SLall cosLs 3 cycles ln addlLlon Lo
normal one cycle execuLlon
oe cycle (memoty wolt) to looJ
ooocootlqooos l1cn oJJtess
1hls ls exLra cosL due Lo random
access Lo memory belng one
cycle slower Lhan Lyplcal
sequenLlal access
3 Cne cycle Lo flll lL1CP plpellne
sLage
6 Cne cycle Lo reflll uLCCuL
plpellne sLage
7 AfLer whlch Lhe LxLCD1L cycle of
Lhe correcL lnsLrucLlon can be
performed
cyc|e IL1n DLCDL LkLD1L
1 8 L1 CM Auu
2 xxx 8 L1 CM
3 ??? xxx 8 L1
memoty
wolt
sLalled sLalled
S ,CV (at L1) sLalled sLalled
6 SD8 ,CV sLalled
7 Ckk SD8 ,CV
ADD
CMP
B L1
XXX
YYY
L1 MOV
SUB
ORR
3.41 tjwc - 12-Nov-11 SE1 / EE2 ntroduction to Computer Architecture
ARM %erminoIogy
nstruction is fetched if it enters ETCH stage of pipeline (whether or
not it ever reaches EXEC&TE stage)
nstruction is executed only if it reaches EXEC&TE stage of pipeline
When instruction is executed but execution condition is false - so
nothing happens - we say it is condition faIse executed
When instruction is executed and execution condition is true we say it
is condition true executed.
XXX is always condition true executed if it is executed
EQ XXX is condition true executed if Z=1
ADDNE R0,R1,R2 is condition true executed if Z=0
Note that a condition false executed instruction is inspected and
condition checked in EXEC&TE stage - unlike an instruction which
never reaches EXEC&TE stage of pipeline, but may be ETCHED or
DECODED
3.42 tjwc - 12-Nov-11 SE1 / EE2 ntroduction to Computer Architecture
ExampIe execution
; throughout Z=1, EQ true, NE false
ADD R0,R1,R2 ;Condition true executed
ORRNE R5,R5,R5 ;Condition false executed
RSEQ R3,R3,R3 ;Condition true executed
EOREQ R5,R5,R6 ;Condition true executed
BNE A ;Condition false executed
EQ A ;Condition true executed
SUB R2,R3,R4 ;Fetched, decoded, not executed
ADC R1,R0,R0 ;Fetched, not decoded or executed
A AND R2,R3,R4 ;Condition true executed
3.43 tjwc - 12-Nov-11 SE1 / EE2 ntroduction to Computer Architecture
$peed Ioss through ARM staIIs
On ARM7 processor 3 cycles are lost whenever branch is
executed.
Assume one instruction = one cycle except for pipeline stalls
Nearly correct
Suppose 20% of instructions are condition true executed
branches (ie they actually happen):

EQ (if Z=0)
L
MOV pc,r14
Calculate loss due to stall:
Every 5 instructions = 5 cycles without stall there will be one stall = 3
extra cycles lost.
Throughput = 5/8 = 0.625 instructions/cycle.
f clock = 60MHz, throughput = 37.5 M!S (million instructions per second)
$taIIs & Branch Prediction
SLalls happen when wrong lnsLrucLlon ls prefeLched
1hls can be avolded by branch pred|ct|on work ouL wheLher
branch wlll happen and lf lL wlll feLch Lhe branch LargeL address
SLall only lf branch lnsLrucLlon and branch |s not correct|y pred|cted
1hree Lypes of branches
A8M predlcLs correcLly only condlLlon false execuLed branches!
%ype of branch ARM exampIes ARM prediction
&nconditional ranch XXX
L XXX
100% ncorrect
Conditional ranch EQ XXX
LEQ XXX
Correct if branch not taken
ncorrect if branch taken
Computed ranch MOV pc, R14 100% ncorrect
3.45 tjwc - 12-Nov-11 SE1 / EE2 ntroduction to Computer Architecture
%hroughput CaIcuIation
Assume
1 lnsLrucLlon/cycle when plpellne full
8ranches happen wlLh probablllLy per lnsLrucLlon
8ranches lead Lo sLall only lf not correct|y pred|cted
eq lo AkM btoocb cooJltloo folse execoteJ bos oo stoll becoose tbls ls pteJlcteJ by AkM
botJwote bot btoocb cooJltloo ttoe execoteJ leoJs to stoll
ocottect branch redlcLlon has probablllLy
lpellne penalLy from lncorrecL branch predlcLlon ls a p|pe||ne Sta|| of cycles
1yplcally assume lengLh of plpellne (noL exacL buL good guess)
Average number of cycles added per lnsLrucLlon ls 9&&
1hroughpuL (per cycle) decreases Lo
MulLlply 1 by clock frequency for LhroughpuL ln lnsLrucLlons/second
MlS mllllons of lnsLrucLlons/second
1
instructions/cyce
1
%
!$

3.46 tjwc - 12-Nov-11 SE1 / EE2 ntroduction to Computer Architecture


$ummary
A8M ls plpelloeJ
LhroughpuL of up Lo 1 cycle per lnsLrucLlon
laLency of 3 cycles per lnsLrucLlon
MosL lnsLrucLlons Lake one cycle
LxcepLlons relaLe Lo hardware
Instruction CycIes
Add R0, R1, R2, IsI R3 2 1 cycle added to read R3 since only three
register file ports
B, B,
MOV pc, R14
4 !ipeline stall & memory wait adds 3 cycles
DR/$%R 4 Memory read/write (& two memory waits) add
3 cycles
DM/$%M 3+n n memory read/write (& two memory waits)
adds 2+n cycles
3.47 tjwc - 12-Nov-11 SE1 / EE2 ntroduction to Computer Architecture
SkA,
value ls sLored on a palr of lnverLlng gaLes
very fasL buL Lakes up more space (4 Lo 6 LranslsLors per blL) Lhan
u8AM
DkA,
value ls sLored as charge on a capaclLor (musL be refreshed)
very dense ( hlgh capaclLy) buL slower Lhan S8AM (facLor of 3 Lo 10)
Dsers want |arge and fast memor|es!
S8AM access Llmes are 2 20 ns aL cosL of $30 Lo $100 per MbyLe
u8AM access Llmes are 1030 ns aL cosL of $003 Lo $02 per MbyLe
ulsk access Llmes are 3 10 mllllon ns aL cosL $0001 Lo $0003 per
MbyLe
ecture 14 - Cache Memory
"I suffer from shorf ferm memory Ioss. If runs in my fomiIy... Af Ieosf
I fhink if does... Where ore fhey7" Finding Memo
3.48 tjwc - 12-Nov-11 SE1 / EE2 ntroduction to Computer Architecture
Question:
how to organize memory to
improve performance without
the cost?
Answer:
build a memory hierarchy
Store copies of frequently used
memory locations in smaller &
faster memories close to the
C!&
Memory reference instructions
use the closest memory
containing the required item
Can speed up average memory
access manyfold.
ExpIoiting Memory Hierarchy
3.49 tjwc - 12-Nov-11 SE1 / EE2 ntroduction to Computer Architecture
ExpIoiting Memory Hierarchy
Processor Registers Processor Registers
64 to 256
bytes
0.3 - 5 nsec
memory type size access speed
On-chip cache On-chip cache 8 - 1024
Kbytes
3-5 nsec
$econd-IeveI cache $econd-IeveI cache
256 - 16384
Kbytes
5-10 nsec
Main memory (DRAM) Main memory (DRAM) 16M - 16G bytes ~ 20 nsec
Disk or other store Disk or other store
10's - 1000's
Gbytes
10's msec
n
e
a
r
e
s
t

t
o

C
P
&
!& controlled
transfer
Automatic
{hardware-
controlled_
transfer
!& controlled
transfer
3.50 tjwc - 12-Nov-11 SE1 / EE2 ntroduction to Computer Architecture
Automatic memory transfer between
IeveIs of hierarchy
Tr|s d|agrar |||uslrales dala slored |r a cacre rerory
ard a |arger rerory lurlrer lror lre CPu.
Local|or A ras va|ue x slored |r Cacre elc.
Tre cacre slores sore, oul rol a||, ol lre |lers |r lre
|arger rerory.
lere addresses A & 0, |l requesled oy lre CPu, W||| oe
llT3 |r lre cacre, ard addresses 8 & 0 Vl33es.
lr a rerory r|erarcry every |eve| |oo|s |||e lr|s lr|s,
requesls lror lre CPu rove oulWards l||| lrey llT.
ache"
memory
: x
D: w
Memory
: x
B: y
: z
D: w
!U
3.51 tjwc - 12-Nov-11 SE1 / EE2 ntroduction to Computer Architecture
Cache %erminoIogy
Memory hierarchy can be multiple levels
Data is copied between two adjacent levels at a time
We will focus on two levels:
&pper IeveI (closer to the processor, smaller but faster)
ower IeveI (further from the processor, larger but slower)
Some terms used in describing memory hierarchy:
bIock: minimum unit of data to transfer between levels - also
called a cache Iine
hit: data requested is in the upper level
miss: data requested is not in the upper level
3.52 tjwc - 12-Nov-11 SE1 / EE2 ntroduction to Computer Architecture
Two issues:
How do we know if a data item is in the cache? (s it a HI% or a
MI$$?)
f it is there (a HT), how do we find it?
Our first example:
block size is one byte of data
we will consider the "direct mapped" approach
For each item of data at the Iower IeveI,
there is exactIy one Iocation in the cache memory where it might be.
ots of items at the Iower IeveI share a given Iocation in the
upper IeveI, which can contain at most one at any given time.
For each item of data at the Iower IeveI,
there is exactIy one Iocation in the cache memory where it might be.
ots of items at the Iower IeveI share a given Iocation in the
upper IeveI, which can contain at most one at any given time.
%he Basics of Caches
3.53 tjwc - 12-Nov-11 SE1 / EE2 ntroduction to Computer Architecture
Direct Mapped Cache
&pper memory
level: cache
Lower memory level: memory
Mapping: ottom (LS) bits of the
address determine position in
cache
tems differing only by top (MS)
bits map to same cache location
An item is replaced in the cache
when different item with the same
bottom bits is requested
Assume data is written both to the
cache and the lower level,
preserving coherence
rite-through cache
3.54 tjwc - 12-Nov-11 SE1 / EE2 ntroduction to Computer Architecture
Cache Operation
tag |ndex
13 3 2 0
ache address lndex
Lach cache llne (block) conslsLs
of uaLa 1ag valld flelds
Index parL of memory address
deLermlnes whlch cache llne ls
used
1ag parL of memory address ls
maLched wlLh sLored Lag ln
cache
Memory address
Cache
Address
Cache
Data
Index Data %ag V
000 0
001 0
010 0
011 0
100 0
101 0
110 0
111 0
3.55 tjwc - 12-Nov-11 SE1 / EE2 ntroduction to Computer Architecture
Cache data Iine
Data field of cache contains cache line, can be more
than one byte of memory
$eIect bits from memory address are used to select
byte within the cache line
tag |ndex
23 1S 14 4 3 0
Memory address
se|ect
2
4
16 byLes per llne
2
11
2048 llnes
2
13
32768 byLes cache memory
Organization of Direct-mapped cache
%ag Index Select
n b|t lndex
se|ects 1 out
of 2
n
cache
||nes
m b|t 5e/ect
se|ects 1 out of
2
m
data bytes
w|th|n a ||ne
1ag n blLs lndex m blLs selecL
3.57 tjwc - 12-Nov-11 SE1 / EE2 ntroduction to Computer Architecture
Cache Contents - A waIk-through
Assume cache is 8 bIocks (Iines) each of 1 byte
nitial state on power-ON
After handling read of address 00000
After handling read of address 00001
After handling write to address 01010
After handling read of address 01000
After handling read of address 01010
3/ex Vali/ bit (V) Tag Data

1
1
11
1
11
11
111

3.58 tjwc - 12-Nov-11 SE1 / EE2 ntroduction to Computer Architecture
Cache Contents - A waIk-through (2)
3/ex Vali/ bit (V) Tag Data
1 MEM|(
1
1
11
1
11
11
111

nitial state on power-ON
After handling read of address 00000, miss, mem read
After handling read of address 00001
After handling write to address 01010
After handling read of address 01000
After handling read of address 01010
3.59 tjwc - 12-Nov-11 SE1 / EE2 ntroduction to Computer Architecture
Cache Contents - A waIk-through (3)
nitial state on power-ON
After handling read of address 00000, miss, mem read
After handling read of address 00001, miss, mem read
After handling write to address 01010
After handling read of address 01000
After handling read of address 01010
3/ex Vali/ bit (V) Tag Data
1 MEM|(
1 1 MEM|1(
1
11
1
11
11
111

3.60 tjwc - 12-Nov-11 SE1 / EE2 ntroduction to Computer Architecture
Cache Contents - A waIk-through (4)
3/ex Vali/ bit (V) Tag Data
1 MEM|(
1 1 MEM|1(
1 1 1 Written MEM|11(
11
1
11
11
111

nitial state on power-ON
After handling read of address 00000, miss, mem read
After handling read of address 00001, miss, mem read
After handling write to address 01010, miss, mem write
After handling read of address 01000
After handling read of address 01010
NB - woud
write to
MEM even
iI it w,s ,
hit
3.61 tjwc - 12-Nov-11 SE1 / EE2 ntroduction to Computer Architecture
Cache Contents - A waIk-through (5)
3/ex Vali/ bit (V) Tag Data
1 1 MEM|1(
1 1 MEM|1(
1 1 1 MEM|11(
11
1
11
11
111

nitial state on power-ON
After handling read of address 00000, miss, mem read
After handling read of address 00001, miss, mem read
After handling write to address 01010, miss, mem write
After handling read of address 01000, miss, mem read
After handling read of address 01010, hit
3.62 tjwc - 12-Nov-11 SE1 / EE2 ntroduction to Computer Architecture
Cache Contents - A waIk-through (6)
3/ex Vali/ bit (V) Tag Data
1 1 MEM|1(
1 1 MEM|1(
1 1 1 MEM|11(
11
1
11
11
111

nitial state on power-ON
After handling read of address 00000, miss, mem read
After handling read of address 00001, miss, mem read
After handling write to address 01010, miss, mem write
After handling read of address 01000, miss, mem read
After handling read of address 01010, hit
A Further ExampIe
We have consldered a cache
conslsLlng of 8 llnes each of 1
byLe (1oLal 8 byLes)
An alLernaLlve arrangemenL for
8 byLes cache s|ze ls 4 llnes
each of 2 byLes
Cn each cache mlss we musL
Lhen flll Lhe enLlre llne (on boLh
wrlLes and reads)
1he se| (selecL) fleld selecLs
whlch byLe ln a llne ls
addressed
t,g index
1
se
15 3 2
kead h|ts
1hls ls whaL we wanL cache passes
daLa Lo processor
kead m|sses
CD sLalls feLch block from
memory dellver Lo cache resLarL
Wr|te h|ts
Can replace addressed daLa ln
cache and memory (wrlLe
Lhrough)
Wr|te m|sses
8ead Lhe enLlre block lnLo Lhe
cache llne Lhen wrlLe Lhe
addressed daLa ln cache and
memory (wrlLeLhrough)
Why does llne slze 1 noL requlre
Lhe read?
3.64 tjwc - 12-Nov-11 SE1 / EE2 ntroduction to Computer Architecture
Cont'd.
After handling read of address
00000 (read block 0000) MSS
After handling read of address
00001 HT
No change from above
After handling write to address
01010 (read block 0101, write
block 0101) MSS
After handling read of address
01011 HT
No change from above
After handling read of address
11010 (read block 1101) MSS
3/ex Vali/
bit (V)
Tag Data(0) Data(1)
1 MEM|( MEM|1(
1
1
11

3/ex Vali/
bit (V)
Tag Data(0) Data(1)
1 MEM|( MEM|1(
1 1 1 Write d,t, MEM|&b(
1
11

3/ex Vali/
bit (V)
Tag Data(0) Data(1)
1 MEM|( MEM|1(
1 1 11 MEM|&1,( MEM|&1b(
1
11

Each memory block is 2 bytes
The index bits are in bold text
Each memory block is 2 bytes
The index bits are in bold text
3.65 tjwc - 12-Nov-11 SE1 / EE2 ntroduction to Computer Architecture
PrincipIe of ocaIity
The principle of locality makes having a memory
hierarchy a good idea
f an item is referenced,
temporaI IocaIity: it will tend to be referenced
again soon
spatiaI IocaIity: nearby items will tend to be
referenced soon. (E.g. sequential access).
hy does code have IocaIity?
All caches exploit temporal locality.
A cache with a larger line size will achieve a higher hit
rate for spatial locality
E.g. line size N items, sequential access => hit rate at least
1

3.66 tjwc - 12-Nov-11 SE1 / EE2 ntroduction to Computer Architecture


Caches explolL tempotol spotlol locallLy Lo geL hlgh hlL raLe
A parLlcular memory lLem ls sLored ln a unlque locaLlon ln Lhe cache
1o check lf a parLlcular memory lLem ls ln cache Lhe |ndex blLs of Lhe
memory address are used Lo address Lhe cache enLry
1he Lop memory address blLs are Lhen compared wlLh Lhe sLored tag lf
Lhey are equal and Lhe v blL ls 1 we have goL a hlL
1he boLLom memory address blLs se|ect deLermlne byLe wlLhln cache llne
1ag ls all address blLs excepL |ndex and se|ect
Slze of cache wlLh n blLs lndex and M blLs selecL ls 2
(n+M)
byLes
When a mlss occurs daLa cannoL be read from Lhe cache A slower read
from Lhe nexL level of memory musL Lake place lncurrlng a mlss penalLy
noLe v blL ls only necessary Lo mark lnlLlal sLaLe where cache conLalns no
valld daLa (hence musL always be a mlss)
Direct Mapped Cache - Key points

Vous aimerez peut-être aussi