0 évaluation0% ont trouvé ce document utile (0 vote)
23 vues66 pages
Tjwc - 12-Nov-11 ISE1 / EE2 Introduction to Computer Architecture Contents ecture 11 and 12 - input and Output, Exceptions and Interrupts Interrupt-driven vs DMA vs polled I / O SWIs, exceptions and interrupts ARM modes, shadow registers.
Tjwc - 12-Nov-11 ISE1 / EE2 Introduction to Computer Architecture Contents ecture 11 and 12 - input and Output, Exceptions and Interrupts Interrupt-driven vs DMA vs polled I / O SWIs, exceptions and interrupts ARM modes, shadow registers.
Droits d'auteur :
Attribution Non-Commercial (BY-NC)
Formats disponibles
Téléchargez comme PPT, PDF, TXT ou lisez en ligne sur Scribd
Tjwc - 12-Nov-11 ISE1 / EE2 Introduction to Computer Architecture Contents ecture 11 and 12 - input and Output, Exceptions and Interrupts Interrupt-driven vs DMA vs polled I / O SWIs, exceptions and interrupts ARM modes, shadow registers.
Droits d'auteur :
Attribution Non-Commercial (BY-NC)
Formats disponibles
Téléchargez comme PPT, PDF, TXT ou lisez en ligne sur Scribd
Part 3 Hardware Part of CharIes Babbage's never compIeted "difference engine". It wouId have been the worId's first computer, mechanicaI, driven by steam 3.2 tjwc - 12-Nov-11 SE1 / EE2 ntroduction to Computer Architecture Contents ecture 11 & 12 - Input & Output, Exceptions & Interrupts nterrupt-driven vs DMA vs polled /O SWs, exceptions & interrupts ARM modes, shadow registers ecture 13 - ARM Hardware !ipelining instruction fetch and execution Throughput vs Latency ARM !rocessor organisation ecture 14 - Cache Hierarchy Adding a cache to the memory hierarchy Cache purpose, hit & miss calculations Direct mapped caches: basics Cache lines Tag, index, select fields Startup issues Valid bit 3.3 tjwc - 12-Nov-11 SE1 / EE2 ntroduction to Computer Architecture ecture 11 & 12 I/O, Exceptions and Interrupts Pandllng l/C efflclenLly ls a key requlremenL LxcepLlons ( lnLerrupLs) allow processors Lo handle evenLs LhaL occur whlch are noL dlrecLly relaLed Lo Lhe user program xceptloos Lyplcally caused by unexpecLed runLlme errors ln user code (eg dlvlslon by 0) otettopts spec|a| case of except|on caused by hardware condlLlons LhaL requlre processor acLlon eg l/C servlce compleLely lndependenL of user code "There is no usefuI ruIe wifhouf on excepfion" Thomos FuIIer Hardware for I/O - Overview 9D ,emory I]C dev|ce D|rect ,emory Access (D,A) ontro||er Interrupt contro||er I]C dev|ce Address bus Databus Ik II other dev|ces other dev|ces %I]C dev|ces commun|cate v|a memorymapped reg|sters %D,A Interrupts are opt|ona| 3.5 tjwc - 12-Nov-11 SE1 / EE2 ntroduction to Computer Architecture $ervicing an I/O device Typical input !C serial port receives data at fixed rate (say 9600 baud) independent of C!& instructions. New data byte is available 9600/10 = 960 times per second C!& must read data register to input new value repeatedly, at correct time. This operation is known as servicing the /O device The problem How can the C!& know when to service the /O device? 3.6 tjwc - 12-Nov-11 SE1 / EE2 ntroduction to Computer Architecture PoIIing and Interrupt oth are methods to notify processor that /O device needs attention PoIIing simple, but slow processor check status of /O device regularly to see if it needs attention similar to checking a telephone without bells! responsibility of user code to check /O device status soon enough to avoid buffer overrun. Interrupt fast, but more complicated processor is notified by /O device (interrupted) when device needs attention similar to a telephone with bells Method 1: PoIIing C!& repeatedly reads status bit from serial port (bit 31 of !ORT_STAT&S here) to determine if new data is available When needed it reads data from !ORT_DATA(7:0) !olling is simple, requires no additionaI hardware Requires 100% C!& use, inefficient, inflexible IC_SLkVIL_CDL ADkL k10 SLk_9Ck1 k10 SLkIAL_9Ck1 ADk k11 8DIILk k11 array of bytes read 9CLL_LCC9 examp|e of po|||ng I]C LDk k0 k10#9Ck1_S1A1DS|nput status ,9 k0 #0 89L 9CLL_LCC9 wa|t t||| port |s ready b311 LDk8 k0 k10#9Ck1_DA1A |nput data S1k8 k0 k11 #1 store data |n memory 8 9CLL_LCC9 repeat 3.8 tjwc - 12-Nov-11 SE1 / EE2 ntroduction to Computer Architecture Method 2: Interrupt C!& executes user code independently of /O /O device raises interrupt line when it requires service C!& receives interrupt, suspends current execution, handles device in an interrupt service routine, then returns to previous execution as though nothing had happened nterrupt is invisible to executing code except is small loss of speed Interrupt Hardware for ARM 9D ,emory I]C dev|ce Interrupt contro||er I]C dev|ce Address bus Databus Ik II other dev|ces %I]C dev|ces commun|cate v|a memorymapped reg|sters %D,A Interrupts are opt|ona| Ak, has two |nterrupt |nputs Ik II Cpt|ona| Interrupt contro||er a||ows more I]C dev|ces to use 2 |nterrupt ||nes 3.10 tjwc - 12-Nov-11 SE1 / EE2 ntroduction to Computer Architecture Anatomy of an interrupt exception The interrupt causes user code to suspend and a branch to a fixed hardware determined memory location - so-called interrupt "vector" (labelled RQ_VEC here) C!& enters IRQ mode The vector contains a single branch to a software- determined interrupt service routine RQ_S& When the interrupt has been dealt with the SR makes an interrupt return &ser code continues execution as though interrupt had never happened - no register is disturbed User Hode: H0V . A00 . H0V . E0 . L0R . A00 . 8TR . |R0_VE6 |R0_8U |R0_8U ; hand|er code ; .. 8U8 pc, r14, #4 Exception RO Mode Interrupt service routine Interrupt service routine 709:73 3.11 tjwc - 12-Nov-11 SE1 / EE2 ntroduction to Computer Architecture ARM Operating Modes The ARM processor can work in one of many operating modes. So far we have considered user mode, which is the "normaI" mode of operation. The processor can also enter "priviIeged" operating modes which are used to handle exceptions and $Is The Current !rocessor Status Register CP$R has 5 bits [bit4:0] to indicate which mode the processor is in:- Mode bits 3.12 tjwc - 12-Nov-11 SE1 / EE2 ntroduction to Computer Architecture $hadow Registers - CIever Feature of ARM I$A enabIes Exceptions As the processor enters an exception mode, some new registers are automatically switched in to avoid overwriting user regs:- (II ,ode) ,CV k6 k8 (Dser ,ode),CV k6 k8 3.13 tjwc - 12-Nov-11 SE1 / EE2 ntroduction to Computer Architecture $hadow Registers - Key points or example, an external event (such as movement of the mouse) occurs that generates a ast nterrupt (on the Q pin), the processor enters Q operating mode. t sees the same r0 - r7 as before, and a new set of shadow r8 - r14 y swapping to the new registers, it is easier for the programmer to preserve the state of the processor. or example, during Q mode, the Q versions of r8 - r14 can be used freely. On returning back to user mode, the original values of r8 - r14 will never have changed. The shadow r8-r14 preserve their values across Q interrupts and can be used to store persistent FIQ state. 3.14 tjwc - 12-Nov-11 SE1 / EE2 ntroduction to Computer Architecture ARM Exception HandIer Routines r13, r14 are shadowed automatically n handler the alternative shadow registers are used. &ser r13,r14 not changed. R13 - stack pointer R14 link register holds return address &ser R1 is saved on SR mode stack indexed by shadow r13 S!. R1 used for temp data in SR -holds value of TME Code increments TME R1 is restored at the end of the SR C!SR is saved/restored automatically Exception_handIer (not FIQ) $%MED r13!, {r1} DR r1, %IME ADD r1, r1, #1 $%R r1, %IME DMED r13!, {r1} $&B$ pc, r14, #4 3.15 tjwc - 12-Nov-11 SE1 / EE2 ntroduction to Computer Architecture ARM Exception HandIer Routines Exception_handIer (not FIQ) $%MED r13!, {r1} DR r1, %IME ADD r1, r1, #1 $%R r1, %IME DMED r13!, {r1} $&B$ pc, r14, #4 FIQ_handIer (FIQ) DR r8, %IME ; ADD r8, r8, #1 $%R r8, %IME $&B$ pc, r14, #4 4 4 1 4 4 4 4 1 4 4 FIQ_handIer_opt (FIQ) ADD r8, r8, #1 $%R r8, %IME $&B$ pc, r14, #4 1 4 4 cycles 21 cycles 13 cycles 3.16 tjwc - 12-Nov-11 SE1 / EE2 ntroduction to Computer Architecture How are exceptions generated? |ass|fy except|ons accord|ng to cause 1 As a direct resu/t of executinq on instruction such as Software Interrupt Instruct|on (SWI) Dndef|ned or |||ega| |nstruct|on ,emory error dur|ng fetch|ng an |nstruct|on 2 As a sideeffect of on instruction such as ,emory fau|t dur|ng data read]wr|te from memory Ar|thmet|c error (eg d|v|de by zero |f 9D has d|v|de |nstruct|on) 3 As a resu/t of externo/ hordwore siqno/s such as keset (eg from reset sw|tch) Iast Interrupt (II) Norma| Interrupt (Ik) Ik II are typ|ca||y connected to hardware dev|ces requ|r|ng |nterrupt serv|ce Ik II are typ|ca||y connected to hardware dev|ces requ|r|ng |nterrupt serv|ce 3.17 tjwc - 12-Nov-11 SE1 / EE2 ntroduction to Computer Architecture hat does an exception do? Dser code ls execuLlng and an excepLlon happens 1he handler runs llke a subrouLlne and aL Lhe end reLurns Lo Lhe nexL user lnsLrucLlon ([usL llke a subrouLlne) buL also resLores a|| processor reg|sters (|nc|ud|ng 9Sk) 1hls makes Lhe handler compleLely LransparenL Lo Lhe Dsermode code 1he [ob of Lhe handler ls Lo deal wlLh Lhe condlLlon LhaL caused Lhe excepLlon 8ecover from memory faulL erform SWl operaLlon Servlce Lhe hardware devlce LhaL caused Lhe lnLerrupL 3.18 tjwc - 12-Nov-11 SE1 / EE2 ntroduction to Computer Architecture hat happens when an exception occurs? Ak, comp|etes current |nstruct|on as best |t can It departs from current |nstruct|on sequence to hand|e the except|on by perform|ng the fo||ow|ng steps 1 It changes the operat|ng mode (see 310) correspond|ng to the part|cu|ar except|on 2 It saves the current 9 |n the k14 correspond|ng to the new mode Ior examp|e |f II occurs the 9 va|ue |s stored |n k14(II) 1h|s w||| be the return address (but see s||de 32S) 3 It saves the o|d va|ue of 9Sk |n a spec|a| reg|ster of the new mode 4 It d|sab|es except|ons of |ower pr|or|ty S It forces the 9 to a new va|ue correspond|ng to the except|on 1h|s |s effect|ve|y a forced [ump to the Lxcept|on nand|er or Interrupt Serv|ce kout|ne 3.19 tjwc - 12-Nov-11 SE1 / EE2 ntroduction to Computer Architecture see 3.11 1he CS8 musL be preserved by lnLerrupLs A8M archlLecLure conLalns a mechanlsm Lo do Lhls buL lL ls noL a shadow reglsLer Cn lnLerrupL or excepLlon Lhe old CS8 (currenL S8) contents are saved ln an SS8 (saved S8) reglsLer speclflc Lo Lhe mode of Lhe lnLerrupL or excepLlon however Lhe same CS8 conLlnues Lo be used lLs mode blLs are changed Lo Lhe excepLlon mode Cn reLurn Lhe saved value ls wrlLLen back Lo CS8 by hardware 1hls ls subLly dlfferenL from Lhe shadowlng mechanlsm slnce the same reg|ster |s used for 9Sk |n a|| modes CP$R during interrupts 3.20 tjwc - 12-Nov-11 SE1 / EE2 ntroduction to Computer Architecture here is the exception handIer routine? Lxcept|ons can be v|ewed as forced subrout|ne ca||s When and |f an except|on occurs |s not pred|ctab|e (un|ess |t |s an SWI except|on) 1he address to wh|ch the processor |s forced to branch to |s ca||ed the except|on (or |nterrupt) vector It |s f|xed by hardware 3.21 tjwc - 12-Nov-11 SE1 / EE2 ntroduction to Computer Architecture Exception vector addresses Lach vector (except II) |s 4 bytes |ong (|e one |nstruct|on) at a f|xed pos|t|on |n |ow memory ou put a branch |nstruct|on at th|s address eg 8 svc_except|on_hand|er See next s||de (x introduces a hex constant) II |s spec|a| |n two ways 1 ou can put the actua| II hand|er (a|so ca||ed Iast Interrupt Serv|ce kout|ne) at 0x0000001 onwards because II vector occup|es the h|ghest address 2 II has many more shadow reg|sters So you don't have to save as many reg|sters on the stack as other except|ons faster kS1_VL 8 kLSL1_SD8 DND_VL 8 DND_SD8 SWI_VL 8 SWI_SD8 9kL_VL 8 9kLIL1n_SD8 DA1A_VL 8 DA1A_SD8 Ik_VL 8 Ik_SD8 branch adds 4 cyc|es to hand|er t|me II_VL ADD k8 k8 #1 dont need a 8 here S1k k8 1I,L SD8S pc r14 #4 return from |nterrupt 1I,L DD 0 memory |ocat|on for 1I,L var|ab|e NB - why B not BL? 3.23 tjwc - 12-Nov-11 SE1 / EE2 ntroduction to Computer Architecture MuItipIe $tacks LxcepLlon modes replace user 813 by a shadow reglsLer whlch sLores Lhe S of Lhe excepLlon sLack SeparaLe shadow reglsLer for each mode dlfferenL sLack for each mode Dser reglsLers can Lhen be saved/resLored safely on Lhe excepLlon sLack User stack |R0 stack F|0 stack 3.24 tjwc - 12-Nov-11 SE1 / EE2 ntroduction to Computer Architecture Exception Return Once the exception has been handled (by the exception handler), the user task is resumed. The handler program (or nterrupt Service Routine) must restore the user state exactly as it was before the exception occurred: 1. Any user registers saved on the exception handler's stack must be restored from it 2. The C!SR must be restored from the appropriate S!SR (done by processor automatically). 3. !C must be changed back to the instruction address in the user instruction stream Steps 1 and 3 are done explicitly by exception handler code Restoring registers from the stack (and saving them initially) would be the same as in the case of subroutines Restoring !C value is more complicated. The exact way to do it depends on which exception you are returning from. Exception Return (con't) 8emember LhaL Lhe reLurn address was saved ln r14 before enLerlng Lhe excepLlon handler 1o reLurn from a SWl or undeflned lnsLrucLlon Lrap use ,CVS pc r14 1o reLurn from an l8C llC or prefeLch aborL use SD8S pc r14 #4 1o reLurn from a daLa aborL Lo reLry Lhe daLa access lnsLrucLlon LhaL falled use SD8S pc r14 #8 lf Lhe desLlnaLlon reglsLer ls Lhe C Lhe 'S' modlfler does nC1 mean seL Lhe flags" buL resLore Lhe CS8 from Lhe SS8" 1he dlfferences beLween Lhese Lhree meLhods of reLurn ls due Lo Lhe p|pe||ne arch|tecture of Lhe A8M processor Direct Memory Access vs Interrupt & PoIIing DMA controller shares memory bus cycles with the processor - a technique known as cycIe steaIing. The C!& notices only a slightly slower memory system. DMA controller manages transfers 9D ,emory I]C dev|ce D|rect ,emory Access (D,A) ontro||er Interrupt contro||er I]C dev|ce Address bus Databus Ik II other dev|ces other dev|ces 3.27 tjwc - 12-Nov-11 SE1 / EE2 ntroduction to Computer Architecture Direct Memory Access (con't) DMA is initiated by a processor. The following information must be sent by the processor to the DMA module (AKA DMA controller): /O device Read or Write is required? Memory address of the /O device involved in the transfer The starting location in memory to read from or write to The number of word to read or write Once DMA is initiated, the processor can continue with other work. !rocessor can work concurrently with transfer between /O device and memory. Data transfer without interrupt overhead 3.28 tjwc - 12-Nov-11 SE1 / EE2 ntroduction to Computer Architecture %echniques for Inputting a BIock of Data: DMA !olling nterrupt DMA 1 per block 1 per byte 3.29 tjwc - 12-Nov-11 SE1 / EE2 ntroduction to Computer Architecture ecture 13 ARM Processor Organization irst ARM processor developed on 3 micron technology in '83-'85 This course is mainly based on the ARM6/7 architecture developed between '90-'95 and now the market leader fior low/medium performance embedded applications. ARM has continued development. Recent faster compute engines: ARM9: 220M!S at 200Mhz clock, 0.2mW/Mhz ARM10, 400+ M!S ARM11M! (4 processors, 2600M!S)! ntel has developed an ARM-based Xscale architecture for !DAs etc. Later designs use 0.13 technology: 25X smaller than the first ARM! Recent ARM11-M! multiprocessing core "Ifs o poor bureoucrof who conf sfoII o good ideo unfiI even ifs sponsor is reIieved fo see if deod ond officioIIy buried" Poberf Townsend hat is a computer? MicrocontroIIer ROM,RAM,C!&, peripherals all on one chip Address, data,control bus inside chip PC C!& on one chip (+ cache, see later) ROM (OS chip) RAM DDR2 DRAM Address,data,control bus on motherboard !eripherals, on motherboard or in !C slots 9D kA, I]C kC, 3.31 tjwc - 12-Nov-11 SE1 / EE2 ntroduction to Computer Architecture InternaI Organization of ARM CP& 1wo maln blocks datapath and decoder 8eglsLer bank (r0 Lo r13) 1wo read porLs Lo Abus/8bus Cne wrlLe porL from ALDbus AddlLlonal read/wrlLe porLs for program counLer r13 8arrel shlfLer shlfL/roLaLe 2nd operand by any number of blLs ALD performs arlLhmeLlc/loglc funcLlons uedlcaLed C lncremenLer Address reglsLer elLher from C or from ALD 3.32 tjwc - 12-Nov-11 SE1 / EE2 ntroduction to Computer Architecture InternaI Organization of ARM (con't) Data register holds read/write data from/to memory nstruction decoder decodes machine code instructions to produce control signals to datapath Data processing instructions take a single cycle: data values are read on the A-bus & -bus, the results from AL& is written back into register bank ,n you see why register v,ued shiIts: r0 := r1 + (r2 IsI r3) t,e ,n extr, cyce? 3.33 tjwc - 12-Nov-11 SE1 / EE2 ntroduction to Computer Architecture ADD R0, R1, R2, IsI #4 keg|sters LU ShiIt A (k) 8 (k) ALD (W) 4 from lnsLrucLlon A k1 8 k2 ALD k0 8eglsLer bank can read A8 reglsLers wrlLe ALD reglsLer slmulLaneously 8eglsLer nos A8ALD come from lnsLrucLlon Who|e execute path computed |n one cyc|e 3.34 tjwc - 12-Nov-11 SE1 / EE2 ntroduction to Computer Architecture $peeding up CP& execution All computers typically perform a sequence of operations in each instruction: etch instruction word: FE%CH Decode it (work out what to do): DECODE etch register values !erform AL& operation Write back result to registers Each of these operations is done by different hardware, and doing them in strict sequence means that each hardware block would spend most of its time waiting for others This motivates the concept of pipeIining used by all modern C!&s EXEC&%E (aII one cycIe on ARM) PipeIining The maximum processing rate is determined by the propagation delay of the computational logic in abstract: Above we can process one input every T1+T2 Above we can process a new input every max{T1,T2}, but each input takes 2*max(T1,T2) > T1+T2 to be completely processed Iunction 1 Iunction 2 time T1 time T2 time T1 time T2 Iunction 1 Iunction 2 Registers A8C In Out A f2(f1(A)) f2(f1()) C f2(f1(C)) In Reg1 Reg2 A f1(A) C f1() f2(f1(A)) f1(C) f2(f1()) f1(f2(C)) A8C 1lmesLep 11+12 1lmesLep max(1112) oc 3.36 tjwc - 12-Nov-11 SE1 / EE2 ntroduction to Computer Architecture ARM PipeIining ARM core uses a 3-stage instruction pipeline Fetch: fetch instruction code from memory into the instruction pipeIine Decode: instruction decoded to obtain controI signaIs for the datapath ready for the next stage Execute: instruction "owns" the datapath - register read; shifting; A& resuIts generated and write-back, aII in one cycIe. Results for each stage stored in registers The consequence is that the clock period is much shorter than without pipelining Datapath Execution nst. etch Logic Decoder Logic R E G I $ % E R R E G I $ % E R R E G I $ % E R clock instr. code IL1n 1008 (8I k3k3#1) ,emory 1000 ADD k09 #4 1004 SD8 k3 k4 k0 1008 8I k3 k3 #1 (SD8) (ADD k09#4) 1008 9 4 LkLD1L IL1n DLCDL (8I k3k3#1) (SD8 k3k4k0) A8M CD when Auu lnsLrucLlon ls execuLlng ADD k0 3.38 tjwc - 12-Nov-11 SE1 / EE2 ntroduction to Computer Architecture ARM PipeIining (con't) At any time, 3 different instructions may occupy each of the the 3- stages of pipeline t may take three cycles to complete a single-cycle instruction. This is said to have a three cycle Iatency Once a pipeline fills, the processor completes a single-cycle instruction every clock cycle. Therefore the throughput (%) is one instruction per cycle. A00 R0, R1, R2 8U R3, R4, R0 |6 R3, R3, #1 3.39 tjwc - 12-Nov-11 SE1 / EE2 ntroduction to Computer Architecture Branches & pipeIine staIIs !ipelining is a key design technique. or example the ntel !entium 4 pipeline was 20 stages long! This is excessive, and this design was abandoned in the later !entium Mobile core AMD new 64 bit Hammer Athlon architecture uses 10/12 stages Long pipelines mean that many future instructions must start processing before the current instruction has finished !roblems occur when branches happen changing the future course of instruction execution &nconditional branch may be OK extra fetch hardware can check for and follow branches - though ARM does not have this Conditional branches are a real problem cannot tell which way to go! Cost of wrong branch prediction is a pipeIine staII waiting for the correct instructions to fill up the pipeline PipeIine staIIs due to branches on ARM A8M plpellne wlll sLall whenever a branch |s taken SLall ls deLecLed aL end of LxLCD1L cycle of 8ranch lnsLrucLlon SLall cosLs 3 cycles ln addlLlon Lo normal one cycle execuLlon oe cycle (memoty wolt) to looJ ooocootlqooos l1cn oJJtess 1hls ls exLra cosL due Lo random access Lo memory belng one cycle slower Lhan Lyplcal sequenLlal access 3 Cne cycle Lo flll lL1CP plpellne sLage 6 Cne cycle Lo reflll uLCCuL plpellne sLage 7 AfLer whlch Lhe LxLCD1L cycle of Lhe correcL lnsLrucLlon can be performed cyc|e IL1n DLCDL LkLD1L 1 8 L1 CM Auu 2 xxx 8 L1 CM 3 ??? xxx 8 L1 memoty wolt sLalled sLalled S ,CV (at L1) sLalled sLalled 6 SD8 ,CV sLalled 7 Ckk SD8 ,CV ADD CMP B L1 XXX YYY L1 MOV SUB ORR 3.41 tjwc - 12-Nov-11 SE1 / EE2 ntroduction to Computer Architecture ARM %erminoIogy nstruction is fetched if it enters ETCH stage of pipeline (whether or not it ever reaches EXEC&TE stage) nstruction is executed only if it reaches EXEC&TE stage of pipeline When instruction is executed but execution condition is false - so nothing happens - we say it is condition faIse executed When instruction is executed and execution condition is true we say it is condition true executed. XXX is always condition true executed if it is executed EQ XXX is condition true executed if Z=1 ADDNE R0,R1,R2 is condition true executed if Z=0 Note that a condition false executed instruction is inspected and condition checked in EXEC&TE stage - unlike an instruction which never reaches EXEC&TE stage of pipeline, but may be ETCHED or DECODED 3.42 tjwc - 12-Nov-11 SE1 / EE2 ntroduction to Computer Architecture ExampIe execution ; throughout Z=1, EQ true, NE false ADD R0,R1,R2 ;Condition true executed ORRNE R5,R5,R5 ;Condition false executed RSEQ R3,R3,R3 ;Condition true executed EOREQ R5,R5,R6 ;Condition true executed BNE A ;Condition false executed EQ A ;Condition true executed SUB R2,R3,R4 ;Fetched, decoded, not executed ADC R1,R0,R0 ;Fetched, not decoded or executed A AND R2,R3,R4 ;Condition true executed 3.43 tjwc - 12-Nov-11 SE1 / EE2 ntroduction to Computer Architecture $peed Ioss through ARM staIIs On ARM7 processor 3 cycles are lost whenever branch is executed. Assume one instruction = one cycle except for pipeline stalls Nearly correct Suppose 20% of instructions are condition true executed branches (ie they actually happen):
EQ (if Z=0) L MOV pc,r14 Calculate loss due to stall: Every 5 instructions = 5 cycles without stall there will be one stall = 3 extra cycles lost. Throughput = 5/8 = 0.625 instructions/cycle. f clock = 60MHz, throughput = 37.5 M!S (million instructions per second) $taIIs & Branch Prediction SLalls happen when wrong lnsLrucLlon ls prefeLched 1hls can be avolded by branch pred|ct|on work ouL wheLher branch wlll happen and lf lL wlll feLch Lhe branch LargeL address SLall only lf branch lnsLrucLlon and branch |s not correct|y pred|cted 1hree Lypes of branches A8M predlcLs correcLly only condlLlon false execuLed branches! %ype of branch ARM exampIes ARM prediction &nconditional ranch XXX L XXX 100% ncorrect Conditional ranch EQ XXX LEQ XXX Correct if branch not taken ncorrect if branch taken Computed ranch MOV pc, R14 100% ncorrect 3.45 tjwc - 12-Nov-11 SE1 / EE2 ntroduction to Computer Architecture %hroughput CaIcuIation Assume 1 lnsLrucLlon/cycle when plpellne full 8ranches happen wlLh probablllLy per lnsLrucLlon 8ranches lead Lo sLall only lf not correct|y pred|cted eq lo AkM btoocb cooJltloo folse execoteJ bos oo stoll becoose tbls ls pteJlcteJ by AkM botJwote bot btoocb cooJltloo ttoe execoteJ leoJs to stoll ocottect branch redlcLlon has probablllLy lpellne penalLy from lncorrecL branch predlcLlon ls a p|pe||ne Sta|| of cycles 1yplcally assume lengLh of plpellne (noL exacL buL good guess) Average number of cycles added per lnsLrucLlon ls 9&& 1hroughpuL (per cycle) decreases Lo MulLlply 1 by clock frequency for LhroughpuL ln lnsLrucLlons/second MlS mllllons of lnsLrucLlons/second 1 instructions/cyce 1 % !$
$ummary A8M ls plpelloeJ LhroughpuL of up Lo 1 cycle per lnsLrucLlon laLency of 3 cycles per lnsLrucLlon MosL lnsLrucLlons Lake one cycle LxcepLlons relaLe Lo hardware Instruction CycIes Add R0, R1, R2, IsI R3 2 1 cycle added to read R3 since only three register file ports B, B, MOV pc, R14 4 !ipeline stall & memory wait adds 3 cycles DR/$%R 4 Memory read/write (& two memory waits) add 3 cycles DM/$%M 3+n n memory read/write (& two memory waits) adds 2+n cycles 3.47 tjwc - 12-Nov-11 SE1 / EE2 ntroduction to Computer Architecture SkA, value ls sLored on a palr of lnverLlng gaLes very fasL buL Lakes up more space (4 Lo 6 LranslsLors per blL) Lhan u8AM DkA, value ls sLored as charge on a capaclLor (musL be refreshed) very dense ( hlgh capaclLy) buL slower Lhan S8AM (facLor of 3 Lo 10) Dsers want |arge and fast memor|es! S8AM access Llmes are 2 20 ns aL cosL of $30 Lo $100 per MbyLe u8AM access Llmes are 1030 ns aL cosL of $003 Lo $02 per MbyLe ulsk access Llmes are 3 10 mllllon ns aL cosL $0001 Lo $0003 per MbyLe ecture 14 - Cache Memory "I suffer from shorf ferm memory Ioss. If runs in my fomiIy... Af Ieosf I fhink if does... Where ore fhey7" Finding Memo 3.48 tjwc - 12-Nov-11 SE1 / EE2 ntroduction to Computer Architecture Question: how to organize memory to improve performance without the cost? Answer: build a memory hierarchy Store copies of frequently used memory locations in smaller & faster memories close to the C!& Memory reference instructions use the closest memory containing the required item Can speed up average memory access manyfold. ExpIoiting Memory Hierarchy 3.49 tjwc - 12-Nov-11 SE1 / EE2 ntroduction to Computer Architecture ExpIoiting Memory Hierarchy Processor Registers Processor Registers 64 to 256 bytes 0.3 - 5 nsec memory type size access speed On-chip cache On-chip cache 8 - 1024 Kbytes 3-5 nsec $econd-IeveI cache $econd-IeveI cache 256 - 16384 Kbytes 5-10 nsec Main memory (DRAM) Main memory (DRAM) 16M - 16G bytes ~ 20 nsec Disk or other store Disk or other store 10's - 1000's Gbytes 10's msec n e a r e s t
t o
C P & !& controlled transfer Automatic {hardware- controlled_ transfer !& controlled transfer 3.50 tjwc - 12-Nov-11 SE1 / EE2 ntroduction to Computer Architecture Automatic memory transfer between IeveIs of hierarchy Tr|s d|agrar |||uslrales dala slored |r a cacre rerory ard a |arger rerory lurlrer lror lre CPu. Local|or A ras va|ue x slored |r Cacre elc. Tre cacre slores sore, oul rol a||, ol lre |lers |r lre |arger rerory. lere addresses A & 0, |l requesled oy lre CPu, W||| oe llT3 |r lre cacre, ard addresses 8 & 0 Vl33es. lr a rerory r|erarcry every |eve| |oo|s |||e lr|s lr|s, requesls lror lre CPu rove oulWards l||| lrey llT. ache" memory : x D: w Memory : x B: y : z D: w !U 3.51 tjwc - 12-Nov-11 SE1 / EE2 ntroduction to Computer Architecture Cache %erminoIogy Memory hierarchy can be multiple levels Data is copied between two adjacent levels at a time We will focus on two levels: &pper IeveI (closer to the processor, smaller but faster) ower IeveI (further from the processor, larger but slower) Some terms used in describing memory hierarchy: bIock: minimum unit of data to transfer between levels - also called a cache Iine hit: data requested is in the upper level miss: data requested is not in the upper level 3.52 tjwc - 12-Nov-11 SE1 / EE2 ntroduction to Computer Architecture Two issues: How do we know if a data item is in the cache? (s it a HI% or a MI$$?) f it is there (a HT), how do we find it? Our first example: block size is one byte of data we will consider the "direct mapped" approach For each item of data at the Iower IeveI, there is exactIy one Iocation in the cache memory where it might be. ots of items at the Iower IeveI share a given Iocation in the upper IeveI, which can contain at most one at any given time. For each item of data at the Iower IeveI, there is exactIy one Iocation in the cache memory where it might be. ots of items at the Iower IeveI share a given Iocation in the upper IeveI, which can contain at most one at any given time. %he Basics of Caches 3.53 tjwc - 12-Nov-11 SE1 / EE2 ntroduction to Computer Architecture Direct Mapped Cache &pper memory level: cache Lower memory level: memory Mapping: ottom (LS) bits of the address determine position in cache tems differing only by top (MS) bits map to same cache location An item is replaced in the cache when different item with the same bottom bits is requested Assume data is written both to the cache and the lower level, preserving coherence rite-through cache 3.54 tjwc - 12-Nov-11 SE1 / EE2 ntroduction to Computer Architecture Cache Operation tag |ndex 13 3 2 0 ache address lndex Lach cache llne (block) conslsLs of uaLa 1ag valld flelds Index parL of memory address deLermlnes whlch cache llne ls used 1ag parL of memory address ls maLched wlLh sLored Lag ln cache Memory address Cache Address Cache Data Index Data %ag V 000 0 001 0 010 0 011 0 100 0 101 0 110 0 111 0 3.55 tjwc - 12-Nov-11 SE1 / EE2 ntroduction to Computer Architecture Cache data Iine Data field of cache contains cache line, can be more than one byte of memory $eIect bits from memory address are used to select byte within the cache line tag |ndex 23 1S 14 4 3 0 Memory address se|ect 2 4 16 byLes per llne 2 11 2048 llnes 2 13 32768 byLes cache memory Organization of Direct-mapped cache %ag Index Select n b|t lndex se|ects 1 out of 2 n cache ||nes m b|t 5e/ect se|ects 1 out of 2 m data bytes w|th|n a ||ne 1ag n blLs lndex m blLs selecL 3.57 tjwc - 12-Nov-11 SE1 / EE2 ntroduction to Computer Architecture Cache Contents - A waIk-through Assume cache is 8 bIocks (Iines) each of 1 byte nitial state on power-ON After handling read of address 00000 After handling read of address 00001 After handling write to address 01010 After handling read of address 01000 After handling read of address 01010 3/ex Vali/ bit (V) Tag Data
1 1 11 1 11 11 111
3.58 tjwc - 12-Nov-11 SE1 / EE2 ntroduction to Computer Architecture Cache Contents - A waIk-through (2) 3/ex Vali/ bit (V) Tag Data 1 MEM|( 1 1 11 1 11 11 111
nitial state on power-ON After handling read of address 00000, miss, mem read After handling read of address 00001 After handling write to address 01010 After handling read of address 01000 After handling read of address 01010 3.59 tjwc - 12-Nov-11 SE1 / EE2 ntroduction to Computer Architecture Cache Contents - A waIk-through (3) nitial state on power-ON After handling read of address 00000, miss, mem read After handling read of address 00001, miss, mem read After handling write to address 01010 After handling read of address 01000 After handling read of address 01010 3/ex Vali/ bit (V) Tag Data 1 MEM|( 1 1 MEM|1( 1 11 1 11 11 111
3.60 tjwc - 12-Nov-11 SE1 / EE2 ntroduction to Computer Architecture Cache Contents - A waIk-through (4) 3/ex Vali/ bit (V) Tag Data 1 MEM|( 1 1 MEM|1( 1 1 1 Written MEM|11( 11 1 11 11 111
nitial state on power-ON After handling read of address 00000, miss, mem read After handling read of address 00001, miss, mem read After handling write to address 01010, miss, mem write After handling read of address 01000 After handling read of address 01010 NB - woud write to MEM even iI it w,s , hit 3.61 tjwc - 12-Nov-11 SE1 / EE2 ntroduction to Computer Architecture Cache Contents - A waIk-through (5) 3/ex Vali/ bit (V) Tag Data 1 1 MEM|1( 1 1 MEM|1( 1 1 1 MEM|11( 11 1 11 11 111
nitial state on power-ON After handling read of address 00000, miss, mem read After handling read of address 00001, miss, mem read After handling write to address 01010, miss, mem write After handling read of address 01000, miss, mem read After handling read of address 01010, hit 3.62 tjwc - 12-Nov-11 SE1 / EE2 ntroduction to Computer Architecture Cache Contents - A waIk-through (6) 3/ex Vali/ bit (V) Tag Data 1 1 MEM|1( 1 1 MEM|1( 1 1 1 MEM|11( 11 1 11 11 111
nitial state on power-ON After handling read of address 00000, miss, mem read After handling read of address 00001, miss, mem read After handling write to address 01010, miss, mem write After handling read of address 01000, miss, mem read After handling read of address 01010, hit A Further ExampIe We have consldered a cache conslsLlng of 8 llnes each of 1 byLe (1oLal 8 byLes) An alLernaLlve arrangemenL for 8 byLes cache s|ze ls 4 llnes each of 2 byLes Cn each cache mlss we musL Lhen flll Lhe enLlre llne (on boLh wrlLes and reads) 1he se| (selecL) fleld selecLs whlch byLe ln a llne ls addressed t,g index 1 se 15 3 2 kead h|ts 1hls ls whaL we wanL cache passes daLa Lo processor kead m|sses CD sLalls feLch block from memory dellver Lo cache resLarL Wr|te h|ts Can replace addressed daLa ln cache and memory (wrlLe Lhrough) Wr|te m|sses 8ead Lhe enLlre block lnLo Lhe cache llne Lhen wrlLe Lhe addressed daLa ln cache and memory (wrlLeLhrough) Why does llne slze 1 noL requlre Lhe read? 3.64 tjwc - 12-Nov-11 SE1 / EE2 ntroduction to Computer Architecture Cont'd. After handling read of address 00000 (read block 0000) MSS After handling read of address 00001 HT No change from above After handling write to address 01010 (read block 0101, write block 0101) MSS After handling read of address 01011 HT No change from above After handling read of address 11010 (read block 1101) MSS 3/ex Vali/ bit (V) Tag Data(0) Data(1) 1 MEM|( MEM|1( 1 1 11
3/ex Vali/ bit (V) Tag Data(0) Data(1) 1 MEM|( MEM|1( 1 1 1 Write d,t, MEM|&b( 1 11
3/ex Vali/ bit (V) Tag Data(0) Data(1) 1 MEM|( MEM|1( 1 1 11 MEM|&1,( MEM|&1b( 1 11
Each memory block is 2 bytes The index bits are in bold text Each memory block is 2 bytes The index bits are in bold text 3.65 tjwc - 12-Nov-11 SE1 / EE2 ntroduction to Computer Architecture PrincipIe of ocaIity The principle of locality makes having a memory hierarchy a good idea f an item is referenced, temporaI IocaIity: it will tend to be referenced again soon spatiaI IocaIity: nearby items will tend to be referenced soon. (E.g. sequential access). hy does code have IocaIity? All caches exploit temporal locality. A cache with a larger line size will achieve a higher hit rate for spatial locality E.g. line size N items, sequential access => hit rate at least 1
Caches explolL tempotol spotlol locallLy Lo geL hlgh hlL raLe A parLlcular memory lLem ls sLored ln a unlque locaLlon ln Lhe cache 1o check lf a parLlcular memory lLem ls ln cache Lhe |ndex blLs of Lhe memory address are used Lo address Lhe cache enLry 1he Lop memory address blLs are Lhen compared wlLh Lhe sLored tag lf Lhey are equal and Lhe v blL ls 1 we have goL a hlL 1he boLLom memory address blLs se|ect deLermlne byLe wlLhln cache llne 1ag ls all address blLs excepL |ndex and se|ect Slze of cache wlLh n blLs lndex and M blLs selecL ls 2 (n+M) byLes When a mlss occurs daLa cannoL be read from Lhe cache A slower read from Lhe nexL level of memory musL Lake place lncurrlng a mlss penalLy noLe v blL ls only necessary Lo mark lnlLlal sLaLe where cache conLalns no valld daLa (hence musL always be a mlss) Direct Mapped Cache - Key points