cs2071 New Notes 3

ANAND INSTITUTE OF HIGHER TECHNOLOGY Chennai-603 103
DEPARTMENT OF ELECTRONICS AND INSTRUMENTATION ENGINEERING

CS2071 COMPUTER ARCHITECTURE
Faculty Name: C.MAGESHKUMAR
Class: IV EIE A&B Semester: VII
UNIT III DATA PATH AND CONTROL

Page no.
2
2
3
4
6
6
6
CONTENT
I.
1.
2.
3.
4.
5.
6.
Instruction Execution Steps

A Small Set of Instructions
The Instruction Execution Unit
A Single-Cycle Data Path
Branching and Jumping
Deriving the Control Signals
Performance of the Single-Cycle Design
Control Unit Synthesis
A Multicycle Implementation
Choosing the Clock Cycle
The Control State Machine
Performance of the Multicycle Design
6
6
8
9
10
III.
Microprogramming
11
IV.
11.
12.
13.
14.
15.
16.
Pipelining
Pipelining Concepts
Pipeline Stalls or Bubbles
Pipeline Timing and Performance
Pipelined Data Path Design
Pipelined Control
Optimal Pipelining
13
13
14
16
16
16
16
Pipeline Performance
Data Dependencies and Hazards
Data Forwarding
Pipeline Branch Hazards
Delayed Branch and Branch Prediction
Advanced Pipelining
17
17
18
19
19
21
II.
7.
8.
9.
10.
V.
17.
18.
19.
20.
21.
CMageshKumar_AP_AIHT
CS2071_Computer Architecture
I. INSTRUCTION EXECUTION STEPS

1. A SMALL SET OF INSTRUCTIONS
MiniMIPS instruction set 40 instructions
MicroMIPS instruction set 22 instructions
The instructions in below table can be divided into 5 categories
1. Seven (7) R-format ALU instructions (add, sub, slt, and, or, xor, nor)
2. Six (6) I-format ALU instructions (lui, addi, slti, andi, ori, xori)
3. Two (2) I-format memory access instructions (lw, sw)
4. Three (3) I-format conditional branch instructions (bltz, beq, bne)
5. Four (4) unconditional jump instructions (j, jr, jal, syscall)
op
31
R
I
25
rs
20
rt
15
rd
10
sh
fn
6 bits
5 bits
5 bits
5 bits
5 bits
6 bits
Opcode
Source 1
or base
Source 2
or destn
Destination
Unused
Opcode ext
imm
Operand / Offset, 16 bits
jta
Jump target address, 26 bits
inst
Instruction, 32 bits
Fig.1. MICROMIPS INSTRUCTION FORMATS
Execution sequence of MicroMIPS instructions:

Seven R-format ALU instructions (add, sub, slt, and, or, xor, nor) have following common execution sequence:
1. Read out the contents of source registers rs & rt and forward them to ALU as inputs
2. Inform the ALU to perform the desired operation by means of appropriate control signal
3. Write the output of ALU in destination register rd
5 out of 6 I-format ALU instructions (addi, slti, andi, ori, xori) have following common execution sequence:
1. Read out the contents of source registers rs & immediate value and forward them to ALU as inputs
3. Write the output of ALU in destination register rt
2
The 1 out of 6 I-format ALU instructions (lui) have following common execution sequence:
1. Read out the contents of source register immediate value and forward them to ALU as input
3. Write the output of ALU in destination register rt
The Two (2) I-format memory access instructions (lw, sw) have following common execution sequence:
1. Read out the content of rs
2. Add the number of read out from rs to immediate value in instruction to form a memory address
3. Read from / write into memory at specified address.
4. In case of lw instruction, place the word read out from memory into rt
The Three (3) I-format conditional branch instructions (bltz, beq, bne) and Four (4) unconditional jump
instructions (j, jr, jal, syscall) have following common execution sequence:
1. Read out the contents of source registers rs & immediate value and forward them to ALU as inputs
3. The branch target address is specified by an offset relative to increamented program counter value ((PC)+4)
4. To branch back tp previous instruction, the offset value supplied in the immediate field of instruction will be -2,
which in branch target address [ (PC)+4-(2*4) = (PC)-4]
5. For beq, bne instructions, contents of rs and rt are compared to determine wheather branch condition is
satisfied.
6. For bltz, the branch decision is based on the sign bit of content of rs.
7. For 4 jump instructions (j, jr, jal, syscall):
PC is unconditionally modified to allow the next instruction to be fetched from jump target address.
The jump target address comes from instruction itself (j, jal) is read out from register rs or is a known
constant associated with the location of an operating system routine call (syscall)
2. THE INSTRUCTION EXECUTION UNIT
Step by step execution of all 22 MicroMIPS instructions can be depicted from below block diagram:
1. Beginning at the left end, the content of program counter (PC) is supplied to instruction cache and an
instruction word is read out from specified location.
2. With every clock cycle ticking, a new address is loaded into program counter causing a new instruction to
appear at output of instruction cache after a short access delay
3. Contents of various fields of instruction are sent to relevant blocks including control unit (decides the
operation to be performed)
4. Once an instruction has been read out from instruction cache, its various fields are separated and dispatched
to approx. place.
Example: op and fn fields goto control unit, rs, rt, rd will goto register file
5. The upper input of ALU always comes from register rs and lower input of ALU is from rt or immediate
value of instruction.
6. As the data from register file pass through ALU, the specified operation is performed and the output
appears at ALU output.
7. In case of arithmetic and logic instructions the output of ALU is stored in destination register and thus it
bye-pass data cache, run through feedback line is stored in rd of register file.
8. In case of memory access instructions, the ALU output data is treated as data address for writing into / read
from data cache
9. Data cache: For many instructions, the output of ALU is stored in a register thus, data cache is byepassed.
For lw and sw instructions, the data cache is accessed with the content of rt written into rt for sw
instruction and its output sent to register file for lw instruction
10. In one clock cycle, the content of any 2 registers out of 32 registers (mostly rs & rt) is read out from read
ports, At the same time, the output from ALU is stored in the register via write port.
11. The flip-flops representing registers are edge-triggered. So, reading / writing into same register in a single
clock cycle does not cause any problem.
3
12. For beq and bne instructions, contents of rs and rt are compared to determine whether the branch
condition is satisfied. The comparison is performed in next address block.
13. In case of bltz, the branch decision is based on sign bit of content of rs rather than comparison of two
register contents. This is performed by next address block.
14. Next address blocks also choose the jump target address under the guidance of control unit.
15. The jump target address comes from j, jal instructions is read out from register rs (jr instruction).
16. The middle part composing program counter, instruction cache, register file, ALU, data cache is known as
data path.
syscall
beq,bne
Next addr
jta
j,jal
bltz,jr
rs,rt,rd
PC
Instr
cache
12 A/L,
lui,
lw,sw
Reg
file
inst
22 instructions
(rs)
ALU
Address
Data
Data
cache
(rt)
imm
op fn
Control
Fig.2. Abstract view of the instruction execution unit for MicroMIPS.
Harvard
architecture
3. A SINGLE-CYCLE DATA PATH

1. The middle part composing program counter, instruction cache, register file, ALU, data cache is known as
data path.
2. The datapath shown above is capable of executing one instruction per clock cycle. Hence the name single
cycle datapath
3. Singlecycle design : clock rate- 125 MHz and CPI- 1
4. There are 3 multiplexers used in datapath,
1. At input side of register file
2. At lower input of ALU
3. At output of ALU and data cache.
5. Multiplexer 1 (At input side of register file) :
i. This multiplexer allows rt, rd or $31 to be used as the index of destination register into which
results will be written.
ii. The logic signal RegDst is supplied by control unit directs the selection of rt or rd or $31.
iii. RegDst control signals and corresponding selections
S.no Control signal
Selection
1
00
rt
2
01
rd
3
10
$31
iv. RegWrite is declared (asserted) by control unit to write into register file.
6. Registers rs and rt are read out for every instruction even it is not needed, so there is no read control signal.
7. Instruction cache block also wont receive any control signal to read the instructions since instructions are
read out in every cycle.
8. Multiplexer 2 (At lower input of ALU):
i. The multiplexer at the lower input of ALU allows the control unit by asserting / deasserting ALUSrc
control signal to choose the content of rt or 32-bit sign-extended version of 16-bit immediate
operand to be used as second ALU input.
1. If ALUSrc signal = 0 (deasserting), then content of rt is used as ALU lower input
2. If ALUSrc signal = 1 (asserting), then content of 32-bit sign-extended version of
16-bit immediate operand is used as ALU lower input.
ii. Sign extension of immediate operand is performed by SE block.
9. Multiplexer 3 (At output of ALU and data cache): The control signal used here is RegInSrc
S.no Control signal
Selection
1
00
Data cache output
2
01
ALU output
3
10
Incremented PC value coming from next-address block
10. With every clock cycle ticking, a new address is loaded into program counter causing a new instruction to
appear at output of instruction cache after a short access delay.
11. Contents of various fields of instruction are sent to relevant blocks including control unit (decides the
operation to be performed)
12. As the data from register file pass through ALU, the specified operation is performed by ALUFunc signal and
the output appears at ALU output.
13. In case of arithmetic and logic instructions the output of ALU is stored in destination register and thus it byepass data cache, run through feedback line is stored in rd of register file.
14. In case of memory access instructions, the ALU output data is treated as data address for writing into
(DataWrite signal ) / read from (DataRead signal) data cache
Incr PC
Next addr
jta
Next PC
ALUOvfl
(PC)
PC
(rs)
rs
rt
Instr
cache
inst
rd
31
imm
op
Br&Jump
0
1
2
Ovfl
Reg
file
ALU
(rt)
/
16
0
32
SE / 1
Func
ALU
out
Data
addr
Data
cache
Data
in
0
1
2
Register input
fn
RegDst
RegWrite
ALUSrc
ALUFunc
DataRead
RegInSrc
DataWrite
Data
out
4. BRANCHING AND JUMPING:
(Refer page no. 249,250 in text book B.Parhami)

5. DERIVING THE CONTROL SIGNALS:
(Refer page no. 250-253 in text book B.Parhami)

Control signals for the single-cycle MicroMIPS implementation.
6. PERFORMANCE OF THE SINGLE-CYCLE DESIGN
(Refer page no. 253-255 in text book B.Parhami)

II.
CONTROL UNIT SYNTHESIS
7. MULTICYCLE IMPLEMENTATION:
Clock
Time
needed
Time
allotted
Instr 1
Instr 2
Instr 3
Instr 4
Clock
Time
needed
Time
allotted
Time
saved
3 cycles
5 cycles
3 cycles
4 cycles
Instr 1
Instr 2
Instr 3
Instr 4
Fig.3. Single-cycle versus multicycle instruction execution.
With multicycle design, a subset of actions required for an instruction is performed in one clock cycle.
Hence the clock cycle can be made much shorter, with several cycles needed to execute a single instruction.
Advantages of multicycle implementation are greater speed and economy
6
MULTICYCLE DATA PATH:
Inst Reg
x Reg
jta
Address
rs,rt,rd
(rs)
PC
imm
Cache
z Reg
Reg
file
ALU
(rt)
Data
Data Reg
op
y Reg
fn
Control
Fig.4. Abstract view of a multicycle instruction execution unit for MicroMIPS.

1. The datapath in above block diagram is capable of executing one instruction in every 3-5 clock cuycles.
Hence named as multi-cycle data path
2. Multicycle design : clock rate- 500 MHz and CPI- approx. 4
3. Cache block = instruction cache + data cache.
4. All instructions will be executed in 5 cycles refer control state machine
5. When a word is read from cache block, it must be held in a register for use in subsequent cycles.
6. The reason for having 2 registers Instruction register and Data register between cache and register file
is that once the instruction is read out, it must be kept for all the remaining cycles in its execution to
generate the control signals appropriately.
7. So a second register is needed for data readout associated with lw
8. Three other registers namely, x, y, and z also serve the same purpose of holding information between
cycles.
9. It is notable that except program counter and Instruction register all other registers are loaded in every clock
cycle.
10. Instruction fetch cycle: Execution of all instruction starts the same way in first cycle. The content of PC is
used to access cache and the retrieved word is placed in instruction register. This is known as instruction
fetch cycle.
11. In second clock cycle, the instructions are decoded and the registers rs and rt are accessed.
12. If the instruction executed is one of four jump instructions (j, jr, jal, syscall), its execution terminates in 3rd
cycle by simply writing the appropriate address into PC.
13. If it is a branch instruction (beq, bne, bltz), then the branch condition is checked and the appropriate value is
written into PC in 3rd cycle.
14. All other instructions proceed to and completed in 4th cycle.
15. lw instruction requires 5th cycle to write the data retrieved from cache into a register.
FOR DETAILED CONTROL SIGNAL AND MUX EXPLANATION REFER PAGE NO. 260, 261 IN
P.BRAHAMI BOOK
8. CHOOSING THE CLOCK CYCLE
(Refer page no. 262 in text book B.Parhami)
9. THE CONTROL STATE MACHINE
CONTROL STATE MACHINE for MULTICYCLE MicroMIPS
The control unit must distinguish between 5 cycles of mutlicycle design and additionally be able to perform
different operations depending on the instruction.
The above diagram depicts the control states and state transitions
The control state machine carries the required information along by moving from state to state. The control
state machine is set to state 0 when program execution begins
Then it moves from state to state until one instruction has been completed, at which it returns to state 0 to
begin the execution of another instruction.
The control state sequences for various MicroMIPS instruction classes are as follows:
ALU type 0,1,7,8
Load word 0,1,2,3,4
Store word 0,1,2,6
Jump / branch 0,1,5
In each state except state 5 & 7, the control signals are uniquely determined.
Information regarding the current control state and instruction executed is supplied by decoders.
Control signals can be easily determined by using control state machine diagram and decoder diagram
Example of control signals that are uniquely determined by control state information include:
Certain control signals depend only on the control state
ALUSrcX = ControlSt2 ControlSt5 ControlSt7
RegWrite = ControlSt4 ControlSt8
Auxiliary signals identifying instruction classes
addsubInst = addInst subInst addiInst
logicInst = andInst orInst xorInst norInst andiInst oriInst xoriInst
Logic expressions for ALU control signals
AddSub = ControlSt5 (ControlSt7 subInst)
FnClass1 = ControlSt7 addsubInst logicInst
FnClass0 = ControlSt7 (logicInst sltInst sltiInst)
LogicFn1 = ControlSt7 (xorInst xoriInst norInst)
LogicFn0 = ControlSt7 (orInst oriInst norInst)
9
op
/4
5
6
7
8
9
10
11
12
ControlSt0
ControlSt1
ControlSt2
ControlSt3
ControlSt4
ControlSt5
ControlSt6
ControlSt7
ControlSt8
0
1
2
3
4
op Decoder
st Decoder
0
1
2
3
4
fn
/6
13
14
15
bltzInst
jInst
jalInst
beqInst
bneInst
andiInst
10
sltiInst
12
13
14
15
andiInst
oriInst
xoriInst
luiInst
35
lwInst
43
63
/6
RtypeInst
fn Decoder
st
jrInst
12
syscallInst
32
addInst
34
subInst
36
37
38
39
andInst
orInst
xorInst
norInst
42
sltInst
swInst
63
Decoders
10.
PERFORMANCE OF THE MULTICYCLE DESIGN
(Refer page no. 266 in text book B.Parhami)
10
III.
MICROPROGRAMMING
The control state machine resembles a program that has instructions /state, branching, and loops. Such
a hardware program is called as microprogram and its basic steps are microinstructions.
A single instruction in microcode. It is the most elementary instruction in the computer, such as
moving the contents of a register to the arithmetic logic unit (ALU).
It takes several microinstructions to carry out one complex machine instruction (CISC).
Also called a "micro-op" or "op," microinstructions differ within the same computer family and even
the same vendor.
Microprogrammed control is a control mechanism to generate control signals by using a memory
called control storage (CS), which contains the control signals.
Although microprogrammed control seems to be advantageous to CISC machines, since CISC
requires systematic development of sophisticated control signals, there is no intrinsic difference
between these 2 control mechanism.
Microprogramming is a method of control unit design in which the control unit selection and
sequencing information are stored in ROM and RAMs called control store or control memory.
Micro programmed control unit is a general approach used for implementation of control unit. Here
control signals are generated by a program similar to machine language programs
Instead of implementing the control state machine in custom hardware, we can store microinstructions
in locations of control ROM, fetching and executing sequence of microinstructions for each machine
language instruction.
Each microinstruction defines a step in execution of a machine language instruction.
Advantages of ROM-based implementation of control
o Simple hardware
o More regular
o Less dependent on instruction-set architecture details
o Same hardware can be used for different purpose by modifying ROM contents
Microprogramming : Designing a suitable sequence of microinstructions to realize a particular
instruction set architecture is called microprogramming.
Micro programmable machine: if the microprogram is easily modifiable, even by user then the
machine is called Micro programmable machine.
Micro instruction format:
o 23 bit microinstruction format. Each bit has one to one correspondence except sequence
control bits in multicycle datapath.
o The 2-bit sequence control field allows for the control of microinstruction sequencing in same
way that PC control affects the sequencing of machine language instructions.
Microprogrammed control unit: Microprogrammed control unit for MicroMIPS diagram shows 4
options (MUX) for choosing next microinstruction.
o Option 0: to advance the next microinstruction in sequence by incrementing
microprogram counter
o Option 1 & 2: allows branching to occur depending on opcode field in machine
instruction being excuted.
o Option 3: is to goto microinstruction 0 corresponding to state 0 (refer control
state machine). This initiates the fetch phase for next machine instruction
Dispatch table 1 : corresponds to multiway branch in going from cycle 2 to cycle 3
Dispatch table 2 : implements the branch between cycles 3 & 4. (refer control state machine)
11
PC
control
Cache
control
Register
control
JumpAddr
PCSrc
PCWrite
ALU
inputs
ALU Sequence
function control
FnType
LogicFn
AddSub
ALUSrcY
ALUSrcX
RegInSrc
RegDst
RegWrite
InstData
MemRead
MemWrite
IRWrite
23-BIT MICROINSTRUCTION FORMAT FOR MICROMIPS.
Dispatch
table 1
Dispatch
table 2
0
1
2
3
MicroPC
1
Address
Microprogram
memory or PLA
Incr
Data
Microinstruction register
op (from
instruction
register)
Control signals to data path
Sequence
control
Microprogrammed control unit for MicroMIPS
(For detailed explanation with microprogram example please Refer page no. 269 - 271 in text
book B.Parhami)
12
IV. PIPELINING
11. PIPELINING CONCEPTS
2 strategies for achieving greater performance:
Strategy 1: multiple-instruction-issue or superscalar organization: use multiple independent data paths that can
accept several instructions that are read out at once.
Strategy 2: Pipelined or super-pipelined organization: overlap the execution of several instructions in singlecycle design, starting next instruction before previous instruction has executed.
Pipelining:
Pipelining is an implementation technique where multiple instructions are overlapped in execution. The
computer pipeline is divided in stages.
Each stage completes a part of an instruction in parallel. The stages are connected one to the next to form
a pipe - instructions enter at one end, progress through the stages, and exit at the other end.
Pipelining does not decrease the time for individual instruction execution. Instead, it increases instruction
throughput.
The throughput of the instruction pipeline is determined by how often an instruction exits the pipeline.
5 instruction execution steps / stages in a pipelining of MicroMIPS:

Each step takes 1-2 ns.
1. Instruction Fetch
2. Instruction Decode and register access
3. ALU operation
4. Data memory access
5. Register writeback
Cycle 1
Cycle 2
Instr
cache
Reg
file
Instr
cache
Instr 3
Instr 4
Instr 5
Cycle 3
Cycle 4
Cycle 5
Cycle 6
Cycle 7
Cycle 8
ALU
Data
cache
Reg
file
Reg
file
ALU
Data
cache
Reg
file
Instr
cache
Reg
file
ALU
Data
cache
Reg
file
Instr
cache
Reg
file
ALU
Data
cache
Reg
file
Instr
cache
Reg
file
ALU
Data
cache
Cycle 9
Time dimension
Instr 2
Instr 1
Pipelined Instruction Execution (Pipelining in the MicroMIPS instruction execution process.)
Task
dimension
Reg
file
In task-time diagram, stages of each task are horizontally aligned and their positions along the horizontal
axis represent the timing of their execution.
In space-time diagram, the vertical axis represents stages in the pipeline (the space dimension) and boxes
representing the various stages of a task are diagonally aligned.
Ideally a q-stage pipeline can increase instruction execution throughput by a factor of q. But this fact is
not quite the case because of the following:
o Effects of pipeline start-up and drainage
o Wastage due to unequal stage delays.
o Time overhead of saving stage results in registers
o Safety margin in clock period necessitated by clock skew.
13
1
2
f
f = Fetch
r = Reg read
a = ALU op
d = Data access
w = Writeback
3
4
5
6
7
10
11
Cycle
1
2
3
4
5
Start-up
region
10
11
Cycle
Drainage
region
Pipeline
stage
Instruction
(a) Task-time diagram
(b) Space-time diagram
Fig. Two abstract graphical representations of a 5-stage pipeline executing 7 tasks (instructions).
12.
PIPELINE STALLS OR BUBBLES
Data dependency in pipeline : Execution of one instruction depending on completion of a previous

instruction.
Data dependency in pipeline can cause pipeline stalls which diminish the performance.
Types of data dependency:
o Read-after-compute: register access after updating it with a computed value.
o Read-after-load: register access after updating it with data from memory
Example for Read-after-compute is shown in below diagram, where the 3rd instruction uses the value that
the 2nd instruction writes into register $8 & the 4th instruction needs the result of 3rd instruction in register
$9. Note that write operation in register $8 is completed in cycle 6. Hence, reading the new value from
register $8 is possible beginning with cycle 7. The 3rd instruction reads out register $8 & $2in cycle 4. The
data dependency problem can be solved by bubble insertion or by data forwarding.
BUBBLE INSERTION:
First detect the type of data dependency
Bubble insertion: The phenomenon of inserting redundant and harmless instruction (adding 0 to a register /
shifting a register by 0 bit) before the next instruction. Such instruction is called as no-op (no-operation)
instruction. Since they didnt perform any useful task but use the memory they resembles the bubble in a
water pipe is called bubble insertion.
Insertion of bubbles in a pipeline implies
o reduced throughput
o hurts the performance when more than 2 bubbles are inserted.
So bubble insertion should be minimized. It can be minimized by relocating an useful instruction in a
program between the data dependent instruction instead of inserting bubbles.
DATA FORWARDING:
the phenomenon of bypassing the output of ALU of 1st instruction to the input of ALU that is needed as
input for execution of 2nd instruction without storing the output value of 1st instruction in memory is called
data forwarding . please see below diagrams for clear understanding
Control dependency:
When a conditional branch is executed, the location of the next branch instruction depends on whether the branch
condition is satisfied. Since branch instructions are based on testing the register contents, branch condition will be
resolved at the end of 2nd pipeline stage. Therefore a bubble is required after every conditional branch instruction.
14
$5 = $6 + $7
Cycle 1
Cycle 2
Instr
cache
Reg
file
Instr
cache
$8 = $8 + $6
Cycle 3
$9 = $8 + $2
Cycle 4
Cycle 5
ALU
Data
cache
Reg
file
Reg
file
ALU
Data
cache
Reg
file
Instr
cache
Reg
file
ALU
Data
cache
Reg
file
Instr
cache
Reg
file
ALU
Data
cache
sw $9, 0($3)
Cycle 6
Cycle 7
Cycle 8
Data
forwarding
Reg
file
Read-after-write data dependency and its possible resolution through data forwarding .
Cycle 2
Cycle 3
Cycle 4
Instr
cache
ALU
Instr
cache
Reg
file
Instr 3
Reg
file
Data
cache
Reg
file
ALU
Data
cache
Bubble
Instr
cache
Instr 4
Instr 5
Cycle 5
Cycle 6
Cycle 7
Cycle 8
Cycle 9
Time dimension
Instr 2
Instr 1
Cycle 1
Reg
file
ALU
Bubble
Instr
cache
Reg
file
Data
cache
Bubble
ALU
Reg
file
Instr
cache
Task
dimension
Without data forwarding,

three bubbles are needed
to resolve a read-afterwrite data dependency
Writes into $8
Reg
file
Reg
file
Data
cache
Reg
file
ALU
Data
cache
Reg
file
Reads from $8
Two bubbles, if we assume

that a register can be
updated and read from in
one cycle
C ycle 1
C ycle 2
Instr
mem
Reg
file
ALU
Instr
mem
sw $6, . . .
C ycle 3
C ycle 4
C ycle 5
C ycle 6
C ycle 7
Data
mem
Reg
file
Reg
file
ALU
Data
mem
Reg
file
Instr
mem
Reg
file
ALU
Data
mem
Reg
file
Instr
mem
Reg
file
ALU
Data
mem
C ycle 8
Reorder?
lw $8, . . .
Insert bubble?
$9 = $8 + $2
Without data
forwarding, three
(two) bubbles are
needed to resolve a
read-after-load data
dependency
Reg
file
Read-after-load data dependency and its possible resolution through bubble insertion and data forwarding.
15
C ycle 1
C ycle 2
Instr
mem
Reg
file
Instr
mem
$6 = $3 + $5
beq $1, $2, . . .
C ycle 3
Insert bubble?
C ycle 4
C ycle 5
ALU
Data
mem
Reg
file
Reg
file
ALU
Data
mem
Reg
file
Instr
mem
Reg
file
ALU
Data
mem
Reg
file
Instr
mem
Reg
file
ALU
Data
mem
$9 = $8 + $2
Assume branch
resolved here
C ycle 6
C ycle 7
C ycle 8
Reorder?
(delayed
branch)
Reg
file
Here would need

1-2 more bubbles
Control dependency due to conditional branch.
13.
PIPELINE TIMING AND PERFORMANCE (Refer page no. 284 in text book B.Parhami)
14. PIPELINED DATA PATH DESIGN (Refer page no. 285-286 for detailed description of each stage in
text book B.Parhami)
The pipelined datapath for MicroMIPS is obtained by inserting latches or registers in single-cycle data path.
The 5 pipeline stages are
1. Instruction Fetch
2. Instruction Decode and register access
3. ALU operation
4. Data memory access
5. Register writeback
Stage 1
Stage 2
NextPC
ALUOvfl
1
PC
inst
Instr
cache
rs
rt
(rs)
Stage 4
Stage 5
Reg
file
ALU
imm SE
Incr
IncrPC
SeqInst
op
Data
addr
Ovfl
(rt)
15.
16.
Stage 3
Next addr
Data
cache
Func
0
1
0
1
2
rt
rd 0
1
31 2
Br&Jump
RegDst
fn
RegWrite
ALUSrc
ALUFunc
DataRead
RetAddr
DataWrite
RegInSrc
PIPELINED CONTROL (Refer page no. 289 in text book B.Parhami)

OPTIMAL PIPELINING (Refer page no. 291 in text book B.Parhami)
16
0
1
V. PIPELINE PERFORMANCE
17. DATA DEPENDENCIES AND HAZARDS
Data dependency in pipeline : Execution of one instruction depending on completion of a previous
instruction or the phenomenon of one instruction requiring data generated by previous instruction is called
data dependency
The generated data may reside in a register or memory location where the subsequent instruction expects to
find the value.
In the below diagram, each instruction from 2nd through 5th instruction reads a register written into by the 1st
instruction.
o The 5th instruction needs the content of $2 register after completion of register writeback by 5th
instruction.
o The 4th instruction needs the new content of register $2 in the same cycle when the 1st instruction
produces it which results in a little problem.
o But the 2nd & 3rd instruction needs the content of 1st instruction before the 1st instruction execution.
This results in a major problem of data dependency.
Data dependency in pipeline can cause pipeline stalls which diminish the performance.
Types of data dependency:
o Read-after-compute: register access after updating it with a computed value. This dependency exists
when 1 instruction updates a register with a computed value and a subsequent instruction uses the
content of that register as an operand.
o Read-after-load: register access after updating it with data from memory. This dependency arises
when one instruction loads a new value from memory into a register and a subsequent instruction
uses the content of that register as an operand.
Cycle 1
Cycle 2
Instr
cache
Reg
file
Instr
cache
Cycle 3
Cycle 4
Cycle 5
Cycle 6
Cycle 7
Cycle 8
ALU
Data
cache
Reg
file
Reg
file
ALU
Data
cache
Reg
file
Instr
cache
Reg
file
ALU
Data
cache
Reg
file
Instr
cache
Reg
file
ALU
Data
cache
Reg
file
Instr
cache
Reg
file
ALU
Data
cache
Cycle 9
$2 = $1 - $3
Instructions
that read
register $2
Reg
file
17
SINCE THE BELOW TOPICS ARE CLEAR AND READABLE IN THE BOOK PLEASE REFER PAGE
NO. 298-308 IN TEXT BOOK B.PARHAMI)
18.
DATA FORWARDING:
Resolving Data Dependencies via Forwarding: When a previous instruction writes back a value
computed by the ALU into a register, the data dependency can always be resolved through forwarding
Cycle 1
Cycle 2
Cycle 3
Cycle 4
Cycle 5
Instr
cache
Reg
file
ALU
Instr
cache
Cycle 6
Cycle 7
Data
cache
Reg
file
Reg
file
ALU
Data
cache
Reg
file
Instr
cache
Reg
file
ALU
Data
cache
Reg
file
Instr
cache
Reg
file
ALU
Data
cache
Cycle 8
Cycle 9
$2 = $1 - $3
Instructions
that read
register $2
Reg
file
Certain Data Dependencies Lead to Bubbles: When the immediately preceding instruction writes a value
read out from the data memory into a register, the data dependency cannot be resolved through forwarding
(i.e., we cannot go back in time) and a bubble must be inserted in the pipeline.
Cycle 1
Cycle 2
Cycle 3
Cycle 4
Cycle 5
Cycle 6
Instr
cache
Reg
file
ALU
Instr
cache
Cycle 7
Data
cache
Reg
file
lw
Reg
file
ALU
Data
cache
Reg
file
Instr
cache
Reg
file
ALU
Data
cache
Reg
file
Instr
cache
Reg
file
ALU
Data
cache
Cycle 8
Cycle 9
$2,4($12)
Instructions
that read
register $2
Reg
file
18
19. PIPELINE BRANCH HAZARDS

Software-based solutions
Compiler inserts a no-op after every branch (simple, but wasteful)
Branch is redefined to take effect after the instruction that follows it
Branch delay slot(s) are filled with useful instructions via reordering
Hardware-based solutions
Mechanism similar to data hazard detector to flush the pipeline
Constitutes a rudimentary form of branch prediction:
o Always predict that the branch is not taken, flush if mistaken
o More elaborate branch prediction strategies possible
20. DELAYED BRANCH AND BRANCH PREDICTION
Predicting whether a branch will be taken
Always predict that the branch will not be taken
Use program context to decide (backward branch is likely taken, forward branch is likely not taken)
Allow programmer or compiler to supply clues
Decide based on past history (maintain a small history table); to be discussed later
Apply a combination of factors: modern processors use elaborate techniques due to deep pipelines
Problem with this approach:

Each branch in a loop entails two
mispredictions:
1. Once in first iteration (loop is repeated,
but the history indicates exit from loop)
2. Once in last iteration (when loop is
terminated, but history indicates repetition)
19
Other branch prediction algorithms:

Taken
Not taken
Not taken
Predict
taken
Taken
Not taken
Predict
taken
again
Taken
Taken
Predict
taken
Taken
Not taken
Predict
not taken
Not taken
Taken
Predict
taken
again
Predict
not taken
again
Taken
Not taken
Predict
not taken
Taken
Not taken
Predict
not taken
again
Not taken
Taken
Not taken
Not taken
Predict
taken
Taken
Not taken
Predict
taken
again
Taken
Predict
not taken
Not taken
Predict
not taken
again
Taken
Hardware Implementation of Branch Prediction

The mapping scheme used to go from PC contents to a table entry is the same as that used in direct-mapped caches
Low-order
bits used
as index
Addresses of recent
branch instructions
Target
addresses
History
bit(s)
Incremented
PC
0
1
Read-out table entry
From
PC
Compare
Next
PC
Logic
20
The Three Hardware Designs for MicroMIPS

Incr PC
Single-cycle
Next addr
jta
Next PC
ALUOvfl
(PC)
PC
Instr
cache
inst
rd
31
0
1
2
Reg
file
imm
op
ALU
(rt)
/
16
ALU
out
Data
cache
Data
out
Data
in
Func
0
32
SE / 1
Data
addr
0
1
0
1
2
Data Reg
32 y Reg
SE /
ALU
y Mux
4
0
1
2
4 3
(rt)
imm 16
/
30
0
1
2
3
Func
ALU out
Register input
fn
RegDst
RegWrite
ALUSrc
Stage 1
Stage 2
IRWrite
ALUOvfl
PC
fn
RegInSrc
RegDst
Stage 3
1
inst
Instr
cache
rs
rt
(rs)
ALUSrcX
RegWrite
Stage 4
ALUFunc
ALUSrcY
Stage 5
Reg
file
IncrPC
Address
Data
cache
ALU
imm SE
Incr
Data
Data
addr
Ovfl
(rt)
500 MHz
CPI 1.1
op
MemWrite
MemRead
Next addr
NextPC
PCWrite
DataRead
RegInSrc
DataWrite
ALUFunc
125 MHz
CPI = 1
rt
rd
31
Func
0
1
0
1
0
1
2
0
1
2
2
3
5
SeqInst
op
21.
(rs)
Reg
file
0
1
Data
InstData
Br&Jump
rt
0
1
rd
31 2
Cache
ALUZero
x Mux
ALUOvfl
0
Zero
z Reg
1
Ovfl
x Reg
rs
PC
0
1
SysCallAddr
jta
Address
Ovfl
30
/
4 MSBs
Inst Reg
(rs)
rs
rt
26
/
Multicycle
Br&Jump
RegDst
fn
RegWrite
ALUSrc
ALUFunc
DataRead RetAddr
DataWrite
RegInSrc
ADVANCED PIPELINING (Refer page no. 306-308 in text book B.Parhami)
21
PCSrc
JumpAddr
500 MHz
CPI 4

cs2071 New Notes 3

Transféré par

Informations du document

Description originale:

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

cs2071 New Notes 3

Transféré par

Droits d'auteur :

Formats disponibles

ANAND INSTITUTE OF HIGHER TECHNOLOGY Chennai-603 103

DEPARTMENT OF ELECTRONICS AND INSTRUMENTATION ENGINEERING

Class: IV EIE A&B Semester: VII

UNIT III DATA PATH AND CONTROL

Instruction Execution Steps

I. INSTRUCTION EXECUTION STEPS

Jump target address, 26 bits

Fig.1. MICROMIPS INSTRUCTION FORMATS

Execution sequence of MicroMIPS instructions:

3. A SINGLE-CYCLE DATA PATH

4. BRANCHING AND JUMPING:

(Refer page no. 249,250 in text book B.Parhami)

(Refer page no. 250-253 in text book B.Parhami)

6. PERFORMANCE OF THE SINGLE-CYCLE DESIGN

(Refer page no. 253-255 in text book B.Parhami)

CONTROL UNIT SYNTHESIS

Fig.3. Single-cycle versus multicycle instruction execution.

MULTICYCLE DATA PATH:

Fig.4. Abstract view of a multicycle instruction execution unit for MicroMIPS.

8. CHOOSING THE CLOCK CYCLE

(Refer page no. 262 in text book B.Parhami)

9. THE CONTROL STATE MACHINE

CONTROL STATE MACHINE for MULTICYCLE MicroMIPS

PERFORMANCE OF THE MULTICYCLE DESIGN

(Refer page no. 266 in text book B.Parhami)

23-BIT MICROINSTRUCTION FORMAT FOR MICROMIPS.

Control signals to data path

Microprogrammed control unit for MicroMIPS

5 instruction execution steps / stages in a pipelining of MicroMIPS:

Pipelined Instruction Execution (Pipelining in the MicroMIPS instruction execution process.)

(b) Space-time diagram

PIPELINE STALLS OR BUBBLES

Data dependency in pipeline : Execution of one instruction depending on completion of a previous

Without data forwarding,

Two bubbles, if we assume

beq $1, $2, . . .

Here would need

Control dependency due to conditional branch.

PIPELINED CONTROL (Refer page no. 289 in text book B.Parhami)

19. PIPELINE BRANCH HAZARDS

Problem with this approach:

Other branch prediction algorithms:

Hardware Implementation of Branch Prediction

Read-out table entry

The Three Hardware Designs for MicroMIPS

ADVANCED PIPELINING (Refer page no. 306-308 in text book B.Parhami)

Vous aimerez peut-être aussi