l13 Hand

Processor Architecture
Pipelining
Computer Systems Architecture

http://cs.nott.ac.uk/txa/g51csa/
Thorsten Altenkirch and Liyang Hu

School of Computer Science
University of Nottingham
Lecture 13: Processor Architecture and Pipelining
Pipeline Hazards
Pipelining
Pipeline Hazards
Abstract View of MIPS Implementation
4
Add
Add
Data
PC
Address Instruction
Instruction
memory
Register #
Registers
Register #
ALU
Address
Data
memory
Register #
Data
Pipelining
Pipeline Hazards
Datapath and Control

Most instructions have common initial operations
Fetch instruction from memory at address PC
Decode and select register(s) for subsequent operation
Use ALU for: address, arithmetic, logic or comparison
Remaining operations differ between instruction classes
Consider datapaths used in following instruction classes

Memory-reference: e.g. lw and sw
Arithmetic/logic: e.g. add, sub and slt
Branching: e.g. beq and j
Multiplexors select between multiple data sources

Another layer of control logic over previous diagram
Pipelining
Pipeline Hazards
Multiplexors and Control Logic

Branch
M
u
x
4
Add
Add
M
u
x
ALU operation
Data
PC
Address Instruction
Instruction
memory
Register #
Registers
Register #
Register # RegWrite
MemWrite
Address
ALU
M
u
x
Zero
Data
memory
Data
Control
MemRead
Pipelining
Pipeline Hazards
Functional Units and Their Timings

There are at least five functional units, or stages:
IF
ID
EX
MEM
WB
Instruction Fetch
Instruction Decode
Execute
Memory Access
Write-Back
get instruction from memory

get source register operands
ALU operation
data memory read or write
result to destination register
Some stages take longer to finish than others, e.g.

Type
Memory
ALU
Register
Duration1
200ps
200ps
100ps
1012 s = 1ps, one picosecond
Stage
IF, MEM
EX
ID, WB
Pipelining
Pipeline Hazards
Critical Paths and Instruction Timings

Each instruction uses different subset of functional units
Class
R-Type
Load
Store
Cond. Branch
Jump
IF
200
200
200
200
200
ID
100
100
100
100
EX
200
200
200
200
MEM
200
200
WB
100
100
Total
600
800
700
500
200
Hence some instructions could run faster than others

But if every instruction must take exactly one cycle,
All instructions must take worst-case timing
Clock speed will be constrained by slowest instruction
Pipelining
Pipeline Hazards
The Laundry Room Analogy

If it takes 2 hours to wash, dry, fold and store one set of
clothes, how long will it take for 20 sets?
Time
6 PM
10
11
12
2 AM
6 PM
10
11
12
2 AM
Task
order
A
B
C
D
Time
Total of 2 20 = 40 hours?
Task
order
Time
Pipelining
Pipeline Hazards
Task
order
The Laundry Room Analogy

A
B
C
If Dit takes 2 hours to wash, dry, fold and store one set of
clothes, how long will it take for 20 sets?
Time
6 PM
10
11
12
Task
order
A
B
C
D
Total of 0.5 20 + 3 0.5 = 11.5 hours
2 AM
Pipelining
Pipeline Hazards
A Production Line for Instructions

Execute multiple instructions overlapped
Make each stage simple and fast; one cycle per stage
Start next instruction as soon as current stage is free
Same concept as a factory production line
Instruction latency is just as long as before

Maybe even a little longer due to pipelining overheads
But instruction throughput massively increased

Throughput is more important than latency
Ideal case: every instruction with a timing of t can be

divided into s stages. Executing n instructions takes,
Pipelined, at s/tHz: (n + (s 1))t/s nt/s
Single-cycle, at 1/tHz: nt
lw $1, 100($0)
Instruction
Reg
fetch
Pipelining
Data
access
ALU
Pipeline Overheads
lw $2, 200($0)
Pipeline Hazards
Reg
Instruction
Reg
fetch
800 ps
lw $3, 300($0)
ALU
Data
access
Reg
Instruction
fetch
800 ps
800 ps
Program
execution
Time
order
(in instructions)
200
400
600
Instruction
fetch
Reg
lw $2, 200($0) 200 ps
Instruction
fetch
Reg
200 ps
Instruction
fetch
lw $1, 100($0)
lw $3, 300($0)
ALU
800
Data
access
ALU
Reg
1000
1200
1400
Reg
Data
access
ALU
Reg
Data
access
Reg
200 ps 200 ps 200 ps 200 ps 200 ps
Even though some stages take less time than others. . .

. . . speed is still limited by the slowest component
Here, slowest stage rather than slowest instruction
Pipelining
Pipeline Hazards
Designing ISAs for Pipelining
Pipelining favours uniform timing and few special cases

MIPS architecture was designed with pipelining in mind
Fixed 32-bit instructions simplifies instruction fetch
Few instruction formats, sharing common operand fields
Only lw/sw access memory; ALU calculates address
Aligned memory references for single-cycle access
Slow instructions like mult taken out of pipeline
Write to dedicated registers HI and LO (no WB stage)
Avoids slowing down the EX stage
Pipelining
Pipeline Hazards
Obstacles to Pipelining
Previously assumed no interaction between instructions

Can always issue one instruction every clock cycle
Reality: various hazards prevents smooth pipeline flow

Structural Hazards: hardware cannot support instruction
Data Hazards: ALU needs value not yet in register file
Control Hazards: IF from PC+4 after branch instruction?
Pipelining
Pipeline Hazards
Structural Hazards
Structural Hazard: hardware cannot support instruction
Suppose we want to add a new instruction:
xor dst, src0 , n(src1 )
Fetch second operand during MEM, two cycles after EX!
Requires an additional MEM (read) stage before EX
Requires ALU to calculate n+src1 as well as XOR
But each instruction only has one cycle in EX stage!
Cant simultaneously IF and MEM on same memory bus

Switch to a Harvard architecture
Dual-ported memory allows two operations each cycle
Cache memory often separate for instruction and data
Design ISA to avoid structural-hazards!
Pipelining
Pipeline Hazards
Data Hazards
Data Hazards: ALU needs value not yet in register file
Suppose we execute the following dependent instructions:
add $s0, $t0, $t1
sub $t2, $s0, $t3
Result of add not written to $s0 until WB
But sub requires $s0 = $t0+$t1 the very next cycle!
200
Time
add $s0, $t0, $t1
IF
400
ID
600
EX
800
MEM
1000
WB
Stall sub in ID for 3 cycles until result written to $s0?
Pipelining
Pipeline Hazards
EX Forwarding
Wasted cycles waiting for previous instruction to complete
Compiler could fill bubbles with independent instructions
Or even the hardware out-of-order execution
Hard to find useful instructions; happens too often!
Better solution forward result from EX output

Extra hardware to take result directly from ALU output
Program
execution
Time
order
(in instructions)
add $s0, $t0, $t1
200
IF
sub $t2, $s0, $t3
No stalls required
400
600
800
ID
EX
MEM
IF
ID
EX
1000
WB
MEM
WB
Pipelining
Pipeline Hazards
MEM Forwarding
What about load instructions?
Consider the following instruction sequence:
lw $s0, 20($t1)
sub $t2, $s0, $t3
Result from lw not available until after MEM stage
Program
execution
Time
order
(in instructions)
lw $s0, 20($t1)
200
IF
400
ID
bubble
sub $t2, $s0, $t3
800
600
EX
bubble
IF
MEM
bubble
ID
1000
1200
1400
WB
bubble
EX
Still requires one bubble to be inserted
bubble
MEM
WB
Pipelining
Reordering Instructions
Reorder the following to eliminate pipeline stalls:

lw $s0, 20($t1)
lw $s0, 20($t1)
sub $t2, $s0, $t3
lw $s1, 24($t1)
sw $s0, 20($t1)
sub $t2, $s0, $t3
sw $s0, 20($t1)
lw $s1, 24($t1)
Now try the example on H&P p378
Pipeline Hazards
Pipelining
Pipeline Hazards
Branch/Control Hazards
Control Hazards: IF from PC+4 after branch instruction?

Consider the following instruction sequence:
beq $s0, $s1, next
addi $s2, $s2, 1
next:
lw $s0, ($s2)
Which instruction do we fetch after beq?
Could stall for 2 cycles until $s0 $s1 decided after EX
Fetch addi anyway; if wrong, flush/restart the pipeline
Solutions not as effective as forwarding for data hazards
Pipelining
Pipeline Hazards
Branch Prediction
Static Branch Prediction
If target before PC, predict taken likely to be a loop
Otherwise could be if-then control predict not taken
Unconditional branches (or jumps) always taken
Dynamic Branch Prediction

Keep track of recent branch decisions
If previous branches taken, fetch from branch target
Otherwise predict not taken; fetch from PC+4
Alternatively, employ delayed branches

Execute the instruction at PC+4 anyway
Instruction following branch called the branch delay slot
Pipelining
Pipeline Hazards
Reading Material
H&P: 5.1 Introduction to The Processor: Datapath and

Control
H&P: 6.1, An Overview of Pipelining
For more detailed information,
H&P: 6.5, Data Hazards and Forwarding
H&P: 6.6, Branch Hazards

l13 Hand

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

l13 Hand

Transféré par

Droits d'auteur :

Formats disponibles

Processor Architecture

Computer Systems Architecture

Thorsten Altenkirch and Liyang Hu

Lecture 13: Processor Architecture and Pipelining

Abstract View of MIPS Implementation

Datapath and Control

Consider datapaths used in following instruction classes

Multiplexors select between multiple data sources

Multiplexors and Control Logic

Functional Units and Their Timings

get instruction from memory

Some stages take longer to finish than others, e.g.

1012 s = 1ps, one picosecond

Critical Paths and Instruction Timings

Hence some instructions could run faster than others

The Laundry Room Analogy

The Laundry Room Analogy

Total of 0.5 20 + 3 0.5 = 11.5 hours

A Production Line for Instructions

Instruction latency is just as long as before

But instruction throughput massively increased

Ideal case: every instruction with a timing of t can be

lw $2, 200($0) 200 ps

200 ps 200 ps 200 ps 200 ps 200 ps

Even though some stages take less time than others. . .

Designing ISAs for Pipelining

Pipelining favours uniform timing and few special cases

Previously assumed no interaction between instructions

Reality: various hazards prevents smooth pipeline flow

Cant simultaneously IF and MEM on same memory bus

Stall sub in ID for 3 cycles until result written to $s0?

Better solution forward result from EX output

sub $t2, $s0, $t3

sub $t2, $s0, $t3

Still requires one bubble to be inserted

Reorder the following to eliminate pipeline stalls:

Control Hazards: IF from PC+4 after branch instruction?

Solutions not as effective as forwarding for data hazards

Dynamic Branch Prediction

Alternatively, employ delayed branches

H&P: 5.1 Introduction to The Processor: Datapath and

Vous aimerez peut-être aussi