Pipeline

Pipelining
We’ve already covered the basics of pipelining in Lecture 1.
We saw that cars could be built on an assembly line, and that

instructions could be executed in much the same way.
[H&P §A.1] In the ideal situation, this could give a speedup equal to
the number of pipeline stages:
Time to execute instruction on unpipelined machine

Number of pipe stages
However, this assumes “perfectly balanced” stages—each stage

requires exactly the same amount of time.
This is rarely the case, and anyway, pipelining does involve some
extra overhead.
Three aspects of RISC architectures make them easy to pipeline:
• All operations on data apply to data in registers.

• Only load and store operations move data between
memory and registers.
• All instructions are the same size, and there are few
instruction formats.
An unpipelined RISC
For our examples, we’ll work with a simplified RISC instruction set. In
an unpipelined implementation, instructions take at most 5 clock
cycles. One cycle is devoted to each of—
• Instruction fetch (IF).

Fetch the current instruction (the one pointed to by PC).
IR ← Mem[PC]
Update the PC by adding
NPC ← PC +
© 2002 Edward F. Gehringer ECE 463/521 Lecture Notes, Fall 2002 1
Based on notes from Drs. Tom Conte & Eric Rotenberg of NCSU
Figures from CAQA used with permission of Morgan Kaufmann Publishers. © 2003 Elsevier Science (USA)
• Instruction decode/register fetch (ID).
Decode the instruction.
Read the source registers from the register file.
A ← Regs[IR6..10]; B = Regs[IR11..15]
Sign-extend the offset (displacement) field of the

instruction.
Imm ← sign-extend(IR16..31)
Check for a possible branch (by reading values from the
source registers).
Cond ← (A rel B)
Compute the branch target address by adding the
to the
ALU_Output ← NPC + Imm
If the branch is taken, store the branch-target address
into the PC.
If (cond) PC ← ALU_Output, else PC ← NPC
What feature of the ISA makes it possible to read the

registers in this stage?
• Execute/compute effective address (EX).

The ALU operates on the operands, performing one of
three types of functions, depending on the opcode
Ø Memory reference: ALU adds and
to form the effective address.
ALU_Output ←
Ø Register-register instruction: ALU performs
operation on the values read from the register file.
ALU_Output ← A op B
Ø Register-immediate instruction: ALU performs
operation on the
and the
Lecture 14 Advanced Microprocessor Design 2

ALU_Output ← A op Imm
In a load-store architecture, execution can be done at

the same time as effective-address computation
because
• Memory access (MEM).

Load_Mem_Data ← Mem[ALU_Output] /* Load */
Mem[ALU_Output] ← B /* Store */
• Write-back (WB). If the instruction is register-register or
, the result is written into the register file
at the address specified by the destination operand.
Reg-Reg ALU Operation: Regs[IR16..20] ← ALU_Output
Reg-Immediate ALU Operation: Regs[IR11..15] ← ALU_Output
Load instruction: Regs[IR11..15] ← Load_Mem_Data
In this implementation, some instructions require 2 cycles, some
require 4, and some require 5.
• 2 cycles:
• 4 cycles:
• 5 cycles:
Assuming the instruction frequencies from the integer benchmarks

mentioned in the last lecture, what’s the CPI of this architecture?
Pipelining our RISC

It’s easy to pipeline this architecture—just make each clock cycle into
a pipe stage.

Clock # 1 2 3 4 5 6 7 8 9
Instruction i IF ID EX MEM WB
Instr. i+1 IF ID EX MEM WB
Here is a diagram of our instruction pipeline.
Instruction Fetch (IF) Instruction Decode (ID) Execute (EX) Memory (MEM) Write-
back
(WB)
ALU
MUX
4
ALU
NPC
PC A
IR
ALU
Instruction cond
Data LMD
(inst.
MUX
cache Regs cache
reg.)
MUX
Sign- Imm
extend
In this pipeline, the major functional units are used in different cycles,
so overlapping the execution of instructions introduces few conflicts.
• Separating the instruction and data caches eliminates a

conflict that would arise in the IF and MEM stages.
Of course, we have to access these caches faster than we
would in an unpipelined processor.
• The register file is used in two stages:

Thus, we need to perform reads and writes
each clock cycle.
To handle reads and writes to the same register, we write in
the first half of the clock cycle and read in the second half.
• Something is incomplete about our diagram of the IF stage.
What?
We’ve omitted one thing from the diagram above: We need a place
to save values between pipeline stages. Otherwise, the different
instructions in the pipeline would interfere with each other.
So we insert latches, or pipeline registers, between stages. Of

course, we’d need latches even in an unpipelined multicycle
implementation.
What is our pipeline speedup, then …?
Of course, we have to allow for latch-delay time.
We also need to allow for clock skew—the maximum delay between

when the clock arrives at any two registers.
Let’s define To’head = Tlatch + Tskew.
Avg. unpipelined execution time

Speedup =
Avg. pipelined execution time
Tunpipe
=T
unpipe
n + To'head . n
= n (ideal case where To'head = 0)
Example: Consider the unpipelined processor in the previous

example. Assume—
• Clock cycle is 1 ns.

• Branch instructions, 20% of the total, take 2 cycles.
• Store instructions, 10% of the total, take 4 cycles.
• All other instructions take 5 cycles.
• Clock skew and latch delay add 0.2 ns. to the cycle time.
What is the speedup from pipelining?

How can pipelining help?
How can pipelining improve performance?
• If we keep CT constant, by improving CPI …
50n
IF/ID and MEM/WB
IF ID EX MEM WB
are unpipelined
50n
Pipeline IF ID EX MEM WB
• If we keep CPI constant, by improving CT …
50ns
Unpipelined IF ID EX MEM WB CPI_pipe=1
MEM MEM
IF1 IF2 ID1 ID2 EX1 EX2 1 2 WB1 WB2 CPI_pipe=1
Pipelined
25ns
• Usually we improve both CT and CPI.
Pipeline hazards
A hazard reduces the performance of the pipeline. Hazards arise
because of the program’s characteristics.
There are three kinds of hazards.
• Structural hazards—Not enough hardware resources exist

for all combinations of instructions.
• Data hazards—Dependences between instructions prevent
their overlapped execution.
• Control hazards—Branches change the PC, which results
in stalls while branch targets are fetched.

Structural hazards
Consider a pipeline with a unified instruction-data cache.
Clock # 1 2 3 4 5 6 7 8 9 10
Load instr. IF ID EX MEM WB
Instr. i+3 stall IF ID EX MEM WB
Instr. i+5 IF ID EX MEM
Instr. i+6 IF ID EX
Instruction i+3 has to stall, because the load instruction “steals” an

instruction-fetch cycle.
In this pipeline, what kind of instructions (what “opcodes”) cause

structural hazards?

Pipeline

Transféré par

Informations du document

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Pipeline

Transféré par

Droits d'auteur :

Formats disponibles

Pipelining

We’ve already covered the basics of pipelining in Lecture 1.

We saw that cars could be built on an assembly line, and that

Time to execute instruction on unpipelined machine

However, this assumes “perfectly balanced” stages—each stage

Three aspects of RISC architectures make them easy to pipeline:

• All operations on data apply to data in registers.

• Instruction fetch (IF).

Sign-extend the offset (displacement) field of the

What feature of the ISA makes it possible to read the

• Execute/compute effective address (EX).

Lecture 14 Advanced Microprocessor Design 2

In a load-store architecture, execution can be done at

• Memory access (MEM).

Assuming the instruction frequencies from the integer benchmarks

Pipelining our RISC

© 2002 Edward F. Gehringer ECE 463/521 Lecture Notes, Fall 2002 3

Here is a diagram of our instruction pipeline.

• Separating the instruction and data caches eliminates a

Lecture 14 Advanced Microprocessor Design 4

So we insert latches, or pipeline registers, between stages. Of

Of course, we have to allow for latch-delay time.

We also need to allow for clock skew—the maximum delay between

Let’s define To’head = Tlatch + Tskew.

Avg. unpipelined execution time

= n (ideal case where To'head = 0)

Example: Consider the unpipelined processor in the previous

• Clock cycle is 1 ns.

What is the speedup from pipelining?

Lecture 14 Advanced Microprocessor Design 6

• If we keep CT constant, by improving CPI …

• If we keep CPI constant, by improving CT …

Unpipelined IF ID EX MEM WB CPI_pipe=1

• Usually we improve both CT and CPI.

There are three kinds of hazards.

• Structural hazards—Not enough hardware resources exist

© 2002 Edward F. Gehringer ECE 463/521 Lecture Notes, Fall 2002 7

Instruction i+3 has to stall, because the load instruction “steals” an

In this pipeline, what kind of instructions (what “opcodes”) cause

Lecture 14 Advanced Microprocessor Design 8

Vous aimerez peut-être aussi