Chap 4

Chapter 4
Pipeline and Vector Processing

Topics
E Parallel Processing
E Pipelining
E Arithmetic Pipelining
E Instruction Pipeline
E RISC Pipeline
E Vector Processing
E Array Processors

4-1 Parallel Processing
4 Simultaneous data processing tasks for the purpose of increasing the
computational speed
4 Perform concurrent data processing to achieve faster execution time
E.g. While executing data in ALU, the next instruction can be fetched from memory
+ All this are to increase throughput & execution speed
Parallel processing by level of complexity
Lowest level:
4 Shift registers operate in serial fashion one bit at a time,
while shift registers with parallel load operate with all
the bits of the word simultaneously.
Highest level
4 Can be achieved by having a multiplicity of functional units
that perform identical or different operations simultaneously.

4 Multiple Functional Unit :
- Separate the execution unit into eight functional units
operating in parallel
- The unit performs identical or different operations
- Controlled by control unit

Parallel Processing Example
Adder-subtractor
Integer multiply
Floatint-point
add-subtract
Incrementer
Shift unit
Logic unit
Floatint-point
divide
Floatint-point
multiply
Processor
registers
To Memory
=
Lowest level
Highest level
Classification of parallel processing
From the internal organization of processors
From the interconnection structure between processors
From the flow of information through the system.
the three classifications are:
Data-Instruction Stream : Flynn
Serial versus Parallel Processing : Feng
Parallelism and Pipelining : Handler
4 Flynns Classification: based on the number of instructions and data items
that are manipulated simultaneously.
1) SISD (Single Instruction - Single Data stream)
for practical purpose: only one processor is useful
Example systems : Amdahl 470V/6, IBM 360/91
Parallel processing may achieved by
CU MM PU
IS DS

IS
Multiple functional unit
Pipeline processing
2) SIMD (Single Instruction - Multiple Data stream)
Many processing unit under control of common control unit
All processor accept the same instruction from CU
And perform on different data.

vector or array operations
one vector operation includes many operations on a data stream
Example systems : CRAY -1, ILLIAC-IV

3) MISD(Multiple Instruction - Single Data stream)
Data Stream Bottle neck
This is only theoretical ,no practical system

CU
PU1
PUn
PU2
MM1
MMn
MM2
DS1
DS2
DSn
IS
IS
Shared memmory
PU1
PUn
PU2
DS
CU1
CUn
CU2
IS1
IS2
ISn
MM1 MMn MM2
IS1
IS2
ISn
DS
Shared memory
4) MIMD(Multiple Instruction - Multiple Data stream)
capable of processing several programs at the same time
4 Example Multiprocessor System

4 One limitation is it sees only performance
Of control and data processing unit
Focus only behavioral structure rather
Than operational structure.

4 Main topics in this Chapter
Pipeline processing :
Arithmetic pipeline :
Instruction pipeline :
Vector processing :adder/multiplier pipeline ,
Array processing : array processor ,
Attached array processor :
SIMD array processor :
PU1
PUn
PU2
DS
CU1
CUn
CU2
IS1
IS2
ISn
IS1
IS2
ISn
MM1
MMn
MM2
Shared memory
v v
4-2 Pipelining
4 Pipelining
Decomposing a sequential process into suboperations and each
subprocess is executed in a special dedicated segment concurrently
Each segment performs partial processing.
When we get final result?
4 Simple Pipelining architecture: each segment have input register followed by
combinational circuit.
Multiply and add operation : ( for i = 1, 2, , 7 )
3 Suboperation Segment
1) : Input Ai and Bi
2) : Multiply and input Ci
3) : Add Ci
Content of registers in pipeline example :
4 The five registers load with new data every clock-pulse
4 General considerations
Any operation that can decompose to suboperations that have the same
complexity ,can be implemented with pipeline
Efficient ,for applications that repeat the same task for different data
4 3 5
4 , 2 * 1 3
2 , 1
R R R
Ci R R R R
Bi R Ai R
+

Ci Bi Ai + *
Space-time diagram : behavior of pipeline
Show segment utilization as a function of time
Task : T1, T2, T3,, T6 the total operation performed in all segments.
Total operation performed going through all the segment
The task T1 completes after the fourth clock cycle
The remaining tasks complete every clock cycle

how K segment with clock cycle time t
p
completes n task
The first task T1 needs time equal to K t
p
.
The remaining n-1 tasks, one task per clock cycle
+ To complete n-1 tasks, (n-1) t
p
time is needed.
total time:- to complete n task
= K + (n 1 ) clock cycles.

Segment
versus
clock-cycle
1 8 7 6 5 4 3 2 9
1
4
3
2
Clock cycles
T1 T6 T3 T5 T2 T4
T1 T6 T3 T5 T2 T4
T1 T6 T3 T5 T2 T4
T1 T6 T3 T5 T2 T4
S
e
g
m
e
n
t
4 Compare the nonpipeline that performs the same operation and takes time t
n

to complete each task,
4 For n task n t
n

4 Speedup (S) : a ratio which shows the speedup of pipeline over the same
nonpipeline process.
4 S = Nonpipeline / Pipeline
S = n t
n
/ ( k + n - 1 ) t
p
= 6 6 t
p
/ ( 4 +( 6 1) ) t
p
= 36 t
p
/ 9 t
p
= 4
n : task number ( 6 )
t
n
: time to complete each task in nonpipeline ( 6 cycle times = 6 t
p
)
t
p
: clock cycle time ( 1 clock cycle )
k : segment number ( 4 )
If n , S = t
n
/ t
p
If the time is the same for the two
task, nonpipeline ( t
n
) = pipeline ( k t
p
),
S = t
n
/ t
p
= k t
p
/ t
p
= k
k (segment) This is theoretical maximum speed up
Example : time to complete each Suboperation is
20ns, task 100. Total time to complete the task with
4 segment pipeline and nonpipeline? Speedup ratio?

1 8 7 6 5 4 3 2 9
1
4
3
2
Clock cycles
T1 T6 T3 T5 T2 T4
T1 T6 T3 T5 T2 T4
T1 T6 T3 T5 T2 T4
T1 T6 T3 T5 T2 T4
S
e
g
m
e
n
t
Pipeline= 9 clock cycles
k + n - 1 ~ n
4-3 Arithmetic Pipeline: found in high speed computers. Used for:
floating point operation, multiplication
of fixed point numbers and for scientific problems.
4 Floating-point Adder Pipeline Example :
Add / Subtract two normalized floating-point
binary number
X = A x 2
a

Y = B x 2
b

4 segments suboperations
1) Compare exponents by subtraction :( a-b)
2) Align mantissas(A B),by shift
3) Add mantissas
4) Normalize result(overflow/underflow )
Example:-
X = 0.9504 x 10
3

Y = 0.8200 x 10
2
shifter, adder and subtractor, incrementer and decrementer are implemented with
combinational circuit. The time delay of segment is the sum of the max and min, but
nonpipeline sum of all

R
Compare
exponents
by subtraction
R
Choose exponent Align mantissas
R
Add or subtract
mantissas
R
Normalize
result
R
R
Adjust
exponent
R
R
a b B A
Exponents Mantissas
Difference
Segment 1 :
Segment 4 :
Segment 3 :
Segment 2 :
Time delays of the 4 segments are t
1
= 60 ns, t
2
= 70 ns, t
3
= 100 ns, t
4
= 8o ns,
and the interface registers have a delay of t
r
= 10 ns. The clock cycle is chosen
to be t
p
= t3 + t
r
= l1o ns. An equivalent nonpipeline: delay time t
n
= t
1
+ t
2
+
t
3
+ t
4
+ t
r
= 320 ns
4-4 Instruction Pipeline: reads consecutive instruction from memory while the
previous instruction is executed.
Fetch & execute phase overlap, perform simultaneously
Example: a computer with fetch unit and execute unit
Two segment pipeline is needed
1) Instruction Fetch
2) Decode Instruction & execute
Fetch Segment: first in first out(FIFO) buffer.
+ Instruction stream is placed in and wait for decoding and processing by
the second segment.
+ This reduce memory access time to read instruction
Computers with complex instruction set
4 Instruction Cycle
1) Fetch the instruction from memory
2) Decode the instruction
3) Calculate the effective address
4) Fetch the operands from memory
5) Execute the instruction
6) Store the result in the proper place

4 Difficulties to perform at its maximum rate.
- different segments take different time to operate incoming task
- Some segment skip some operations. Example: register mode instruction
- Two/more segment access memory at same time, causes one to wait.

4 Example : Four-segment Instruction Pipeline
Four-segment CPU pipeline :
1) FI : Instruction Fetch
2) DA : Decode Instruction & calculate EA
3) FO : Operand Fetch
4) EX : Execution & Store
Timing of Instruction Pipeline :
Instruction 3 Branch

1 3 2
1
4
3
2
7
6
5
8 7 6 5 4 9 12 11 10 13
FI EX FO DA
FI EX FO DA
FI EX FO DA
FI EX FO DA
FI EX FO DA
FI EX FO DA
FI EX FO DA
FI
Instruction :
(Branch)
Step :
Branch No Branch
Segment 1 :
Segment 4 :
Segment 3 :
Segment 2 :
Fetch instruction
from memory
Decode instruction
and calculate
effective address
Fetch operand
from memory
Execute instruction
Branch ?
Interrupt ?
Interrupt
handling
Update PC
Empty pipe
yes
no
yes
no
4 Pipeline Conflicts :causes instruction pipeline deviate from its normal
operation
1) Resource conflicts
memory access by two segments at the same time
2) Data dependency
when an instruction depend on the result of a previous instruction, but this result is not
yet available
3) Branch difficulties
branch and other instruction (interrupt, ret, ..) that change the value of PC
4 Data Dependency :causes degradation of performance due to data/address
collision.
Example, FO needs data that generated at the same time in EX
Mechanisms to solve conflicts
Hardware
Hardware Interlock; circuit detects a conflicts
Delay the instruction whose source is not available.
Operand Forwarding: use special hardware to detect conflict and avoid it by routing
the data through special path
Transfer previous instruction ALU result to next instruction input of ALU
Needs Multiplexer , and circuit
Software : give responsibility to the compiler to solve conflict.
Delayed Load
previous instruction No-operation instruction
4 Handling of Branch Instructions
Branch:- conditional / unconditional.
Mechanisms to minimize conflict(hardware)
Prefetch target instruction in addition to the instruction following the
branch(conditional branch). If condition met, continue from branch target
instruction

Branch Target Buffer : BTB: an associative memory include in the fetch segment.
Each BTB entry contain, address of previous branch and target instruction for that branch
and few instructions after branch
When pipeline decodes the branch instruction, it searches BTB for the address
If it is in BTB Prefetch
Else shift to new instruction and store the target instruction to BTB
+ the associative memory cache memory

Loop Buffer: small very high speed register file.
when program loop is detected, stored in loop buffer including all
branches.
execute directly with out accessing memory

Branch Prediction
Use additional hardware logic to guess the out come of branch
Helps to eliminate the wasted time due to branch

4 Delayed Branch: employed in most RISC process: the compiler detects
branch instruction and rearrange the machine language code.

4-5 RISC Pipeline:
Implement instruction pipeline with small number of suboperations, that
execute in one clock cycle.
Since it uses:
- Fixed length instruction format: decoding occur while register is selected.
- Register to register operation
- Operands are in register: no need of calculating effective address.
Because of this, the instruction pipeline can implement with two/three
segments.
1. Fetch instruction from program memory
2. Execute the instruction in ALU
3. Store the result of ALU to destination register.
To prevent conflict to memory access( fetch & load),use two separate
buses with separate memories.
Single-cycle instruction execution one characteristics how?
Start each instruction with each clock cycle and to pipeline to achieve this.

4 Compiler support: characteristics of RISC
To handle data conflict difficulties and branch penalties, use compiler to detect
and minimize delay.
4 Example : Three-segment Instruction Pipeline
3 Suboperations Instruction Cycle
1) I : Instruction fetch
2) A : Instruction decoded and ALU operation
3) E : Transfer the output of ALU to a register,
memory, or PC
Delayed Load : 4 instructions
1. LOAD: R1+M[address 1]
2. LOAD: R2+M[address 2]
3. ADD: R3+R1 + R2
4. STORE: M[address3]+R3
There will be data conflict at instruction 3 for normal flow
Why? What is the action of the compiler?
Delayed Load
No-operation
1 3 2 6 5 4
1. Load R1
4. Store R3
3. Add R1+R2
2. Load R2
I E A
I E A
I E A
I E A
(a) Pipeline timing with data conflict
1 3 2 6 5 4
1. Load R1
4. Add R1+R2
2. Load R2
I E A
I E A
I E A
I E A
(b) Pipeline timing with delayed load
5. Store R3
3. No-operation
7
I E A
Clock cycles :
Clock cycles :
Conflict
Delayed Branch : mechanism to minimize delay cause by branch.
The compiler with delayed branch is analyze the instruction before and after the
branch and rearrange the program.
It is up to the compiler to find useful instructions to put after the branch
instruction. It can insert no-op instruction.
Example:
1. Load from memory to R1
2. Increment R2
3. Add R3 to R4
4. Subtract R5 from R6
5. Branch to address X

a) insert no-op instruction after branch

b) rearrange the instructions

4-6 Vector Processing: why we need?
Some problems may be beyond the capabilities of conventional computers.
Science and Engineering:-problems are formulated with vector & matrix vector processing
4 Application areas
Long-range weather forecasting,
Petroleum explorations,
Seismic data analysis,
Medical diagnosis,
Aerodynamics and space flight simulations,
Artificial intelligence and expert systems,
Mapping the human genome,
Image processing
Vector Operations:
Many scientific problems require arithmetic operations on large arrays of numbers.
What is vector
Conventional scalar processor vs. vector process
Fortran language
Initialize I = 0
20 Read A(I)
Read B(I)
Store C(I) = A(I) + B(I)
Increment I = I + 1
If I s 100 go to 20
Continue
Machine language

DO 20 I = 1, 100
20 C(I) = A(I) + B(I)
Vector
Matrix
Vector processor: the address type?
Single vector instruction with this format.

ADD A B C 100
The operation is done with pipelined floating-point adder.
Matrix Multiplication
matrix represented with row and column vector(n x m or n x n)
The multiplication of two n x n matrix consists of n
2
inner products or n
3
multiply-
add operations.
Example : multiplication of two 3 x 3 matrixes A and B
n
2
= 9 inner product

The product matrix c is 3 x 3 matrix whose element is related A and B by inner
product. multiply-add 3:
C
11=
a
11
b
11 +
a
12
b
21+
a
13
b
31
Cumulative multiply-add operation : n
3
= 27 multiply-add

C(1:100) = A(1:100) + B(1:100)
Operation
code
Base address
source 1
Base address
source 2
Base address
destination
Vector
length
(
(
(
=
(
(
(
(
(
(
33 32 31
23 22 21
13 12 11
33 32 31
23 22 21
13 12 11
33 32 31
23 22 21
13 12 11
c c c
c c c
c c c
b b b
b b b
b b b
a a a
a a a
a a a
31 13 21 12 11 11 11 11
b a b a b a c c + + + =
C
11
= 0
4 Pipeline for calculating an inner product :
Inner product consists of sum of k product terms:
Floating point multiplier pipeline : 4 segment
Floating point adder pipeline : 4 segment

after 1st clock input

after 8th clock input

Four section summation
k k
B A B A B A B A C + + + + =
3 3 2 2 1 1
Source
A
Source
B
Multiplier
pipeline
Adder
pipeline
after 4th clock input

A
1
B
1

Source
A
Source
B
Multiplier
pipeline
Adder
pipeline
A
4
B
4
A
3
B
3
A
2
B
2
A
1
B
1
Source
A
Source
B
Multiplier
pipeline
Adder
pipeline
Source
A
Source
B
Multiplier
pipeline
Adder
pipeline
after 9th, 10th, 11th ,...

A
8
B
8
A
7
B
7
A
6
B
6
A
5
B
5
A
4
B
4
A
3
B
3
A
2
B
2
A
1
B
1
A
8
B
8
A
7
B
7
A
6
B
6
A
5
B
5
A
4
B
4
A
3
B
3
A
2
B
2
A
1
B
1
5 5 1 1
B A B A +
, , , 1
6 6 2 2
B A B A +
+ + + + +
+ + + + +
+ + + + +
+ + + + =
16 16 12 12 8 8 4 4
15 15 11 11 7 7 3 3
14 14 10 10 6 6 2 2
13 13 9 9 5 5 1 1
B A B A B A B A
B A B A B A B A
B A B A B A B A
B A B A B A B A C
4 Memory Interleaving :
Simultaneous access to memory from two or more source using one memory
bus system conflict
To avoid this ,partitioned the memory to modules
connected to common memory address and data bus
Module is memory array with data register and address.
Assign d/t address sets to d/t module.
we can use for pipeline and vector processing.
AR
Memory
array
DR
AR
Memory
array
DR
AR
Memory
array
DR
AR
Memory
array
DR
Address bus
Data bus
4 Supercomputer
Supercomputer = Vector Instruction + Pipelined floating-point arithmetic
Has multiple functional units & each unit has its own pipeline configuration
Performance Evaluation Index
MIPS : Million Instruction Per Second
FLOPS : Floating-point Operation Per Second
megaflops : 10
6
, gigaflops : 10
9

Cray supercomputer : Cray Research, vector processing
Clay-1 : 12 functional unit, 80 megaflops, 4 million 64 bit words memory,
Clay-2 : 12 times more powerful than the clay-1
VP supercomputer : Fujitsu, vector and scalar process
VP-200 : 300 megaflops, 32 million memory, 83 vector instruction, 195 scalar instruction
VP-2600 : 5 gigaflops
4-7 Array Processors
4 Performs computations on large arrays of data

4 Array Processing
Attached array processor
4 Auxiliary processor attached to a general purpose computer
4 Enhance the performance by providing vector processing for scientific
applications
4 By means of parallel processing with multiple functional unit.
4 Have arithmetic unit with one/two pipelined floating point adder and
multiplier.
Vector processing : Adder/Multiplier pipeline
Array processing : array processors
General-purpose
computer
Input-Output
interface
Attached array
Processor
Main memory Local memory
High-speed memory to-
memory bus
Attached array processor
SIMD array processor
SIMD array processor :
4 Computer with multiple processing units operating in parallel
components
Master Control unit: control the operation in PEs & decode instruction
Main memory: store program
Processing element (PEs) with local memory
4 PEs has ALU, floating point arithmetic unit
4 first store i
th
components a
i
and b
i
Mi for (i= 1,2,.n) store in local
memory then
4 vector processing broadcast to all PEs
c
i
= a
i
+ b
i

Uses masking scheme to control PEs
Example ILLIAC IV

PE1
PE3
PE2
M1
M3
M2
PEn Mn
Master control
unit
Main memory
Class work 5 % time allowed 15min
1. A nonpipeline system takes 40 ns to process a task. The
same task can be Processed in a seven-segment pipeline
with a clock cycle of 10 ns. Determine the speedup ratio of
the pipeline for 100 tasks. what is the maximum speedup
that can be achieved?
2. The time delay of the four segments pipeline are as follows:
t1 50 ns, t2 = 30 ns, t3 = 85 ns, and t4 = 45 ns. The
interface registers delay time tr=5ns.?
a. How long would it take to add 100pairs of numbers in the
pipeline?
b. How can we reduce the total time to about one-half of the time
calculated in part (a)?
3. Draw a space-time diagram for a six-segment Pipeline
showing the time it takes to process nine tasks.

Chap 4

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Chap 4

Transféré par

Droits d'auteur :

Formats disponibles

Chapter 4

Pipeline and Vector Processing

Vous aimerez peut-être aussi