Vector Processing and Pipelining

Chapter 9
Pipeline and Vector

Processing
Dr. Bernard Chen Ph.D.

University of Central Arkansas
Spring 2009
Parallel processing
 A parallel processing system is able to perform
concurrent data processing to achieve faster
execution time
 The system may have two or more ALUs and be

able to execute two or more instructions at the
same time
 Goal is to increase the throughput – the

amount of processing that can be accomplished
during a given interval of time
Parallel processing
classification
Single instruction stream, single data stream – SISD
Single instruction stream, multiple data stream –

SIMD
Multiple instruction stream, single data stream –

MISD
Multiple instruction stream, multiple data stream –

MIMD
Single instruction stream, single data
stream – SISD
 Single control unit, single computer, and a

memory unit
 Instructions are executed sequentially. Parallel

processing may be achieved by means of
multiple functional units or by pipeline
processing
Single instruction stream, multiple
data stream – SIMD
 Represents an organization that includes many

processing units under the supervision of a
common control unit.
 Includes multiple processing units with a single

control unit. All processors receive the same
instruction, but operate on different data.
Multiple instruction stream, single
data stream – MISD
 Theoretical only
 processors receive different instructions, but

operate on same data.
Multiple instruction stream,
multiple data stream – MIMD
 A computer system capable of processing
several programs at the same time.
 Most multiprocessor and multicomputer

systems can be classified in this category
Pipelining: Laundry
Example
 Small laundry has one
washer, one dryer and one
operator, it takes 90 A B C D
minutes to finish one load:
 Washer takes 30 minutes

 Dryer takes 40 minutes
 “operator folding” takes 20
minutes
Sequential Laundry
6 PM 7 8 9 10 11 Midnight
Time
30 40 20 30 40 20 30 40 20 30 40 20
T
a A
s
k
B
O
r
d C
e 90 min
r
D
 This operator scheduled his loads to be delivered to the laundry every 90 minutes
which is the time required to finish one load. In other words he will not start a new
task unless he is already done with the previous task
 The process is sequential. Sequential laundry takes 6 hours for 4 loads
Efficiently scheduled laundry: Pipelined
Laundry
Operator start work ASAP
6 PM 7 8 9 10 11 Midnight
Time
30 40 40 40 40 20
40 40 40
T
a A
s
k
B
O
r
d C
e
r
D
 Another operator asks for the delivery of loads to the laundry every 40 minutes!?.
 Pipelined laundry takes 3.5 hours for 4 loads
 Multiple tasks operating
Pipelining Facts simultaneously
 Pipelining doesn’t help
latency of single task, it
helps throughput of
6 PM 7 8 9 entire workload
Time
 Pipeline rate limited by
slowest pipeline stage
T  Potential speedup =
a 30 40 40 40 40 20
Number of pipe stages
s
k A  Unbalanced lengths of
pipe stages reduces
O speedup
r B  Time to “fill” pipeline
d and time to “drain” it
e The washer reduces speedup
r C waits for the
dryer for 10
minutes
D
9.2 Pipelining
• Decomposes a sequential process into
segments.
• Divide the processor into segment processors
each one is dedicated to a particular segment.
• Each segment is executed in a dedicated
segment-processor operates concurrently with
all other segments.
• Information flows through these multiple
hardware segments.
9.2 Pipelining
 Instruction execution is divided into k
segments or stages
 Instruction exits pipe stage k-1 and
proceeds into pipe stage k

 All pipe stages take the same amount of
time; called one processor cycle

 Length of the processor cycle is determined
by the slowest pipe stage
k segments
9.2 Pipelining
 Suppose we want to perform the
combined multiply and add
operations with a stream of
numbers:
 Ai * Bi + Ci for i =1,2,3,…,7
9.2 Pipelining
 The suboperations performed in
each segment of the pipeline are
as follows:
 R1  Ai, R2  Bi
 R3  R1 * R2 R4  Ci
 R5  R3 + R4
Pipeline Performance
 n:instructions n is equivalent to number of loads in

 k: stages in the laundry example
pipeline k is the stages (washing, drying and
 τ : clockcycle folding.
 Tk: total time Clock cycle is the slowest task time
Tk = (k + (n − 1))τ
T1 nk n
Speedup = =
Tk k + (n − 1) k
SPEEDUP
 • Consider a k-segment pipeline operating on n data
sets. (In the above example, k = 3 and n = 4.)
 > It takes k clock cycles to fill the pipeline and get the
first result from the output of the pipeline.
 After that the remaining (n - 1) results will come out at

each clock cycle.
 > It therefore takes (k + n - 1) clock cycles to

complete the task.
SPEEDUP
 If we execute the same task
sequentially in a single processing
unit, it takes (k * n) clock cycles.
 • The speedup gained by using the
pipeline is:
 S = k * n / (k + n - 1 )
SPEEDUP
 S = k * n / (k + n - 1 )
For n >> k (such as 1 million data sets on a 3-

stage pipeline),
 S~k
 So we can gain the speedup which is equal
to the number of functional units for a large
data sets. This is because the multiple
functional units can work in parallel except
for the filling and cleaning-up cycles.
Example: 6 tasks, divided
into 4 segments
1 2 3 4 5 6 7 8 9
T1 T2 T3 T4 T5 T6
T1 T2 T3 T4 T5 T6
T1 T2 T3 T4 T5 T6
T1 T2 T3 T4 T5 T6

Vector Processing and Pipelining

Transféré par

Informations du document

Description originale:

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Vector Processing and Pipelining

Transféré par

Droits d'auteur :

Formats disponibles

Chapter 9

Pipeline and Vector

Dr. Bernard Chen Ph.D.

 The system may have two or more ALUs and be

 Goal is to increase the throughput – the

Single instruction stream, single data stream – SISD

Single instruction stream, multiple data stream –

Multiple instruction stream, single data stream –

Multiple instruction stream, multiple data stream –

 Single control unit, single computer, and a

 Instructions are executed sequentially. Parallel

 Represents an organization that includes many

 Includes multiple processing units with a single

 processors receive different instructions, but

 Most multiprocessor and multicomputer

 Washer takes 30 minutes

proceeds into pipe stage k

time; called one processor cycle

by the slowest pipe stage

 n:instructions n is equivalent to number of loads in

 After that the remaining (n - 1) results will come out at

 > It therefore takes (k + n - 1) clock cycles to

For n >> k (such as 1 million data sets on a 3-

Vous aimerez peut-être aussi