Pipelinig &amp Super Scalar Execution

NetPro Certification Courseware for NetPro Certified Systems Engineer – N.C.S.
PIPELINING AND SUPERSCALAR EXECUTION
Intel packed over 3 million transistors into the Pentium chip, over 5 million in to the
Pentium Pro chip, over 7.5 million into the Pentium II, and 9.5 million into the
Pentium III .The Pentium II chips have twice the number of transistors as the Pentium
processor chips, but they are much smaller, a result of an improved manufacturing
process that put more processors into a smaller area.
Early transistor radios needed only tens of transistors to receive radio broadcast, a
completely insignificant number against millions of transistors in these processors.
It’s clear from the numbers that all those transistors are doing more than just fetching
instructions and adding numbers. The Pentium Pro, Pentium II, Celeron, Pentium III
processors all shares a similar architecture, so we generally call them P6 processors.
Superscaler instruction execution: Superscaler execution is a way to get

more work done at once by having more than one instruction in progress at
any time. The Pentium and P6 processors implement superscaler execution
using a “pipeline “ in the chip. The Pentium processor has five stages in the
pipeline, while the P6 processors have 12. The work the complete pipeline
does is the same in Pentium and Pentium Pro, because they execute the same
instructions. This means that having more stages in the Pentium Pro allows
each stage to do less work. You can think of it as the stage doing one- twelfth
of the total instead of one – fifth.
Separate code and data caches: The Pentium and P6 processors all maintain
separate physical and L1 caches for instructions and data. Splitting the caches
creates parallelism, allowing the processor to retrieve instructions and data
from the cache at the same time, and prevents the flow of large volumes of
data through the cache from flushing out the instructions being executed.
3 simultaneous memory Paths give 3 times the memory access due to caching
Instruction fetch Operand fetch Operand write

from Instruction from data cache back thro/to
cache.(Icache) (Dcache) Dcache
Get the Figure Get the Do the Put the

instruction out operands work answers
what to where they
do belong
Handoff Handoff Handoff Handoff
NetPro Certification Courseware for NetPro Certified Systems Engineer – N.C.S.E
Splitting the L1 cache into Instruction and Data caches
The split cache allows all three operations (Instruction read, Operand read
and Operand write) to happen in the same clock cycle, enabling the processor
to get the full benefit of the pipeline parallelism. At least three cache access
are possible every clock tick from this five stage pipeline, one reading
instruction, one reading data and one writing data. Splitting an L1 cache into
an instruction cache and data cache makes these accesses possible
simultaneously, because a cache can do a read and a write at the same time.
The split cache provides an increased memory access required to support the
pipeline doing multiple things at once without having to wait.
Dynamic branch prediction: From the diagram given, the processor has to
know the address of the instruction that comes third after the one being
executed- that is, while the box “do the work” is running, the box “get the
instruction” is fetching three instructions ahead.
Suppose the processor comes across an instruction sequence like this:
1. Load the value of COLOR.

2. Test if COLOR equals GREEN.
3. Store a new value in COLOR.
4. If the old value of COLOR was GREEN, the next instruction is
number 1 otherwise the next instruction is number 5 (An instruction
like this is called a branch)
At the time the processor is doing the work for the instruction number one in the
example, the pipeline is loading instruction number four. The processor next does the
work for instruction at two. At that time the first pipeline stage needs to know what
instruction to load. Because the processor won’t know what instruction to execute
after instruction four until it executes instruction four and makes the branch decision,
the pipeline has a problem: The “get the instruction “ box doesn’t know what to do.
Some older processors solved this problem by doing nothing. In those chips, the
pipeline empties out after loading a branch until that instruction executes. In our
example, three cycles would pass with no instructions being executed (a pipeline stall)
while the instruction after number four loads, is looked at, and gets its operands. More
complex processor designs do a simple form of branch prediction by assuming that
the next instruction is always the one immediately after the branch. If this assumption
is true, the processor loses no time. If its wrong, the processor stalls for a number of
cycles while it loads the rights instruction.
A more sophisticated approach to branch prediction is to recognize that many

branches are there to make the code loop, so they will be executed over and over. This
approach suggests that the most likely next instruction the second time the branch is
seen is the instruction that followed the branch last time. The Pentium and P6
processors and P6 processors all use this strategy and improve on it by fetching both
the instruction immediately after the branch and the one that followed the branch the
last time through the loop.
Floating point execution pipeline: The Pentium and P6 processors include a

separate pipeline for computations using floating-point numbers. The Pentium
floating-point pipeline is the same as the pipeline given above for the first four
stages but then has four more floating point specific stages. The eight-stage
pipeline is shown below. Overall the Pentium floating-point pipeline can
execute two floating-point instructions in one clock cycle.
Get the Figure Get the Do the

instruction out what operands work
to do.
Put the Clean up Do the 2nd Do the 1st

answer and part of the part of the
where it round the floating point floating
belongs answer work point work
Enhanced 64-bit Data Bus: The Pentium and P6 processors all use a 64-bit
data bus, twice the width of the bus on the old 486.These processors transfer
twice as much data to and from memory at a time as the 486 did in one cycle.
They also cycle the cache faster than the 486, leading to a net access rate five
times faster than the 486. The combination of twice the bus width, less than
one half the cycle time, and over twice the number of memory cycles per
second nets out at the factor of five increases in maximum host bus data rate.
Comparison of 486 and Pentium processing Rates
Millions of
Processor Maximum Host Bus Bus Cycle
memory cycles per
Data Rate (MBps) Width Time(ns)
Second
Pentium 528 8 66 16
486 105 4 26.25 39
Dynamic Execution
The biggest difference between the Pentium and the P6 processors is increased
sophistication of the analysis of the instruction stream and the use of the results- a
technology Intel calls Dynamic Execution. In the discussion of Dynamic Branch
Prediction in the preceding section, you saw that the uncertainty following a branch
instruction could cause the processor pipelines to stall. Other problems cause
pipelines stall too. For example, look at this sequence:

2. Load the value of SATURATION.
3. Multiply COLOR times SATURATION.
4. Store the multiplication result in COLOR.
The first two instructions can be executed in parallel by pipelines, but the third
instruction has to wait for the first two, and the fourth has to wait for the third. Adding
more pipelines to the basic Pentium architecture won’t make this sequence faster
because of the dependencies among the instructions that cause a conventional pipeline
to stall.
The P6 processors are vastly more complex than the Pentium. Intel’s engineers
devoted a great deal of the additional complexity to solving the pipeline stall problem,
because eliminating the stalls increases the sustained the instruction issue rate. The P6
processors break up the linear pipeline structure of the Pentium. The P6 processor
execution structure executes instruction a pipeline stage at a time but returns the result
to the execution pool between stages. Each stage takes the next instruction to the it
can work on, even if its out of the linear order. This means that the hand-off between
the pipeline stages doesn’t have to be in rigid, linear lock step. Instead, each pipeline
stage can look in to what Intel calls the “instruction pool “ for the next instruction it
can work on. The control circuits for the instruction pool ensure that the necessary
dependencies between instructions are observed, but they otherwise allow for out-of-
linear-order instruction execution.
Figure out Get the Do the Work

what to do operands
Put the
Get the answers where
instruction they belong
Instruction
Pool
Extending the preceding example let us illustrate the example of P6 processors.

2. Load the value of SATURATION.
3. Multiply COLOR times SATURATION.
4. Store the multiplication result in COLOR.
5. Load the value of CHANNEL.
6. Add one to CHANNEL.
7. Store the updated CHANNEL in SURFCHANNEL.
The advantage the P6 processors have over the Pentium and early processors is that
even though the instructions at steps three and four may stall, the fetch/ decode unit
will continue to fill the instruction pool from steps five to seven. At the point the
dispatch execution unit stalls at three, the instruction at five will be available. The
dispatch execution unit picks up that instruction and continues working. No cycles are
wasted on pipeline stalls, so your programs run faster.
Intel also improved the memory access performance of the P6 processors. In both the
Pentium and the P6, the split L1 cache ties to a bus interface unit, which is an engine
that works to get information from the L2 cache or the host bus. New in the P6
processors is a direct interface to the L2 cache from the bus interface unit. This isn’t
possible in the Pentium because the L2 cache is external to the Pentium Package, and
there aren’t enough pins in the package to add separate connections to the L2 cache.
The Pentium pro includes the L2 cache as a second silicon chip within the package, so
it doesn’t need additional pins for connection to the cache. The Pentium II includes
the L2 cache in the cartridge, within which Intel can maintain a controlled
environment for programs of the necessary high-speed signals. The initial Celeron
processors omitted the Pentium II L2 cache; later Celeron use a version of the
Pentium II cache design.
Parallel access paths to the system bus and the L2 cache give the later processors a
huge increase in performance. While the system bus makes a memory or I/O access
taking 60 ns or more, the processor can continue to pull instructions and data form the
L2 cache with an access time of 8ns to 16 ns. In the time the system bus makes one
cycle, the processor can get nearly 8 cycles from the L2 cache. The increased access
rate helps the fetch/ decode unit, the dispatch execution unit, and the retire unit avoid
stalling, boosting performance.

Pipelinig &amp Super Scalar Execution

Transféré par

Informations du document

Description originale:

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Pipelinig &amp Super Scalar Execution

Transféré par

Droits d'auteur :

Formats disponibles

NetPro Certification Courseware for NetPro Certified Systems Engineer – N.C.S.

PIPELINING AND SUPERSCALAR EXECUTION

Superscaler instruction execution: Superscaler execution is a way to get

Instruction fetch Operand fetch Operand write

Get the Figure Get the Do the Put the

Splitting the L1 cache into Instruction and Data caches

Suppose the processor comes across an instruction sequence like this:

1. Load the value of COLOR.

A more sophisticated approach to branch prediction is to recognize that many

Floating point execution pipeline: The Pentium and P6 processors include a

Get the Figure Get the Do the

Put the Clean up Do the 2nd Do the 1st

Comparison of 486 and Pentium processing Rates

486 105 4 26.25 39

1. Load the value of COLOR.

Figure out Get the Do the Work

Extending the preceding example let us illustrate the example of P6 processors.

1. Load the value of COLOR.

Vous aimerez peut-être aussi