Académique Documents
Professionnel Documents
Culture Documents
Intel packed over 3 million transistors into the Pentium chip, over 5 million in to the
Pentium Pro chip, over 7.5 million into the Pentium II, and 9.5 million into the
Pentium III .The Pentium II chips have twice the number of transistors as the Pentium
processor chips, but they are much smaller, a result of an improved manufacturing
process that put more processors into a smaller area.
Early transistor radios needed only tens of transistors to receive radio broadcast, a
completely insignificant number against millions of transistors in these processors.
It’s clear from the numbers that all those transistors are doing more than just fetching
instructions and adding numbers. The Pentium Pro, Pentium II, Celeron, Pentium III
processors all shares a similar architecture, so we generally call them P6 processors.
Separate code and data caches: The Pentium and P6 processors all maintain
separate physical and L1 caches for instructions and data. Splitting the caches
creates parallelism, allowing the processor to retrieve instructions and data
from the cache at the same time, and prevents the flow of large volumes of
data through the cache from flushing out the instructions being executed.
3 simultaneous memory Paths give 3 times the memory access due to caching
The split cache allows all three operations (Instruction read, Operand read
and Operand write) to happen in the same clock cycle, enabling the processor
to get the full benefit of the pipeline parallelism. At least three cache access
are possible every clock tick from this five stage pipeline, one reading
instruction, one reading data and one writing data. Splitting an L1 cache into
an instruction cache and data cache makes these accesses possible
simultaneously, because a cache can do a read and a write at the same time.
The split cache provides an increased memory access required to support the
pipeline doing multiple things at once without having to wait.
Dynamic branch prediction: From the diagram given, the processor has to
know the address of the instruction that comes third after the one being
executed- that is, while the box “do the work” is running, the box “get the
instruction” is fetching three instructions ahead.
At the time the processor is doing the work for the instruction number one in the
example, the pipeline is loading instruction number four. The processor next does the
work for instruction at two. At that time the first pipeline stage needs to know what
instruction to load. Because the processor won’t know what instruction to execute
after instruction four until it executes instruction four and makes the branch decision,
the pipeline has a problem: The “get the instruction “ box doesn’t know what to do.
Some older processors solved this problem by doing nothing. In those chips, the
pipeline empties out after loading a branch until that instruction executes. In our
example, three cycles would pass with no instructions being executed (a pipeline stall)
while the instruction after number four loads, is looked at, and gets its operands. More
complex processor designs do a simple form of branch prediction by assuming that
the next instruction is always the one immediately after the branch. If this assumption
is true, the processor loses no time. If its wrong, the processor stalls for a number of
cycles while it loads the rights instruction.
NetPro Certification Courseware for NetPro Certified Systems Engineer – N.C.S.E
Enhanced 64-bit Data Bus: The Pentium and P6 processors all use a 64-bit
data bus, twice the width of the bus on the old 486.These processors transfer
twice as much data to and from memory at a time as the 486 did in one cycle.
They also cycle the cache faster than the 486, leading to a net access rate five
times faster than the 486. The combination of twice the bus width, less than
one half the cycle time, and over twice the number of memory cycles per
second nets out at the factor of five increases in maximum host bus data rate.
NetPro Certification Courseware for NetPro Certified Systems Engineer – N.C.S.E
Millions of
Processor Maximum Host Bus Bus Cycle
memory cycles per
Data Rate (MBps) Width Time(ns)
Second
Pentium 528 8 66 16
Dynamic Execution
The biggest difference between the Pentium and the P6 processors is increased
sophistication of the analysis of the instruction stream and the use of the results- a
technology Intel calls Dynamic Execution. In the discussion of Dynamic Branch
Prediction in the preceding section, you saw that the uncertainty following a branch
instruction could cause the processor pipelines to stall. Other problems cause
pipelines stall too. For example, look at this sequence:
The first two instructions can be executed in parallel by pipelines, but the third
instruction has to wait for the first two, and the fourth has to wait for the third. Adding
more pipelines to the basic Pentium architecture won’t make this sequence faster
because of the dependencies among the instructions that cause a conventional pipeline
to stall.
The P6 processors are vastly more complex than the Pentium. Intel’s engineers
devoted a great deal of the additional complexity to solving the pipeline stall problem,
because eliminating the stalls increases the sustained the instruction issue rate. The P6
processors break up the linear pipeline structure of the Pentium. The P6 processor
execution structure executes instruction a pipeline stage at a time but returns the result
to the execution pool between stages. Each stage takes the next instruction to the it
can work on, even if its out of the linear order. This means that the hand-off between
the pipeline stages doesn’t have to be in rigid, linear lock step. Instead, each pipeline
stage can look in to what Intel calls the “instruction pool “ for the next instruction it
can work on. The control circuits for the instruction pool ensure that the necessary
dependencies between instructions are observed, but they otherwise allow for out-of-
linear-order instruction execution.
NetPro Certification Courseware for NetPro Certified Systems Engineer – N.C.S.E
Put the
Get the answers where
instruction they belong
Instruction
Pool
The advantage the P6 processors have over the Pentium and early processors is that
even though the instructions at steps three and four may stall, the fetch/ decode unit
will continue to fill the instruction pool from steps five to seven. At the point the
dispatch execution unit stalls at three, the instruction at five will be available. The
dispatch execution unit picks up that instruction and continues working. No cycles are
wasted on pipeline stalls, so your programs run faster.
NetPro Certification Courseware for NetPro Certified Systems Engineer – N.C.S.E
Intel also improved the memory access performance of the P6 processors. In both the
Pentium and the P6, the split L1 cache ties to a bus interface unit, which is an engine
that works to get information from the L2 cache or the host bus. New in the P6
processors is a direct interface to the L2 cache from the bus interface unit. This isn’t
possible in the Pentium because the L2 cache is external to the Pentium Package, and
there aren’t enough pins in the package to add separate connections to the L2 cache.
The Pentium pro includes the L2 cache as a second silicon chip within the package, so
it doesn’t need additional pins for connection to the cache. The Pentium II includes
the L2 cache in the cartridge, within which Intel can maintain a controlled
environment for programs of the necessary high-speed signals. The initial Celeron
processors omitted the Pentium II L2 cache; later Celeron use a version of the
Pentium II cache design.
Parallel access paths to the system bus and the L2 cache give the later processors a
huge increase in performance. While the system bus makes a memory or I/O access
taking 60 ns or more, the processor can continue to pull instructions and data form the
L2 cache with an access time of 8ns to 16 ns. In the time the system bus makes one
cycle, the processor can get nearly 8 cycles from the L2 cache. The increased access
rate helps the fetch/ decode unit, the dispatch execution unit, and the retire unit avoid
stalling, boosting performance.