Académique Documents
Professionnel Documents
Culture Documents
Basic concepts
Speed of execution of programs can be improved in two ways:
Faster circuit technology to build the processor and the memory. Arrange the hardware so that a number of operations can be performed simultaneously. The number of operations performed per second is increased although the elapsed time needed to perform any one operation is not changed.
Pipelining is an effective way of organizing concurrent activity in a computer system to improve the speed of execution of programs.
What if the execution of one instruction is overlapped with the fetching of the next one?
Fetch and the execute units can be kept busy all the time.
If this pattern of fetch and execute can be sustained for a long time, the completion rate of instruction execution will be twice that achievable by the sequential operation. Fetch and execute units constitute a two-stage pipeline.
Each stage performs one step in processing of an instruction. Interstage storage buffer holds the information that needs to be passed from the fetch stage to execute stage. New information gets loaded into the buffer every clock cycle.
F:Fetch instruction B1
E:Execute operation B3
W:Write results
Clock cycle 1: F1 Clock cycle 2: D1, F2 Clock cycle 3: E1, D2, F3 Clock cycle 4: W1, E2, D3, F4 Clock cycle 5: W2, E3, D4 Clock cycle 6: W3, E3, D4 Clock cycle 7: W4
8
During clock cycle #4: Buffer B1 holds instruction I3, which is being decoded by the instruction-decoding unit. Instruction I3 was fetched in cycle 3. Buffer B2 holds the source and destination operands for instruction I 2. It also holds the information needed for the Write step (W2) of instruction I2. This information will be passed to the stage W in the following clock cycle. Buffer B3 holds the results produced by the execution unit and the destination information for instruction I1.
9
If instructions are to be fetched from the main memory, the instruction fetch stage would take as much as ten times greater than the other stage operations inside the processor. However, if instructions are to be fetched from the cache memory which is on the processor chip, the time required to fetch the instruction would be more or less similar to the time required for other basic operations.
10
Completes an instruction each clock cycle Therefore, four times as fast as without pipeline as long as nothing takes more than one cycle But sometimes things take longer -- for example, most executes such as ADD take one clock, but suppose DIVIDE takes three
and other stages idle Write has nothing to write Decode cant use its out buffer Fetch cant use its out buffer
Pipeline performance
Potential increase in performance achieved by using pipelining is proportional to the number of pipeline stages.
For example, if the number of pipeline stages is 4, then the rate of instruction processing is 4 times that of sequential execution of instructions. Pipelining does not cause a single instruction to be executed faster, it is the throughput that increases.
This rate can be achieved only if the pipelined operation can be sustained without interruption through program instruction. If a pipelined operation cannot be sustained without interruption, the pipeline is said to stall. A condition that causes the pipeline to stall is called a hazard.
14
Data hazard
A data hazard is a condition in which either the source or the destination operand is not available at the time expected in the pipeline . Execution of the instruction occurs in the E stage of the pipeline.
Execution of most arithmetic and logic operations would take only one clock cycle. However, some operations such as division would take more time to complete. For example, the operation specified in instruction I2 takes three cycles to complete from cycle 4 to cycle 6.
T ime Clockcycle Instruction I1 I2 I3 I4 F1 D1 F2 E1 D2 F3 W1 E2 D3 F4 W2 E3 D4 W3 E4 W4 1 2 3 4 5 6 7
15
Cycles 5 and 6, the Write stage is idle, because it has no data to work with. Information in buffer B2 must be retained till the execution of the instruction I 2 is complete. Stage 2, and by extension stage 1 cannot accept new instructions because the information in B1 cannot be overwritten. Steps D6 and F5 must be postponed. A data hazard is a condition in which either the source or the destination operand is not available at the time expected in the pipeline.
16
Data Hazards
Situations that cause the pipeline to stall because data to be operated on is delayed
18
An instruction hazard (or control hazard) has caused the pipeline to stall Instruction I2 not in the cache, required a main memory access
F D
2 1
F D
3 2
idle E
1
idle idle W
1
D E
3 2
idle idle
E W
3 2
idle
20
Structural hazard
Two instructions require the use of a hardware resource at the same time. Most common case is in access to the memory:
One instruction needs to access the memory as part of the Execute or Write stage. Other instruction is being fetched. If instructions and data reside in the same cache unit, only one instruction can proceed and the other is delayed.
Many processors have separate data and instruction caches to avoid this delay. In general, structural hazards can be avoided by providing sufficient resources on the processor chip.
21
W3 E4 D5
Memory address X+R1 is computed in step E2 in cycle 4, memory access takes place in cycle 5, operand read from the memory is written into register R2 in cycle 6. Execution of instruction I2 takes two clock cycles 4 and 5. In cycle 6, both instructions I2 and I3 require access to register file. Pipeline is stalled because the register file cannot handle two operations at once.
22
I5 fetch delayed I2 takes extra cycle for cache access as part of execution
When a hazard occurs, one of the stages in the pipeline cannot complete its operation in one clock cycle.
Performance level of one instruction completion in each clock cycle is the upper limit for the throughput that can be achieved in a pipelined processor.
24
Data hazards
Data hazard is a situation in which the pipeline is stalled because the data to be operated on are delayed. Consider two instructions: I1 : A = 3 + A I2 : B = 4 x A If A = 5, and I1 and I2 are executed sequentially, B=32. In a pipelined processor, the execution of I2 can begin before the execution of I1. The value of A used in the execution of I2 will be the original value of 5 leading to an incorrect result. Thus, instructions I1 and I2 depend on each other, because the data used by I 2 depends on the results generated by I1. Results obtained using sequential execution of instructions should be the same as the results obtained from pipelined execution. When two instructions depend on each other, they must be performed in the correct order.
25
Concurrency
A B 3+ A 4 x A
Cant be performed concurrently--result incorrect if new value of A is not used
A B
5 x C 20 + C
Concurrency
A B 3+ A 4 x A
Second operation depends on completion of first operation
A B
5 x C 20 + C
MUL R2, R3, R4 ADD R5, R4, R6 (dependent on result in R4 from previous instruction)
Add R5,R4,R6
W3 E4 W4
Mul instruction places the results of the multiply operation in register R4 at the end of clock cycle 4. Register R4 is used as a source operand in the Add instruction. Hence the Decode Unit decoding the Add instruction cannot proceed until the Write step of the first instruction is complete. Data dependency arises because the destination of one instruction is used as a source in the next instruction.
29
Operand forwarding
Data hazard occurs because the destination of one instruction is used as the source in the next instruction. Hence, instruction I2 has to wait for the data to be written in the register file by the Write stage at the end of step W1. However, these data are available at the output of the ALU once the Execute stage completes step E1. Delay can be reduced or even eliminated if the result of instruction I1 can be forwarded directly for use in step E2. This is called operand forwarding.
30
pipeline stall
data forwarding
R2, R3
R2 x R3 R4
R6
R2 x R3
If solved by software: MUL R2, R3, R4 NOOP NOOP ADD R5, R4, R6
from R2
from R3
to R4
to I2
R2 x R3
I1: Mul R2, R3, R4 I2: Add R5, R4, R6 Clock cycle 3: - Instruction I2 is decoded, and a data dependency is detected. - Operand not involved in the dependency, register R5 is loaded in register SRC1. Clock cycle 4: - Product produced by I1 is available in register RSLT. - The forwarding connection allows the result to be used in step E2. Instruction I2 proceeds without interruption.
SRC1
SRC2
Register file
ALU
RSLT Destination
36
Detecting data dependencies and handling them can also be accomplished in software.
Control hardware may delay by an appropriate number of clock cycles reading of a register till its contents become available. The pipeline stalls for that many number of clock cycles.
Compiler can introduce the necessary delay by introducing an appropriate number of NOP instructions. For example, if a twocycle delay is needed between two instructions then two NOP instructions can be introduced between the two instructions. I1: Mul R2, R3, R4 NOP NOP I2: Add R5, R4, R6
37
Side effects
Data dependencies are explicit easy to detect if a register specified as the destination in one instruction is used as a source in the subsequent instruction. However, some instructions also modify registers that are not specified as the destination.
For example, in the autoincrement and autodecrement addressing mode, the source register is modified as well.
When a location other than the one explicitly specified in the instruction as a destination location is affected, the instruction is said to have a side effect. Another example of a side effect is condition code flags which implicitly record the results of the previous instruction, and these results may be used in the subsequent instruction.
38
Instructions with side effects can lead to multiple data dependencies. Results in a significant increase in the complexity of hardware or software needed to handle the dependencies. Side effects should be kept to a minimum in instruction sets designed for execution on pipelined hardware.
39
Instruction hazards
Instruction fetch units fetch instructions and supply the execution units with a steady stream of instructions. If the stream is interrupted then the pipeline stalls. Stream of instructions may be interrupted because of a cache miss or a branch instruction.
40
41
Fk
Ek Fk+1 Ek+ 1
Pipeline stalls for one clock cycle. Time lost as a result of a branch instruction is called as branch penalty. Branch penalty is one clock cycle.
42
44
D k+ 1 E k+ 1
46
Fetch unit fetches instructions before they are needed & stores them in a queue
E:Execute instruction
W:Write results
Dispatch unit takes instructions from the front of the queue and dispatches them to the Execution unit. Dispatch unit also decodes the instruction.
47
48
I1 I2 I3 I4 I5 (Branch) I6 Ik Ik+ 1
Initial length of the queue is 1. Fetch adds 1 to the queue, dispatch reduces the length by 1. Queue length remains the same for first 4 clock cycles. I1 stalls the pipeline for 2 cycles. Queue has space, so the fetch unit continues and queue length rises to 3 in clock cycle 6.
I5 is a branch instruction with F6 X target instruction Ik. Ik is fetched in cycle 7, and I6 Fk Dk Ek Wk is discarded. Fk+1 D k+ 1 E k+ 1 However, this does not stall the pipeline, since I4 is dispatched. I2, I3, I4 and Ik are executed in successive clock cycles. Fetch unit computes the branch address concurrently with the execution of other instructions. This is called as branch folding.
49
previous out, F1 in
F1 out, F2 in
F2 out, F3 in
F4 in
F5 in
instructions 3 and 4
Queue should ideally be full most of the time. Increasing the rate at which the fetch unit reads instructions from the cache. Most processors allow more than one instruction to be fetched from the cache in one clock cycle. Fetch unit must replenish the queue quickly after a branch has occurred.
54
Decision on whether to branch cannot be made until the execution of the preceding instruction is complete.
Branch instructions represent 20% of the dynamic instruction count of most programs.
Dynamic instruction count takes into consideration that some instructions are executed repeatedly.
Branch instructions may incur branch penalty reducing the performance gains expected from pipelining. Several techniques to mitigate the negative impact of branch penalty on performance.
55
Delayed branch
Branch target address is computed in stage E2. Instructions I3 and I4 have to be discarded. Location following a branch instruction is called a branch delay slot. There may be more than one branch delay slot depending on the time it takes to determine whether the instruction is a branch instruction. In this case, there are two branch delay slots. The instructions in the delay slot are always fetched and at least partially executed before the branch decision is made and the branch address is computed.
Clockcycle I1 I2 (Branch) I3 I4 Ik Ik+ 1 1 F1 2 D1 F2 3 E1 D2 F3 4 W1 E2 D3 F4 X X Fk Dk Fk+ 1 Ek Dk+ 1 Wk Ek+ 1 5 6 7 8 Time
56
If we are able to place useful instructions in these slots, then they will always be executed whether or not the branch is taken.
If we cannot place useful instructions in the branch delay slots, then we can fill these slots with NOP instructions.
59
NEXT
LOOP
NEXT
R2 LOOP R1 R1,R3
Register R2 is used as a counter to determine how many times R1 is to be shifted. Processor has a two stage pipeline or one delay slot. Instructions can be reordered so that the shift left instruction appears in the delay slot. Shift left instruction is always executed whether the branch condition is true or false.
60
Branch
Shift(delayslot)
Decrement(Branchtaken)
Branch
Shift(delayslot)
Add(Branchnottaken)
61
62
Branch prediction
To reduce the branch penalty associated with conditional branches, we can predict whether the branch will be taken. Simplest form of branch prediction:
Assume that the branch will not take place. Continue to fetch instructions in sequential execution order. Until the branch condition is evaluated, instruction execution along the predicted path must be done on a speculative basis.
Speculative execution implies that the processor is executing instructions before it is certain that they are in the correct sequence.
Processor registers and memory locations should not be updated unless the sequence is confirmed. If the branch prediction turns out to be wrong, then instructions that were executed on a speculative basis and their data must be purged. Correct sequence of instructions must be fetched and executed.
63
I1 is a compare instruction and I2 is a branch instruction. Branch prediction takes place in cycle 3 when I2 is being decoded. I3 is being fetched at that time. Fetch unit predicts that the branch will not be taken and continues to fetch I4 in cycle 4 when I3 is being decoded.
Dk
Results of I1 are available in cycle 3. Fetch unit evaluates branch condition in cycle 4. If the branch prediction is incorrect, the fetch unit realizes at this point. I3 and I4 are discarded and Ik is fetched from the branch target address.
64
65
66
BNT
LNT
LT
BT
Branchnottaken(BNT)
67
68
BNT
LNT
BT
LT BNT
ST BT
ST : Strong likely to be taken LT : Likely to be taken LNT : Likely not to be taken SNT : Strong likely not to be taken
Initial state of the algorithm is LNT. After the branch instruction is executed, if the branch is taken, the state is changed to ST For a branch instruction, the fetch unit predicts that the branch will be taken if the state is ST or LT, else it predicts that the branch will not be taken. In state SNT: - The prediction is that the branch is not taken. - If the branch is actually taken, the state changes to LNT. - Next time the branch is encountered, the prediction again is that it is not taken. - If the prediction is wrong the second time, the state changes to ST. - After that, the branch is predicted as taken.
69
In the subsequent passes, the algorithm will predict that the branch is taken:
Prediction will be incorrect, except for the last pass. The state will change to LT from ST.
When the loop is entered the second time, the algorithm will predict that the branch is taken:
Information necessary to set the initial state of the branch prediction algorithm can be provided by static prediction schemes.
With this, the only misprediction that occurs is on the final pass through the loop. This misprediction is unavoidable.
Comparing the branch target address with the address of the branch instruction, Checking the branch prediction bit set by the compiler. Branch instruction at the end of the loop, initial state is LT. Branch instruction at the start of the loop, initial state is LNT.
71
Overview
Some instructions are much better suited to pipeline execution than others. Addressing modes Conditional code flags
Addressing Modes
Addressing modes include simple ones and complex ones. In choosing the addressing modes to be implemented in a pipelined processor, we must consider the effect of each addressing mode on instruction flow in the pipeline:
Side effects
The extent to which complex addressing modes cause the pipeline to stall Whether a given mode is likely to be used by compilers
Recall
Load X(R1), R2
Time Clockcycle Instruction I1 I2 (Load) I3 I4 F1 D1 F2 E1 D2 F3 W1 E2 D3 F4 M2 E3 D4 W2 W3 E4 1 2 3 4 5 6 7
I5
F5
D5
Load (R1), R2
Figure8.5.
EffectofaLoadinstructiononpipelinetiming.
Load (X(R1)), R2
Clockcycle 1 Load F
2 D
3 X+ [R1]
6 W
Time
Nextinstruction
(a)Complexaddressingmode
Add
X+ [R1]
Load
[X+[R1]]
Load
[[X+[R1]]]
Nextinstruction
(b)Simpleaddressingmode
Addressing Modes
In a pipelined processor, complex addressing modes do not necessarily lead to faster execution. Advantage: reducing the number of instructions / program space Disadvantage: cause pipeline to stall / more hardware to decode / not convenient for compiler to work with Conclusion: complex addressing modes are not suitable for pipelined execution.
Addressing Modes
Good addressing modes should have:
Access to an operand does not require more than one access to the memory Only load and store instruction access memory operands The addressing modes used do not have side effects
Register, register indirect, index
Conditional Codes
If an optimizing compiler attempts to reorder instruction to avoid stalling the pipeline when branches or data dependencies between successive instructions occur, it must ensure that reordering does not cause a change in the outcome of a computation. The dependency introduced by the condition-code flags reduces the flexibility available for the compiler to reorder instructions.
Conditional Codes
Add Compare Branch=0 R1,R2 R3,R4 ...
(a)Aprogramfragment
(b)Instructionsreordered
Figure8.17.Instructionreordering.
Conditional Codes
Two conclusion:
To provide flexibility in reordering instructions, the condition-code flags should be affected by as few instruction as possible. The compiler should be able to specify in which instructions of a program the condition codes are affected and in which they are not.
Bus A
Bus B
In cremen t er
Bus C
PC
Regi s t er fi l e Co ns t an t4 MUX
A ALU B R
In st ru ct i on d ecod er
IR
MDR
MAR
Memoryb u s d at al i n es
Address l i nes
Figure 7.8.
Threeb usorganizationofthedatapath.
BusA
Pipelined Design
BusB
Register file
- Separate instruction and data caches - PC is connected to IMAR - DMAR - Separate MDR - Buffers for ALU - Instruction queue - Instruction decoder output
Controlsignalpipeline Incrementer
Instruction decoder
- Reading an instruction from the instruction cache - Incrementing the PC - Decoding an instruction - Reading from or writing into the data cache - Reading the contents of up to two regs - Writing into one register in the reg file - Performing an ALU operation
Superscalar Operation
Overview
The maximum throughput of a pipelined processor is one instruction per clock cycle. If we equip the processor with multiple processing units to handle several instructions in parallel in each processing stage, several instructions start execution in the same clock cycle multiple-issue. Processors are capable of achieving an instruction execution throughput of more than one instruction per cycle superscalar processors. Multiple-issue requires a wider path to the cache and multiple execution units.
Superscalar
F:Instruction fetchunit Instructionqueue
Figure8.19.
Aprocessorwithtwoexecutionunits.
Timing
1 F1 F2
2 D1 D2 F3 F4
3 E1A E2 D3 D4
4 E1B W2 E3 E4
5 E 1C
6 W1
Time
E3 W4
E3
W3
Out-of-Order Execution
(a)Delayedwrite
Execution Completion
It is desirable to used out-of-order execution, so that an execution unit is freed to execute other instructions as soon as possible. At the same time, instructions must be completed in program order to allow precise exceptions. The use of temporary registers Commitment unit
Clockcycle I 1 (Fadd) I 2 (Add) I 3 (Fsub) I 4 (Sub) 1 F1 F2 2 D1 D2 F3 F4 3 E1A E2 D3 D4 4 E 1B TW2 E3A E4 E 3B TW4 5 E 1C 6 W1 W2 E 3C W3 W4 Time 7
(b)Usingtemporaryregisters
Performance Considerations
Overview
The execution time T of a program that has a dynamic instruction count N is given by:
T=
where S is the average number of clock cycles it takes to fetch and execute one instruction, and R is the clock rate. Instruction throughput is defined as the number of instructions executed per R second. P =
s
NS R
Overview
An n-stage pipeline has the potential to increase the throughput by n times. However, the only real measure of performance is the total execution time of a program. Higher instruction throughput will not necessarily lead to higher performance. Two questions regarding pipelining
How much of this potential increase in instruction throughput can be realized in practice? What is good value of n?
Since an n-stage pipeline has the potential to increase the throughput by n times, how about we use a 10,000-stage pipeline? As the number of stages increase, the probability of the pipeline being stalled increases. The inherent delay in the basic operations increases. Hardware considerations (area, power, complexity,)