Vous êtes sur la page 1sur 32

A hazard is created whenever there is a dependence between instructions, and they are

close enough that the overlap caused by pipelining would change the order of access to
an operand. Our example hazards have all been with register operands, but it is also
possible to create a dependence by writing and reading the same memory location. In
DLX pipeline, however, memory references are always kept in order, preventing this type
of hazard from arising.

All the data hazards discussed here involve registers within the CPU. By convention, the
hazards are named by the ordering in the program that must be preserved by the

RAW (read after write)

WAW (write after write)
WAR (write after read)

Consider two instructions i and j, with i occurring before j. The possible data hazards are:

RAW (read after write) - j tries to read a source before i writes it, so j incorrectly gets
the old value.

This is the most common type of hazard and the kind that we use forwarding to

WAW (write after write) - j tries to write an operand before it is written by i. The
writes end up being performed in the wrong order, leaving the value written by i rather
than the value written by j in the destination.

This hazard is present only in pipelines that write in more than one pipe stage or allow an
instruction to proceed even when a previous instruction is stalled. The DLX integer
pipeline writes a register only in WB and avoids this class of hazards.

WAW hazards would be possible if we made the following two changes to the DLX

move write back for an ALU operation into the MEM stage, since the data value is
available by then.
suppose that the data memory access took two pipe stages.

Here is a sequence of two instructions showing the execution in this revised pipeline,
highlighting the pipe stage that writes the result:


Unless this hazard is avoided, execution of this sequence on this revised pipeline will
leave the result of the first write (the LW) in R1, rather than the result of the ADD.

Allowing writes in different pipe stages introduces other problems, since two instructions
can try to write during the same clock cycle. The DLX FP pipeline , which has both
writes in different stages and different pipeline lengths, will deal with both write conflicts
and WAW hazards in detail.

WAR (write after read) - j tries to write a destination before it is read by i , so i

incorrectly gets the new value.

This can not happen in our example pipeline because all reads are early (in ID) and all
writes are late (in WB). This hazard occurs when there are some instructions that write
results early in the instruction pipeline, and other instructions that read a source late in the

Because of the natural structure of a pipeline, which typically reads values before it
writes results, such hazards are rare. Pipelines for complex instruction sets that support
autoincrement addressing and require operands to be read late in the pipeline could create
a WAR hazards.

If we modified the DLX pipeline as in the above example and also read some operands
late, such as the source value for a store instruction, a WAR hazard could occur. Here is
the pipeline timing for such a potential hazard, highlighting the stage where the conflict



If the SW reads R2 during the second half of its MEM2 stage and the Add writes R2
during the first half of its WB stage, the SW will incorrectly read and store the value
produced by the ADD.

RAR (read after read) - this case is not a hazard :).

Unfortunately, not all potential hazards can be handled by forwarding.

Consider the following sequence of instructions:

1 2 3 4 5 6 7 8

The LW instruction does not have the data until the end of clock cycle 4 (MEM) , while
the SUB instruction needs to have the data by the beginning of that clock cycle (EXsub).

For AND instruction we can forward the result immediately to the ALU (EXand) from the
MEM/WB register(MEM).

OR instruction has no problem, since it receives the value through the register file (ID).
In clock cycle no. 5, the WB of the LW instruction occurs "early" in first half of the cycle
and the register read of the OR instruction occurs "late" in the second half of the cycle.

For SUB instruction, the forwarded result would arrive too late - at the end of a clock
cycle, when needed at the beginning.

The load instruction has a delay or latency that cannot be eliminated by forwarding alone.
Instead, we need to add hardware, called a pipeline interlock, to preserve the correct
execution pattern. In general, a pipeline interlock detects a hazard and stalls the pipeline
until the hazard is cleared.

The pipeline with a stall and the legal forwarding is:

1 2 3 4 5 6 7 8 9
SUB R4, R1, R5 IF ID stall EXsub MEM WB
AND R6, R1 R7 IF stall ID EX MEM WB
OR R8, R1, R9 stall IF ID EX MEM WB

The only necessary forwarding is done for R1 from MEM to EXsub.

Notice that there is no need to forward R1 for AND instruction because now it is getting
the value through the register file in ID (as OR above).

There are techniques to reduce number of stalls even in this case, which we consider next

Generate DLX code that avoids pipeline stalls for the following sequence of statements:

Assume that all variables are 32-bit integers. Wherever necessary, explicitly explain the
actions that are needed to avoid pipeline stalls in your scheduled code.


The DLX assembly code for the given sequence of statements is :

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
Rb, b
Rc, c
Ra,Rb, IF ID stall EX M WB
IF stall ID EX M WB
Ra, a
LW Rf,
stall IF ID EX M WB
Rd, Ra, IF ID stall EX M WB
IF stall ID EX M WB
Rd, d
stall IF ID EX M WB
Rg, g
Rh, h
Re, Rg, IF ID stall EX M WB
IF stall ID EX M WB
Re, e

Running this code segment will need some forwarding. But instructions LW and
ALU(Add or Sub), when put in sequence, are generating hazards for the pipeline that can
not be resolved by forwarding. So the pipeline will stall. Observe that in time steps 4, 5,
and 6, there are two forwards from the Data memory unit to the ALU in the EX stage of
the Add instruction. So also the case in time steps 13, 14, and 15. The hardware to
implement this forwarding will need two Load Memory Data registers to store the output
of data memory. Note that for the SW instructions, the register value is needed at the
input of Data memory. The better solution with compiler assist is given below.

Rather then just allow the pipeline to stall, the compiler could try to schedule the pipeline
to avoid these stalls by rearranging the code sequence to eliminate the hazards.

Suggested version is (the problem has actually more than one solution) :

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Explanation
Rb read in
second half
Add Ra,
Rb, Rc
Rf read in
second half
Sub Rd,
Ra, Rf
Rd read in
SW Rd, d IF ID EX M WB second half
of ID;
Rg read in
second half
Sub Re,
Rg, Rh

The same color is used to outline the source and destination of forwarding.
The blue color is used to indicate the technique to perform the register file reads in the
second half of a cycle, and the writes in the first half.
Note: Notice that the use of different registers for the first, second and third statements
was critical for this schedule to be legal! In general, pipeline scheduling can increase the
register count required.

. Control hazards can cause a greater performance loss for DLX pipeline than data
hazards. When a branch is executed, it may or may not change the PC (program counter)
to something other than its current value plus 4. If a branch changes the PC to its target
address, it is a taken branch; if it falls through, it is not taken.

If instruction i is a taken branch, then the PC is normally not changed until the end of
MEM stage, after the completion of the address calculation and comparison (see

The simplest method of dealing with branches is to stall the pipeline as soon as the
branch is detected until we reach the MEM stage, which determines the new PC. The
pipeline behavior looks like :

Branch successor IF(stall) stall stall IF ID EX MEM WB
Branch successor+1 IF ID EX MEM WB

The stall does not occur until after ID stage (where we know that the instruction is a

This control hazards stall must be implemented differently from a data hazard, since the
IF cycle of the instruction following the branch must be repeated as soon as we know
the branch outcome. Thus, the first IF cycle is essentially a stall (because it never
performs useful work), which comes to total 3 stalls.

Three clock cycles wasted for every branch is a significant loss. With a 30% branch
frequency and an ideal CPI of 1, the machine with branch stalls achieves only half the
ideal speedup from pipelining!

The number of clock cycles can be reduced by two steps:

Find out whether the branch is taken or not taken earlier in the pipeline;
Compute the taken PC (i.e., the address of the branch target) earlier.
Both steps should be taken as early in the pipeline as possible.

By moving the zero test into the ID stage, it is possible to know if the branch is taken at
the end of the ID cycle. Computing the branch target address during ID requires an
additional adder, because the main ALU, which has been used for this function so far, is
not usable until EX.
The revised datapath :

Data Hazards
A major effect of pipelining is to change the relative timing of instructions by
overlapping their execution. This introduces data and control hazards. Data hazards
occur when the pipeline changes the order of read/write accesses to operands so that the
order differs from the order seen by sequentially executing instructions on the
unpipelined machine.

Consider the pipelined execution of these instructions:

1 2 3 4 5 6 7 8 9

All the instructions after the ADD use the result of the ADD instruction (in R1). The
ADD instruction writes the value of R1 in the WB stage (shown black), and the SUB
instruction reads the value during ID stage (IDsub). This problem is called a data hazard.
Unless precautions are taken to prevent it, the SUB instruction will read the wrong value
and try to use it.

The AND instruction is also affected by this data hazard. The write of R1 does not
complete until the end of cycle 5 (shown black). Thus, the AND instruction that reads the
registers during cycle 4 (IDand) will receive the wrong result.

The OR instruction can be made to operate without incurring a hazard by a simple

implementation technique. The technique is to perform register file reads in the second
half of the cycle, and writes in the first half. Because both WB for ADD and IDor for OR
are performed in one cycle 5, the write to register file by ADD will perform in the first
half of the cycle, and the read of registers by OR will perform in the second half of the

The XOR instruction operates properly, because its register read occur in cycle 6 after
the register write by ADD.
The next page discusses forwarding, a technique to eliminate the stalls for the hazard
involving the SUB and AND instructions.

We will also classify the data hazards and consider the cases when stalls can not be
eliminated. We will see what compiler can do to schedule the pipeline to avoid stalls.

Hazard (computer architecture)

From Wikipedia, the free encyclopedia
Jump to: navigation, search

Hazards are problems with the instruction pipeline in central processing unit (CPU)
microarchitectures that potentially result in incorrect computation. There are typically
three types of hazards:

• data hazards
• structural hazards
• control hazards (branching hazards)

There are several methods used to deal with hazards, including pipeline stalls (pipeline
bubbling), register forwarding, and in the case of out-of-order execution, the
scoreboarding method and the Tomasulo algorithm.

• 1 Background
• 2 Types
o 2.1 Data hazards
 2.1.1 Read After Write (RAW)
 Example
 2.1.2 Write After Read (WAR)
 Example
 2.1.3 Write After Write (WAW)
 Example
o 2.2 Structural hazards
o 2.3 Control hazards (branch hazards)
• 3 Eliminating hazards
o 3.1 Generic
 3.1.1 Pipeline bubbling
o 3.2 Data hazards
 3.2.1 Register forwarding
 Example
o 3.3 Control hazards (branch hazards)
• 4 References

• 5 See also

[edit] Background
Further information: instruction pipeline

Instructions in a pipelined processor are performed in several stages, so that at any given
time several instructions are being processed in the various stages of the pipeline, such as
fetch and execute. There are many different instruction pipeline microarchitectures, and
instructions may be executed out-of-order. A hazard occurs when two or more of these
simultaneous (possibly out of order) instructions conflict.

[edit] Types
[edit] Data hazards

Data hazards occur when instructions that exhibit data dependence modify data in
different stages of a pipeline. Ignoring potential data hazards can result in race conditions
(sometimes known as race hazards). There are three situations in which a data hazard can

1. read after write (RAW), a true dependency

2. write after read (WAR)
3. write after write (WAW)

consider two instructions i and j, with i occurring before j in program order.

[edit] Read After Write (RAW)

(j tries to read a source before i writes to it) A read after write (RAW) data hazard refers
to a situation where an instruction refers to a result that has not yet been calculated or
retrieved. This can occur because even though an instruction is executed after a previous
instruction, the previous instruction has not been completely processed through the

[edit] Example

For example:

i1. R2 <- R1 + R3
i2. R4 <- R2 + R3

The first instruction is calculating a value to be saved in register 2, and the second is
going to use this value to compute a result for register 4. However, in a pipeline, when we
fetch the operands for the 2nd operation, the results from the first will not yet have been
saved, and hence we have a data dependency.

We say that there is a data dependency with instruction 2, as it is dependent on the

completion of instruction 1.

[edit] Write After Read (WAR)

(j tries to write a destination before it is read by i) A write after read (WAR) data hazard
represents a problem with concurrent execution.

[edit] Example

For example:

i1. R4 <- R1 + R3
i2. R3 <- R1 + R2

If we are in a situation that there is a chance that i2 may be completed before i1 (i.e. with
concurrent execution) we must ensure that we do not store the result of register 3 before
i1 has had a chance to fetch the operands.
[edit] Write After Write (WAW)

(j tries to write an operand before it is written by i) A write after write (WAW) data
hazard may occur in a concurrent execution environment.

[edit] Example

For example:

i1. R2 <- R1 + R2
i2. R2 <- R4 + R7

We must delay the WB (Write Back) of i2 until the execution of i1.

[edit] Structural hazards

A structural hazard occurs when a part of the processor's hardware is needed by two or
more instructions at the same time. A canonical example is a single memory unit that is
accessed both in the fetch stage where an instruction is retrieved from memory, and the
memory stage where data is written and/or read from memory.[1] They can often be
resolved by separating the component into orthogonal units (such as separate caches) or
bubbling the pipeline.

[edit] Control hazards (branch hazards)

Further information: branch (computer science)

Branching hazards (also known as control hazards) occur with branches. On many
instruction pipeline microarchitectures, the processor will not know the outcome of the
branch when it needs to insert a new instruction into the pipeline (normally the fetch

[edit] Eliminating hazards

[edit] Generic

[edit] Pipeline bubbling

Bubbling the pipeline, also known as a pipeline break or a pipeline stall, is a method for
preventing data, structural, and branch hazards from occurring. As instructions are
fetched, control logic determines whether a hazard could/will occur. If this is true, then
the control logic inserts NOPs into the pipeline. Thus, before the next instruction (which
would cause the hazard) is executed, the previous one will have had sufficient time to
complete and prevent the hazard. If the number of NOPs is equal to the number of stages
in the pipeline, the processor has been cleared of all instructions and can proceed free
from hazards. This is called flushing the pipeline. All forms of stalling introduce a delay
before the processor can resume execution.
[edit] Data hazards

There are several main solutions and algorithms used to resolve data hazards:

• insert a pipeline bubble whenever a read after write (RAW) dependency is

encountered, guaranteed to increase latency, or
• utilize out-of-order execution to potentially prevent the need for pipeline bubbles
• utilize register forwarding to use data from later stages in the pipeline

In the case of out-of-order execution, the algorithm used can be:

• scoreboarding, in which case a pipeline bubble will only be needed when there is
no functional unit available
• the Tomasulo algorithm, which utilizes register renaming allowing the continual
issuing of instructions

We can delegate the task of removing data dependencies to the compiler, which can fill in
an appropriate number of NOP instructions between dependent instructions to ensure
correct operation, or re-order instructions where possible.

[edit] Register forwarding

Forwarding involves feeding output data into a previous stage of the pipeline. Forwarding
is implemented by feeding back the output of an instruction into the previous stage(s) of
the pipeline as soon as the output of that instruction is available.

[edit] Example
NOTE: In the following examples, computed values are in bold, while Register
numbers are not.

For instance, let's say we want to write the value 3 to register 1, (which already contains a
6), and then add 7 to register 1 and store the result in register 2, i.e.:

Instruction 0: Register 1 = 6
Instruction 1: Register 1 = 3
Instruction 2: Register 2 = Register 1 + 7 = 10

Following execution, register 2 should contain the value 10. However, if Instruction 1
(write 3 to register 1) does not completely exit the pipeline before Instruction 2 starts
execution, it means that Register 1 does not contain the value 3 when Instruction 2
performs its addition. In such an event, Instruction 2 adds 7 to the old value of register 1
(6), and so register 2 would contain 13 instead, i.e:

Instruction 0: Register 1 = 6
Instruction 2: Register 2 = Register 1 + 7 = 13
Instruction 1: Register 1 = 3
This error occurs because Instruction 2 reads Register 1 before Instruction 1 has
committed/stored the result of its write operation to Register 1. So when Instruction 2 is
reading the contents of Register 1, register 1 still contains 6, not 3.

Forwarding (described below) helps correct such errors by depending on the fact that the
output of Instruction 1 (which is 3) can be used by subsequent instructions before the
value 3 is committed to/stored in Register 1.

Forwarding applied to our example means that we do not wait to commit/store the output
of Instruction 1 in Register 1 (in this example, the output is 3) before making that output
available to the subsequent instruction (in this case, Instruction 2). The effect is that
Instruction 2 uses the correct (the more recent) value of Register 1: the commit/store was
made immediately and not pipelined.

With forwarding enabled, the ID/EX or Instruction Decode/Execution stage of the

pipeline now has two inputs: the value read from the register specified (in this example,
the value 6 from Register 1), and the new value of Register 1 (in this example, this value
is 3) which is sent from the next stage (EX/MEM) or Instruction Execute/Memory
Access. Additional control logic is used to determine which input to use.

[edit] Control hazards (branch hazards)

To avoid control hazards microarchitectures can:

• insert a pipeline bubble (discussed above), guaranteed to increase latency, or

• use branch prediction and essentially guesstimate which instructions to insert, in
which case a pipeline bubble will only be needed in the case of an incorrect

In the event that a branch causes a pipeline bubble after incorrect instructions have
entered the pipeline, care must be taken to prevent any of the wrongly-loaded instructions
from having any effect on the processor state excluding energy wasted processing them
before they were discovered to be loaded incorrectly.

Tomasulo algorithm
From Wikipedia, the free encyclopedia
Jump to: navigation, search

The Tomasulo algorithm is a hardware algorithm developed in 1967 by Robert

Tomasulo from IBM. It allows sequential instructions that would normally be stalled due
to certain dependencies to execute non-sequentially (out-of-order execution). It was first
implemented for the IBM System/360 Model 91’s floating point unit.
This algorithm differs from scoreboarding in that it utilizes register renaming. Where
scoreboarding resolves Write-after-Write (WAW) and Write-after-Read (WAR) hazards
by stalling, register renaming allows the continual issuing of instructions. The Tomasulo
algorithm also uses a common data bus (CDB) on which computed values are broadcast
to all the reservation stations that may need it. This allows for improved parallel
execution of instructions which may otherwise stall under the use of scoreboarding.

Robert Tomasulo received the Eckert-Mauchly Award in 1997 for this algorithm.


• 1 Implementation concepts
• 2 Instruction lifecycle
o 2.1 Stage 1: issue
o 2.2 Stage 2: execute
o 2.3 Stage 3: write result
• 3 See also
• 4 External links

• 5 Bibliography

[edit] Implementation concepts

The following are the concepts necessary to the implementation of Tomasulo's


• Instructions are issued sequentially so that the effects of a sequence of instructions

such as exceptions raised by these instructions occur in the same order as they
would in a non-pipelined processor, regardless of the fact that they are being
executed non-sequentially.

• All general-purpose and reservation station registers hold either real or virtual
values. If a real value is unavailable to a destination register during the issue
stage, a virtual value is initially used. The functional unit that is computing the
real value is assigned as the virtual value. The virtual register values are converted
to real values as soon as the designated functional unit completes its computation.

• Functional units use reservation stations with multiple slots. Each slot holds
information needed to execute a single instruction, including the operation and the
operands. The functional unit begins processing when it is free and when all
source operands needed for an instruction are real.
[edit] Instruction lifecycle

The three stages listed below are the stages through which each instruction passes from
the time it is issued to the time its execution is complete.

[edit] Stage 1: issue

In the issue stage, instructions are issued for execution if all operands and reservation
stations are ready or else they are stalled. Registers are renamed in this step, eliminating
WAR and WAW hazards.

• Retrieve the next instruction from the head of the instruction queue. If the
instruction operands are currently in the registers
o If there is a matching empty reservation station (i.e., functional unit is
available) then: issue the instruction
o Else, there is not a matching empty reservation station (i.e., functional unit
is not available) then: stall the instruction until a station or buffer is free
• Else, the operands are not in the registers, then: use virtual values, the functional
unit calculating the real value, to keep track of the functional units that will
produce the operand

[edit] Stage 2: execute

In the execute stage, the instruction operations are carried out. Instructions are delayed in
this step until all of their operands are available, eliminating RAW hazards. Program
correctness is maintained through effective address calculation to prevent hazards
through memory.

1. If one or more of the operands is not yet available then: wait for operand to
become available on the CDB.
2. When all operands are available, then: if the instruction is a load or store
1. Compute the effective address when the base register is available, and
place it in the load/store buffer
 If the instruction is a load then: execute as soon as the memory unit
is available, then:
 Else, if the instruction is a store then: wait for the value to be
stored before sending it to the memory unit
 Else, the instruction is an ALU operation then: execute the
instruction at the corresponding functional unit

[edit] Stage 3: write result

In the write Result stage, ALU operations results are written back to registers and store
operations are written back to memory.
• If the instruction was an ALU operation
o If the result is available, then: write it on the CDB and from there into the
registers and any reservation stations waiting for this result
• Else, if the instruction was a store then: write the data to memory during this step


Register renaming
From Wikipedia, the free encyclopedia
Jump to: navigation, search

In computer architecture, register renaming refers to a technique used to avoid

unnecessary serialization of program operations imposed by the reuse of registers by
those operations.


• 1 Problem definition
• 2 Data hazards
• 3 Architectural vs physical registers
• 4 Details: tag-indexed register file
• 5 Details: reservation stations
• 6 Comparison between the schemes
• 7 History

• 8 References

[edit] Problem definition

Programs are composed of instructions which operate on values. The instructions must
name these values in order to distinguish them from one another. A typical instruction
might say, add X and Y and put the result in Z. In this instruction, X, Y, and Z are the
names of storage locations.

In order to have a compact instruction encoding, most processor instruction sets have a
small set of special locations which can be directly named. For example, the x86
instruction set architecture has 8 integer registers, x86-64 has 16, many RISCs have 32,
and IA-64 has 128. In smaller processors, the names of these locations correspond
directly to elements of a register file.
Different instructions may take different amounts of time (e.g., CISC architecture). For
instance, a processor may be able to execute hundreds of instructions while a single load
from main memory is in progress. Shorter instructions executed while the load is
outstanding will finish first, thus the instructions are finishing out of the original program
order. Out-of-order execution has been used in most recent high-performance CPUs to
achieve some of their speed gains.

Consider this piece of code running on an out-of-order CPU:

1. R1=M[1024]
2. R1=R1+2
3. M[1032]=R1
4. R1=M[2048]
5. R1=R1+4
6. M[2056]=R1

Instructions 4, 5, and 6 are independent of instructions 1, 2, and 3, but the processor

cannot finish 4 until 3 is done, because 3 would then write the wrong value.

We can eliminate this restriction by changing the names of some of the registers:

1. R1=M[1024] 4. R2=M[2048]
2. R1=R1+2 5. R2=R2+4
3. M[1032]=R1 6. M[2056]=R2

Now instructions 4, 5, and 6 can be executed in parallel with instructions 1, 2, and 3, so

that the program can be executed faster.

When possible, the compiler performs this renaming. The compiler is constrained in
many ways, primarily by the finite number of register names in the instruction set. Many
high performance CPUs have more physical registers than may be named directly in the
instruction set, so they rename registers in hardware to achieve additional parallelism.

[edit] Data hazards

Main article: Data hazard

When more than one instruction references a particular location for an operand, either
reading it (as an input) or writing it (as an output), executing those instructions in an
order different from the original program order can lead to three kinds of data hazards:

Read-after-write (RAW)
A read from a register or memory location must return the value placed there by
the last write in program order, not some other write. This is referred to as a true
dependency or flow dependency, and requires the instructions to execute in
program order.
Write-after-write (WAW)
Successive writes to a particular register or memory location must leave that
location containing the result of the second write. This can be resolved by
squashing (synonyms: cancelling, annulling, mooting) the first write if necessary.
WAW dependencies are also known as output dependencies.
Write-after-read (WAR)

A read from a register or memory location must return the last prior value written to that
location, and not one written programmatically after the read. This is the sort of false
dependency that can be resolved by renaming. WAR dependencies are also known as

Instead of delaying the write until all reads are completed, two copies of the location can
be maintained, the old value and the new value. Reads that precede, in program order, the
write of the new value can be provided with the old value, even while other reads that
follow the write are provided with the new value. The false dependency is broken and
additional opportunities for out-of-order execution are created. When all reads needing
the old value have been satisfied, it can be discarded. This is the essential concept behind
register renaming.

Anything that is read and written can be renamed. While the general-purpose and
floating-point registers are discussed the most, flag and status registers or even individual
status bits are commonly renamed as well.

Memory locations can also be renamed, although it is not commonly done to the extent
practised in register renaming. The Transmeta Crusoe processor's gated store buffer is a
form of memory renaming.

If programs refrained from reusing registers immediately, there would be no need for
register renaming. Some instruction sets (e.g., IA-64) specify very large numbers of
registers for specifically this reason. There are limitations to this approach:

• It is very difficult for the compiler to avoid reusing registers without large code
size increases. In loops, for instance, successive iterations would have to use
different registers, which requires replicating the code in a process called loop
unrolling (but see register rotation)
• Large numbers of registers require lots of bits to specify those registers, making
the code size increase.
• Many instruction sets historically specified smaller numbers of registers and
cannot be changed now.

Code size increases are important because when the program code is larger, the
instruction cache misses more often and the processor stalls waiting for new instructions.
[edit] Architectural vs physical registers

Machine language programs specify reads and writes to a limited set of registers
specified by the instruction set architecture (ISA). For instance, the Alpha ISA specifies
32 integer registers, each 64 bits wide, and 32 floating-point registers, each 64 bits wide.
These are the architectural registers. Programs written for processors running the Alpha
instruction set will specify operations reading and writing those 64 registers. If a
programmer stops the program in a debugger, she or he can observe the contents of these
64 registers (and a few status registers) to determine the progress of the machine.

One particular processor which implements this ISA, the Alpha 21264, has 80 integer and
72 floating-point physical registers. There are, on an Alpha 21264 chip, 80 physically
separate locations which can store the results of integer operations, and 72 locations
which can store the results of floating point operations. (In fact, there are even more
locations than that, but those extra locations are not germane to the register renaming

Below are described two styles of register renaming, distinguished by the circuit which
holds data ready for an execution unit.

In all renaming schemes, the machine converts the architectural registers referenced in
the instruction stream into tags. Where the architectural registers might be specified by 3
to 5 bits, the tags are usually a 6 to 8 bit number. The rename file must have a read port
for every input of every instruction renamed every cycle, and a write port for every
output of every instruction renamed every cycle. Because the size of a register file
generally grows as the square of the number of ports, the rename file is usually physically
large and consumes significant power.

In the tag-indexed register file style, there is one large register file for data values,
containing one register for every tag. For example, if the machine has 80 physical
registers, then it would use 7 bit tags. 48 of the possible tag values in this case are

In this style, when an instruction is issued to an execution unit, the tags of the source
registers are sent to the physical register file, where the values corresponding to those
tags are read and sent to the execution unit.

In the reservation station style, there are many small associative register files, usually
one at the inputs to each execution unit. Each operand of each instruction in an issue
queue has a place for a value in one of these register files.

In this style, when an instruction is issued to an execution unit, the register file entries
corresponding to the issue queue entry are read and forwarded to the execution unit.

Architectural Register File or Retirement Register File (RRF)

The committed register state of the machine. RAM indexed by logical register
number. Typically written into as results are retired or committed out of a reorder
Future File
The most speculative register state of the machine. RAM indexed by logical
register number.
Active Register File
The Intel P6 group's term for Future File.
History Buffer
Typically used in combination with a future file. Contains the "old" values of
registers that have been overwritten. If the producer is still in flight it may be
RAM indexed by history buffer number. After a branch misprediction must use
results from the history buffer—either they are copied, or the future file lookup is
disabled and the history buffer is CAM indexed by logical register number.
Reorder Buffer (ROB)

Pretty much any structure that is sequentially (circularly) indexed on a per operation
basis, for instructions in flight. Except… differs from a history buffer, in that the reorder
buffer typically comes after the future file (if it exists) and before the architectural
register file.

Reorder buffers come in data-less and data-ful versions.

In Willamette's ROB, the ROB entries point to registers in the physical register file
(PRF), and also contain other bookkeeping. This was also the first OOO design done by
Andy Glew, at Illinois with HaRRM.

In P6's ROB, the ROB entries contain data; there is no separate PRF. Data values from
the ROB are copied from the ROB to the RRF at retirement.

One small detail: if there is temporal locality in ROB entries (i.e., if instructions close
together in the Von Neuman instruction sequence write back close together in time, it
may be possible to perform write combining on ROB entries and so have fewer ports than
a separate ROB/PRF would). It's not clear if it makes a difference, since a PRF should be

ROBs usually don't have associative logic, and certainly none of the ROBs designed by
Andy Glew have CAMs. Keith Diefendorff insisted that ROBs have complex associative
logic for many years. The first ROB proposal may have had CAMs.

[edit] Details: tag-indexed register file

This file is a candidate for speedy deletion. It may be deleted at any time.

This is the renaming style used in the MIPS R10000, the Alpha 21264, and in the FP
section of the AMD Athlon.
In the renaming stage, every architectural register referenced (for read or write) is looked
up in an architecturally-indexed remap file. This file returns a tag and a ready bit. The
tag is non-ready if there is a queued instruction which will write to it that has not yet
executed. For read operands, this tag takes the place of the architectural register in the
instruction. For every register write, a new tag is pulled from a free tag FIFO, and a new
mapping is written into the remap file, so that future instructions reading the architectural
register will refer to this new tag. The tag is marked as unready, because the instruction
has not yet executed. The previous physical register allocated for that architectural
register is saved with the instruction in the reorder buffer, which is a FIFO that holds
the instructions in program order between the decode and graduation stages.

The instructions are then placed in various issue queues.

As instructions are executed, the tags for their results are broadcast, and the issue queues
match these tags against the tags of their non-ready source operands. A match means that
the operand is ready. The remap file also matches these tags, so that it can mark the
corresponding physical registers as ready.

When all the operands to an instruction in an issue queue are ready, that instruction is
ready to issue. The issue queues pick ready instructions to send to the various functional
units each cycle. Non-ready instructions stay in the issue queues. This unordered removal
of instructions from the issue queues is one of the things that makes them large and use
lots of power.

Issued instructions read from a tag-indexed physical register file (bypassing just-
broadcast operands), then execute.

Execution results are written to tag-indexed physical register file, as well as broadcast to
the bypass network preceding each functional unit.

Graduation puts the previous tag for the written architectural register into the free queue
so that it can be reused for a newly decoded instruction.

An exception or branch misprediction causes the remap file to back up to the remap state
at last valid instruction via combination of state snapshots and cycling through the
previous tags in the in-order pre-graduation queue. Since this mechanism is required, and
since it can recover any remap state (not just the state before the instruction currently
being graduated), branch mispredictions can be handled before the branch reaches
graduation, potentially hiding the branch misprediction latency.

[edit] Details: reservation stations

Main article: reservation stations

This file is a candidate for speedy deletion. It may be deleted at any time. This is the style
used in the integer section of the AMD K7 and K8 designs.
In the renaming stage, every architectural register referenced for reads is looked up in
both the architecturally-indexed future file and the rename file. The future file read gives
the value of that register, if there is no outstanding instruction yet to write to it (i.e., it's
ready). When the instruction is placed in an issue queue, the values read from the future
file are written into the corresponding entries in the reservation stations. Register writes
in the instruction cause a new, non-ready tag to be written into the rename file. The tag
number is usually serially allocated in instruction order—no free tag FIFO is necessary.

Just as with the tag-indexed scheme, the issue queues wait for non-ready operands to see
matching tag broadcasts. Unlike the tag-indexed scheme, matching tags cause the
corresponding broadcast value to be written into the issue queue entry's reservation

Issued instructions read their arguments from the reservation station, bypass just-
broadcast operands, and then execute. As mentioned earlier, the reservation station
register files are usually small, with perhaps eight entries.

Execution results are written to the reorder buffer, to the reservation stations (if the issue
queue entry has a matching tag), and to the future file if this is the last instruction to
target that architectural register (in which case register is marked ready).

Graduation copies the value from the reorder buffer into the architectural register file.
The sole use of the architectural register file is to recover from exceptions and branch

Exceptions and branch mispredictions, recognised at graduation, cause the architectural

file to be copied to the future file, and all registers marked as ready in the rename file.
There is usually no way to reconstruct the state of the future file for some instruction
intermediate between decode and graduation, so there is usually no way to do early
recovery from branch mispredictions.

[edit] Comparison between the schemes

In both schemes, instructions are inserted in-order into the issue queues, but are removed
out-of-order. If the queues do not collapse empty slots, then they will either have many
unused entries, or require some sort of variable priority encoding for when multiple
instructions are simultaneously ready to go. Queues that collapse holes have simpler
priority encoding, but require simple but large circuitry to advance instructions through
the queue.

Reservation stations have better latency from rename to execute, because the rename
stage finds the register values directly, rather than finding the physical register number,
and then using that to find the value. This latency shows up as a component of the branch
mispredict latency.
Reservation stations also have better latency from instruction issue to execution, because
each local register file is smaller than the large central file of the tag-indexed scheme.
Tag generation and exception processing are also simpler in the reservation station
scheme, as discussed below.

The physical register files used by reservation stations usually collapse unused entries in
parallel with the issue queue they serve, which makes these register files larger in
aggregate, and burn more power, and more complicated than the simpler register files
used in a tag-indexed scheme. Worse yet, every entry in each reservation station can be
written by every result bus, so that a reservation-station machine with, e.g., 8 issue queue
entries per functional unit will typically have 9 times as many bypass networks as an
equivalent tag-indexed machine. Result forwarding thus takes much more power and area
than in a tag-indexed design.

Furthermore, the reservation station scheme has four places (Future File, Reservation
Station, Reorder Buffer and Architectural File) where a result value can be stored, where
the tag-indexed scheme has just one (the physical register file). Because the results from
the functional units, broadcast to all these storage locations, must reach a much larger
number of locations in the machine than in the tag-indexed scheme, this function
consumes more power, area, and time. Still, in machines equipped with very accurate
branch prediction schemes and if execute latencies are a major concern, reservation
stations can work remarkably well.

[edit] History

The IBM System/360 Model 91 was an early machine that supported out-of-order
execution of instructions; it used the Tomasulo algorithm, which uses register renaming.

The POWER1 is the first microprocessor that used register renaming and out-of-order
execution in 1990.

The original R10000 design had neither collapsing issue queues nor variable priority
encoding, and suffered starvation problems as a result—the oldest instruction in the
queue would sometimes not be issued until both instruction decode stopped completely
for lack of rename registers, and every other instruction had been issued. Later revisions
of the design starting with the R12000 used a partially variable priority encoder to
mitigate this problem.

Early out-of-order machines did not separate the renaming and ROB/PRF storage
functions. For that matter, some of the earliest, such as Sohi's RUU or the Metaflow
DCAF, combined scheduling, renaming, and storage all in the same structure.

Most modern machines do renaming by RAM indexing a map table with the logical
register number. E.g., P6 did this; future files do this, and have data storage in the same
However, earlier machines used content-addressable memory (a type of hardware which
provides the functionality of an associative array) in the renamer. E.g., the HPSM RAT,
or Register Alias Table, essentially used a CAM on the logical register number in
combination with different versions of the register.

In many ways, the story of out-of-order microarchitecture has been how these CAMs
have been progressively eliminated. Small CAMs are useful; large CAMs are impractical.
[citation needed]

The P6 microarchitecture was the first Intel based processor that implemented both out-
of-order execution and register renaming. The P6 microarchitecture manifested in
Pentium Pro, Pentium II, Pentium III, Pentium M, Core, and Core 2 microprocessors.


Basic concept
[edit] In-order processors

In earlier processors, the processing of instructions is normally done in these steps:

1. Instruction fetch.
2. If input operands are available (in registers for instance), the instruction is
dispatched to the appropriate functional unit. If one or more operand is
unavailable during the current clock cycle (generally because they are being
fetched from memory), the processor stalls until they are available.
3. The instruction is executed by the appropriate functional unit.
4. The functional unit writes the results back to the register file.

[edit] Out-of-order processors

This new paradigm breaks up the processing of instructions into these steps:

1. Instruction fetch.
2. Instruction dispatch to an instruction queue (also called instruction buffer or
reservation stations).
3. The instruction waits in the queue until its input operands are available. The
instruction is then allowed to leave the queue before earlier, older instructions.
4. The instruction is issued to the appropriate functional unit and executed by that
5. The results are queued.
6. Only after all older instructions have their results written back to the register file,
then this result is written back to the register file. This is called the graduation or
retire stage.
The key concept of OoO processing is to allow the processor to avoid a class of stalls that
occur when the data needed to perform an operation are unavailable. In the outline above,
the OoO processor avoids the stall that occurs in step (2) of the in-order processor when
the instruction is not completely ready to be processed due to missing data.

OoO processors fill these "slots" in time with other instructions that are ready, then re-
order the results at the end to make it appear that the instructions were processed as
normal. The way the instructions are ordered in the original computer code is known as
program order, in the processor they are handled in data order, the order in which the
data, operands, become available in the processor's registers. Fairly complex circuitry is
needed to convert from one ordering to the other and maintain a logical ordering of the
output; the processor itself runs the instructions in seemingly random order.

The benefit of OoO processing grows as the instruction pipeline deepens and the speed
difference between main memory (or cache memory) and the processor widens. On
modern machines, the processor runs many times faster than the memory, so during the
time an in-order processor spends waiting for data to arrive, it could have processed a
large number of instructions.

[edit] Dispatch and issue decoupling allows out-of-order issue

One of the differences created by the new paradigm is the creation of queues which
allows the dispatch step to be decoupled from the issue step and the graduation stage to
be decoupled from the execute stage. An early name for the paradigm was decoupled
architecture. In the earlier in-order processors, these stages operated in a fairly lock-step,
pipelined fashion.

To avoid false operand dependencies, which would decrease the frequency when
instructions could be issued out of order, a technique called register renaming is used. In
this scheme, there are more physical registers than defined by the architecture. The
physical registers are tagged so that multiple versions of the same architectural register
can exist at the same time.

[edit] Execute and writeback decoupling allows program restart

The queue for results is necessary to resolve issues such as branch mispredictions and
exceptions/traps. The results queue allows programs to be restarted after an exception,
which requires the instructions to be completed in program order. The queue allows
results to be discarded due to mispredictions on older branch instructions and exceptions
taken on older instructions.

The ability to issue instructions past branches which have yet to resolve is known as
speculative execution.
[edit] Micro-architectural choices

• Are the instructions dispatched to a centralized queue or to multiple distributed


IBM PowerPC processors use queues which are distributed among the different
functional units while other Out-of-Order processors use a centralized queue.
IBM uses the term reservation stations for their distributed queues.

• Is there an actual results queue or are the results written directly into a register
file? For the latter, the queueing function is handled by register maps which hold
the register renaming information for each instruction in flight.

Early Intel out-of-order processors use a results queue called a re-order buffer,
while most later Out-of-Order processors use register maps.
More precisely: Intel P6 family microprocessors have both a ROB re-order buffer
and a RAT register map mechanism. The ROB was motivated mainly by branch
misprediction recovery.
The Intel P6 family was among the earliest OoO processors, was supplanted by
the Intel Pentium 4 Willamette microarchitecture, but which returned after the
right hand turn and, at the time of writing (2009) is still Intel's flagship
microprocessor family.


Data dependency
From Wikipedia, the free encyclopedia
(Redirected from Data dependence)
Jump to: navigation, search

A data dependency in computer science is a situation in which a program statement

(instruction) refers to the data of a preceding statement. In compiler theory, the technique
used to discover data dependencies among statements (or instructions) is called
dependence analysis.

There are two types of dependencies: data and control.


• 1 Data dependencies
o 1.1 True dependency
o 1.2 Anti-dependency
o 1.3 Output dependency
• 2 Control Dependency
• 3 Implications

• 4 References

[edit] Data dependencies

Assuming statement S1 and S2, S2 depends on S1 if:

[I(S1) ∩ O(S2)] ∪ [O(S1) ∩ I(S2)] ∪ [O(S1) ∩ O(S2)] ≠ Ø


• I(Si) is the set of memory locations read by Si and

• O(Sj) is the set of memory locations written by Sj
• and there is a feasible run-time execution path from S1 to S2

This Condition is called Bernstein Condition, named by A. J. Bernstein.

Three cases exist:

• True (data) dependence: O(S1) ∩ I (S2) , S1-> S2 and S1 writes something read
by S2
• Anti-dependence: I(S1) ∩ O(S2) , mirror relationship of true dependence
• Output dependence: O(S1) ∩ O(S2), S1->S2 and both write the same memory

[edit] True dependency

A true dependency, also known as a data dependency, occurs when an instruction

depends on the result of a previous instruction:

1. A = 3
2. B = A
3. C = B

Instruction 3 is truly dependent on instruction 2, as the final value of C depends on the

instruction updating B. Instruction 2 is truly dependent on instruction 1, as the final value
of B depends on the instruction updating A. Since instruction 3 is truly dependent upon
instruction 2 and instruction 2 is truly dependent on instruction 1, instruction 3 is also
truly dependent on instruction 1. Instruction level parallelism is therefore not an option in
this example. [1]

[edit] Anti-dependency

An anti-dependency occurs when an instruction requires a value that is later updated. In

the following example, instruction 3 anti-depends on instruction 2 — the ordering of
these instructions cannot be changed, nor can they be executed in parallel (possibly
changing the instruction ordering), as this would affect the final value of A.

1. B = 3
2. A = B + 1
3. B = 7

An anti-dependency is an example of a name dependency. That is, renaming of variables

could remove the dependency, as in the next example:

1. B = 3
N. B2 = B
2. A = B2 + 1
3. B = 7

A new variable, B2, has been declared as a copy of B in a new instruction, instruction N.
The anti-dependency between 2 and 3 has been removed, meaning that these instructions
may now be executed in parallel. However, the modification has introduced a new
dependency: instruction 2 is now truly dependent on instruction N, which is truly
dependent upon instruction 1. As true dependencies, these new dependencies are
impossible to safely remove. [1]

[edit] Output dependency

An output dependency occurs when the ordering of instructions will affect the final
output value of a variable. In the example below, there is an output dependency between
instructions 3 and 1 — changing the ordering of instructions in this example will change
the final value of B, thus these instructions cannot be executed in parallel.

1 A = 2 * X
2 B = A / 3
3 A = 9 * Y

As with anti-dependencies, output dependencies are name dependencies. That is, they
may be removed through renaming of variables, as in the below modification of the
above example:

1 A2 = 2 * X
2 B = A2 /3
3 A = 9 * Y

A commonly used naming convention for data dependencies is the following: Read-after-
Write (true dependency), Write-after-Write (output dependency), and Write-After-Read
(anti-dependency). [1]

[edit] Control Dependency

An instruction B is control dependent on a preceding instruction A if the latter determines

whether B should execute or not. In the following example, instruction 2 is control
dependent on instruction 1.

1. if a == b goto AFTER
2. A = 2 * X

Intuitively, there is control dependence between two statements S1 and S2 if

• S1 could be possibly executed before S2

• The outcome of S1 execution will determine whether S2 will be executed.

A typical example is that there is control dependence between if statement's condition

part and the statements in the corresponding true/false bodies.

A formal definition of control dependence can be presented as follows:

A statement S2 is said to be control dependent on another statement S1 iff

• there exists a path P from S1 to S2 such that every statement Si ≠ S1 within P will
be followed by S2 in each possible path to the end of the program and
• S1 will not necessarily be followed by S2, i.e. there is an execution path from S1
to the end of the program that does not go through S2.

Expressed with the help of (post-)dominance the two conditions are equivalent to

• S2 post-dominates all Si
• S2 does not post-dominate S1

[edit] Implications

Conventional programs are written assuming the sequential execution model. Under this
model, instructions execute one after the other, atomically (i.e., at any given point of time
only one instruction is executed) and in the order specified by the program.

However, dependencies among statements or instructions may hinder parallelism —

parallel execution of multiple instructions, either by a parallelizing compiler or by a
processor exploiting instruction level parallelism. Recklessly executing multiple
instructions without considering related dependences may cause danger of getting wrong
results, namely hazards.

Reservation Stations are decentralized features of the microarchitecture of a CPU that

allow for register renaming, and are used by the Tomasulo algorithm for dynamic
instruction scheduling.

Reservation stations permit the CPU to fetch and re-use a data value as soon as it has
been computed, rather than waiting for it to be stored in a register and re-read. When
instructions are issued, they can designate the reservation station from which they want
their input to read. When multiple instructions need to write to the same register, all can
proceed and only the (logically) last one need actually be written. It checks if the
operands are available (RAW) and if execution unit is free (Structural hazard) before
starting execution.

Instruction are stored with available parameters, and executed when ready. Results are
identified by the unit that will execute the corresponding instruction. Implicitly register
renaming solves WAR and WAW hazards. Since this is a fully-associative structure, it
has a very high cost in comparators (need to compare all results returned from processing
units with all stored addresses).

In Tomasulo's algorithm, instructions are issued in sequence to Reservation Stations

which buffer the instruction as well as the operands of the instruction. If the operand is
not available, the Reservation Station listens on a Common Data Bus for the operand to
become available. When the operand becomes available, the Reservation Station buffers
it, and the execution of the instruction can begin.

Functional Units (such as an adder or a multiplier), each have their own corresponding
Reservation Station. The output of the Functional Unit connects to the Common Data
Bus, where Reservation Stations are listening for the operands they need.

From Wikipedia, the free encyclopedia
Jump to: navigation, search

Scoreboarding is a centralized method, used in the CDC 6600 computer, for

dynamically scheduling a pipeline so that the instructions can execute out of order when
there are no conflicts and the hardware is available. In a scoreboard, the data
dependencies of every instruction are logged. Instructions are released only when the
scoreboard determines that there are no conflicts with previously issued and incomplete
instructions. If an instruction is stalled because it is unsafe to continue, the scoreboard
monitors the flow of executing instructions until all dependencies have been resolved
before the stalled instruction is issued.

• 1 Stages
• 2 Data structure
• 3 The algorithm
• 4 Remarks
• 5 External links

• 6 See also

[edit] Stages

Instructions are decoded in order and go through the following four stages.

1. Issue: The system checks which registers will be read and written by this
instruction. This information is remembered as it will be needed in the following
stages. In order to avoid output dependencies (WAW - Write after Write) the
instruction is stalled until instructions intending to write to the same register are
completed. The instruction is also stalled when required functional units are
currently busy.
2. Read operands: After an instruction has been issued and correctly allocated to
the required hardware module, the instruction waits until all operands become
available. This procedure resolves read dependencies (RAW - Read after Write)
because registers which are intended to be written by another instruction are not
considered available until they are actually written.
3. Execution: When all operands have been fetched, the functional unit starts its
execution. After the result is ready, the scoreboard is notified.
4. Write Result: In this stage the result is about to be written to its destination
register. However, this operation is delayed until earlier instructions—which
intend to read registers this instruction wants to write to—have completed their
read operands stage. This way, so called data dependencies (WAR - Write after
Read) can be addressed.

[edit] Data structure

To control the execution of the instructions, the scoreboard maintains three status tables:

• Instruction Status: Indicates, for each instruction being executed, which of the
four stages it is in.
• Functional Unit Status: Indicates the state of each functional unit. Each function
unit maintains 9 fields in the table:
o Busy: Indicates whether the unit is being used or not
o Op: Operation to perform in the unit (e.g. MUL, DIV or MOD)
o Fi: Destination register
o Fj,Fk: Source-register numbers
o Qj,Qk: Functional units that will produce the source registers Fj, Fk
o Rj,Rk: Flags that indicates when Fj, Fk are ready
• Register Status: Indicates, for each register, which function unit will write results
into it.

[edit] The algorithm

The detailed algorithm for the scoreboard control is described below:

function issue(op, dst, src1, src2)

wait until (!Busy[FU] AND !Result[dst]); // FU can be any
functional unit that can execute operation op
Busy[FU] ← Yes;
Op[FU] ← op;
Fi[FU] ← dst;
Fj[FU] ← src1;
Fk[FU] ← src2;
Qj[FU] ← Result[src1];
Qk[FU] ← Result[src2];
Rj[FU] ← not Qj;
Rk[FU] ← not Qk;
Result[dst] ← FU;

function read_operands(FU)
wait until (Rj[FU] AND Rk[FU]);
Rj[FU] ← No;
Rk[FU] ← No;

function execute(FU)
// Execute whatever FU must do

function write_back(FU)
wait until ( f {(Fj[f]≠Fi[FU] OR Rj[f]=No) AND (Fk[f]≠Fi[FU] OR
foreach f do
if Qj[f]=FU then Rj[f] ← Yes;
if Qk[f]=FU then Rk[f] ← Yes;
Result[Fi[FU]] ← 0;
Busy[FU] ← No;

[edit] Remarks

The scoreboarding method must stall the issue stage when there is no functional unit
available. In this case, future instructions that could potentially be executed will wait
until the structural hazard is resolved. Some other techniques like Tomasulo algorithm
can avoid the structural hazard and also resolve WAR and WAW dependencies with
Register renaming.