Académique Documents
Professionnel Documents
Culture Documents
Presence of dependence indicates potential for a hazard, but actual hazard and length
of any stall is property of the pipeline
HW/SW goal:
exploit parallelism by preserving program order only where it affects the
outcome of the program
Data Dependences
Data Dependent
e.g.1
I: add r1,r2,r3
J: sub r4, r1,r3
e.g.2
F0=array element
add scalar in F2
store result
decrement pointer 8 bytes
branch R1!=zero
integer data
Name Dependences
Name Dependent
when 2 instructions use same register or memory location, called a name, but no flow of data
between the instructions associated with that name
To Overcome
Instructions involved in a name dependence can execute simultaneously if name used
in instructions is changed so instructions do not conflict(i.e. Register renaming-Either by
compiler or by HW)
Control Dependencies
Every instruction is control dependent on some set of branches, and, in general, these
control dependencies must be preserved to preserve program order.
e.g.
if p1 {
S1;
};
if p2 {
S2;
}
S1 is control dependent on p1, and S2 is control dependent on p2 but not on p1
Data Hazards
Corresponds to anti-dependence
Tournament Predictors
1) Branch Predictors
It is effectively a cache.
1 Bit Predictor
For each branch, keep track of what
happened last time and use that outcome as
the prediction
Problem: in a loop, 1 -bit BHT will cause two
mispredictions:
End of loop case, when it exits instead of looping
as before
First time through loop on next time through
code, when it predicts exit instead of looping
Performance = (accuracy, cost of misprediction)
2 bit predictor
For each branch,
maintain a 2-bit saturating counter:
if the branch is taken:
counter = min(3,counter+1)
if the branch is not taken:
counter = max(0,counter-1)
If (counter >= 2), predict taken, else predict not taken
Advantage:
a few atypical branches will not influence the prediction
Especially useful when multiple branches share the same counter
Can be easily extended to N-bits (in most processors, N=2)
DSUBUI R3,R1,#2
BNEZ R3,L1 ;branch b1 (aa!=2)
DADD R1,R0,R0 ;aa=0
L1: DSUBUI R3,R2,#2
BNEZ R3,L2 ;branch b2(bb!=2)
DADD R2,R0,R0 ; bb=0
L2: DSUBU R3,R1,R2 ;R3=aa-bb
BEQZ R3,L3 ;branch b3 (aa==bb)
A predictor that uses only the behavior of a single branch to predict the outcome of that branch
can never capture this behavior.
Two level Predictor
Behavior of recent branches selects between four predictions of next branch, updating just
that prediction
2m n 2p Total memory bits required
2m banks of memory selected by the global branch
history (which is just a shift register)
Use p bits of the branch address to select row
Get the n predictor bits in the entry to make the
decision
3) Tournament Predictors
Q2 a) Advanced compiler support for Detecting and enhancing Loop level parallelism
It is also possible to have a loop-carried dependence that does not prevent parallelism
Finding Dependence
(1)good scheduling of code,
(2) determining which loops might contain parallelism, and
(3) eliminating name dependences.
It is based on the observation that if a loop-carried dependence exists, then GCD (c,a) must
divide (d b).
Copy Propagation
Within a basic block, algebraic simplifications of expressions and an optimization
Iteration
0
Iteration
1
Iteration
2
Iteration
3
Software Pipelining
Loop Unrolling :
1. Unroll the loop
2. Schedule
The scheduler essentially interleaves instructions from different loop
iterations, so as to separate the dependent instructions that occur within a single
loop iteration and to eliminate stalls.
Iteration
4
Softwarepipelined
iteration
Assumptions :
DADDUI is scheduled before the ADD.D and that the L.D instruction, with an adjusted
offset, is placed in the branch delay slot.
that
Loads are therefore speculative: They are done on the speculation that the results are
needed later.
Original code:
I1: add r3, r1, r2
I2: sub r2, r3, r5
I3: P1, P2 = cmp(r2 == 3)
I4: <P1> sub r7, r3, r4
I5: <P1> rori r7, 3
I6: <P1> j I10
I7: <P2> lw r4, 0(r6)
I8: <P2> muli r4, r3, r6
Modified Code:
I1: add r3, r1, r2
I2: sub r2, r3, r5
I3: lw r4, 0(r6) Speculative
Load
I4: P1, P2 = cmp(r2 == 3)
I5: <P1> sub r7, r3, r4
I6: <P1> rori r7, 3
I7: <P1> j I10
I9: <P2> chk.s r4 Load Check
I10: <P2> muli r4, r3, r6 fl Load
used here
Write Invalidate
All processors continuously snoop on the bus watching addresses. They invalidate their
caches if they have the address.
States of a block:
Invalid another core has modified the block
Shared potentially shared with other caches
Modified updated in the private cache. Implies that the block is exclusive.
an extra state bit (shared/exclusive) associated with a valid bit and a dirty bit for
each block
Requests to the cache can come from a core or from the bus.
Implements a finite-state machine
Extensions:
MESI protocol
add Exclusive state to indicate when a clean block is resident in only one cache
Prevents the need to write invalidate on a write
Corei7 uses MESIF : F forward
Read miss to block in E state block changed to S state
MOESI protocol
Owned state; indicates that the block is owned by that cache and it is out-of-date in
memory.
Block changed from modified to owned without written into memory instead of
shared . Other caches keep block in shared state.
Processor 2:
A=0
A=1
if (B==0)
B=0
B=1
if (A==0)
Relax W R
Total store ordering no additional synchronization
Relax W W
Partial store order
Relax R W and R R
Weak ordering and release consistency
Q4 a)
i) Advantages of HW (Tomasulo) vs. SW (VLIW) Speculation
HW advantages:
Q4 b)
i) Cache Performance
Length of memory latencyWhat to consider as the start and the end of a memory
operation in an out-of-order processor
Length of latency overlapWhat is the start of overlap with the processor
Which has the lower miss rate: a 16 KB instruction cache with a 16 KB data cache or a 32
KB unified cache?
split caches offer two memory ports per clock cycle, thereby avoiding the structural
hazardhave a better average memory access time than the single-ported unified
cache despite having a worse effective miss rate.
Average memory access time = Hit time + Miss rate X Miss penalty
Cache Optimization
Reducing the miss rate: larger block size, larger cache size, and higher associativity
Reducing the miss penalty: multilevel caches and giving reads priority over
writes
Reducing the time to hit in the cache: avoiding address translation when
indexing the cache
ii) Synchronization
Atomic exchange
Swaps register with memory location
Test-and-set
Sets under condition
Test if memory position if 0 and set to 1 if 0.
Fetch-and-increment
Reads original value from memory and increments it in memory
Requires memory read and write in uninterruptable instruction
Atomic Exchange
LOCK = 0 => FREE
LOCK = 1 => UNAVAILABLE
Lock value in memory
EXCH R,M
Q5 a) Multiprocessor Architecture
Shared memory:
Communication among threads through a shared address space
A memory reference can be made by any processor to any to any memory
location.
Types
Centralized shared-memory multiprocessor
Distributed shared memory (DSM)
Q5 b)
i) Centralized Shared Memory Architectures
Q6 a) Intel Montecito
Features
Dual core Itanium 2, each core dual threaded
1.7 billion transistors, 21.5 mm x 27.7 mm die
27 MB of on-chip three levels of cache
Not shared among cores
1.8+ GHz, 100 W
Single-thread enhancements
Extra shifter improves performance of crypto codes by 100%
Improved branch prediction
Improved data and control speculation recovery
Separate L2 instruction and data caches buys 7% improvement over Itanium2; four times
bigger L2I (1 MB)
Asynchronous 12 MB L3 cache
Dual threads
SMT only for cache, not for core resources
Simulations showed high resource utilization at core level, but low utilization of cache
Branch predictor is still shared but use thread id tags
Thread switch is implemented by flushing the pipe
More like coarse-grain multithreading
Compatibility processor;
runs legacy code, compilers, libraries, etc.
Performs all system level functions
including starting the other cores
Synergistic Processing Element (SPE)
Offload processor
Dual issue, in-order, VLIW inspired
SIMD processor
128 x 128b register file
256KB local store, no I$, no D$
Runs a small program (placed in LS)
Offloads system functions to the PPE
MFC (memory flow controller)
a programmable DMA engine and onchip network interface
Performance
1.83 GFlop/s (double precision)
25.6 GFlop/s (single precision)
51.2GB/s access the local store
25.6GB/s access to the network
Threading-Model
For each SPE program
the PPE must:
Create a SPE context
Load the SPE program (embedded in the
binary)
Create a pthread to run it
Typically, split the work
into 2 phases:
Code that runs on the SPEs
Code that runs on the PPE with a barrier in
between
Threads-Communication
The PPE may communicate with the SPEs via :
individual mailboxes (FIFO)
HW signals
DRAM
Direct PPE access to the SPEs local stores