Vous êtes sur la page 1sur 25

Q1 a) Dependences & Hazards

Dependences are a property of programs

Presence of dependence indicates potential for a hazard, but actual hazard and length
of any stall is property of the pipeline

Dependencies that flow through memory locations are difficult to detect

To avoid Hazards HW/SW must preserve program order:


i.e. sequential execution as determined by original source program

HW/SW goal:
exploit parallelism by preserving program order only where it affects the
outcome of the program

Data Dependences

Data Dependent

Instruction j is data dependent on instruction i if

Instruction i produces a result that may be used by instruction j

Instruction j is data dependent on instruction k and instruction k is data


dependent on instruction

e.g.1

I: add r1,r2,r3
J: sub r4, r1,r3

e.g.2

Loop: L.D F0,0(R1) ;


ADD.D F4,F0,F2 ;
S.D F4,0(R1) ;
DADDUI R1,R1,#-8 ;
BNE R1,R2,LOOP ;

F0=array element
add scalar in F2
store result
decrement pointer 8 bytes
branch R1!=zero

Importance of the data dependencies


1. indicates the possibility of a hazard
2. determines order in which results must be calculated
3. sets an upper bound on how much parallelism can possibly be exploited
To Overcome the data dependencies
1. Maintain dependence while avoiding hazard
2. Eliminatine dependence by code transformation
Loop: L.D F0,0(R1) ;F0=array element
ADD.D F4,F0,F2 ;add scalar in F2
S.D F4,0(R1) ;store result

Floating point data

DADDIU R1,R1,-8 ;decrement pointer ;8 bytes (per DW)


BNE R1,R2,Loop ; branch R1!=zero

integer data

Name Dependences

Name Dependent

when 2 instructions use same register or memory location, called a name, but no flow of data
between the instructions associated with that name

Antidependence : Instr-J writes operand before Instr-I reads it


I: sub r4, r1,r3
J: add r1,r2,r3
K: mul r6,r1,r7

Output dependence : Instr-J writes operand before Instr-I writes it


I: sub r1,r4,r3
J: add r1,r2,r3
K: mul r6,r1,r7

To Overcome
Instructions involved in a name dependence can execute simultaneously if name used
in instructions is changed so instructions do not conflict(i.e. Register renaming-Either by
compiler or by HW)

Control Dependencies

Every instruction is control dependent on some set of branches, and, in general, these
control dependencies must be preserved to preserve program order.

e.g.
if p1 {
S1;
};
if p2 {
S2;
}
S1 is control dependent on p1, and S2 is control dependent on p2 but not on p1

Control dependence need not be preserved


willing to execute instructions that should not have been executed, thereby violating the
control dependences, if can do so without affecting correctness of the program

Instead, 2 properties critical to program correctness are


1) Exception behavior
2) Data flow

Preserving Exception Behavior


any changes in instruction execution order must not change how exceptions are raised in
program
no new exceptions
e.g.
DADDU R2,R3,R4
BEQZ R2,L1
LW R1,0(R2)
L1:

Preserving Data flow:


actual flow of data values among instructions that produce results and those that consume
them
branches make flow dynamic, determine which instruction is
supplier of data
DADDU R1,R2,R3
e.g.
BEQZ R12,Skip
DADDU R1,R2,R3
DSUBU R4,R5,R6
BEQZ R4,L
DADDU R5,R4,R9
DSUBU R1,R5,R6
Skip : OR R7, R1,R8
L:
Here violating control
OR R7, R1,R8
dependence wont affect
exception behavior or data
OR depends on DADDU or DSUBU
flow.
Must preserve data flow on execution

Data Hazards

Read after write (RAW)

Instr-J reads operand before Instr-I writes it

Corresponds to data dependence

Write after write (WAW)

Instr-J writes operand before Instr-I writes it

Corresponds to output Dependence

Occurs in pipelines that

writes more than one pipe stage

Allow instruction to proceed when previous instruction is stalled

Write after read (WAR)

Instr-J writes operand before Instr-I read it

Corresponds to anti-dependence

Read after read (RAR) : not a Hazard

Q1 b) Dynamic branch prediction

Branch prediction buffer

Correlating branch predictors

Tournament Predictors

1) Branch Predictors

Basic Branch prediction buffer

A branch-prediction buffer (branch history table) is a small memory indexed by the


lower potion of the address of the branch instruction.
The memory contains whether the branch was recently taken or not.
Any branch with the same low-order
address can modify the content.

If the hint turns out to be wrong, the


prediction bit is inverted and stored
back.

It is effectively a cache.

1 Bit Predictor
For each branch, keep track of what
happened last time and use that outcome as
the prediction
Problem: in a loop, 1 -bit BHT will cause two
mispredictions:
End of loop case, when it exits instead of looping
as before
First time through loop on next time through
code, when it predicts exit instead of looping
Performance = (accuracy, cost of misprediction)
2 bit predictor
For each branch,
maintain a 2-bit saturating counter:
if the branch is taken:

counter = min(3,counter+1)
if the branch is not taken:
counter = max(0,counter-1)
If (counter >= 2), predict taken, else predict not taken
Advantage:
a few atypical branches will not influence the prediction
Especially useful when multiple branches share the same counter
Can be easily extended to N-bits (in most processors, N=2)

2) Correlating Branch Predictors


The previous schemes use only the recent behavior of a signal branch to predict the future
behavior of the branch.
Correlating Branch Predicator
to improve prediction accuracy use the behavior of other branches to
make a predication.
Hypothesis: recent branches are correlated; that is, behavior of recently executed branches
affects prediction of current branch
Idea: record m most recently executed branches as taken or not taken, and use that pattern
to select the proper branch history table
(m,n) predictor records last m branches to select between 2 m history tables each with n-bit
counters
Old 2-bit BHT is then a (0,2) predictor
if(aa==2)
aa=0;
if(bb==2)
bb=0;
if(aa!=bb)
{
}

DSUBUI R3,R1,#2
BNEZ R3,L1 ;branch b1 (aa!=2)
DADD R1,R0,R0 ;aa=0
L1: DSUBUI R3,R2,#2
BNEZ R3,L2 ;branch b2(bb!=2)
DADD R2,R0,R0 ; bb=0
L2: DSUBU R3,R1,R2 ;R3=aa-bb
BEQZ R3,L3 ;branch b3 (aa==bb)

A predictor that uses only the behavior of a single branch to predict the outcome of that branch
can never capture this behavior.
Two level Predictor
Behavior of recent branches selects between four predictions of next branch, updating just
that prediction
2m n 2p Total memory bits required
2m banks of memory selected by the global branch
history (which is just a shift register)
Use p bits of the branch address to select row
Get the n predictor bits in the entry to make the
decision

3) Tournament Predictors

multilevel branch predictors

ability to select the right predictor for


the right branch

2 bit predictor failed on important


branches; by adding global
information,
performance improved

Adaptively Combining Local and Global


Predictors :
use 2 predictors, 1 based on global
information and 1 based on local
information, and combine with a
selector

E.g. Tournament Predictor in Alpha


21264

Q2 a) Advanced compiler support for Detecting and enhancing Loop level parallelism

Loop Carried dependence

It is also possible to have a loop-carried dependence that does not prevent parallelism

Recurrence : type of loop carried dependence

Detecting a recurrence can be important for two reasons:


1. Some architectures (especially vector computers) have special support for executing
recurrences, and
2. some recurrences can be the source of a reasonable amount of parallelism

Finding Dependence
(1)good scheduling of code,
(2) determining which loops might contain parallelism, and
(3) eliminating name dependences.

Test for dependence


GCD (Greatest common deviser) :

It is based on the observation that if a loop-carried dependence exists, then GCD (c,a) must
divide (d b).

Eliminating dependent computations


to achieve more instruction-level parallelism (ILP)
technique is to eliminate or reduce a dependent computation by back substitution

Copy Propagation
Within a basic block, algebraic simplifications of expressions and an optimization

Tree height reduction


Optimizations that may increase the parallelism of the code,
possibly even increasing the number of operations.
They reduce the height of the tree structure representing a computation,
making it wider but shorter.

Scheduling and Structuring Code for Parallelism

Iteration
0

Iteration
1

Iteration
2

Iteration
3

Software Pipelining

Symbolic Loop Unrolling

Loop Unrolling :
1. Unroll the loop
2. Schedule
The scheduler essentially interleaves instructions from different loop
iterations, so as to separate the dependent instructions that occur within a single
loop iteration and to eliminate stalls.

A software-pipelined loop interleaves instructions from different iterations


without unrolling the loop

This technique is the software counterpart to what Tomasulos algorithm does in


hardware.

Iteration
4

Softwarepipelined
iteration

Assumptions :
DADDUI is scheduled before the ADD.D and that the L.D instruction, with an adjusted
offset, is placed in the branch delay slot.

For start-up, we will need to execute any


instructions that correspond to iteration 1 and 2
will not be executed. These instructions are the
L.D for iterations 1 and 2 and the ADD.D for
iteration 1.
For the finish-up code, we need to execute any
instructions that will not be execute
in the final two iterations. These include the
ADD.D for the last iteration and the
S.D for the last two iterations

that

Q2 b) IA-64 Control Hazard Solution


IA64 solves this by predication.
Idea:
Compiler tags each side of a branch with a predicate.
Bundle tagged instructions and set template bits to allow parallel execution of predicated
instructions.
Both sides of the branch are executed simultaneously.
When the outcome of the branch is known, the effects of the correct side are committed
(registers modified, etc.), while the effects (and remainder) of the wrong side are discarded.
Benefits:
No need to unroll effects (they are committed only after we know
which is the correct side the branch)
Time taken to execute wrong side at least partially amortized by
execution of correct side, assuming sufficient functional units.

Memory Latencies IA-64 Solution

Idea: Move all load instructions to the start of the program.


This is called hoisting.
Loads will be executed concurrently with the rest of the program.
Hopefully data will be ready in register when it is read.
Loads may belong to decision paths that are never executed

Hoisting effectively causes these loads to be executed anyway, even if


their contents arent actually required.

Loads are therefore speculative: They are done on the speculation that the results are
needed later.

A Check instruction is placed just before the load results are


needed.
This checks for exceptions in the speculative load.
As well as commits effect of load to the target register.

Original code:
I1: add r3, r1, r2
I2: sub r2, r3, r5
I3: P1, P2 = cmp(r2 == 3)
I4: <P1> sub r7, r3, r4
I5: <P1> rori r7, 3
I6: <P1> j I10
I7: <P2> lw r4, 0(r6)
I8: <P2> muli r4, r3, r6
Modified Code:
I1: add r3, r1, r2
I2: sub r2, r3, r5
I3: lw r4, 0(r6) Speculative
Load
I4: P1, P2 = cmp(r2 == 3)
I5: <P1> sub r7, r3, r4
I6: <P1> rori r7, 3
I7: <P1> j I10
I9: <P2> chk.s r4 Load Check
I10: <P2> muli r4, r3, r6 fl Load
used here

Q3 a) Snooping Coherence Protocol

Write Invalidate

On write, invalidate all other copies (most common)


Use bus to serialize
Write cannot complete until bus access is obtained
The same item can appear in multiple caches
Two caches will never have different values for the same block.
Most commonly used in MPs

Write Update (a.k.a. write broadcast)


On write, update all copies
Consumes considerable bandwidth

Basic Snooping Cache Implementation

All processors continuously snoop on the bus watching addresses. They invalidate their
caches if they have the address.
States of a block:
Invalid another core has modified the block
Shared potentially shared with other caches
Modified updated in the private cache. Implies that the block is exclusive.
an extra state bit (shared/exclusive) associated with a valid bit and a dirty bit for
each block
Requests to the cache can come from a core or from the bus.
Implements a finite-state machine

Complications for the basic MSI protocol:

Operations are not atomic

E.g. detect miss, acquire bus, receive a response

Creates possibility of deadlock and races

One solution: processor that sends invalidate can hold bus


until other processors receive the invalidate

Extensions:

MESI protocol

add Exclusive state to indicate when a clean block is resident in only one cache
Prevents the need to write invalidate on a write
Corei7 uses MESIF : F forward
Read miss to block in E state block changed to S state

MOESI protocol
Owned state; indicates that the block is owned by that cache and it is out-of-date in
memory.
Block changed from modified to owned without written into memory instead of
shared . Other caches keep block in shared state.

Q3 b) Memory consistency models


Processor 1:

Processor 2:

A=0

A=1
if (B==0)

B=0

B=1
if (A==0)

Should be impossible for both if-statements to be


evaluated as true
Delayed write invalidate?
Sequential consistency model:
Result of execution should be the same as long as:
Accesses on each processor were kept in order
Accesses on different processors were arbitrarily interleaved
Relaxed consistency models

X Y (X must complete before Y)


Sequential consistency requires:
R W, R R, W R, W W

Relax W R
Total store ordering no additional synchronization

Relax W W
Partial store order

Relax R W and R R
Weak ordering and release consistency

Q4 a)
i) Advantages of HW (Tomasulo) vs. SW (VLIW) Speculation

HW advantages:

HW better at memory disambiguation since knows actual addresses


HW better at branch prediction since lower overhead
HW maintains precise exception model
HW does not execute bookkeeping instructions
Same software works across multiple implementations
Smaller code size (not as many nops filling blank instructions)
SW advantages:

Window of instructions that is examined for parallelism much higher


Much less hardware involved in VLIW (unless you are Intel!)
More involved types of speculation can be done more easily
Speculation can be based on large-scale program behavior, not just local
information

ii) Comparison between CISC, RISC, VLIW

Q4 b)
i) Cache Performance

Memory stall cycles = Number of misses x Miss penalty

IC : instruction count, Miss penalty : cost per miss


Miss rate : number of accesses that miss divided by number of accesses

Miss Penalty and Out-of-Order Execution Processors


miss penalty considered as nonoverlapped latency

Length of memory latencyWhat to consider as the start and the end of a memory
operation in an out-of-order processor
Length of latency overlapWhat is the start of overlap with the processor

Which has the lower miss rate: a 16 KB instruction cache with a 16 KB data cache or a 32
KB unified cache?
split caches offer two memory ports per clock cycle, thereby avoiding the structural
hazardhave a better average memory access time than the single-ported unified
cache despite having a worse effective miss rate.

Better measure of memory performance is

Average memory access time = Hit time + Miss rate X Miss penalty

Cache Optimization
Reducing the miss rate: larger block size, larger cache size, and higher associativity
Reducing the miss penalty: multilevel caches and giving reads priority over
writes
Reducing the time to hit in the cache: avoiding address translation when
indexing the cache

ii) Synchronization

Basic building blocks:

Atomic exchange
Swaps register with memory location

Test-and-set
Sets under condition
Test if memory position if 0 and set to 1 if 0.

Fetch-and-increment
Reads original value from memory and increments it in memory
Requires memory read and write in uninterruptable instruction

Pair of Load linked/Store conditional instructions


If the contents of the memory location specified by the load linked
are changed before the store conditional to the same address, the
store conditional fails

Atomic Exchange
LOCK = 0 => FREE
LOCK = 1 => UNAVAILABLE
Lock value in memory
EXCH R,M

Load Linked/Store Conditional

Goal: atomically exchange R4 and content of memory location specified by R1


try: MOV R3,R4 ; move exchange value
LL R2,0(R1) ; load linked
SC R3,0(R1) ; store conditional
BEQZ R3,try ; branch store fails
MOV R4,R2 ; put load value in R4
Mem pointed by R1: A
R1:
R2:
R3:
R4: B

Q5 a) Multiprocessor Architecture
Shared memory:
Communication among threads through a shared address space
A memory reference can be made by any processor to any to any memory
location.
Types
Centralized shared-memory multiprocessor
Distributed shared memory (DSM)

Centralized shared-memory multiprocessor

Also called as Symmetric multiprocessors (SMP)


Share single memory with uniform memory access/latency (UMA)
Small number of cores
A symmetric relationship to all processors
A uniform memory access time from any processor
scalability problem: less attractive for large-scale processors

Distributed shared memory (DSM)

Distributed shared memory (DSM)

Memory distributed among processors.


Non-uniform memory access/latency (NUMA)
Processors connected via direct (switched) and non-direct (multihop)
interconnection networks

Q5 b)
i) Centralized Shared Memory Architectures

SMPs: both shared and private data can be cached.


Shared data provides a mechanism for processors to communicate through reads and
writes to shared memory.
The effect of caching private data on program behavior is the same as that of a
uniprocessor because no other processor access these data.
The value of shared data may be replicated in the multiple caches:
Adv : reduction in cache contention
Disadv : cache coherence!
Cache Coherence

Processors may see different values through their caches e.g.

ii) Distributed Shared-Memory Architectures


Distributed shared-memory architectures
Separate memory per processor
Local or remote access via memory controller
The physical address space is statically distributed
Coherence Problems
Simple approach: uncacheable
shared data are marked as uncacheable and only private data are kept in caches
very long latency to access memory for shared data
Alternative: directory for memory blocks
Directory Protocols

Directory keeps track of every block


Which caches have each block
Dirty status of each block

Implement in a distributed fashion:

Implement in shared L3 cache


Keep bit vector of size = # cores for each block in L3
Not scalable beyond shared L3

Q6 a) Intel Montecito
Features
Dual core Itanium 2, each core dual threaded
1.7 billion transistors, 21.5 mm x 27.7 mm die
27 MB of on-chip three levels of cache
Not shared among cores
1.8+ GHz, 100 W
Single-thread enhancements
Extra shifter improves performance of crypto codes by 100%
Improved branch prediction
Improved data and control speculation recovery
Separate L2 instruction and data caches buys 7% improvement over Itanium2; four times
bigger L2I (1 MB)
Asynchronous 12 MB L3 cache

Dual threads
SMT only for cache, not for core resources
Simulations showed high resource utilization at core level, but low utilization of cache
Branch predictor is still shared but use thread id tags
Thread switch is implemented by flushing the pipe
More like coarse-grain multithreading

Five thread switch events :


L3 cache miss (immense impact on in-order pipe)/ L3 cache refill
Quantum expiry
Spin lock/ ALAT invalidation
Software-directed switch
Execution in low power mode
Thread urgency
Each thread has eight urgency levels
Every L3 miss decrements urgency by one
Every L3 refill increments urgency by one until urgency reaches 5
A switch due to time quantum expiry sets the urgency of the switched thread to 7
Arrival of asynchronous interrupt for a background thread sets the urgency level of that
thread to 6
Switch from L3 miss requires urgency level to be compared also

Q6 b) IBM Cell Processor


It is a heterocore processor with 1 PowerPC and 8 SPEs cores.
Each Cell chip has:
One PowerPC core
8 compute cores (SPEs)
On-chip Memory controller
On-chip I/O
On-chip network to
connect them all

PowerPC Core (PPE)


3.2GHz
Dual issue, in-order
2-way multithreaded
512KB L2 cache
No hardware prefetching
SIMD (altivec) + FMA
6.4 GFlop/s (double precision)
25.6 GFlop/s (single precision)
Serves 2 purposes:

Compatibility processor;
runs legacy code, compilers, libraries, etc.
Performs all system level functions
including starting the other cores
Synergistic Processing Element (SPE)

Offload processor
Dual issue, in-order, VLIW inspired
SIMD processor
128 x 128b register file
256KB local store, no I$, no D$
Runs a small program (placed in LS)
Offloads system functions to the PPE
MFC (memory flow controller)
a programmable DMA engine and onchip network interface
Performance
1.83 GFlop/s (double precision)
25.6 GFlop/s (single precision)
51.2GB/s access the local store
25.6GB/s access to the network
Threading-Model
For each SPE program
the PPE must:
Create a SPE context
Load the SPE program (embedded in the
binary)
Create a pthread to run it
Typically, split the work
into 2 phases:
Code that runs on the SPEs
Code that runs on the PPE with a barrier in
between
Threads-Communication
The PPE may communicate with the SPEs via :
individual mailboxes (FIFO)
HW signals
DRAM
Direct PPE access to the SPEs local stores

Vous aimerez peut-être aussi