ACA Dec 16 Model Ans

Q1 a) Dependences & Hazards
Dependences are a property of programs
Presence of dependence indicates potential for a hazard, but actual hazard and length
of any stall is property of the pipeline
Dependencies that flow through memory locations are difficult to detect
To avoid Hazards HW/SW must preserve program order:

i.e. sequential execution as determined by original source program
HW/SW goal:
exploit parallelism by preserving program order only where it affects the
outcome of the program
Data Dependences
Data Dependent
Instruction j is data dependent on instruction i if
Instruction i produces a result that may be used by instruction j
Instruction j is data dependent on instruction k and instruction k is data

dependent on instruction
e.g.1
I: add r1,r2,r3
J: sub r4, r1,r3
e.g.2
Loop: L.D F0,0(R1) ;

ADD.D F4,F0,F2 ;
S.D F4,0(R1) ;
DADDUI R1,R1,#-8 ;
BNE R1,R2,LOOP ;
F0=array element
add scalar in F2
store result
decrement pointer 8 bytes
branch R1!=zero
Importance of the data dependencies

1. indicates the possibility of a hazard
2. determines order in which results must be calculated
3. sets an upper bound on how much parallelism can possibly be exploited
To Overcome the data dependencies
1. Maintain dependence while avoiding hazard
2. Eliminatine dependence by code transformation
Loop: L.D F0,0(R1) ;F0=array element
ADD.D F4,F0,F2 ;add scalar in F2
S.D F4,0(R1) ;store result
Floating point data
DADDIU R1,R1,-8 ;decrement pointer ;8 bytes (per DW)

BNE R1,R2,Loop ; branch R1!=zero
integer data
Name Dependences
Name Dependent
when 2 instructions use same register or memory location, called a name, but no flow of data
between the instructions associated with that name
Antidependence : Instr-J writes operand before Instr-I reads it

I: sub r4, r1,r3
J: add r1,r2,r3
K: mul r6,r1,r7
Output dependence : Instr-J writes operand before Instr-I writes it

I: sub r1,r4,r3
J: add r1,r2,r3
K: mul r6,r1,r7
To Overcome
Instructions involved in a name dependence can execute simultaneously if name used
in instructions is changed so instructions do not conflict(i.e. Register renaming-Either by
compiler or by HW)
Control Dependencies
Every instruction is control dependent on some set of branches, and, in general, these
control dependencies must be preserved to preserve program order.
e.g.
if p1 {
S1;
};
if p2 {
S2;
}
S1 is control dependent on p1, and S2 is control dependent on p2 but not on p1
Control dependence need not be preserved

willing to execute instructions that should not have been executed, thereby violating the
control dependences, if can do so without affecting correctness of the program
Instead, 2 properties critical to program correctness are

1) Exception behavior
2) Data flow
Preserving Exception Behavior

any changes in instruction execution order must not change how exceptions are raised in
program
no new exceptions
e.g.
DADDU R2,R3,R4
BEQZ R2,L1
LW R1,0(R2)
L1:
Preserving Data flow:

actual flow of data values among instructions that produce results and those that consume
them
branches make flow dynamic, determine which instruction is
supplier of data
DADDU R1,R2,R3
e.g.
BEQZ R12,Skip
DADDU R1,R2,R3
DSUBU R4,R5,R6
BEQZ R4,L
DADDU R5,R4,R9
DSUBU R1,R5,R6
Skip : OR R7, R1,R8
L:
Here violating control
OR R7, R1,R8
dependence wont affect
exception behavior or data
OR depends on DADDU or DSUBU
flow.
Must preserve data flow on execution
Data Hazards
Read after write (RAW)
Instr-J reads operand before Instr-I writes it
Corresponds to data dependence
Write after write (WAW)
Instr-J writes operand before Instr-I writes it
Corresponds to output Dependence
Occurs in pipelines that
writes more than one pipe stage
Allow instruction to proceed when previous instruction is stalled
Write after read (WAR)
Instr-J writes operand before Instr-I read it
Corresponds to anti-dependence
Read after read (RAR) : not a Hazard
Q1 b) Dynamic branch prediction
Branch prediction buffer
Correlating branch predictors
Tournament Predictors
1) Branch Predictors
Basic Branch prediction buffer
A branch-prediction buffer (branch history table) is a small memory indexed by the

lower potion of the address of the branch instruction.
The memory contains whether the branch was recently taken or not.
Any branch with the same low-order
address can modify the content.
If the hint turns out to be wrong, the

prediction bit is inverted and stored
back.
It is effectively a cache.
1 Bit Predictor
For each branch, keep track of what
happened last time and use that outcome as
the prediction
Problem: in a loop, 1 -bit BHT will cause two
mispredictions:
End of loop case, when it exits instead of looping
as before
First time through loop on next time through
code, when it predicts exit instead of looping
Performance = (accuracy, cost of misprediction)
2 bit predictor
For each branch,
maintain a 2-bit saturating counter:
if the branch is taken:
counter = min(3,counter+1)
if the branch is not taken:
counter = max(0,counter-1)
If (counter >= 2), predict taken, else predict not taken
Advantage:
a few atypical branches will not influence the prediction
Especially useful when multiple branches share the same counter
Can be easily extended to N-bits (in most processors, N=2)
2) Correlating Branch Predictors

The previous schemes use only the recent behavior of a signal branch to predict the future
behavior of the branch.
Correlating Branch Predicator
to improve prediction accuracy use the behavior of other branches to
make a predication.
Hypothesis: recent branches are correlated; that is, behavior of recently executed branches
affects prediction of current branch
Idea: record m most recently executed branches as taken or not taken, and use that pattern
to select the proper branch history table
(m,n) predictor records last m branches to select between 2 m history tables each with n-bit
counters
Old 2-bit BHT is then a (0,2) predictor
if(aa==2)
aa=0;
if(bb==2)
bb=0;
if(aa!=bb)
{
}
DSUBUI R3,R1,#2
BNEZ R3,L1 ;branch b1 (aa!=2)
DADD R1,R0,R0 ;aa=0
L1: DSUBUI R3,R2,#2
BNEZ R3,L2 ;branch b2(bb!=2)
DADD R2,R0,R0 ; bb=0
L2: DSUBU R3,R1,R2 ;R3=aa-bb
BEQZ R3,L3 ;branch b3 (aa==bb)
A predictor that uses only the behavior of a single branch to predict the outcome of that branch
can never capture this behavior.
Two level Predictor
Behavior of recent branches selects between four predictions of next branch, updating just
that prediction
2m n 2p Total memory bits required
2m banks of memory selected by the global branch
history (which is just a shift register)
Use p bits of the branch address to select row
Get the n predictor bits in the entry to make the
decision
3) Tournament Predictors
multilevel branch predictors
ability to select the right predictor for

the right branch
2 bit predictor failed on important

branches; by adding global
information,
performance improved
Adaptively Combining Local and Global

Predictors :
use 2 predictors, 1 based on global
information and 1 based on local
information, and combine with a
selector
E.g. Tournament Predictor in Alpha

21264
Q2 a) Advanced compiler support for Detecting and enhancing Loop level parallelism
Loop Carried dependence
It is also possible to have a loop-carried dependence that does not prevent parallelism
Recurrence : type of loop carried dependence
Detecting a recurrence can be important for two reasons:

1. Some architectures (especially vector computers) have special support for executing
recurrences, and
2. some recurrences can be the source of a reasonable amount of parallelism
Finding Dependence
(1)good scheduling of code,
(2) determining which loops might contain parallelism, and
(3) eliminating name dependences.
Test for dependence

GCD (Greatest common deviser) :
It is based on the observation that if a loop-carried dependence exists, then GCD (c,a) must
divide (d b).
Eliminating dependent computations

to achieve more instruction-level parallelism (ILP)
technique is to eliminate or reduce a dependent computation by back substitution
Copy Propagation
Within a basic block, algebraic simplifications of expressions and an optimization
Tree height reduction

Optimizations that may increase the parallelism of the code,
possibly even increasing the number of operations.
They reduce the height of the tree structure representing a computation,
making it wider but shorter.
Scheduling and Structuring Code for Parallelism
Iteration
0
Iteration
1
Iteration
2
Iteration
3
Software Pipelining
Symbolic Loop Unrolling
Loop Unrolling :
1. Unroll the loop
2. Schedule
The scheduler essentially interleaves instructions from different loop
iterations, so as to separate the dependent instructions that occur within a single
loop iteration and to eliminate stalls.
A software-pipelined loop interleaves instructions from different iterations

without unrolling the loop
This technique is the software counterpart to what Tomasulos algorithm does in

hardware.
Iteration
4
Softwarepipelined
iteration
Assumptions :
DADDUI is scheduled before the ADD.D and that the L.D instruction, with an adjusted
offset, is placed in the branch delay slot.
For start-up, we will need to execute any

instructions that correspond to iteration 1 and 2
will not be executed. These instructions are the
L.D for iterations 1 and 2 and the ADD.D for
iteration 1.
For the finish-up code, we need to execute any
instructions that will not be execute
in the final two iterations. These include the
ADD.D for the last iteration and the
S.D for the last two iterations
that
Q2 b) IA-64 Control Hazard Solution

IA64 solves this by predication.
Idea:
Compiler tags each side of a branch with a predicate.
Bundle tagged instructions and set template bits to allow parallel execution of predicated
instructions.
Both sides of the branch are executed simultaneously.
When the outcome of the branch is known, the effects of the correct side are committed
(registers modified, etc.), while the effects (and remainder) of the wrong side are discarded.
Benefits:
No need to unroll effects (they are committed only after we know
which is the correct side the branch)
Time taken to execute wrong side at least partially amortized by
execution of correct side, assuming sufficient functional units.
Memory Latencies IA-64 Solution
Idea: Move all load instructions to the start of the program.

This is called hoisting.
Loads will be executed concurrently with the rest of the program.
Hopefully data will be ready in register when it is read.
Loads may belong to decision paths that are never executed
Hoisting effectively causes these loads to be executed anyway, even if

their contents arent actually required.
Loads are therefore speculative: They are done on the speculation that the results are
needed later.
A Check instruction is placed just before the load results are

needed.
This checks for exceptions in the speculative load.
As well as commits effect of load to the target register.
Original code:
I1: add r3, r1, r2
I2: sub r2, r3, r5
I3: P1, P2 = cmp(r2 == 3)
I4: <P1> sub r7, r3, r4
I5: <P1> rori r7, 3
I6: <P1> j I10
I7: <P2> lw r4, 0(r6)
I8: <P2> muli r4, r3, r6
Modified Code:
I1: add r3, r1, r2
I2: sub r2, r3, r5
I3: lw r4, 0(r6) Speculative
Load
I4: P1, P2 = cmp(r2 == 3)
I5: <P1> sub r7, r3, r4
I6: <P1> rori r7, 3
I7: <P1> j I10
I9: <P2> chk.s r4 Load Check
I10: <P2> muli r4, r3, r6 fl Load
used here
Q3 a) Snooping Coherence Protocol
Write Invalidate
On write, invalidate all other copies (most common)

Use bus to serialize
Write cannot complete until bus access is obtained
The same item can appear in multiple caches
Two caches will never have different values for the same block.
Most commonly used in MPs
Write Update (a.k.a. write broadcast)

On write, update all copies
Consumes considerable bandwidth
Basic Snooping Cache Implementation
All processors continuously snoop on the bus watching addresses. They invalidate their
caches if they have the address.
States of a block:
Invalid another core has modified the block
Shared potentially shared with other caches
Modified updated in the private cache. Implies that the block is exclusive.
an extra state bit (shared/exclusive) associated with a valid bit and a dirty bit for
each block
Requests to the cache can come from a core or from the bus.
Implements a finite-state machine
Complications for the basic MSI protocol:
Operations are not atomic
E.g. detect miss, acquire bus, receive a response
Creates possibility of deadlock and races
One solution: processor that sends invalidate can hold bus

until other processors receive the invalidate
Extensions:
MESI protocol
add Exclusive state to indicate when a clean block is resident in only one cache
Prevents the need to write invalidate on a write
Corei7 uses MESIF : F forward
Read miss to block in E state block changed to S state
MOESI protocol
Owned state; indicates that the block is owned by that cache and it is out-of-date in
memory.
Block changed from modified to owned without written into memory instead of
shared . Other caches keep block in shared state.
Q3 b) Memory consistency models

Processor 1:
Processor 2:
A=0
A=1
if (B==0)
B=0
B=1
if (A==0)
Should be impossible for both if-statements to be

evaluated as true
Delayed write invalidate?
Sequential consistency model:
Result of execution should be the same as long as:
Accesses on each processor were kept in order
Accesses on different processors were arbitrarily interleaved
Relaxed consistency models
X Y (X must complete before Y)

Sequential consistency requires:
R W, R R, W R, W W
Relax W R
Total store ordering no additional synchronization
Relax W W
Partial store order
Relax R W and R R
Weak ordering and release consistency
Q4 a)
i) Advantages of HW (Tomasulo) vs. SW (VLIW) Speculation
HW advantages:
HW better at memory disambiguation since knows actual addresses

HW better at branch prediction since lower overhead
HW maintains precise exception model
HW does not execute bookkeeping instructions
Same software works across multiple implementations
Smaller code size (not as many nops filling blank instructions)
SW advantages:
Window of instructions that is examined for parallelism much higher

Much less hardware involved in VLIW (unless you are Intel!)
More involved types of speculation can be done more easily
Speculation can be based on large-scale program behavior, not just local
information
ii) Comparison between CISC, RISC, VLIW
Q4 b)
i) Cache Performance
Memory stall cycles = Number of misses x Miss penalty
IC : instruction count, Miss penalty : cost per miss

Miss rate : number of accesses that miss divided by number of accesses
Miss Penalty and Out-of-Order Execution Processors

miss penalty considered as nonoverlapped latency
Length of memory latencyWhat to consider as the start and the end of a memory
operation in an out-of-order processor
Length of latency overlapWhat is the start of overlap with the processor
Which has the lower miss rate: a 16 KB instruction cache with a 16 KB data cache or a 32
KB unified cache?
split caches offer two memory ports per clock cycle, thereby avoiding the structural
hazardhave a better average memory access time than the single-ported unified
cache despite having a worse effective miss rate.
Better measure of memory performance is
Average memory access time = Hit time + Miss rate X Miss penalty
Cache Optimization
Reducing the miss rate: larger block size, larger cache size, and higher associativity
Reducing the miss penalty: multilevel caches and giving reads priority over
writes
Reducing the time to hit in the cache: avoiding address translation when
indexing the cache
ii) Synchronization
Basic building blocks:
Atomic exchange
Swaps register with memory location
Test-and-set
Sets under condition
Test if memory position if 0 and set to 1 if 0.
Fetch-and-increment
Reads original value from memory and increments it in memory
Requires memory read and write in uninterruptable instruction
Pair of Load linked/Store conditional instructions

If the contents of the memory location specified by the load linked
are changed before the store conditional to the same address, the
store conditional fails
Atomic Exchange
LOCK = 0 => FREE
LOCK = 1 => UNAVAILABLE
Lock value in memory
EXCH R,M
Load Linked/Store Conditional
Goal: atomically exchange R4 and content of memory location specified by R1

try: MOV R3,R4 ; move exchange value
LL R2,0(R1) ; load linked
SC R3,0(R1) ; store conditional
BEQZ R3,try ; branch store fails
MOV R4,R2 ; put load value in R4
Mem pointed by R1: A
R1:
R2:
R3:
R4: B
Q5 a) Multiprocessor Architecture
Shared memory:
Communication among threads through a shared address space
A memory reference can be made by any processor to any to any memory
location.
Types
Centralized shared-memory multiprocessor
Distributed shared memory (DSM)
Centralized shared-memory multiprocessor
Also called as Symmetric multiprocessors (SMP)

Share single memory with uniform memory access/latency (UMA)
Small number of cores
A symmetric relationship to all processors
A uniform memory access time from any processor
scalability problem: less attractive for large-scale processors
Memory distributed among processors.

Non-uniform memory access/latency (NUMA)
Processors connected via direct (switched) and non-direct (multihop)
interconnection networks
Q5 b)
i) Centralized Shared Memory Architectures
SMPs: both shared and private data can be cached.

Shared data provides a mechanism for processors to communicate through reads and
writes to shared memory.
The effect of caching private data on program behavior is the same as that of a
uniprocessor because no other processor access these data.
The value of shared data may be replicated in the multiple caches:
Adv : reduction in cache contention
Disadv : cache coherence!
Cache Coherence
Processors may see different values through their caches e.g.
ii) Distributed Shared-Memory Architectures

Distributed shared-memory architectures
Separate memory per processor
Local or remote access via memory controller
The physical address space is statically distributed
Coherence Problems
Simple approach: uncacheable
shared data are marked as uncacheable and only private data are kept in caches
very long latency to access memory for shared data
Alternative: directory for memory blocks
Directory Protocols
Directory keeps track of every block

Which caches have each block
Dirty status of each block
Implement in a distributed fashion:
Implement in shared L3 cache

Keep bit vector of size = # cores for each block in L3
Not scalable beyond shared L3
Q6 a) Intel Montecito
Features
Dual core Itanium 2, each core dual threaded
1.7 billion transistors, 21.5 mm x 27.7 mm die
27 MB of on-chip three levels of cache
Not shared among cores
1.8+ GHz, 100 W
Single-thread enhancements
Extra shifter improves performance of crypto codes by 100%
Improved branch prediction
Improved data and control speculation recovery
Separate L2 instruction and data caches buys 7% improvement over Itanium2; four times
bigger L2I (1 MB)
Asynchronous 12 MB L3 cache
Dual threads
SMT only for cache, not for core resources
Simulations showed high resource utilization at core level, but low utilization of cache
Branch predictor is still shared but use thread id tags
Thread switch is implemented by flushing the pipe
More like coarse-grain multithreading
Five thread switch events :

L3 cache miss (immense impact on in-order pipe)/ L3 cache refill
Quantum expiry
Spin lock/ ALAT invalidation
Software-directed switch
Execution in low power mode
Thread urgency
Each thread has eight urgency levels
Every L3 miss decrements urgency by one
Every L3 refill increments urgency by one until urgency reaches 5
A switch due to time quantum expiry sets the urgency of the switched thread to 7
Arrival of asynchronous interrupt for a background thread sets the urgency level of that
thread to 6
Switch from L3 miss requires urgency level to be compared also
Q6 b) IBM Cell Processor

It is a heterocore processor with 1 PowerPC and 8 SPEs cores.
Each Cell chip has:
One PowerPC core
8 compute cores (SPEs)
On-chip Memory controller
On-chip I/O
On-chip network to
connect them all
PowerPC Core (PPE)

3.2GHz
Dual issue, in-order
2-way multithreaded
512KB L2 cache
No hardware prefetching
SIMD (altivec) + FMA
6.4 GFlop/s (double precision)
25.6 GFlop/s (single precision)
Serves 2 purposes:
Compatibility processor;
runs legacy code, compilers, libraries, etc.
Performs all system level functions
including starting the other cores
Synergistic Processing Element (SPE)
Offload processor
Dual issue, in-order, VLIW inspired
SIMD processor
128 x 128b register file
256KB local store, no I$, no D$
Runs a small program (placed in LS)
Offloads system functions to the PPE
MFC (memory flow controller)
a programmable DMA engine and onchip network interface
Performance
1.83 GFlop/s (double precision)
25.6 GFlop/s (single precision)
51.2GB/s access the local store
25.6GB/s access to the network
Threading-Model
For each SPE program
the PPE must:
Create a SPE context
Load the SPE program (embedded in the
binary)
Create a pthread to run it
Typically, split the work
into 2 phases:
Code that runs on the SPEs
Code that runs on the PPE with a barrier in
between
Threads-Communication
The PPE may communicate with the SPEs via :
individual mailboxes (FIFO)
HW signals
DRAM
Direct PPE access to the SPEs local stores

ACA Dec 16 Model Ans

Transféré par

Informations du document

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

ACA Dec 16 Model Ans

Transféré par

Droits d'auteur :

Formats disponibles

Q1 a) Dependences & Hazards

Dependences are a property of programs

Dependencies that flow through memory locations are difficult to detect

To avoid Hazards HW/SW must preserve program order:

Instruction j is data dependent on instruction i if

Instruction i produces a result that may be used by instruction j

Instruction j is data dependent on instruction k and instruction k is data

Loop: L.D F0,0(R1) ;

Importance of the data dependencies

Floating point data

DADDIU R1,R1,-8 ;decrement pointer ;8 bytes (per DW)

Antidependence : Instr-J writes operand before Instr-I reads it

Output dependence : Instr-J writes operand before Instr-I writes it

Control dependence need not be preserved

Instead, 2 properties critical to program correctness are

Preserving Exception Behavior

Preserving Data flow:

Read after write (RAW)

Instr-J reads operand before Instr-I writes it

Corresponds to data dependence

Write after write (WAW)

Instr-J writes operand before Instr-I writes it

Corresponds to output Dependence

Occurs in pipelines that

writes more than one pipe stage

Allow instruction to proceed when previous instruction is stalled

Write after read (WAR)

Instr-J writes operand before Instr-I read it

Read after read (RAR) : not a Hazard

Q1 b) Dynamic branch prediction

Branch prediction buffer

Correlating branch predictors

Basic Branch prediction buffer

A branch-prediction buffer (branch history table) is a small memory indexed by the

If the hint turns out to be wrong, the

2) Correlating Branch Predictors

multilevel branch predictors

ability to select the right predictor for

2 bit predictor failed on important

Adaptively Combining Local and Global

E.g. Tournament Predictor in Alpha

Loop Carried dependence

Recurrence : type of loop carried dependence

Detecting a recurrence can be important for two reasons:

Test for dependence

Eliminating dependent computations

Tree height reduction

Scheduling and Structuring Code for Parallelism

Symbolic Loop Unrolling

A software-pipelined loop interleaves instructions from different iterations

This technique is the software counterpart to what Tomasulos algorithm does in

For start-up, we will need to execute any

Q2 b) IA-64 Control Hazard Solution

Memory Latencies IA-64 Solution

Idea: Move all load instructions to the start of the program.

Hoisting effectively causes these loads to be executed anyway, even if

A Check instruction is placed just before the load results are

Q3 a) Snooping Coherence Protocol

On write, invalidate all other copies (most common)

Write Update (a.k.a. write broadcast)

Basic Snooping Cache Implementation

Complications for the basic MSI protocol:

Operations are not atomic

E.g. detect miss, acquire bus, receive a response