Vous êtes sur la page 1sur 217

Shared Counters and Parallelism

Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit

A Shared Pool
public interface Pool { public void put(Object x); public Object remove(); }

Unordered set of objects


Put
Inserts object blocks if full

Remove
Removes & returns an object blocks if empty
Art of Multiprocessor Programming 2

A Shared Pool
Put
Insert item block if full

Remove
Remove & return item block if empty

public interface Pool<T> { public void put(T x); public T remove(); }

Art of Multiprocessor Programming

Simple Locking Implementation


put

put

Art of Multiprocessor Programming

Simple Locking Implementation


put

put

Problem: hotspot contention


5

Simple Locking Implementation


put Problem: sequential bottleneck

put

Problem: hotspot contention


6

Simple Locking Implementation


put Problem: sequential bottleneck

put

Problem: hotSolution: spot contention Art of Multiprocessor Queue Lock


Programming

Simple Locking Implementation


put Problem: sequential Solution:? bottleneck ??

put

Problem: hotSolution: spot contention Art of Multiprocessor Queue Lock


Programming

Counting Implementation
put
19 19 20 21 20 21

remove

Art of Multiprocessor Programming

Counting Implementation
put
19 19 20 20 21

remove

Only the counters are sequential


Art of Multiprocessor Programming 10

21

Shared Counter
0 12 3

3
2 1

Art of Multiprocessor Programming

11

Shared Counter
No duplication
0 12 3

3
2 1

Art of Multiprocessor Programming

12

Shared Counter
No duplication No Omission 0
3
2 1

12 3

Art of Multiprocessor Programming

13

Shared Counter
No duplication No Omission 0
3
2 1

12 3

Art of Multiprocessor Programming

Not necessarily linearizable

14

Shared Counters
Can we build a shared counter with
Low memory contention, and Real parallelism?

Locking
Can use queue locks to reduce contention No help with parallelism issue

Art of Multiprocessor Programming

15

Software Combining Tree


Contention: All spinning local 4

Parallelism: Potential n/log n speedup

Art of Multiprocessor Programming

16

Combining Trees
0

Art of Multiprocessor Programming

17

Combining Trees
0

+3

Art of Multiprocessor Programming

18

Combining Trees
0

+3

+2

Art of Multiprocessor Programming

19

Combining Trees
0

+3

+2

Two threads meet, combine sums

Art of Multiprocessor Programming

20

Combining Trees
0
+5
+3 +2

Two threads meet, combine sums

Art of Multiprocessor Programming

21

Combining Trees
5
+5
+3 +2

Combined sum added to root

Art of Multiprocessor Programming

22

Combining Trees
5
0
+3 +2

Result returned to children

Art of Multiprocessor Programming

23

Combining Trees
5
0
0 3

Results returned to 0 threads

Art of Multiprocessor Programming

24

What if?
Threads dont arrive together?
Should I stay or should I go?

How long to wait?


Waiting times add up

Idea:
Use multi-phase algorithm Where threads wait in parallel
Art of Multiprocessor Programming 25

Combining Status
enum CStatus{ IDLE, FIRST, SECOND, RESULT, ROOT };

Art of Multiprocessor Programming

26

Combining Status
enum CStatus{ IDLE, FIRST, SECOND, RESULT, ROOT };

Nothing going on

Art of Multiprocessor Programming

27

Combining Status
enum CStatus{ IDLE, FIRST, SECOND, RESULT, ROOT };

1st thread is a partner for combining, will return to check for 2nd thread
Art of Multiprocessor Programming 28

Combining Status
enum CStatus{ IDLE, FIRST, SECOND, RESULT, ROOT };

2nd thread has arrived with value for combining


Art of Multiprocessor Programming 29

Combining Status
enum CStatus{ IDLE, FIRST, SECOND, RESULT, ROOT };

1st thread has deposited result for 2nd thread


Art of Multiprocessor Programming 30

Combining Status
enum CStatus{ IDLE, FIRST, SECOND, RESULT, ROOT };

Special case: root node


Art of Multiprocessor Programming 31

Node Synchronization
Short-term
Synchronized methods Consistency during method call

Long-term
Boolean locked field Consistency across calls

Art of Multiprocessor Programming

32

Phases
Precombining
Set up combining rendez-vous

Art of Multiprocessor Programming

33

Phases
Precombining
Set up combining rendez-vous

Combining
Collect and combine operations

Art of Multiprocessor Programming

34

Phases
Precombining
Set up combining rendez-vous

Combining
Collect and combine operations

Operation
Hand off to higher thread

Art of Multiprocessor Programming

35

Phases
Precombining
Combining
Set up combining rendez-vous

Operation

Collect and combine operations


Hand off to higher thread Distribute results to waiting threads
Art of Multiprocessor Programming 36

Distribution

Precombining Phase
0
IDLE

Examine status

Art of Multiprocessor Programming

37

Precombining Phase
0
FIRST

If IDLE, 0 promise to return to look for partner

Art of Multiprocessor Programming

38

Precombining Phase
0
FIRST

At ROOT,turn back

Art of Multiprocessor Programming

39

Precombining Phase
0
FIRST

Art of Multiprocessor Programming

40

Precombining Phase
0
SECOND

If FIRST, Im 0 combine, willing to but lock for now

Art of Multiprocessor Programming

41

Code
Tree class
In charge of navigation

Node class
Combining state Synchronization state Bookkeeping

Art of Multiprocessor Programming

42

Precombining Navigation
Node node = myLeaf; while (node.precombine()) { node = node.parent; } Node stop = node;

Art of Multiprocessor Programming

43

Precombining Navigation
Node node = myLeaf; while (node.precombine()) { node = node.parent; } Node stop = node;

Start at leaf
Art of Multiprocessor Programming 44

Precombining Navigation
Node node = myLeaf; while (node.precombine()) { node = node.parent; } Node stop = node;

Move up while instructed to do so


Art of Multiprocessor Programming 45

Precombining Navigation
Node node = myLeaf; while (node.precombine()) { node = node.parent; } Node stop = node;

Remember where we stopped


Art of Multiprocessor Programming 46

Precombining Node
synchronized boolean precombine() { while (locked) wait(); switch (cStatus) { case IDLE: cStatus = CStatus.FIRST; return true; case FIRST: locked = true; cStatus = CStatus.SECOND; return false; case ROOT: return false; default: throw new PanicException() } }
Art of Multiprocessor Programming 47

Precombining Node
synchronized boolean precombine() { while (locked) wait(); switch (cStatus) { case IDLE: cStatus = CStatus.FIRST; return true; case FIRST: locked = true; cStatus = CStatus.SECOND; return false; case ROOT: return false; default: throw new PanicException() Short-term } synchronization }
Art of Multiprocessor Programming 48

Synchronization
synchronized boolean precombine() { while (locked) wait(); switch (cStatus) { case IDLE: cStatus = CStatus.FIRST; return true; case FIRST: locked = true; cStatus = CStatus.SECOND; Wait while node is locked return false; (in use by earlier combining phase) case ROOT: return false; default: throw new PanicException() } }
Art of Multiprocessor Programming 49

Precombining Node
synchronized boolean precombine() { while (locked) wait(); switch (cStatus) { case IDLE: cStatus = CStatus.FIRST; return true; case FIRST: locked = true; cStatus = CStatus.SECOND; return false; case ROOT: return false; default: throw new PanicException() } Check combining }
Art of Multiprocessor Programming

status
50

Node was IDLE


synchronized boolean precombine() { while (locked) {wait();} switch (cStatus) { case IDLE: cStatus = CStatus.FIRST; return true; case FIRST: locked = true; cStatus = CStatus.SECOND; return false; case ROOT: return false; default: throw new PanicException() nd I will return to look for 2 } threads input value }
Art of Multiprocessor Programming 51

Precombining Node
synchronized boolean precombine() { while (locked) {wait();} switch (cStatus) { case IDLE: cStatus = CStatus.FIRST; return true; case FIRST: locked = true; cStatus = CStatus.SECOND; return false; Continue up the case ROOT: return false; default: throw new PanicException() } }
Art of Multiprocessor Programming

tree

52

Im the

nd 2

Thread

synchronized boolean precombine() { while (locked) {wait();} switch (cStatus) { case IDLE: cStatus = CStatus.FIRST; return true; case FIRST: locked = true; cStatus = CStatus.SECOND; return false; case ROOT: return false; default: throw new PanicException() st thread has promised to return, If 1 } } lock node so it wont leave without me
Art of Multiprocessor Programming 53

Precombining Node
synchronized boolean precombine() { while (locked) {wait();} switch (cStatus) { case IDLE: cStatus = CStatus.FIRST; return true; case FIRST: locked = true; cStatus = CStatus.SECOND; return false; case ROOT: return false; default: throw new PanicException() nd Prepare to deposit 2 } threads input value }
Art of Multiprocessor Programming 54

Precombining Node
synchronized boolean phase1() { End of precombining while (sStatus==SStatus.BUSY) {wait();} phase, dont continue switch (cStatus) { up tree case IDLE: cStatus = CStatus.FIRST; return true; case FIRST: locked = true; cStatus = CStatus.SECOND; return false; case ROOT: return false; default: throw new PanicException() } }
Art of Multiprocessor Programming 55

Node is the Root


synchronized boolean phase1() { If root, precombining while (sStatus==SStatus.BUSY) {wait();} phase ends, dont switch (cStatus) { continue up tree case IDLE: cStatus = CStatus.FIRST; return true; case FIRST: locked = true; cStatus = CStatus.SECOND; return false; case ROOT: return false; default: throw new PanicException() } }
Art of Multiprocessor Programming 56

Precombining Node
synchronized boolean precombine() { Always check for while (locked) {wait();} switch (cStatus) { unexpected values! case IDLE: cStatus = CStatus.FIRST; return true; case FIRST: locked = true; cStatus = CStatus.SECOND; return false; case ROOT: return false; default: throw new PanicException() } }
Art of Multiprocessor Programming 57

Combining Phase
0
SECOND

+3

1st thread locked 0 2nd out until provides value

Art of Multiprocessor Programming

58

Combining Phase
0
SECOND

+3

2nd thread deposits value to 0 be combined, unlocks node, & waits


zzz

Art of Multiprocessor Programming

59

Combining Phase
0
+5
SECOND
2

+3

+2

1st thread moves up the tree with combined value


zzz

Art of Multiprocessor Programming

60

Combining (reloaded)
0
FIRST

2nd thread 0 has not yet deposited value

Art of Multiprocessor Programming

61

Combining (reloaded)
0

FIRST

+3

1st thread is alone, locks out late partner

Art of Multiprocessor Programming

62

Combining (reloaded)
0 +3
FIRST

Stop at root

+3

Art of Multiprocessor Programming

63

Combining (reloaded)
0 +3
FIRST

+3

2nd threads late precombining phase visit locked out

Art of Multiprocessor Programming

64

Combining Navigation
node = myLeaf; int combined = 1; while (node != stop) { combined = node.combine(combined); stack.push(node); node = node.parent; }

Art of Multiprocessor Programming

65

Combining Navigation
node = myLeaf; int combined = 1; while (node != stop) { combined = node.combine(combined); stack.push(node); node = node.parent; }

Start at leaf

Art of Multiprocessor Programming

66

Combining Navigation
node = myLeaf; int combined = 1; while (node != stop) { combined = node.combine(combined); stack.push(node); node = node.parent; }

Add 1
Art of Multiprocessor Programming 67

Combining Navigation
node = myLeaf; int combined = 1; while (node != stop) { combined = node.combine(combined); stack.push(node); Revisit nodes node = node.parent; visited in }

precombining

Art of Multiprocessor Programming

68

Combining Navigation
node = myLeaf; int combined = 1; while (node != stop) { combined = node.combine(combined); stack.push(node); node = node.parent; }

Accumulate combined values, if any


Art of Multiprocessor Programming 69

Combining Navigation
node = myLeaf; We will retraverse path in int combined = 1; reverse order while (node != stop) { combined = node.combine(combined); stack.push(node); node = node.parent; }

Art of Multiprocessor Programming

70

Combining Navigation
node = myLeaf; Move up the tree int combined = 1; while (node != stop) { combined = node.combine(combined); stack.push(node); node = node.parent; }

Art of Multiprocessor Programming

71

Combining Phase Node


synchronized int combine(int combined) { while (locked) wait(); locked = true; firstValue = combined; switch (cStatus) { case FIRST: return firstValue; case SECOND: return firstValue + secondValue; default: } }
Art of Multiprocessor Programming 72

Combining Phase Node


synchronized int combine(int combined) { while (locked) wait(); locked = true; firstValue = combined; switch (cStatus) { case FIRST: return firstValue; Wait until node is case SECOND: unlocked. It is locked by return firstValue + secondValue; the 2nd thread default: } until it deposits its value }
Art of Multiprocessor Programming 73

Combining Phase Node


synchronized int combine(int combined) { while (locked) wait(); locked = true; firstValue = combined; switch (cStatus) { case FIRST: return firstValue; Why is it that no thread case SECOND: acquires+ the lock between return firstValue secondValue; the two lines? default: } }
Art of Multiprocessor Programming 74

Combining Phase Node


synchronized int combine(int combined) { while (locked) wait(); locked = true; firstValue = combined; Lock out late switch (cStatus) { attempts to combine case FIRST: (by threads still in return firstValue; case SECOND: precombining) return firstValue + secondValue; default: } }
Art of Multiprocessor Programming 75

Combining Phase Node


synchronized int combine(int combined) { while (locked) wait(); locked = true; firstValue = combined; switch (cStatus) { case FIRST: return firstValue; case SECOND: return firstValue + secondValue; default: } Remember my (1st }
Art of Multiprocessor Programming

thread) contribution
76

Combining Phase Node


synchronized int combine(int combined) { while (locked) wait(); Check status locked = true; firstValue = combined; switch (cStatus) { case FIRST: return firstValue; case SECOND: return firstValue + secondValue; default: } }
Art of Multiprocessor Programming 77

Combining Phase Node


synchronized int combine(int combined) { while (locked) wait(); st thread) am I (1 locked = true; firstValue = combined; switch (cStatus) { case FIRST: return firstValue; case SECOND: return firstValue + secondValue; default: } }
Art of Multiprocessor Programming

alone

78

Combining Node
synchronized int combine(int combined) { while (locked) wait(); locked = true; Not alone: firstValue = combined; combine with switch (cStatus) { 2nd thread case FIRST: return firstValue; case SECOND: return firstValue + secondValue; default: } }
Art of Multiprocessor Programming 79

Operation Phase
5 +5

+3

Add combined value to root, start back down


+2

zzz

Art of Multiprocessor Programming

80

Operation Phase (reloaded)


5

SECOND

Leave value to be combined

Art of Multiprocessor Programming

81

Operation Phase (reloaded)


5

SECOND

Unlock, and wait


+2

zzz

Art of Multiprocessor Programming

82

Operation Phase Navigation


prior = stop.op(combined);

Art of Multiprocessor Programming

83

Operation Phase Navigation


prior = stop.op(combined);

The node where we stopped. Provide collected sum and wait for combining result
Art of Multiprocessor Programming 84

Operation on Stopped Node


synchronized int op(int combined) { switch (cStatus) { case ROOT: int prior = result; result += combined; return prior; case SECOND: secondValue = combined; locked = false; notifyAll(); while (cStatus != CStatus.RESULT) wait(); locked = false; notifyAll(); cStatus = CStatus.IDLE; return result; default:
Art of Multiprocessor Programming 85

Op States of Stop Node


synchronized int op(int combined) { switch (cStatus) { case ROOT: int prior = result; result += combined; return prior; and SECOND possible. Only ROOT case SECOND: secondValue = combined; Why? locked = false; notifyAll(); while (cStatus != CStatus.RESULT) wait(); locked = false; notifyAll(); cStatus = CStatus.IDLE; return result; default:
Art of Multiprocessor Programming 86

At Root
synchronized int op(int combined) { switch (cStatus) { case ROOT: int prior = result; result += combined; return prior; case SECOND: secondValue = combined; locked = false; notifyAll(); while (cStatus != CStatus.RESULT) wait(); locked = false; notifyAll(); cStatus = CStatus.IDLE; Add sum to root, return result; default: return prior value
Art of Multiprocessor Programming 87

Intermediate Node
synchronized int op(int combined) { switch (cStatus) { case ROOT: int prior = result; result += combined; return prior; case SECOND: secondValue = combined; locked = false; notifyAll(); while (cStatus != CStatus.RESULT) wait(); locked = false; notifyAll(); cStatus = CStatus.IDLE; Deposit value for return result; default: later combining
Art of Multiprocessor Programming 88

synchronized int op(int combined) Unlock node (which I { locked in switch (cStatus) { precombining). Then notify 1st thread case ROOT: int prior = result; result += combined; return prior; case SECOND: secondValue = combined; locked = false; notifyAll(); while (cStatus != CStatus.RESULT) wait(); locked = false; notifyAll(); cStatus = CStatus.IDLE; return result; default:
Art of Multiprocessor Programming 89

Intermediate Node

Intermediate Node
synchronized int op(int combined) { switch (cStatus) { for 1st thread case ROOT: int prior Wait = result; result += combined; to deliver results return prior; case SECOND: secondValue = combined; locked = false; notifyAll(); while (cStatus != CStatus.RESULT) wait(); locked = false; notifyAll(); cStatus = CStatus.IDLE; return result; default:
Art of Multiprocessor Programming 90

Intermediate Node
synchronized int op(int combined) { switch (cStatus) { st thread Unlock node (locked by 1 case ROOT: int prior = result; result in += combining combined; phase) & return return prior; case SECOND: secondValue = combined; locked = false; notifyAll(); while (cStatus != CStatus.RESULT) wait(); locked = false; notifyAll(); cStatus = CStatus.IDLE; return result; default:
Art of Multiprocessor Programming 91

Distribution Phase
5 0
SECOND

Move down with result

zzz

Art of Multiprocessor Programming

92

Distribution Phase
5

SECOND

Leave result for 2nd thread & lock node


zzz

Art of Multiprocessor Programming

93

Distribution Phase
5

Push result down tree

SECOND

zzz

Art of Multiprocessor Programming

94

Distribution Phase
5

IDLE
3

2nd thread awakens, unlocks, takes value

Art of Multiprocessor Programming

95

Distribution Phase Navigation


while (!stack.empty()) { node = stack.pop(); node.distribute(prior); } return prior;

Art of Multiprocessor Programming

96

Distribution Phase Navigation


while (!stack.empty()) { node = stack.pop(); node.distribute(prior); } return prior;

Traverse path in reverse order

Art of Multiprocessor Programming

97

Distribution Phase Navigation


while (!stack.empty()) { node = stack.pop(); node.distribute(prior); } return prior;

Distribute results to waiting 2nd threads

Art of Multiprocessor Programming

98

Distribution Phase Navigation


while (!stack.empty()) { node = stack.pop(); node.distribute(prior); } return prior;

Return result to caller


Art of Multiprocessor Programming 99

Distribution Phase
synchronized void distribute(int prior) { switch (cStatus) { case FIRST: cStatus = CStatus.IDLE; locked = false; notifyAll(); return; case SECOND: result = prior + firstValue; cStatus = CStatus.RESULT; notifyAll(); return; default:

Art of Multiprocessor Programming

100

Distribution Phase
synchronized void distribute(int prior) { switch (cStatus) { case FIRST: cStatus = CStatus.IDLE; locked = false; notifyAll(); return; case SECOND: result = prior + firstValue; nd No 2 thread to combine cStatus = CStatus.RESULT; notifyAll(); with me, unlock node & return; reset default:

Art of Multiprocessor Programming

101

nd thread synchronized void distribute(int prior) Notify 2 that result is{ switch (cStatus) { available (2nd thread will release lock) case FIRST: cStatus = CStatus.IDLE; locked = false; notifyAll(); return; case SECOND: result = prior + firstValue; cStatus = CStatus.RESULT; notifyAll(); return; default:

Distribution Phase

Art of Multiprocessor Programming

102

Bad News: High Latency


+5

Log n
+2 +3

Art of Multiprocessor Programming

103

Good News: Real Parallelism


+5

1 thread
+3

+2

2 threads
104

Art of Multiprocessor Programming

Throughput Puzzles
Ideal circumstances
All n threads move together, combine n increments in O(log n) time

Worst circumstances
All n threads slightly skewed, locked out n increments in O(n log n) time

Art of Multiprocessor Programming

105

Index Distribution Benchmark


void indexBench(int iters, int work) { while (int i < iters) { i = r.getAndIncrement(); Thread.sleep(random() % work); }}

Art of Multiprocessor Programming

106

Index Distribution Benchmark


void indexBench(int iters, int work) { while (int i < iters) { i = r.getAndIncrement(); Thread.sleep(random() % work); }}

How many iterations

Art of Multiprocessor Programming

107

Index Distribution Benchmark


void indexBench(int iters, int work) { while (int i < iters) { i = r.getAndIncrement(); Thread.sleep(random() % work); }}

Expected time between incrementing counter


Art of Multiprocessor Programming 108

Index Distribution Benchmark


void indexBench(int iters, int work) { while (int i < iters) { i = r.getAndIncrement(); Thread.sleep(random() % work); }}

Take a number

Art of Multiprocessor Programming

109

Index Distribution Benchmark


void indexBench(int iters, int work) { while (int i < iters) { i = r.getAndIncrement(); Thread.sleep(random() % work); }}

Pretend to work (more work, less concurrency)


Art of Multiprocessor Programming 110

Performance
Here are some fake graphs
Distilled from real ones

Performance
Here are some fake graphs
Distilled from real ones

Your performance will probably vary


But not by much?

Performance
Here are some fake graphs
Distilled from real ones

Your performance will probably vary


But not by much?

Throughput
Average incs in 1 million cycles

Performance
Here are some fake graphs
Your performance will probably vary Throughput
But not by much?
Average incs in 1 million cycles Average cycles per inc Distilled from real ones

Latency

Latency
Spin lock bad Combining tree

good
Number of processors
115

Throughput
Spin lock

Combining tree
bad Combining tree Spin lock

good
Number of processors
116

Load Fluctuations
Combining is sensitive:
if arrival rates drop So do combining rates & performance deteriorates!

Test
Vary work Duration between accessess
117

Combining Rate vs Work


70 60 50 40
W=100 W=1000

30
20 10 0 1 2 4 8 16 31 48 64

W=5000

118

Better to Wait Longer


Short wait

Latency

Medium wait

Indefinite wait

processors
119

Conclusions
Combining Trees
Linearizable Counters Work well under high contention Sensitive to load fluctuations Can be used for getAndMumble() ops

And now for something completely different


Art of Multiprocessor Programming 120

A Balancer

Input wires

Output wires

Art of Multiprocessor Programming

121

Tokens Traverse Balancers

Token i enters on any wire leaves on wire i (mod 2)


Art of Multiprocessor Programming 122

Tokens Traverse Balancers

Art of Multiprocessor Programming

123

Tokens Traverse Balancers

Art of Multiprocessor Programming

124

Tokens Traverse Balancers

Art of Multiprocessor Programming

125

Tokens Traverse Balancers

Art of Multiprocessor Programming

126

QuiescentTraverse State: all tokens have exited Tokens Balancers

Arbitrary input distribution


Art of Multiprocessor Programming

Balanced output distribution


127

Smoothing Network

1-smooth property Art of Multiprocessor


Programming

128

Counting Network

Art of Multiprocessor Programming

step property

129

Counting

Step property guarantees no duplication or Networks Count! omissions, how?


0, 4, 8....

1, 5, 9.....

2, 6, ...
3, 7 ...

counters Multiple counters distribute Art load of Multiprocessor


Programming

130

Step property guarantees that in-flight Counting Count! tokens willNetworks take missing values
0

1, 5, 9.....

2, 6, ...

3, 7 ...

If 5 and 9 are taken before 4 and 8

131

Counting Networks
Good for counting number of tokens low contention no sequential bottleneck high throughput 2 practical networks depth log n

Art of Multiprocessor Programming

132

Counting Network
1

Art of Multiprocessor Programming

133

Counting Network
1 2

Art of Multiprocessor Programming

134

Counting Network
1 2 3

Art of Multiprocessor Programming

135

Counting Network
1 2 3

Art of Multiprocessor Programming

136

Counting Network
15 2 3 4

Art of Multiprocessor Programming

137

Counting Network
1 2 3 4 5

Art of Multiprocessor Programming

138

Bitonic[k] Counting Network

Art of Multiprocessor Programming

139

Bitonic[k] Counting Network

140

Bitonic[k] not Linearizable

Art of Multiprocessor Programming

141

Bitonic[k] is not Linearizable

Art of Multiprocessor Programming

142

Bitonic[k] is not Linearizable


2

Art of Multiprocessor Programming

143

Bitonic[k] is not Linearizable 0


2

Art of Multiprocessor Programming

144

Bitonic[k] is not Linearizable 0


Problem is: Red finished before Yellow started Red took 2 Yellow took 0

Art of Multiprocessor Programming

145

But it is Quiescently Consistent

Has Step Property in Any Quiescent State (one in which all tokens have exited)
Art of Multiprocessor Programming 146

Shared Memory Implementation


class balancer { boolean toggle; balancer[] next;
synchronized boolean flip() { boolean oldValue = this.toggle; this.toggle = !this.toggle; return oldValue; }
Art of Multiprocessor Programming 147

Shared Memory Implementation


class balancer { boolean toggle; balancer[] next;
state synchronized boolean flip() { boolean oldValue = this.toggle; this.toggle = !this.toggle; return oldValue; }
Art of Multiprocessor Programming 148

Shared Memory Implementation


class balancer { boolean toggle; balancer[] next;
Output connections to balancers

synchronized boolean flip() { boolean oldValue = this.toggle; this.toggle = !this.toggle; return oldValue; }
Art of Multiprocessor Programming 149

Shared Memory Implementation


class balancer { boolean toggle; balancer[] next;
getAndComplement

synchronized boolean flip() { boolean oldValue = this.toggle; this.toggle = !this.toggle; return oldValue; }
Art of Multiprocessor Programming 150

Shared Memory Implementation


Balancer traverse (Balancer b) { while(!b.isLeaf()) { boolean toggle = b.flip(); if (toggle) b = b.next[0] else b = b.next[1] return b; }
Art of Multiprocessor Programming 151

Shared Memory Implementation


Balancer traverse (Balancer b) { while(!b.isLeaf()) { boolean toggle = b.flip(); if (toggle) b = b.next[0] Stop when we exit the network else b = b.next[1] return b; }
Art of Multiprocessor Programming 152

Shared Memory Implementation


Balancer traverse (Balancer b) { while(!b.isLeaf()) { boolean toggle = b.flip(); if (toggle) b = b.next[0] else Flip state b = b.next[1] return b; }
Art of Multiprocessor Programming 153

Shared Memory Implementation


Balancer traverse (Balancer b) { while(!b.isLeaf()) { Exit on wire boolean toggle = b.flip(); if (toggle) b = b.next[0] else b = b.next[1] return b; }
Art of Multiprocessor Programming 154

Bitonic[2k] Inductive Structure


Bitonic[k] Merger[2k]

Bitonic[k]
Art of Multiprocessor Programming 156

Bitonic[4] Counting Network


Bitonic[2] Merger[4] Bitonic[2]

Art of Multiprocessor Programming

157

Bitonic[4]

Bitonic[8] Layout

Merger[8]

Bitonic[4]

Art of Multiprocessor Programming

158

Unfolded Bitonic[8] Network

Merger[8]

Art of Multiprocessor Programming

159

Unfolded Bitonic[8] Network


Merger[4]

Merger[4]

Art of Multiprocessor Programming

160

Unfolded Bitonic[8] Network


Merger[2]

Merger[2]
Merger[2] Merger[2]

Art of Multiprocessor Programming

161

Bitonic[k] Depth
Width k Depth is (log2 k)(log2 k + 1)/2

Art of Multiprocessor Programming

162

Proof by Induction
Base:
Bitonic[2] is single balancer has step property by definition

Step:
If Bitonic[k] has step property So does Bitonic[2k]

Bitonic[2k] Schematic
Bitonic[k] Merger[2k]

Bitonic[k]
Art of Multiprocessor Programming 164

Bitonic[2k] Counts
Induction Hypothesis Need to prove

Merger[2k]

165

Merger[2k] Schematic
Merger[k]

Merger[k]
Art of Multiprocessor Programming 166

Merger[2k] Layout

Art of Multiprocessor Programming

167

Proof: Lemma 1
If a sequence has the step property

Art of Multiprocessor Programming

168

Lemma 1
So does its even subsequence

Art of Multiprocessor Programming

169

Lemma 1
Also its odd subsequence

Art of Multiprocessor Programming

170

Lemma 2
even

Even + odd Odd + even

Diff at most 1

even

Art of Multiprocessor Programming

171

Bitonic[2k] Layout Details


Merger[2k]
Bitonic[k]
even

Merger[k]

Bitonic[k]

even

Merger[k]
172

Art of Multiprocessor Programming

By induction hypothesis

Outputs have step property

Bitonic[k]

Merger[k]

Bitonic[k]

Merger[k]
Art of Multiprocessor Programming 173

By Lemma 1
even

All subsequences have step property

Merger[k]

even

Merger[k]
Art of Multiprocessor Programming 174

By Lemma 2
even

Diff at most 1

Merger[k]

even

Merger[k]
Art of Multiprocessor Programming 175

By Induction Hypothesis
Outputs have step property

Merger[k]

Merger[k]
Art of Multiprocessor Programming 176

By Lemma 2
At most one diff

Merger[k]

Merger[k]
Art of Multiprocessor Programming 177

Last Row of Balancers


Merger[k] Merger[k]
Outputs of Merger[k]
Art of Multiprocessor Programming

Outputs of last layer


178

Last Row of Balancers


Wire i from one merger

Merger[k] Merger[k]
Wire i from other merger

Art of Multiprocessor Programming

179

Last Row of Balancers


Merger[k] Merger[k]
Outputs of Merger[k]
Art of Multiprocessor Programming

Outputs of last layer


180

Last Row of Balancers


Merger[k] Merger[k]

Art of Multiprocessor Programming

181

So Counting Networks Count


Merger[k] Merger[k]

Art of Multiprocessor Programming

182

Periodic Network Block

Art of Multiprocessor Programming

183

Periodic Network Block

Art of Multiprocessor Programming

184

Periodic Network Block

Art of Multiprocessor Programming

185

Periodic Network Block

Art of Multiprocessor Programming

186

Block[2k] Schematic
Block[k]

Block[k]
Art of Multiprocessor Programming 187

Block[2k] Layout

Art of Multiprocessor Programming

188

Periodic[8]

Art of Multiprocessor Programming

189

Network Depth
Each block[k] has depth log2 k Need log2 k blocks Grand total of (log2 k)2

Art of Multiprocessor Programming

190

Lower Bound on Depth


Theorem: The depth of any width w counting network is at least (log w). Theorem: there exists a counting network of (log w) depth. Unfortunately, proof is non-constructive and constants in the 1000s.

Art of Multiprocessor Programming

191

Sequential Theorem
If a balancing network counts
Sequentially, meaning that Tokens traverse one at a time

Then it counts
Even if tokens traverse concurrently

Art of Multiprocessor Programming

192

Red First, Blue Second

Art of Multiprocessor Programming

193 (2)

Blue First, Red Second

Art of Multiprocessor Programming

194 (2)

Either Way
Same balancer states

Art of Multiprocessor Programming

195

Order Doesnt Matter


Same balancer states

Same output distribution

Art of Multiprocessor Programming

196

Index Distribution Benchmark


void indexBench(int iters, int work) { while (int i = 0 < iters) { i = fetch&inc(); Thread.sleep(random() % work); } }

Art of Multiprocessor Programming

197

Performance (Simulated)

Throughput

Higher is better!

MCS queue lock Spin lock Number processors


* All graphs taken from Herlihy,Lim,Shavit, copyright ACM.

Art of Multiprocessor Programming

198

Performance (Simulated)
64-leaf combining tree 80-balancer counting network

Throughput

Higher is better!
MCS queue lock Spin lock Number processors

Art of Multiprocessor Programming

199

Performance (Simulated)
64-leaf combining tree 80-balancer counting network

Throughput

Combining and counting are pretty close


MCS queue lock Spin lock Number processors

Art of Multiprocessor Programming

200

Performance (Simulated)
64-leaf combining tree 80-balancer counting network

Throughput

But they beat the hell out of the competition!

MCS queue lock Spin lock


Number processors

Art of Multiprocessor Programming

201

Saturation and Performance


Undersaturated P < w log w

Optimal performance

Saturated

P = w log w

Oversaturated
Art of Multiprocessor Programming

P > w log w
202

Throughput vs. Size


Bitonic[16]

Throughput

Bitonic[8]

Bitonic[4]

Number processors
Art of Multiprocessor Programming 203

Shared Pool
put
19 19 20 21 20 21

remove

Art of Multiprocessor Programming

204

Shared Pool
put

remove

Depth log2w

Art of Multiprocessor Programming

239

Counting Trees
A Tree Balancer:

Single input wire Step property in quiescent state

Counting Trees

Interleaving of output wires

Inductive Construction
Tree[2k] =
b
y1
Tree1[k]

y0
Tree0[k]

. . .

k even outputs

. . .

k odd outputs

Lemma: Tree[2k] has step property in quiescent state.

At most 1 more token in top wire

Inductive Construction
Tree[2k] =
b
y1
Tree1[k]

y0
Tree0[k]

. . .

k even outputs

. . .

k odd outputs

Lemma: Tree[2k] has step property in quiescent state.

Top step sequence has at most one extra on last wire of step

Implementing Counting Trees


b 0/1

b 0/1
0/1 0/1 b 0/1

b 0/1

b 0/1

b 0/1

Example
inc = follow getAndComplement of toggle-bits

1 0

0 1 0
0 1

Implementing Counting Trees


problem: toggle bit in balancer..
b 0/1 0/1 0/1 b 0/1 b 0/1

To lesser extent in next balancers


b 0/1

b 0/1

Contention and Sequential bottleneckso what have we achieved?


b 0/1

Diffraction Balancing
Idea (as in elimination stack): if an even number of tokens pass balancer, the toggle bit remains unchanged!
Prism Array

0/1

toggle bit

Diffracting Tree
B2
prism

B1
1 2 3

prism

1 2 0/1 . . k / 2

1 2 . . : : k

0/1

B3
prism

Diff-Bal
2

Diff-Bal

1 2 0/1 . . k / 2

Diff-Bal

Lemma: Diffracting balancer same as balancer.

Diffracting Tree
B2
B1
1 2 3

prism
1 2 . . k / 2

0/1

prism

1 2 . . : : k

0/1

B3

Diff-Bal
2

prism
1 2 . . k / 2

Diff-Bal

0/1

Diff-Bal

High load Low load

Lots of Diffraction + Few Toggles Low Diffraction + Few Toggles


High Throuhput with Low Contention

Performance
Throughput
160000 140000 10000

MCS

Latency

Ctree

120000
100000

8000

6000 80000

60000 40000 20000 0 0 50 100 150 200 250

Dtree4000 Ctree MCS


300 2000

Dtree

0 0 50 100 150 200 250 300

P=Concurrency

P=Concurrency

Summary
Can build a linearizable parallel shared counter By relaxing our coherence requirements, we can build a shared counter with
Low memory contention, and Real parallelism
Art of Multiprocessor Programming 251

This work is licensed under a Creative Commons AttributionShareAlike 2.5 License.


You are free: to Share to copy, distribute and transmit the work to Remix to adapt the work Under the following conditions: Attribution. You must attribute the work to The Art of Multiprocessor Programming (but not in any way that suggests that the authors endorse you or your use of the work). Share Alike. If you alter, transform, or build upon this work, you may distribute the resulting work only under the same, similar or a compatible license. For any reuse or distribution, you must make clear to others the license terms of this work. The best way to do this is with a link to http://creativecommons.org/licenses/by-sa/3.0/. Any of the above conditions can be waived if you get permission from the copyright holder. Nothing in this license impairs or restricts the author's moral rights.

Art of Multiprocessor Programming

252

Vous aimerez peut-être aussi