Académique Documents
Professionnel Documents
Culture Documents
Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit
A Shared Pool
public interface Pool { public void put(Object x); public Object remove(); }
Remove
Removes & returns an object blocks if empty
Art of Multiprocessor Programming 2
A Shared Pool
Put
Insert item block if full
Remove
Remove & return item block if empty
put
put
put
put
put
Counting Implementation
put
19 19 20 21 20 21
remove
Counting Implementation
put
19 19 20 20 21
remove
21
Shared Counter
0 12 3
3
2 1
11
Shared Counter
No duplication
0 12 3
3
2 1
12
Shared Counter
No duplication No Omission 0
3
2 1
12 3
13
Shared Counter
No duplication No Omission 0
3
2 1
12 3
14
Shared Counters
Can we build a shared counter with
Low memory contention, and Real parallelism?
Locking
Can use queue locks to reduce contention No help with parallelism issue
15
16
Combining Trees
0
17
Combining Trees
0
+3
18
Combining Trees
0
+3
+2
19
Combining Trees
0
+3
+2
20
Combining Trees
0
+5
+3 +2
21
Combining Trees
5
+5
+3 +2
22
Combining Trees
5
0
+3 +2
23
Combining Trees
5
0
0 3
24
What if?
Threads dont arrive together?
Should I stay or should I go?
Idea:
Use multi-phase algorithm Where threads wait in parallel
Art of Multiprocessor Programming 25
Combining Status
enum CStatus{ IDLE, FIRST, SECOND, RESULT, ROOT };
26
Combining Status
enum CStatus{ IDLE, FIRST, SECOND, RESULT, ROOT };
Nothing going on
27
Combining Status
enum CStatus{ IDLE, FIRST, SECOND, RESULT, ROOT };
1st thread is a partner for combining, will return to check for 2nd thread
Art of Multiprocessor Programming 28
Combining Status
enum CStatus{ IDLE, FIRST, SECOND, RESULT, ROOT };
Combining Status
enum CStatus{ IDLE, FIRST, SECOND, RESULT, ROOT };
Combining Status
enum CStatus{ IDLE, FIRST, SECOND, RESULT, ROOT };
Node Synchronization
Short-term
Synchronized methods Consistency during method call
Long-term
Boolean locked field Consistency across calls
32
Phases
Precombining
Set up combining rendez-vous
33
Phases
Precombining
Set up combining rendez-vous
Combining
Collect and combine operations
34
Phases
Precombining
Set up combining rendez-vous
Combining
Collect and combine operations
Operation
Hand off to higher thread
35
Phases
Precombining
Combining
Set up combining rendez-vous
Operation
Distribution
Precombining Phase
0
IDLE
Examine status
37
Precombining Phase
0
FIRST
38
Precombining Phase
0
FIRST
At ROOT,turn back
39
Precombining Phase
0
FIRST
40
Precombining Phase
0
SECOND
41
Code
Tree class
In charge of navigation
Node class
Combining state Synchronization state Bookkeeping
42
Precombining Navigation
Node node = myLeaf; while (node.precombine()) { node = node.parent; } Node stop = node;
43
Precombining Navigation
Node node = myLeaf; while (node.precombine()) { node = node.parent; } Node stop = node;
Start at leaf
Art of Multiprocessor Programming 44
Precombining Navigation
Node node = myLeaf; while (node.precombine()) { node = node.parent; } Node stop = node;
Precombining Navigation
Node node = myLeaf; while (node.precombine()) { node = node.parent; } Node stop = node;
Precombining Node
synchronized boolean precombine() { while (locked) wait(); switch (cStatus) { case IDLE: cStatus = CStatus.FIRST; return true; case FIRST: locked = true; cStatus = CStatus.SECOND; return false; case ROOT: return false; default: throw new PanicException() } }
Art of Multiprocessor Programming 47
Precombining Node
synchronized boolean precombine() { while (locked) wait(); switch (cStatus) { case IDLE: cStatus = CStatus.FIRST; return true; case FIRST: locked = true; cStatus = CStatus.SECOND; return false; case ROOT: return false; default: throw new PanicException() Short-term } synchronization }
Art of Multiprocessor Programming 48
Synchronization
synchronized boolean precombine() { while (locked) wait(); switch (cStatus) { case IDLE: cStatus = CStatus.FIRST; return true; case FIRST: locked = true; cStatus = CStatus.SECOND; Wait while node is locked return false; (in use by earlier combining phase) case ROOT: return false; default: throw new PanicException() } }
Art of Multiprocessor Programming 49
Precombining Node
synchronized boolean precombine() { while (locked) wait(); switch (cStatus) { case IDLE: cStatus = CStatus.FIRST; return true; case FIRST: locked = true; cStatus = CStatus.SECOND; return false; case ROOT: return false; default: throw new PanicException() } Check combining }
Art of Multiprocessor Programming
status
50
Precombining Node
synchronized boolean precombine() { while (locked) {wait();} switch (cStatus) { case IDLE: cStatus = CStatus.FIRST; return true; case FIRST: locked = true; cStatus = CStatus.SECOND; return false; Continue up the case ROOT: return false; default: throw new PanicException() } }
Art of Multiprocessor Programming
tree
52
Im the
nd 2
Thread
synchronized boolean precombine() { while (locked) {wait();} switch (cStatus) { case IDLE: cStatus = CStatus.FIRST; return true; case FIRST: locked = true; cStatus = CStatus.SECOND; return false; case ROOT: return false; default: throw new PanicException() st thread has promised to return, If 1 } } lock node so it wont leave without me
Art of Multiprocessor Programming 53
Precombining Node
synchronized boolean precombine() { while (locked) {wait();} switch (cStatus) { case IDLE: cStatus = CStatus.FIRST; return true; case FIRST: locked = true; cStatus = CStatus.SECOND; return false; case ROOT: return false; default: throw new PanicException() nd Prepare to deposit 2 } threads input value }
Art of Multiprocessor Programming 54
Precombining Node
synchronized boolean phase1() { End of precombining while (sStatus==SStatus.BUSY) {wait();} phase, dont continue switch (cStatus) { up tree case IDLE: cStatus = CStatus.FIRST; return true; case FIRST: locked = true; cStatus = CStatus.SECOND; return false; case ROOT: return false; default: throw new PanicException() } }
Art of Multiprocessor Programming 55
Precombining Node
synchronized boolean precombine() { Always check for while (locked) {wait();} switch (cStatus) { unexpected values! case IDLE: cStatus = CStatus.FIRST; return true; case FIRST: locked = true; cStatus = CStatus.SECOND; return false; case ROOT: return false; default: throw new PanicException() } }
Art of Multiprocessor Programming 57
Combining Phase
0
SECOND
+3
58
Combining Phase
0
SECOND
+3
59
Combining Phase
0
+5
SECOND
2
+3
+2
60
Combining (reloaded)
0
FIRST
61
Combining (reloaded)
0
FIRST
+3
62
Combining (reloaded)
0 +3
FIRST
Stop at root
+3
63
Combining (reloaded)
0 +3
FIRST
+3
64
Combining Navigation
node = myLeaf; int combined = 1; while (node != stop) { combined = node.combine(combined); stack.push(node); node = node.parent; }
65
Combining Navigation
node = myLeaf; int combined = 1; while (node != stop) { combined = node.combine(combined); stack.push(node); node = node.parent; }
Start at leaf
66
Combining Navigation
node = myLeaf; int combined = 1; while (node != stop) { combined = node.combine(combined); stack.push(node); node = node.parent; }
Add 1
Art of Multiprocessor Programming 67
Combining Navigation
node = myLeaf; int combined = 1; while (node != stop) { combined = node.combine(combined); stack.push(node); Revisit nodes node = node.parent; visited in }
precombining
68
Combining Navigation
node = myLeaf; int combined = 1; while (node != stop) { combined = node.combine(combined); stack.push(node); node = node.parent; }
Combining Navigation
node = myLeaf; We will retraverse path in int combined = 1; reverse order while (node != stop) { combined = node.combine(combined); stack.push(node); node = node.parent; }
70
Combining Navigation
node = myLeaf; Move up the tree int combined = 1; while (node != stop) { combined = node.combine(combined); stack.push(node); node = node.parent; }
71
thread) contribution
76
alone
78
Combining Node
synchronized int combine(int combined) { while (locked) wait(); locked = true; Not alone: firstValue = combined; combine with switch (cStatus) { 2nd thread case FIRST: return firstValue; case SECOND: return firstValue + secondValue; default: } }
Art of Multiprocessor Programming 79
Operation Phase
5 +5
+3
zzz
80
SECOND
81
SECOND
zzz
82
83
The node where we stopped. Provide collected sum and wait for combining result
Art of Multiprocessor Programming 84
At Root
synchronized int op(int combined) { switch (cStatus) { case ROOT: int prior = result; result += combined; return prior; case SECOND: secondValue = combined; locked = false; notifyAll(); while (cStatus != CStatus.RESULT) wait(); locked = false; notifyAll(); cStatus = CStatus.IDLE; Add sum to root, return result; default: return prior value
Art of Multiprocessor Programming 87
Intermediate Node
synchronized int op(int combined) { switch (cStatus) { case ROOT: int prior = result; result += combined; return prior; case SECOND: secondValue = combined; locked = false; notifyAll(); while (cStatus != CStatus.RESULT) wait(); locked = false; notifyAll(); cStatus = CStatus.IDLE; Deposit value for return result; default: later combining
Art of Multiprocessor Programming 88
synchronized int op(int combined) Unlock node (which I { locked in switch (cStatus) { precombining). Then notify 1st thread case ROOT: int prior = result; result += combined; return prior; case SECOND: secondValue = combined; locked = false; notifyAll(); while (cStatus != CStatus.RESULT) wait(); locked = false; notifyAll(); cStatus = CStatus.IDLE; return result; default:
Art of Multiprocessor Programming 89
Intermediate Node
Intermediate Node
synchronized int op(int combined) { switch (cStatus) { for 1st thread case ROOT: int prior Wait = result; result += combined; to deliver results return prior; case SECOND: secondValue = combined; locked = false; notifyAll(); while (cStatus != CStatus.RESULT) wait(); locked = false; notifyAll(); cStatus = CStatus.IDLE; return result; default:
Art of Multiprocessor Programming 90
Intermediate Node
synchronized int op(int combined) { switch (cStatus) { st thread Unlock node (locked by 1 case ROOT: int prior = result; result in += combining combined; phase) & return return prior; case SECOND: secondValue = combined; locked = false; notifyAll(); while (cStatus != CStatus.RESULT) wait(); locked = false; notifyAll(); cStatus = CStatus.IDLE; return result; default:
Art of Multiprocessor Programming 91
Distribution Phase
5 0
SECOND
zzz
92
Distribution Phase
5
SECOND
93
Distribution Phase
5
SECOND
zzz
94
Distribution Phase
5
IDLE
3
95
96
97
98
Distribution Phase
synchronized void distribute(int prior) { switch (cStatus) { case FIRST: cStatus = CStatus.IDLE; locked = false; notifyAll(); return; case SECOND: result = prior + firstValue; cStatus = CStatus.RESULT; notifyAll(); return; default:
100
Distribution Phase
synchronized void distribute(int prior) { switch (cStatus) { case FIRST: cStatus = CStatus.IDLE; locked = false; notifyAll(); return; case SECOND: result = prior + firstValue; nd No 2 thread to combine cStatus = CStatus.RESULT; notifyAll(); with me, unlock node & return; reset default:
101
nd thread synchronized void distribute(int prior) Notify 2 that result is{ switch (cStatus) { available (2nd thread will release lock) case FIRST: cStatus = CStatus.IDLE; locked = false; notifyAll(); return; case SECOND: result = prior + firstValue; cStatus = CStatus.RESULT; notifyAll(); return; default:
Distribution Phase
102
Log n
+2 +3
103
1 thread
+3
+2
2 threads
104
Throughput Puzzles
Ideal circumstances
All n threads move together, combine n increments in O(log n) time
Worst circumstances
All n threads slightly skewed, locked out n increments in O(n log n) time
105
106
107
Take a number
109
Performance
Here are some fake graphs
Distilled from real ones
Performance
Here are some fake graphs
Distilled from real ones
Performance
Here are some fake graphs
Distilled from real ones
Throughput
Average incs in 1 million cycles
Performance
Here are some fake graphs
Your performance will probably vary Throughput
But not by much?
Average incs in 1 million cycles Average cycles per inc Distilled from real ones
Latency
Latency
Spin lock bad Combining tree
good
Number of processors
115
Throughput
Spin lock
Combining tree
bad Combining tree Spin lock
good
Number of processors
116
Load Fluctuations
Combining is sensitive:
if arrival rates drop So do combining rates & performance deteriorates!
Test
Vary work Duration between accessess
117
30
20 10 0 1 2 4 8 16 31 48 64
W=5000
118
Latency
Medium wait
Indefinite wait
processors
119
Conclusions
Combining Trees
Linearizable Counters Work well under high contention Sensitive to load fluctuations Can be used for getAndMumble() ops
A Balancer
Input wires
Output wires
121
123
124
125
126
Smoothing Network
128
Counting Network
step property
129
Counting
1, 5, 9.....
2, 6, ...
3, 7 ...
130
Step property guarantees that in-flight Counting Count! tokens willNetworks take missing values
0
1, 5, 9.....
2, 6, ...
3, 7 ...
131
Counting Networks
Good for counting number of tokens low contention no sequential bottleneck high throughput 2 practical networks depth log n
132
Counting Network
1
133
Counting Network
1 2
134
Counting Network
1 2 3
135
Counting Network
1 2 3
136
Counting Network
15 2 3 4
137
Counting Network
1 2 3 4 5
138
139
140
141
142
143
144
145
Has Step Property in Any Quiescent State (one in which all tokens have exited)
Art of Multiprocessor Programming 146
synchronized boolean flip() { boolean oldValue = this.toggle; this.toggle = !this.toggle; return oldValue; }
Art of Multiprocessor Programming 149
synchronized boolean flip() { boolean oldValue = this.toggle; this.toggle = !this.toggle; return oldValue; }
Art of Multiprocessor Programming 150
Bitonic[k]
Art of Multiprocessor Programming 156
157
Bitonic[4]
Bitonic[8] Layout
Merger[8]
Bitonic[4]
158
Merger[8]
159
Merger[4]
160
Merger[2]
Merger[2] Merger[2]
161
Bitonic[k] Depth
Width k Depth is (log2 k)(log2 k + 1)/2
162
Proof by Induction
Base:
Bitonic[2] is single balancer has step property by definition
Step:
If Bitonic[k] has step property So does Bitonic[2k]
Bitonic[2k] Schematic
Bitonic[k] Merger[2k]
Bitonic[k]
Art of Multiprocessor Programming 164
Bitonic[2k] Counts
Induction Hypothesis Need to prove
Merger[2k]
165
Merger[2k] Schematic
Merger[k]
Merger[k]
Art of Multiprocessor Programming 166
Merger[2k] Layout
167
Proof: Lemma 1
If a sequence has the step property
168
Lemma 1
So does its even subsequence
169
Lemma 1
Also its odd subsequence
170
Lemma 2
even
Diff at most 1
even
171
Merger[k]
Bitonic[k]
even
Merger[k]
172
By induction hypothesis
Bitonic[k]
Merger[k]
Bitonic[k]
Merger[k]
Art of Multiprocessor Programming 173
By Lemma 1
even
Merger[k]
even
Merger[k]
Art of Multiprocessor Programming 174
By Lemma 2
even
Diff at most 1
Merger[k]
even
Merger[k]
Art of Multiprocessor Programming 175
By Induction Hypothesis
Outputs have step property
Merger[k]
Merger[k]
Art of Multiprocessor Programming 176
By Lemma 2
At most one diff
Merger[k]
Merger[k]
Art of Multiprocessor Programming 177
Merger[k] Merger[k]
Wire i from other merger
179
181
182
183
184
185
186
Block[2k] Schematic
Block[k]
Block[k]
Art of Multiprocessor Programming 187
Block[2k] Layout
188
Periodic[8]
189
Network Depth
Each block[k] has depth log2 k Need log2 k blocks Grand total of (log2 k)2
190
191
Sequential Theorem
If a balancing network counts
Sequentially, meaning that Tokens traverse one at a time
Then it counts
Even if tokens traverse concurrently
192
193 (2)
194 (2)
Either Way
Same balancer states
195
196
197
Performance (Simulated)
Throughput
Higher is better!
198
Performance (Simulated)
64-leaf combining tree 80-balancer counting network
Throughput
Higher is better!
MCS queue lock Spin lock Number processors
199
Performance (Simulated)
64-leaf combining tree 80-balancer counting network
Throughput
200
Performance (Simulated)
64-leaf combining tree 80-balancer counting network
Throughput
201
Optimal performance
Saturated
P = w log w
Oversaturated
Art of Multiprocessor Programming
P > w log w
202
Throughput
Bitonic[8]
Bitonic[4]
Number processors
Art of Multiprocessor Programming 203
Shared Pool
put
19 19 20 21 20 21
remove
204
Shared Pool
put
remove
Depth log2w
239
Counting Trees
A Tree Balancer:
Counting Trees
Inductive Construction
Tree[2k] =
b
y1
Tree1[k]
y0
Tree0[k]
. . .
k even outputs
. . .
k odd outputs
Inductive Construction
Tree[2k] =
b
y1
Tree1[k]
y0
Tree0[k]
. . .
k even outputs
. . .
k odd outputs
Top step sequence has at most one extra on last wire of step
b 0/1
0/1 0/1 b 0/1
b 0/1
b 0/1
b 0/1
Example
inc = follow getAndComplement of toggle-bits
1 0
0 1 0
0 1
b 0/1
Diffraction Balancing
Idea (as in elimination stack): if an even number of tokens pass balancer, the toggle bit remains unchanged!
Prism Array
0/1
toggle bit
Diffracting Tree
B2
prism
B1
1 2 3
prism
1 2 0/1 . . k / 2
1 2 . . : : k
0/1
B3
prism
Diff-Bal
2
Diff-Bal
1 2 0/1 . . k / 2
Diff-Bal
Diffracting Tree
B2
B1
1 2 3
prism
1 2 . . k / 2
0/1
prism
1 2 . . : : k
0/1
B3
Diff-Bal
2
prism
1 2 . . k / 2
Diff-Bal
0/1
Diff-Bal
Performance
Throughput
160000 140000 10000
MCS
Latency
Ctree
120000
100000
8000
6000 80000
Dtree
P=Concurrency
P=Concurrency
Summary
Can build a linearizable parallel shared counter By relaxing our coherence requirements, we can build a shared counter with
Low memory contention, and Real parallelism
Art of Multiprocessor Programming 251
252