Académique Documents
Professionnel Documents
Culture Documents
John Mellor-Crummey
Department of Computer Science Rice University johnmc@rice.edu
COMP 422
Comparison-based or not
comparison sort
basic operation: compare elements and exchange as necessary (n log n) comparisons to sort n numbers
non-comparison-based sort
e.g. radix sort based on the binary representation of data (n) operations to sort n numbers
Network topology
hypercube mesh
Communication mechanism
shared address space message passing
[communication step]
[comparison step]
7
1. Send block of size n/p to partner 2. Each partner now has both blocks 3. Merge received block with own block 4. Retain only the appropriate half of the merged block
Pi retains smaller values; process Pj retains larger values
8
Basic Analysis
Assumptions
Pi and Pj are neighbors communication channels are bi-directional
Sorting Network
Network of comparators designed for sorting Comparator : two inputs x and y; two outputs x' and y
types
increasing (denoted ): x' = min(x,y) and y' = max(x,y) x y
min(x,y) max(x,y)
max(x,y)
10
Sorting Networks
Network structure: a series of columns Each column consists of a vector of comparators (in parallel) Sorting network organization:
11
Bitonic sequence
two parts: increasing and decreasing
1,2,4,7,6,0: first increases and then decreases (or vice versa)
12
Bitonic Split
s2 = max(a0,an/2),max(a1,an/2+1),,max(an/2-1,an-1)
s1 and s2 are both bitonic x y x s1, y s2 , x < y
Sequence properties
Apply recursively on s1 and s2 to produce a sorted sequence Works for any bitonic sequence, even if |s1| |s2|
13
min
max
14
min
max
15
Bitonic Merge
Sort a bitonic sequence through a series of bitonic splits Example: use bitonic merge to sort 16-element bitonic sequence How: perform a series of log2 16 = 4 bitonic splits
16
Network structure
log2 n columns each column
n/2 comparators performs one step of the bitonic merge
17
Bitonic Sort
How do we sort an unsorted sequence using a bitonic merge? Two steps
19
Repeatedly merge to generate larger bitonic sequences BM[k] & BM[k]: bitonic merging networks of size k
20
Bitonic Sort, n = 16
First 3 stages create bitonic sequence input to stage 4 Last stage (BM[16]) yields sorted sequence
22
depth =
log
j=1
2j =
j = (log
i=1
n + 1)(log 2 n) /2 = (log 2 n)
Each stage of the network contains n/2 comparators Complexity of serial implementation = (n log2 n)
23
24
28
29
30
32
n iterations
in each iteration, each processor does one compare-exchange
Parallel run time of this formulation is (n) Cost optimal with respect to the base serial algorithm
but not the optimal one!
33
Quicksort
Popular sequential sorting algorithm
simplicity, low overhead, optimal average complexity
Operation
select an entry in the sequence to be the pivot divide the sequence into two halves
one with all elements less than the pivot other greater
35
Parallelizing Quicksort
First, recursive decomposition
partition the list serially handle each subproblems on a different processor
Time for this algorithm is lower-bounded by (n)! Can we parallelize the partitioning step?
can we use n processors to partition a list of length n around a pivot in O(1) time?
36
continue recursively
37
Total work is O(n log n) Recursion depth is O(log n) Depth of each operation is constant
key idea: long distance moves of first stage reduce number of rounds necessary in second stage
Radix sort : in a series of rounds, sort elements into buckets by digit Bucket and sample sort
assumes evenly distributed items in an interval buckets represent evenly-sized subintervals
Enumeration sort:
determine rank of each element place it in the correct position CRCW PRAM algorithm: n2 processors, sort in (1) time
assumes that all concurrent writes to a location deposit sum n processes in column j test element j against the rest; write 1 into C[j] place A[j] into A[C[j]]
39
Histogram Sorting
goal: divide keys into p evenly sized pieces use an iterative approach to do so initiating processor broadcasts k > p-1 splitter guesses each processor determines how many keys fall in each bin sum histogram with global reduction one processor examines guesses to see which are satisfactory broadcast finalized splitters and number of keys for each processor each processor sends local data to appropriate processors using all-to-all communication each processor merges chunks it receives Kale and Solomonik improved this (IPDPS 2010)
40
References
Adapted from slides Sorting by Ananth Grama Based on Chapter 9 of Introduction to Parallel Computing by Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar. Addison Wesley, 2003 Programming Parallel Algorithms. Guy Blelloch. Communications of the ACM, volume 39, number 3, March 1996. http://www.cs.cmu.edu/~scandal/nesl/algorithms.html#sort Edgar Solomonik and Laxmikant V. Kale. Highly Scalable Parallel Sorting. Proceedings of IPDPS 2010.
41