Vous êtes sur la page 1sur 41

Parallel Sorting

John Mellor-Crummey
Department of Computer Science Rice University johnmc@rice.edu

COMP 422

Lecture 19 5 April 2011

Topics for Today



Introduction Issues in parallel sorting Sorting networks and Batchers bitonic sort Bubble sort and odd-even transposition sort Parallel quicksort

Why Study Parallel Sorting?



One of the most common operations performed Close relation to task of routing on parallel computers
e.g. HPC Challenge RandomAccess benchmark

Sorting Algorithm Attributes



Internal vs. external
internal: data fits in memory external: uses tape or disk

Comparison-based or not
comparison sort
basic operation: compare elements and exchange as necessary (n log n) comparisons to sort n numbers

non-comparison-based sort
e.g. radix sort based on the binary representation of data (n) operations to sort n numbers

Parallel vs. sequential

Parallel Sorting Is Intrinsically Interesting


Different algorithms for different architecture variants

Abstract parallel architecture


PRAM

Network topology
hypercube mesh

Communication mechanism
shared address space message passing

Todays focus: parallel comparison-based sorting


5

Parallel Sorting Basics



Where are the input and output lists stored?
we assume that both input and output lists are distributed

What is a parallel sorted sequence?


sequence partitioned among the processors each processors sub-sequence is sorted all in Pj's sub-sequence < all in Pk's sub-sequence if j < k
the best process numbering can depend on network topology

Element-wise Parallel Compare-Exchange


When partitioning is one element per process 1. Processes Pj and Pk send their elements to each other
aj Pj ak Pk

[communication step]

Each process now has both elements


aj, ak Pj ak, aj Pk

2. Process Pj keeps min(aj,ak), and Pk keeps max(aj, ak)


min(aj, ak) Pj max(ak, aj) Pk

[comparison step]
7

Bulk Parallel Compare-Split

1. Send block of size n/p to partner 2. Each partner now has both blocks 3. Merge received block with own block 4. Retain only the appropriate half of the merged block
Pi retains smaller values; process Pj retains larger values
8

Basic Analysis

Assumptions
Pi and Pj are neighbors communication channels are bi-directional

Elementwise compare-exchange: 1 element per processor


time = ts + tw

Bulk compare-split: n/p elements per processor


after compare-split on pair of processors Pi and Pj, i < j
smaller n/p elements are at processor Pi larger n/p elements at Pj

time = ts+ twn/p


merge in O(n/p) time, as long as partial lists are sorted

Sorting Network

Network of comparators designed for sorting Comparator : two inputs x and y; two outputs x' and y
types
increasing (denoted ): x' = min(x,y) and y' = max(x,y) x y

min(x,y) max(x,y)

decreasing (denoted ) : x' = max(x,y) and y' = min(x,y) x y

max(x,y)

min(x,y) Sorting network speed is proportional to its depth

10

Sorting Networks

Network structure: a series of columns Each column consists of a vector of comparators (in parallel) Sorting network organization:

11

Example: Bitonic Sorting Network

Bitonic sequence
two parts: increasing and decreasing
1,2,4,7,6,0: first increases and then decreases (or vice versa)

cyclic rotation of a bitonic sequence is also considered bitonic


8,9,2,1,0,4: cyclic rotation of 0,4,8,9,2,1

Bitonic sorting network


sorts n elements in (log2 n) time network kernel: rearranges a bitonic sequence into a sorted one

12

Bitonic Split

Let s = a0,a1,,an-1 be a bitonic sequence such that


a0 a1 an/2-1 , and an/2 an/2+1 an-1

Consider the following subsequences of s s1 = min(a0,an/2),min(a1,an/2+1),,min(an/2-1,an-1)

s2 = max(a0,an/2),max(a1,an/2+1),,max(an/2-1,an-1)
s1 and s2 are both bitonic x y x s1, y s2 , x < y

Sequence properties

Apply recursively on s1 and s2 to produce a sorted sequence Works for any bitonic sequence, even if |s1| |s2|
13

Splitting Bitonic Sequences - I


Sequence properties
s1 and s2 are both bitonic x y x s1, y s2 , x < y

min

max

14

Splitting Bitonic Sequences - II


Sequence properties
s1 and s2 are both bitonic x y x s1, y s2 , x < y

min

max

15

Bitonic Merge
Sort a bitonic sequence through a series of bitonic splits Example: use bitonic merge to sort 16-element bitonic sequence How: perform a series of log2 16 = 4 bitonic splits

16

Sorting via Bitonic Merging Network



Sorting network can implement bitonic merge algorithm
bitonic merging network

Network structure
log2 n columns each column
n/2 comparators performs one step of the bitonic merge

Bitonic merging network with n inputs: BM[n]


yields increasing output sequence

Replacing comparators by comparators: BM[n]


yields decreasing output sequence

17

Bitonic Merging Network, BM[16]

Input: bitonic sequence


input wires are numbered 0,1,, n - 1 (shown in binary)

Output: sequence in sorted order Each column of comparators is drawn separately


18

Bitonic Sort
How do we sort an unsorted sequence using a bitonic merge? Two steps

Build a bitonic sequence Sort it using a bitonic merging network

19

Building a Bitonic Sequence

Build a single bitonic sequence from the given sequence


any sequence of length 2 is a bitonic sequence. build bitonic sequence of length 4
sort first two elements using BM[2] sort next two using BM[2]

Repeatedly merge to generate larger bitonic sequences BM[k] & BM[k]: bitonic merging networks of size k

20

Building a Bitonic Sequence

Input: sequence of 16 unordered numbers Output: a bitonic sequence of 16 numbers


21

Bitonic Sort, n = 16

First 3 stages create bitonic sequence input to stage 4 Last stage (BM[16]) yields sorted sequence
22

Complexity of Bitonic Sorting Networks

Depth of the network is (log2 n)


log2 n merge stages jth merge stage is log2 2j = j
log 2 n log 2 n 2

depth =

log
j=1

2j =

j = (log
i=1

n + 1)(log 2 n) /2 = (log 2 n)

Each stage of the network contains n/2 comparators Complexity of serial implementation = (n log2 n)

23

Mapping Bitonic Sort to a Hypercube


Consider one item per processor

How do we map wires in bitonic network onto a hypercube? In earlier examples


compare-exchange between two wires when labels differ in 1 bit

Direct mapping of wires to processors


all communication is nearest neighbor

24

Mapping Bitonic Merge to a Hypercube

Communication during the last merge stage of bitonic sort

Each number is mapped to a hypercube node Each connection represents a compare-exchange


25

Mapping Bitonic Sort to Hypercubes

Communication in bitonic sort on a hypercube

Processes communicate along dims shown in each stage


Algorithm is cost optimal w.r.t. its serial counterpart Not cost optimal w.r.t. the best sorting algorithm
26

Batchers Bitonic Sort in NESL


function merge(a) = if (#a == 1) then a else let halves = bottop(a); mins = {min(x, y) : x in halves[0]; y in halves[1]}; maxs = {max(x, y) : x in halves[0]; y in halves[1]}; in flatten({merge(x) : x in [mins,maxs]}); function bitonic_sort(a) = if (#a == 1) then a else let b = {bitonic_sort(x) : x in bottop(a)}; in merge(b[0]++reverse(b[1])); bitonic_sort([2, 3, -7, 6, 5, 22, -8, 12]); Try it at: http://www.cs.rice.edu/~johnmc/nesl.html
27

Bubble Sort and Variants


Sequential bubble sort algorithm Compares and exchanges adjacent elements sequence

28

Bubble Sort and Variants



Bubble sort complexity: (n2) Difficult to parallelize
algorithm has no concurrency

A simple variant uncovers concurrency

29

Sequential Odd-Even Transposition Sort

30

Odd-Even Transposition Sort, n = 8

In each phase, n = 8 elements are compared


31

Odd-Even Transposition Sort



After n phases of odd-even exchanges, sequence is sorted Each phase of algorithm requires (n) comparisons Serial complexity is (n2)

32

Parallel Odd-Even Transposition


Consider one item per processor

n iterations
in each iteration, each processor does one compare-exchange

Parallel run time of this formulation is (n) Cost optimal with respect to the base serial algorithm
but not the optimal one!

33

Parallel Odd-Even Transposition Sort

note: if partner id < 1 or > n, then skip compare


34

Quicksort

Popular sequential sorting algorithm
simplicity, low overhead, optimal average complexity

Operation
select an entry in the sequence to be the pivot divide the sequence into two halves
one with all elements less than the pivot other greater

Apply process recursively to each of sublist

35

Parallelizing Quicksort

First, recursive decomposition
partition the list serially handle each subproblems on a different processor

Time for this algorithm is lower-bounded by (n)! Can we parallelize the partitioning step?
can we use n processors to partition a list of length n around a pivot in O(1) time?

Tricky on real machines

36

Practical Parallel Quicksort


Each processor initially responsible for n/p elements

Shared memory formulation


select first pivot & broadcast each processor partitions own data globally rearrange data into smaller and larger parts (in place) recurse with proportional # processors on each part

Message passing formulation


partitioning
each processor first partitions local portion of array determine which processes will be responsible for each partition (based on size of smaller than pivot and larger than pivot groups) divide up the data among the processor subsets responsible for each part

continue recursively

37

Data Parallel Quicksort in NESL


1 2 3 4 5 6 7 8

Total work is O(n log n) Recursion depth is O(log n) Depth of each operation is constant

Total depth is O(log n) as well 38

Other Sorting Algorithms

Shellsort - another variant of bubble sort


two stage process
log p rounds of long distance exchanges followed by rounds of odd-even transposition sort until done

key idea: long distance moves of first stage reduce number of rounds necessary in second stage

Radix sort : in a series of rounds, sort elements into buckets by digit Bucket and sample sort
assumes evenly distributed items in an interval buckets represent evenly-sized subintervals

Enumeration sort:
determine rank of each element place it in the correct position CRCW PRAM algorithm: n2 processors, sort in (1) time
assumes that all concurrent writes to a location deposit sum n processes in column j test element j against the rest; write 1 into C[j] place A[j] into A[C[j]]

39

Other Sorting Algorithms (Cont)

Histogram Sorting
goal: divide keys into p evenly sized pieces use an iterative approach to do so initiating processor broadcasts k > p-1 splitter guesses each processor determines how many keys fall in each bin sum histogram with global reduction one processor examines guesses to see which are satisfactory broadcast finalized splitters and number of keys for each processor each processor sends local data to appropriate processors using all-to-all communication each processor merges chunks it receives Kale and Solomonik improved this (IPDPS 2010)

40

References

Adapted from slides Sorting by Ananth Grama Based on Chapter 9 of Introduction to Parallel Computing by Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar. Addison Wesley, 2003 Programming Parallel Algorithms. Guy Blelloch. Communications of the ACM, volume 39, number 3, March 1996. http://www.cs.cmu.edu/~scandal/nesl/algorithms.html#sort Edgar Solomonik and Laxmikant V. Kale. Highly Scalable Parallel Sorting. Proceedings of IPDPS 2010.

41

Vous aimerez peut-être aussi