Comp422 2011 Lecture19 Sorting

Parallel Sorting
John Mellor-Crummey
Department of Computer Science Rice University johnmc@rice.edu
COMP 422
Lecture 19 5 April 2011
Topics for Today

Introduction Issues in parallel sorting Sorting networks and Batchers bitonic sort Bubble sort and odd-even transposition sort Parallel quicksort
Why Study Parallel Sorting?

One of the most common operations performed Close relation to task of routing on parallel computers
e.g. HPC Challenge RandomAccess benchmark
Sorting Algorithm Attributes

Internal vs. external
internal: data fits in memory external: uses tape or disk
Comparison-based or not
comparison sort
basic operation: compare elements and exchange as necessary (n log n) comparisons to sort n numbers
non-comparison-based sort
e.g. radix sort based on the binary representation of data (n) operations to sort n numbers
Parallel vs. sequential
Parallel Sorting Is Intrinsically Interesting

Different algorithms for different architecture variants
Abstract parallel architecture

PRAM
Network topology
hypercube mesh
Communication mechanism
shared address space message passing
Todays focus: parallel comparison-based sorting

5
Parallel Sorting Basics

Where are the input and output lists stored?
we assume that both input and output lists are distributed
What is a parallel sorted sequence?

sequence partitioned among the processors each processors sub-sequence is sorted all in Pj's sub-sequence < all in Pk's sub-sequence if j < k
the best process numbering can depend on network topology
Element-wise Parallel Compare-Exchange

When partitioning is one element per process 1. Processes Pj and Pk send their elements to each other
aj Pj ak Pk
[communication step]
Each process now has both elements

aj, ak Pj ak, aj Pk
2. Process Pj keeps min(aj,ak), and Pk keeps max(aj, ak)

min(aj, ak) Pj max(ak, aj) Pk
[comparison step]
7
Bulk Parallel Compare-Split
1. Send block of size n/p to partner 2. Each partner now has both blocks 3. Merge received block with own block 4. Retain only the appropriate half of the merged block
Pi retains smaller values; process Pj retains larger values
8
Basic Analysis

Assumptions
Pi and Pj are neighbors communication channels are bi-directional
Elementwise compare-exchange: 1 element per processor

time = ts + tw
Bulk compare-split: n/p elements per processor

after compare-split on pair of processors Pi and Pj, i < j
smaller n/p elements are at processor Pi larger n/p elements at Pj
time = ts+ twn/p

merge in O(n/p) time, as long as partial lists are sorted
Sorting Network

Network of comparators designed for sorting Comparator : two inputs x and y; two outputs x' and y
types
increasing (denoted ): x' = min(x,y) and y' = max(x,y) x y
min(x,y) max(x,y)
decreasing (denoted ) : x' = max(x,y) and y' = min(x,y) x y
max(x,y)
min(x,y) Sorting network speed is proportional to its depth
10
Sorting Networks

Network structure: a series of columns Each column consists of a vector of comparators (in parallel) Sorting network organization:
11
Example: Bitonic Sorting Network
Bitonic sequence
two parts: increasing and decreasing
1,2,4,7,6,0: first increases and then decreases (or vice versa)
cyclic rotation of a bitonic sequence is also considered bitonic

8,9,2,1,0,4: cyclic rotation of 0,4,8,9,2,1
Bitonic sorting network

sorts n elements in (log2 n) time network kernel: rearranges a bitonic sequence into a sorted one
12
Bitonic Split
Let s = a0,a1,,an-1 be a bitonic sequence such that

a0 a1 an/2-1 , and an/2 an/2+1 an-1
Consider the following subsequences of s s1 = min(a0,an/2),min(a1,an/2+1),,min(an/2-1,an-1)
s2 = max(a0,an/2),max(a1,an/2+1),,max(an/2-1,an-1)
s1 and s2 are both bitonic x y x s1, y s2 , x < y
Sequence properties
Apply recursively on s1 and s2 to produce a sorted sequence Works for any bitonic sequence, even if |s1| |s2|
13
Splitting Bitonic Sequences - I

Sequence properties
min
max
14
Splitting Bitonic Sequences - II

Sequence properties
min
max
15
Bitonic Merge
Sort a bitonic sequence through a series of bitonic splits Example: use bitonic merge to sort 16-element bitonic sequence How: perform a series of log2 16 = 4 bitonic splits
16
Sorting via Bitonic Merging Network

Sorting network can implement bitonic merge algorithm
bitonic merging network
Network structure
log2 n columns each column
n/2 comparators performs one step of the bitonic merge
Bitonic merging network with n inputs: BM[n]

yields increasing output sequence
Replacing comparators by comparators: BM[n]

yields decreasing output sequence
17
Bitonic Merging Network, BM[16]
Input: bitonic sequence

input wires are numbered 0,1,, n - 1 (shown in binary)
Output: sequence in sorted order Each column of comparators is drawn separately

18
Bitonic Sort
How do we sort an unsorted sequence using a bitonic merge? Two steps
Build a bitonic sequence Sort it using a bitonic merging network
19
Building a Bitonic Sequence
Build a single bitonic sequence from the given sequence

any sequence of length 2 is a bitonic sequence. build bitonic sequence of length 4
sort first two elements using BM[2] sort next two using BM[2]
Repeatedly merge to generate larger bitonic sequences BM[k] & BM[k]: bitonic merging networks of size k
20
Building a Bitonic Sequence
Input: sequence of 16 unordered numbers Output: a bitonic sequence of 16 numbers

21
Bitonic Sort, n = 16
First 3 stages create bitonic sequence input to stage 4 Last stage (BM[16]) yields sorted sequence
22
Complexity of Bitonic Sorting Networks
Depth of the network is (log2 n)

log2 n merge stages jth merge stage is log2 2j = j
log 2 n log 2 n 2
depth =
log
j=1
2j =
j = (log
i=1
n + 1)(log 2 n) /2 = (log 2 n)
Each stage of the network contains n/2 comparators Complexity of serial implementation = (n log2 n)
23
Mapping Bitonic Sort to a Hypercube

Consider one item per processor
How do we map wires in bitonic network onto a hypercube? In earlier examples

compare-exchange between two wires when labels differ in 1 bit
Direct mapping of wires to processors

all communication is nearest neighbor
24
Mapping Bitonic Merge to a Hypercube
Communication during the last merge stage of bitonic sort
Each number is mapped to a hypercube node Each connection represents a compare-exchange

25
Mapping Bitonic Sort to Hypercubes
Communication in bitonic sort on a hypercube
Processes communicate along dims shown in each stage

Algorithm is cost optimal w.r.t. its serial counterpart Not cost optimal w.r.t. the best sorting algorithm
26
Batchers Bitonic Sort in NESL

function merge(a) = if (#a == 1) then a else let halves = bottop(a); mins = {min(x, y) : x in halves[0]; y in halves[1]}; maxs = {max(x, y) : x in halves[0]; y in halves[1]}; in flatten({merge(x) : x in [mins,maxs]}); function bitonic_sort(a) = if (#a == 1) then a else let b = {bitonic_sort(x) : x in bottop(a)}; in merge(b[0]++reverse(b[1])); bitonic_sort([2, 3, -7, 6, 5, 22, -8, 12]); Try it at: http://www.cs.rice.edu/~johnmc/nesl.html
27
Bubble Sort and Variants

Sequential bubble sort algorithm Compares and exchanges adjacent elements sequence
28
Bubble Sort and Variants

Bubble sort complexity: (n2) Difficult to parallelize
algorithm has no concurrency
A simple variant uncovers concurrency
29
Sequential Odd-Even Transposition Sort
30
Odd-Even Transposition Sort, n = 8
In each phase, n = 8 elements are compared

31
Odd-Even Transposition Sort

After n phases of odd-even exchanges, sequence is sorted Each phase of algorithm requires (n) comparisons Serial complexity is (n2)
32
Parallel Odd-Even Transposition

Consider one item per processor
n iterations
in each iteration, each processor does one compare-exchange
Parallel run time of this formulation is (n) Cost optimal with respect to the base serial algorithm
but not the optimal one!
33
Parallel Odd-Even Transposition Sort
note: if partner id < 1 or > n, then skip compare

34
Quicksort

Popular sequential sorting algorithm
simplicity, low overhead, optimal average complexity
Operation
select an entry in the sequence to be the pivot divide the sequence into two halves
one with all elements less than the pivot other greater
Apply process recursively to each of sublist
35
Parallelizing Quicksort

First, recursive decomposition
partition the list serially handle each subproblems on a different processor
Time for this algorithm is lower-bounded by (n)! Can we parallelize the partitioning step?
can we use n processors to partition a list of length n around a pivot in O(1) time?
Tricky on real machines
36
Practical Parallel Quicksort

Each processor initially responsible for n/p elements
Shared memory formulation

select first pivot & broadcast each processor partitions own data globally rearrange data into smaller and larger parts (in place) recurse with proportional # processors on each part
Message passing formulation

partitioning
each processor first partitions local portion of array determine which processes will be responsible for each partition (based on size of smaller than pivot and larger than pivot groups) divide up the data among the processor subsets responsible for each part
continue recursively
37
Data Parallel Quicksort in NESL

1 2 3 4 5 6 7 8
Total work is O(n log n) Recursion depth is O(log n) Depth of each operation is constant
Total depth is O(log n) as well 38
Other Sorting Algorithms
Shellsort - another variant of bubble sort

two stage process
log p rounds of long distance exchanges followed by rounds of odd-even transposition sort until done
key idea: long distance moves of first stage reduce number of rounds necessary in second stage
Radix sort : in a series of rounds, sort elements into buckets by digit Bucket and sample sort
assumes evenly distributed items in an interval buckets represent evenly-sized subintervals
Enumeration sort:
determine rank of each element place it in the correct position CRCW PRAM algorithm: n2 processors, sort in (1) time
assumes that all concurrent writes to a location deposit sum n processes in column j test element j against the rest; write 1 into C[j] place A[j] into A[C[j]]
39
Other Sorting Algorithms (Cont)
Histogram Sorting
goal: divide keys into p evenly sized pieces use an iterative approach to do so initiating processor broadcasts k > p-1 splitter guesses each processor determines how many keys fall in each bin sum histogram with global reduction one processor examines guesses to see which are satisfactory broadcast finalized splitters and number of keys for each processor each processor sends local data to appropriate processors using all-to-all communication each processor merges chunks it receives Kale and Solomonik improved this (IPDPS 2010)
40
References

Adapted from slides Sorting by Ananth Grama Based on Chapter 9 of Introduction to Parallel Computing by Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar. Addison Wesley, 2003 Programming Parallel Algorithms. Guy Blelloch. Communications of the ACM, volume 39, number 3, March 1996. http://www.cs.cmu.edu/~scandal/nesl/algorithms.html#sort Edgar Solomonik and Laxmikant V. Kale. Highly Scalable Parallel Sorting. Proceedings of IPDPS 2010.
41

Comp422 2011 Lecture19 Sorting

Transféré par

Informations du document

Description originale:

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Comp422 2011 Lecture19 Sorting

Transféré par

Droits d'auteur :

Formats disponibles

Parallel Sorting

Lecture 19 5 April 2011

Topics for Today

Why Study Parallel Sorting?

Sorting Algorithm Attributes

Parallel vs. sequential

Parallel Sorting Is Intrinsically Interesting

Abstract parallel architecture

Todays focus: parallel comparison-based sorting

Parallel Sorting Basics

What is a parallel sorted sequence?

Element-wise Parallel Compare-Exchange

Each process now has both elements

2. Process Pj keeps min(aj,ak), and Pk keeps max(aj, ak)

Bulk Parallel Compare-Split

Elementwise compare-exchange: 1 element per processor

Bulk compare-split: n/p elements per processor

time = ts+ twn/p

decreasing (denoted ) : x' = max(x,y) and y' = min(x,y) x y

min(x,y) Sorting network speed is proportional to its depth

Example: Bitonic Sorting Network

cyclic rotation of a bitonic sequence is also considered bitonic

Bitonic sorting network

Let s = a0,a1,,an-1 be a bitonic sequence such that

Consider the following subsequences of s s1 = min(a0,an/2),min(a1,an/2+1),,min(an/2-1,an-1)

Splitting Bitonic Sequences - I

Splitting Bitonic Sequences - II

Sorting via Bitonic Merging Network

Bitonic merging network with n inputs: BM[n]

Replacing comparators by comparators: BM[n]

Bitonic Merging Network, BM[16]

Input: bitonic sequence

Output: sequence in sorted order Each column of comparators is drawn separately

Build a bitonic sequence Sort it using a bitonic merging network

Building a Bitonic Sequence

Build a single bitonic sequence from the given sequence

Building a Bitonic Sequence

Input: sequence of 16 unordered numbers Output: a bitonic sequence of 16 numbers

Complexity of Bitonic Sorting Networks

Depth of the network is (log2 n)

Mapping Bitonic Sort to a Hypercube

How do we map wires in bitonic network onto a hypercube? In earlier examples

Direct mapping of wires to processors

Mapping Bitonic Merge to a Hypercube

Communication during the last merge stage of bitonic sort

Each number is mapped to a hypercube node Each connection represents a compare-exchange

Mapping Bitonic Sort to Hypercubes

Communication in bitonic sort on a hypercube

Processes communicate along dims shown in each stage

Batchers Bitonic Sort in NESL

Bubble Sort and Variants

Bubble Sort and Variants

A simple variant uncovers concurrency

Sequential Odd-Even Transposition Sort

Odd-Even Transposition Sort, n = 8

In each phase, n = 8 elements are compared

Odd-Even Transposition Sort

Parallel Odd-Even Transposition

Parallel Odd-Even Transposition Sort

note: if partner id < 1 or > n, then skip compare

Apply process recursively to each of sublist

Tricky on real machines

Practical Parallel Quicksort