Vous êtes sur la page 1sur 10

Sorting

CS 102 Sorting – arranging the items in a list in


ascending or descending order by a key value.
File Structures &
 Applicable for all file organizations, not just
File Organizations sequential
 Why sort ?
to make a report,
Chapter 05 to merge files in queries,
External Sorting Algorithms to merge files in master file maintenance,
to make searches easier,
to prioritize,
etc.
CJD

Internal vs External Sorts Some Terminologies


 Internal Sort – sorting items entirely in main  A Pass – an iteration that goes through the items
memory (or records) of a list (or file) once to include
ICS 2, ICS 3, CS 101 reading it from file, processing it in main memory
and writing it to file.
 External Sort – sorting files in secondary
storage using main memory  A Run – a grouping of some items of a list.
Usually a run starts as a block of records but
CS 102 eventually increases in size.
Why external sort ?  Size of a Run – the number of items in a run.
Some files may be too large to fit in main memory Usually no less than the blocking factor.
 A Merge – combining lists into one

CJD CJD

The Algorithms 2-Way Sort Merge


External Sort Algorithms  Overview :
 2-way Sort Merge A simple 2-way Sort Merge repeatedly
merges 2 smaller sorted components of a file
 Balanced 2-way Sort Merge
into a sorted bigger component of the file.
 Balanced k-way Sort Merge
 The algorithm
 Polyphase Sort Merge
Phase 1 : The Sort Phase
Phase 2 : The Merge Phase

CJD CJD
The Sort Phase The Merge Phase
 Phase 1 : The Sort Phase  Phase 2 : The Merge Phase
Divide the records of a file into several runs, For each pair of runs, one from file_1 and
another from file_2,
internal sort the records in a run, and
merge the pair resulting in a longer run.
distribute the runs “evenly” to two external
Store the new resulting run in a third external file
files file_1 and file_2 file_3
Redistribute the runs evenly in file_3 to file_1
and file_2
Repeat Phase 2 until all records are in one
long run.

CJD CJD

Tips for Efficiency Algorithm Simulation (1)


 As much sorting in main memory must be File : 50 110 95 10 100 36 153 40 120 60 70 130 22 140 80
performed using internal sort because file File Size = 15 records
accesses are slower than main memory Size of Run = 3 initially (Usually a large number ≥ blocking factor)
accesses. Number of Runs = 5 initially

 The size of a run must be as large as available Sort Phase :


space in main memory, limited by other data that Pass 1:
must also be in main memory. Group records into size of run
50 110 95 - 10 100 36 - 153 40 120 - 60 70 130 - 22 140 80
 Each file must be on a separate device (such as 3 records are in main memory at a time
tapes or disks) to allow easy access during the
merge phase. Do an internal sort and distribute
File 1: 50 95 110 – 40 120 153 – 22 80 140
 The original file and file_3 may be assigned to File 2: 10 36 100 – 60 70 130
the same device. The output will be in file_3. File 3: empty
CJD CJD

Algorithm Simulation (2) Algorithm Simulation (3)


File 1: 50 95 110 – 40 120 153 – 22 80 140 File 1: 10 36 50 95 100 110 – 22 80 140
File 2: 10 36 100 – 60 70 130 File 2: 40 60 70 120 130 153
File 3: empty File 3: empty
Merge Phase : Pass 4:
Pass 2: Merge : 2 sets of 3 records are in main memory at a time
Merge : 2 sets of 3 records are in main memory at a time File 1: empty
File 1: empty File 2: empty
File 2: empty File 3: 10 36 40 50 60 70 95 100 110 120 130 153 – 22 80 140
File 3: 10 36 50 95 100 110 – 40 60 70 120 130 153 – 22 80 140
Pass 5:
Pass 3: Distribute: 3 records are in main memory at a time
Distribute: 3 records are in main memory at a time File 1: 10 36 40 50 60 70 95 100 110 120 130 153
File 1: 10 36 50 95 100 110 – 22 80 140 File 2: 22 80 140
File 2: 40 60 70 120 130 153 File 3: empty
File 3: empty
CJD CJD
Algorithm Simulation (4) Algorithm Analysis (1)
File 1: 10 36 40 50 60 70 95 100 110 120 130 153
File 2: 22 80 140  Amount of space occupied
File 3: empty Requires 3 files
Pass 6:
2 sets of 3 records are in main memory at a time
 Ease in the implementation of the algorithm
File 1: empty Straightforward
File 2: empty
File 3: 10 22 36 40 50 60 70 80 95 100 110 120 130 140 153 Awkward in having separate merge and
redistribution steps in merge phase
Number of passes = 1 Sort Pass + 5 Merge Passes
Total Number of passes = 6 The redistribution increases the number of
passes

CJD CJD

Algorithm Analysis (2) Algorithm Analysis (3)


 Speed of the algorithm Assume NR is a power of 2. That is NR = 2 j
Let NR be the number of runs initially. Each iteration is composed of a distribution step and a merge
If NR = 1, step.
Total Passes = 2 (one sort phase and one trivial It divides the number of runs by 2
merge phase) Until there is only 1 run.
Suppose NR > 1. The number of divisions to go from 2 j to 1 is j.
Total Passes =  log2 NR  * 2 So the number of iterations is j = log2 NR
If NR is not a power of 2, the number of iterations is  log2 NR 

CJD CJD

Algorithm Analysis (4) Balanced 2-Way Sort Merge


1 2 3 4 5 6 7 8  Overview :
Merge 1
The Balanced 2-Way Sort Merge improves
Merge 2 the 2-Way Sort merge.
It combines the merge and distribution steps.
Merge 3

It requires 4 files.


 The algorithm
Each iteration requires 2 passes : to distribute and to merge Phase 1 : The Sort Phase
So Total Passes =  log2 NR  * 2 Phase 2 : The Merge Phase
When NR=5, algorithm requires 6 passes

CJD CJD
The Sort Phase The Merge Phase
 Phase 2 : The Merge Phase
 Phase 1 : The Sort Phase
For each pair of runs, one from file_1 and
Divide the records of a file into several runs, another from file_2,
internal sort the records in a run, and merge the pairs resulting in longer runs.
distribute the runs “evenly” to two external alternately store the resulting runs in external files
file_3 and file_4
files file_1 and file_2
Repeat Phase 2 until all records are in one
long run. Alternate the roles of file_1 and
file_2 with file_3 and file_4 depending on
which files need to be merged and which
would hold the redistributed resultant longer
runs.
CJD CJD

End of Algorithm Algorithm Simulation (1)


File : 50 110 95 10 100 36 153 40 120 60 70 130 22 140 80
File Size = 15 records
 Copy or assign the file that contains the one long Size of Run = 3 initially
run to the desired output file. Number of Runs = 5 initially
 It is called “balanced” because in each iteration,
Sort Phase :
the number of input files used is equal to the Pass 1
number of output files used. 50 110 95 - 10 100 36 - 153 40 120 - 60 70 130 - 22 140 80
3 records are in main memory at a time

do an internal sort and distribute


File 1: 50 95 110 – 40 120 153 – 22 80 140
File 2: 10 36 100 – 60 70 130
File 3: empty
File 4 : empty
CJD CJD

Algorithm Simulation (2) Algorithm Analysis (1)


Merge Phase :
2 sets of 3 records are in main memory at a time  Amount of space occupied
merge and distribute
Requires 4 files (instead of 3)
Pass 2
File 1, File 2: empty  Ease in the implementation of the algorithm
File 3: 10 36 50 95 100 110 – 22 80 140
File 4: 40 60 70 120 130 153
Straightforward implementation
Pass 3 Requires an additional file
File 1: 10 36 40 50 60 70 95 100 110 120 130 153 Easy to combine merge and redistribution
File 2: 22 80 140
steps in merge phase
File 3, File 4: empty
Pass 4 Reduces the number of passes
File 1, File 2, File 4: empty
File 3: 10 22 36 40 50 60 70 80 95 100 110 120 130 140 153
CJD CJD
Algorithm Analysis (2) Algorithm Analysis (3)
Assume NR is a power of 2. That is NR = 2 j
 Speed of the algorithm The sort phase takes 1 pass: sorts each run, but does not
Let NR be the number of runs initially reduce the number of runs.
Each execution of merge phase is composed of a merge step
If NR = 1,
and a distribution step.
Total Passes = 1 (Sort Phase only) It divides the number of runs by 2
Suppose NR > 1. Until there is only 1 run.
Total Passes =  log2 NR  + 1 The number of divisions to go from 2 j to 1 is j.
So the number of merges is j = log2 NR
And the total number of passes is j + 1 = (log2 NR) + 1, including
the one for sort phase
If NR is not a power of 2, the number of passes is log2 NR + 1
When NR=5, requires 4 passes instead of 6
CJD CJD

Balanced k-Way Sort Merge The Sort Phase


 Overview :  Phase 1 : The Sort Phase
The Balanced k-Way Sort Merge is an Divide the records of a file into several runs,
improvement of Balanced 2-way sort merge. internal sort the records in a run, and
Instead of merging 2 files at a time, merge k distribute the runs alternately to k external
files at a time, k ≥ 2 files
Requires 2k files
Each pass results in fewer number of runs
compared to each pass of balanced 2-way
sort merge.

CJD CJD

The Merge Phase Algorithm Simulation (1)


 Phase 2 : The Merge Phase File : 50 110 95 10 100 36 153 40 120 60 70 130 22 140 80
File Size = 15 records;
 For each k-tuple of runs, one from each file with the Size of Run = 3 initially
distribution of runs, Number of Runs = 5 initially
merge the k-tuples resulting in longer runs. k=3
alternately distribute the resulting runs in the other
k external files Sort Phase :
Pass 1
 Repeat Phase 2 until all records are in one long run. 50 110 95 - 10 100 36 - 153 40 120 - 60 70 130 - 22 140 80
Alternate the roles of the first k files and the second k 3 records are in main memory at a time
files depending on which files need to be merged and do an internal sort and distribute
which would hold the redistributed resultant longer
runs. File 1: 50 95 110 – 60 70 130
File 2: 10 36 100 – 22 80 140
 Copy or assign the file that contains the one long File 3: 40 120 153
run to the desired output file. File 4, File 5, File 6 : empty
CJD CJD
Algorithm Simulation (2) Algorithm Analysis (1)
File 1: 50 95 110 – 60 70 130  Amount of space occupied
File 2: 10 36 100 – 22 80 140
File 3: 40 120 153 k input files (File_1, File_2, …, File_k)
File 4, File 5, File 6 : empty
k output files (File_k+1, File_k+2, …, File 2k)
Merge Phase :
3 sets of 3 records are in main memory at a time 2k = total no. of files needed
Pass 2  Ease in the implementation of the algorithm
File 1, File 2, File 3: empty
File 4 : 10 36 40 50 95 100 110 120 153
More complicated than a (balanced) 2-way
File 5 : 22 60 70 80 130 140 sort merge
File 6 : empty Requires more files
Pass 3
Merging becomes more complicated
File 1: 10 22 36 40 50 60 70 80 95 100 110 120 130 140 153
File 2, File 3, File 4, File 5, File 6: empty
CJD CJD

Algorithm Analysis (2) Algorithm Analysis (3)


 Speed of the algorithm Assume NR is a power of k. That is NR = kj

Let NR be the number of runs initially The sort phase takes 1 pass: sorts each run, but does not
reduce the number of runs.
If NR = 1,
Each execution of merge phase is composed of a merge
Total Passes = 1 (Sort Phase only) step and a distribution step.
Suppose NR > 1. It divides the number of runs by 2
Total Passes =  logk NR  + 1 Until there is only 1 run.
The number of divisions to go from k j to 1 is j.
So the number of merges is j = logk NR

CJD CJD

Algorithm Analysis (4) Exercise 1


1 2 3 4 5 6 7 8 9 Fill in the following table with NR = 100
Merge 1 NR = 100 2-Way Balanced Balanced Balanced
Sort 2-Way 3-Way 4-Way
Merge
Merge 2
No. of
Files Used 3 4 6 8

Total No.
And the total number of passes is j + 1 = (logk NR) + 1, 14 8 6 5
of Passes
including the one for sort phase
If NR is not a power of 2, the number of passes is logk NR + 1 Question: What conclusion/s can you draw based on
When NR=5 and k=3, requires 3 passes instead of 4 (Balanced the above table.
2-way)
CJD CJD
More Exercises Challenges
 What if each sorted run from the sort phase is distributed
Exercise 2: Using Balanced 3-way Sort Merge algorithm, sort
the given master file with the following records. Assume that to a separate file and all such files are merged into one
output file.
the size of the run is 3. Determine the total number of passes.
 What are the implications ? What factors make this
File : 28 17 79 38 5 70 24 91 37 3 19 63 15 44 8
approach possible? impossible?
Exercise 3: Using Balanced 3-way Sort Merge algorithm, sort  There are main memory and number of file devices
the given master file with the following records. Assume that limitations
the size of the run is 4. Determine the total number of passes.  How do you implement a k-way merge efficiently if k > 2 ?
File : 50 110 95 10 100 36 153 40 120 60 70 130 22 140 80  If k is large, use priority queues  CS101 (or an
advanced CS101 course)
Exercise 4: Using Balanced 4-way Sort Merge algorithm, sort
the given master file with the following records. Assume that  The realistic sort/merge situation is somewhere between
the basic balanced two-way sort merge, and the idealistic
the size of the run is 3. Determine the total number of passes.
balanced k-way sort/merge, which uses k input files for k
File : 50 110 95 10 100 36 153 40 120 60 70 130 22 140 80 runs and merges to one output file.
CJD CJD

Polyphase Sort Merge Polyphase Improvements


 Overview : In the merge phase, do not distribute the
merges into several files – just send them to a
Balanced 2-way sort/merge closer analysis :
(k+1)st file.
Suppose the number of runs in a pass does not divide When an input file becomes empty,
evenly among the 2 files.
discontinue the previous merge phase.
The last pair of runs to be merged is trivial (i.e. a copy)
as one of the files is already empty.
Instead, merge the (k-1) possibly non-empty
file(s) with the (k+1)st file into the empty file.
Possible improvement : reduce this trivial merge
Perform this repeatedly each time a file
Another possible improvement : reduce the number of
files required
becomes empty, until there is only one non-
empty file containing one long sorted run.
For a balanced k-way sort/merge, uneven distribution
cause merging of less than k runs. The name “Polyphase” is attributed to the
many phases of the merging process to sort
the records.
CJD CJD

The Sort Phase The Merge Phase


 Phase 2 : The Merge Phase
 Phase 1 : The Sort Phase
while not all records are in one long run
Divide the records of a file into several runs,
repeat
internal sort the records in a run, and merge the next k-tuple of runs,
distribute the runs to k external files using a one from each of the k input files,
into the output (k+1)st file
distribution special to Polyphase Sort Merge.
until an input file becomes empty
reassign the empty file as the next output file
and the other k files as input files.
Copy or assign the file that contains the one
long run to the desired output file.

CJD CJD
Algorithm Simulation (1) Algorithm Simulation (2)
File : 50 110 95 10 100 36 153 40 120 60 70 130 22 140 80 Merge Phase :
File Size = 15 records Pass 2
Size of Run = 3 initially File 1, File 2: empty
Number of Runs = 5 initially File 3: 10 36 50 95 100 110 – 22 80 140 (trivial merge : a copy)
File 4: 40 60 70 120 130 153
Recall : With a balanced 2-way sort merge :
Pass 3
Sort Phase : File 1: 10 36 40 50 60 70 95 100 110 120 130 153
Pass 1 File 2: 22 80 140 (another trivial merge : a copy)
50 110 95 - 10 100 36 - 153 40 120 - 60 70 130 - 22 140 80 File 3, File 4: empty
File 1: 50 95 110 – 40 120 153 – 22 80 140
File 2: 10 36 100 – 60 70 130 Pass 4
File 3, File 4: empty File 1, File 2: empty
File 3: 10 22 36 40 50 60 70 80 95 100 110 120 130 140 153
File 4: empty
CJD CJD

Algorithm Simulation (3) Algorithm Simulation (4)


Merge Phase :
With Polyphase sort merge : (after 2 run merges, 4 more blocks passed = 9 total blocks passed)
Sort Phase : File 1: 22 80 140
50 110 95 - 10 100 36 - 153 40 120 - 60 70 130 - 22 140 80 File 2: empty
(Assume 1 block = 3 records) File 3: 10 36 50 95 100 110 – 40 60 70 120 130 153
(after 1 run merge, 3 more blocks passed = 12 total blocks passed)
(after 5 blocks passed)
File 1: empty
File 1: 50 95 110 – 40 120 153 – 22 80 140
File 2: 10 22 36 50 80 95 100 110 140
File 2: 10 36 100 – 60 70 130
File 3: 40 60 70 120 130 153
File 3: empty
(after 1 run merge, 5 more blocks passed = 17 total blocks passed)
File 1: 10 22 36 40 50 60 70 80 95 100 110 120 130 140 153
File 2: empty
File 3: empty

CJD CJD

Algorithm Simulation Summary Another Run Distribution


Consider Polyphase 3-way sort merge with NR = 17 and
 17 total blocks / 5 blocks per file pass = 3.4 passes
distribute the initial runs 7-6-4 into files 1 to 3.
with Polyphase 2-way Sort Merge
Summary Level Number of Number of Runs in …
 compared to 4 passes with balanced 2-way sort
merges File 1 File 2 File 3 File 4
merge
Sort Phase 4 0 7 6 4 0
Merge Phase 1 3 4 3 2 0 4
Summary Number of Runs in … Merge Phase 2 2 2 1 0 2 2
File 1 File 2 File 3 Merge Phase 3 1 1 0 1 1 1
Sort Phase 3 2 0 Merge Phase 4 0 1 1 0 0 0
 Observe that by distributing the initial runs 17=7-6-4, at
Merge Phase 1 1 0 2
most only one file becomes empty after each merge,
Merge Phase 2 0 1 1 except the last.
Merge Phase 3 1 0 0  This is because 17 has a perfect 3rd order Fibonacci
distribution of 7-6-4.
CJD CJD
Fibonacci Sequence Polyphase Fibonacci Distrib (ex1)
 2nd order Fibonacci sequence In first example, NR = 5
 For k=2, using 3 files = 2 input files + 1 output file
F0( 2) = 0, F1( 2 ) = 1, Fn( 2 ) = Fn(−21) + Fn(−22) , {Fi ( 2 ) } = {0,1,1,2,3,5,8,13,21,...}
 The recommended distribution of NR=5 over two input
 3rd order Fibonacci sequence files is to use 2 nd order Fibonacci distribution.
F0( 3) = 0, F1(3 ) = 0, F2( 3) = 1, Fn(3 ) = Fn(−31) + Fn(−32) + Fn(−33) , {Fi (3 ) } = {0,0,1,1,2,4,7,13,24, 44...} Summary Level Number of Number of Runs in …
(n) merges File 1 File 2 File 3
 kth order Fibonacci sequence Sort Phase 3 0 3 2 0
Merge Phase 1 2 2 1 0 2
Fi ( k ) = 0, ∀i < k − 1, Fk(−k1) = 1, Fn( k ) = Fn(−k1) + Fn(−k2) + L + Fn(−kk)
Merge Phase 2 1 1 0 1 1
 The ith largest file on the nth level (n>0) initially Merge Phase 3 0 1 1 0 0
contains the following number of runs : the number of runs on ith largest file on the nth level is :
Fn(+kk) − 2 + Fn(+kk) −3 + L + Fn(+ki)− 2 Fn(+kk) −2 + Fn(+kk) −3 + L + Fn(+ki)−2 = Fn( 2) + L + Fn(+2i)−2
which means largest is Fn+Fn-1 and 2 nd largest is Fn.
CJD CJD

Polyphase Fibonacci Distrib (ex2) Imperfect NR


In second example, NR=17 and k=3, the number of runs on
the ith largest file on the nth level (n>0) initially contains  If NR is not perfect,
(k ) (k ) (k )
F n+k −2 +F n+ k −3 +L+ F n+i −2
add dummy runs to make it perfect.
level Largest File 2nd Largest File 3rd largest File Runs (perfect run  This is done during the Sort Phase.
(n) (i=1) (i=2) (i=3) sizes for k=3)
 Where?
5 13=7+4+2 11=7+4 7=7 31=13+11+7
4 7=4+2+1 6=4+2 4=4 17=7+6+4 Some say distribute them either at the end
3 4=2+1+1 3=2+1 2=2 9=4+3+2 or beginning of each file
2 2=1+1+0 2=1+1 1=1 5=2+2+1
1 1=1+0+0 1=1+0 1=1 3=1+1+1
0 1 0 0 1

Exercise : build the table for k=4 from level 1 to level 5


CJD CJD

Perfect Fibonacci Numbers Perfect Polyphase Distributions


 Let n = level number
 an, bn, cn = decreasing order of sizes of non-empty level Largest 2nd 3rd 4th File Runs
files at level n (n) File Largest Largest (Empty (perfect run
(i=1) File (i=2) File (i=3) File) sizes for k=3)
 for k=3,
n an bn cn dn tn
c n = Fn(+31) n+1 an+bn an+cn an 0 tn+2an
bn = Fn(+31) + Fn(3) 0 1 0 0 0 1
an = Fn(+31) + Fn(3) + Fn(−31) = Fn(+32) 1 1 1 1 0 3
t n = an + bn + c n
2 2 2 1 0 5
c n +1 = Fn(+32) = a n + 0
3 4 3 2 0 9
bn +1 = Fn(+32) + Fn(+31) = an + cn
4 7 6 4 0 17
an +1 = Fn(+32) + Fn(+31) + Fn( 3) = an + bn
5 13 11 7 0 31
t n +1 = an +1 + bn +1 + c n +1 = 3a n + bn + cn = (an + bn + c n ) + 2an = t n + 2a n
n+1 an+bn an+cn an+dn 0 tn+2an
CJD CJD
Algorithm Analysis (1) Algorithm Analysis (2)
 Amount of space occupied  Speed of the algorithm
 k + 1 files for polyphase k-way sort merge  According to theoretical computations by Knuth,
 vs 2k files for balanced k-way sort merge  Note : logk x = ln x / ln k
 Ease in the implementation of the algorithm Number of Passes
 Polyphase adds more complication to the balanced k- for NR = 100
way algorithm. k Polyphase Balanced Polyphase Balanced
k-way k-way k-way k-way
 The sort phase must distribute the runs according to 2 1.50 ln NR + 0.99 log2 NR + 1 7.90 8
the Fibonacci perfect distribution, adding dummy runs 3 1.02 ln NR + 0.96 log3 NR + 1 5.66 6
when necessary. 4 0.86 ln NR + 0.92 log4 NR + 1 4.88 5
 Each merge phase iteration may not run its full course 5 0.80 ln NR + 0.86 log5 NR + 1 4.54 4
due to some files becoming empty, thereby making it 8 0.73 ln NR + 0.65 log8 NR + 1 4.01 4
a little more difficult to trace the algorithm.
CJD CJD

Algorithm Analysis (3) Summary of Analyses


 Speed of the algorithm Comparison :
 Runs best when NR has a perfect kth order Algorithm Space Time # of Passes
Fibonacci distribution (# of (# of Passes) (NR=100)
Files)
 For small k, Polyphase Sort Merge performs better
2-way 3 2 * log2 NR 14
than Balanced k-Way Sort Merge
Balanced 2-way 4 1 + log2 NR 8
Why ? Because trivial copies are minimized. Balanced k-way 2k 1 + logk NR (6 if k=3;
5 if k=4)
 For large k, Balanced k-Way Sort Merge beats
Polyphase 2-way 3 1.50 ln NR + 0.99 7.90
Polyphase Sort Merge.
Polyphase 3-way 4 1.02 ln NR + 0.96 5.66
Why ? Because with more files, the merged runs Polyphase 4-way 5 0.86 ln NR + 0.92 4.88
are distributed to more files.

CJD CJD

Impact of Devices End


 Device Impact on External Sorts
 The sort time is of course highly influenced by the
secondary storage device being used.
 Tapes require to be rewound between passes.
 On disk, all files may reside on the same disk but
has more overhead because of seek time and
latency time as the head(s) switch from file to file.
 If possible, store the files on separate disks. This
allows I/O to overlap and run in parallel. If a disk
is dedicated to a file, it will reduce seek time and
latency time.
 Further complications arise in a multi-user
environment.
CJD CJD