Vous êtes sur la page 1sur 14

A

C
o
m
p
a
r
i
s
o
n

o
f
S
o
r
t
i
n
g

A
l
g
o
r
i
t
h
m
s
M
e
t
h
o
d
s
f
o
r
s
o
rt
i
n
g
h
a
v
e
b
e
e
n
i
n
t
e
n
s
i
v
e
l
y
s
t
u
d
i
e
d
a
n
d
o
p
ti
m
i
z
e
d
.
N
e
v
e
rt
h
e
l
e
s
s
,
m
a
n
y
p
o
o
r
a
l
g
o
ri
t
h
m
s
(
o
r
p
o
o
r
v
e
r
s
i
o
n
s
o
f
g
o
o
d
a
l
g
o
ri
t
h
m
s
)
a
r
e
s
ti
ll
i
n
c
ir
c
u
l
a
ti
o
n
.
V
B
2
T
h
e
M
a
x
'
s
fr
i
e
n
d
D
a
v
i
d
B
.
R
i
n
g
h
a
v
e
tr
a
n
s
l
a
t
e
d
a
v
a
ri
e
t
y
o
f
s
o
rt
i
n
g
r
o
u
ti
n
e
s
i
n
t
o
V
i
s
u
a
l
B
a
s
i
c
a
n
d
c
o
m
p
a
r
e
d
t
h
e
ir
p
e
rf
o
r
m
a
n
c
e
.
T
h
e
c
o
ll
e
c
ti
o
n
i
n
c
l
u
d
e
s
s
o
m
e
o
f
t
h
e
f
a
s
t
e
r
a
l
g
o
ri
t
h
m
s
k
n
o
w
n
(
e
.
g
.,
Q
u
i
c
k
S
o
rt
,
R
a
d
i
x
S
o
rt
a
n
d
M
e
r
g
e
-
S
o
rt
)
a
n
d
i
n
c
o
r
p
o
r
a
t
e
s
m
a
n
y
o
p
ti
m
i
z
a
ti
o
n
s
k
n
o
w
n
t
o
i
n
c
r
e
a
s
e
s
p
e
e
d
a
n
d
/
o
r
c
o
n
s
i
s
t
e
n
c
y
.
I
n
t
h
i
s
a
rt
i
c
l
e
h
e
d
i
s
c
u
s
s
e
s
a
n
d
c
o
m
p
a
r
e
s
t
h
e
v
a
ri
o
u
s
s
o
rt
i
n
g
m
e
t
h
o
d
s
,
a
n
d
p
r
o
v
i
d
e
s
c
o
d
e
t
o
u
s
e
t
h
e
m
i
n
y
o
u
r
o
w
n
a
p
p
li
c
a
ti
o
n
s
.

by David B. Ring

ethods for sorting have been intensively studied and optimized. Nevertheless, many poor
algorithms (or poor versions of good algorithms) are still in circulation. Recently, I have translated
a variety of sorting routines into Visual Basic and compared their performance. The collection
includes some of the faster algorithms known (e.g., QuickSort, RadixSort and Merge-Sort) and
incorporates many optimizations known to increase speed and/or consistency. On an 800 mhz
PowerBook running the VBA routines in Excel 2001, I observe sorting speeds as high as a million
random strings or doubles per minute (Table 1). I hope you will find the code for these sorts
useful and interesting.

What makes a good sorting algorithm? Speed is probably the top consideration, but other factors
of interest include versatility in handling various data types, consistency of perform-ance, memory
requirements, length and complexity of code, and the property of stability (preserving the original
order of records that have equal keys). As you may guess, no single sort is the winner in all
categories simultaneously (Table 2).
Let's start with speed, which breaks down into "order" and "overhead". When we talk about the
order of a sort, we mean the relationship between the number of keys to be sorted and the time
required. The best case is O(N); time is linearly proportional to the number of items. We can't do
this with any sort that works by comparing keys; the best such sorts can do is O(N log N), but we
can do it with a RadixSort, which doesn't use comparisons. Many simple sorts (Bubble, Insertion,
Selection) have O(N^2) behavior, and should never be used for sorting long lists. But what about
short lists? The other part of the speed equation is overhead resulting from complex code, and
the sorts that are good for long lists tend to have more of it. For short lists of 5 to 50 keys or for
long lists that are almost sorted, Insertion-Sort is extremely efficient and can be faster than
finishing the same job with QuickSort or a RadixSort. Many of the routines in my collection are
"hybrids", with a version of InsertionSort finishing up after a fast algorithm has done most of the
job.

The third aspect of speed is consistency. Some sorts always take the same amount of time , but
many have "best case" and "worst case" performance for particular input orders of keys. A
famous example is QuickSort, generally the fastest of the O(N log N) sorts, but it always has an
O(N^2) worst case. It can be tweaked to make this worst case very unlikely to occur, but other
O(N log N) sorts like HeapSort and MergeSort remain O(N log N) in their worst cases. QuickSort
will almost always beat them, but occasionally they will leave it in the dust.

Along with worst cases, the nature of keys can also dramatically affect the speed (and the relative
speed) of sorts. This is especially true for string keys, which can vary significantly in length and
relatedness. Longer keys take longer to copy or compare, and highly related keys take longer to
compare because we have to examine more digits before finding a difference. Two sorts may run
neck and neck on short keys, but if one is more efficient in doing fewer comparisons or moves,
the difference on longer keys may be dramatic. Likewise, a sort that manipulates pointers to long
keys can be far faster than one that moves the keys themselves.

Speed is usually the largest issue, but certainly not the only one. If you're sorting a very large
number of keys, having enough memory may still be an issue, and methods that sort "in place"
(i.e., without extra room) will be more desirable than those that use a duplicate array (e.g.
MergeSort) or those that use a significant amount of stack space for recursive calls. The length
of the sorting algorithms themselves is not much of an issue today, but there is considerable
variation, and short, simple algorithms are appealing and easy to maintain. Similarly, algorithms
that can easily be applied to all data types (strings, integers, longs, doubles, etc.) are convenient,
although they may not be as fast as more specialized algorithms focused on a single data type.
A final issue is "stability" or the ability to preserve the original order of records that have equal
keys. Some sorts have it (Insertion, Selection, Bubble, Merge, Radix, Ternary Quick) and others
don't (Shell, Comb, Heap, Quick). If you want to sequentially sort records on multiple fields, you'll
need it.

This article comes with nine sorting algorithms in VB: Selection, Insertion, Shell, Comb,
Heap, Merge, Quick, Multi-Key Quick and MSD Radix. If I could choose only one of
these, it would be MergeSort, because it is quite fast, stable, readily adapts to all data
types and never behaves worse than O(N log N). QuickSort is often faster and just as
adaptable, but is not stable; the well-tuned version described by Robert Sedgewick and
included here is highly unlikely to actually show O(N^2) worst case behavior. Next on
my list would be MSD RadixSort; this one is specialized for strings, but it's O(N) linear
behavior will make it faster than anything else if you have millions of strings to sort (and
it's stable). Ternary QuickSort can also be very fast, but may shine better in languages
like C that offer faster byte-level access to strings. Finally, HeapSort is compact in both
code length and memory demands, adaptable to all data types and never worse than O(N
log N). Although I include Comb, Shell and SelectionSort, I think they are outclassed by
the sorts mentioned above, and I include InsertionSort as an auxiliary to Merge, Quick
and Radix, and not for independent usage.

Except for MSD RadixSort and Ternary QuickSort (which are specialized for strings), I
provide two versions of each sort—a "pointerized" version set up for strings and a
"direct" version set up for longs. For example, pMergeSortS is a pointerized sort for
strings and MergeSortL is its direct counterpart for longs. The direct versions rearrange
the keys themselves, while the pointerized versions leave the keys in their original
positions and only rearrange pointers. The pointer approach saves time for strings and
doubles—data types that are longer than 4 byte pointers. The direct approach is just as
fast or faster for short data types like integers, longs or singles. To adapt a string version
to handle doubles or a long version to handle integers or singles, just change the
declaration of the array that holds the keys.

Along with the sorts, I include a SortBase file containing a number of support routines for
generating, listing or loading sets of random strings, longs and doubles. SortBase also
contains global constants and arrays used by the sorts, and examples of timing routines
for testing them. More details, references, usage hints and commented VB code are
found in the files for each sort. Enjoy.

In computer science, especially analysis of algorithms, amortized analysis finds the


average running time per operation over a worst-case sequence of operations. Amortized
analysis differs from average-case performance in that probability is not involved;
amortized analysis guarantees the time per operation over worst-case performance.

The method requires knowledge of which series of operations are possible. This is most
commonly the case with data structures, which have state that persists between
operations. The basic idea is that a worst case operation can alter the state in such a way
that the worst case cannot occur again for a long time, thus "amortizing" its cost.

As a simple example, in a specific implementation of the dynamic array, we double the


size of the array each time it fills up. Because of this, array reallocation may be required,
and in the worst case an insertion may require O(n). However, a sequence of n insertions
can always be done in O(n) time, so the amortized time per operation is O(n) / n = O(1).

Notice that average-case analysis and probabilistic analysis are not the same thing as
amortized analysis. In average-case analysis, we are averaging over all possible inputs; in
probabilistic analysis, we are averaging over all possible random choices; in amortized
analysis, we are averaging over a sequence of operations. Amortized analysis assumes
worst-case input and typically does not allow random choices.

There are several techniques used in amortized analysis:

• Aggregate analysis determines the upper bound T(n) on the total cost of a
sequence of n operations, then calculates the average cost to be T(n) / n.

• The accounting method determines the individual cost of each operation,


combining its immediate execution time and its influence on the running time of
future operations. Usually, many short-running operations accumulate a "debt" of
unfavorable state in small increments, while rare long-running operations
decrease it drastically.

• The potential method is like the accounting method, but overcharges operations
early to compensate for undercharges later.

Vous aimerez peut-être aussi