Students: Daniel Intskirveli and Gurpreet Singh CS220, The City College of New York April 25 th, 2014 INTRODUCTION
It is impossible to study and consequently obtain a deep understanding of the implementation and runtime complexity of various classical algorithms without temporarily abandoning the theoretical domain of textbooks and lectures in favor of a more practical approach. The practical approach is familiar to any computer science student, and involves several recurring themes: choosing a programming language, writing the code, testing the algorithms to make sure they actually work, coming up with a way to count the effort for any algorithm, coming up with a way to measure the execution time for any algorithm, comparing those statistics between various algorithms, etc.
In order to mitigate the pains brought on by these recurring themes, we chose to avoid the typical ad-hock approach of writing an algorithm and testing it, then repeating that process for every algorithm encountered. Instead, we created a testing framework for all the algorithms we have encountered so far. We called the program simply- Fun With Sorts- a suitable name because most of the algorithms are sorts, and because fun was had. This paper serves as a discussion for the design a framework we wrote to explore the performance and implementation of various algorithms, the results we gathered using it, and the insight those results offer.
1. DESIGN
C++ was chosen as the language for the majority of the application because it has features that aid in object-orientated programming and because it is low level enough for our needs. We wanted to avoid the overhead as well as the unwanted optimizations of the higher-level languages. The framework is also designed to be modular. This allows it to be extendable to include more algorithms. In short, the program can be thought of as having the following components:
A. Main- the entry point for the application where the user decides which sorts to test. B. Stats- modules that facilitate in measuring performance. At the time of writing, there are two: the counter class, and the timer class. C. Utilities- a module which contains functionality common to all algorithms for testing purposes, e.g. functions that create random arrays, functions that output data from stats modules to files, functions that check if a file is sorted, helper functions for running algorithms, etc. D. Algorithms- The modules are groups of algorithms that all have the same characteristics and should be tested together. Each of these modules define the counters and timers they will use, implementations for the algorithms being tested, and other settings specific to that group, such as the file size of the data consumed by the algorithms and the type of data they consume (e.g. random array, reverse sorted array). At the time of writing, there are two groups: project_1, and project_2. E. Grapher- a post-processing tool for the data outputted by the rest of the framework. As of writing, it generates plots and bar graphs for the Stats data, and aggregates it to get sum (total work) and average. Because, in its current iteration, the Grapher code needs to be changed often to be useful, we chose Python as the language for it. Python has the advantage of being an interpreted language with a lot of library support. It is slower, but producing graphs is not performance sensitive. 2
1.1 WALKTHROUGH OF A RUN
Each algorithm in a group of algorithms is run against the same data set. This data set is taken from the samples vector 1 , which is a fundamental part of Fun with Sorts. The samples vector is vector that stores pairs of arrays and their respective sizes (arrays are primitives in C and the size of one must be known to do any useful computation). Each sample, a pair of an array and its size, is a list to be processed by an algorithm. An algorithm group module defines the types of input arrays to be generated, and their sizes. Every algorithm in a group of algorithms can then be run against each sample in the samples vector. For each algorithm being a run, a copy of the samples vector is made in order to sandbox the side effects of the algorithm (we do not want to run a sort on an array that has already been sorted by the algorithm that ran before it, for example). This copy is then discarded. As the algorithms are running, they are making use of their respective Stats objects (either counters or timers) to record statistics. When the algorithms are finished running, the data from these Stats objects is exported to csv 2 files in the working directory of the application. Afterwards, the Grapher tool reads the csv files to aggregate the data and create visualizations.
Note: For simplicity, all sorting algorithms sort integer primitives as opposed to more complex objects, and they sort to non-decreasing order.
1.2 COUNTERS
1 Throughout this document the use of the word vector represents the vector class in C++, a standard library class for storing a one- dimensional list of objects. 2 A file format that contains data in comma separated values. This format is supported by our Grapher as well as various popular software such as Microsoft Excel. The Counter class was designed with one goal in mind: to measure work done in a way that can be reproducible on any machine, regardless of age of hardware. It is backed by a vector of 64 bit unsigned integers for compatibility with large datasets. The vector class needs not to reside in one continuous block of memory, and the 64 bit integers prevent overflows that will occur with 32 bit integers and counter values larger than 2 32 .
Each of the algorithms is associated with one or more counters. Before the algorithm begins, the user must call the next() function for all counters. This moves the counter cursor to the next run, and subsequent calls to increment() will increase the value of that run.
The Counter class comes with one important gotcha. Because it doesnt count total work done in the traditional sense, the user must increment the counter correctly in order to get a useful value. While a traditional work count might include all machine instructions, or one increment per line of code in the algorithm, the Counter class will count only the work relevant to the performance of the algorithm, such as the number of comparisons and exchanges. The advantage of this is that, for a particular set of input data, the resulting counter values will be the same on any machine. The downside is that the values will not hold the truth unless the use of the counter is implemented correctly.
1.3 TIMERS
The workflow of the Timer class is similar to that of the Counter class. The Timer values are inherently non-reproducible. Because a timer measures execution time of an algorithm, different machines will yield different results, depending on how busy they are, the speed of their processors, and speed of the hard drives. However, Timer values are useful for comparing runtimes of several algorithms on large file sizes on the same machine. To use a Timer, a user must called next() and start() on the Timer associated with the algorithm before the algorithm begins, and stop() when 3 it finishes. Unlike the Counter class, the Timer is easy to integrate and use. However, its results will differ from machine to machine and even from time to time, and it cannot discriminate between comparisons and exchanges, for instance.
2. PROJECT 1
This is our first algorithm group, a collection of O(n 2 ) sorts, as well as a plain linear search and two variations on it. The algorithms were tested with the following file sizes:
n = 500 n = 2500 n = 12,500 n = 62,500
For each file size, the algorithm was expected to perform work on the following types of input data: Random 3
Reverse-ordered 4
20% sorted 5
2.1. PROJECT 1 ALGORITHMS AND RESULT DISCUSSION
Note: All results are available in the appendix.
Bubble Sort
The pure bubble sort gave us results that are typical for a bubble sort. To illustrate this, included on the following page is a histogram generated by our Grapher tool. The table is organized in 3 input array types, random, reverse, 20% sorted, and 4 file sizes, arranged increasing from left to right. As is typical of a
3 An array of size N, which contains randomly generated integers in the range 0 to a maximum defined in the algorithm group module. 4 A reverse ordered array. If the size is N, then the array is [N-1, N-2, , 0]. 5 A randomly generated array in which 20% of the keys (1/5 th of the file size) were randomly chosen to be taken out, sorted, and put back in the array. bubble sort, our histograms shows that the number of exchanges that the algorithm must make is much higher for the reverse-ordered list than a random or partially sorted list.
Adaptive Bubble Sort
Adaptive bubble sort differs from pure bubble sort in that it knows that it is finished when it examines the entire array and no "swaps" are needed thus the list is completely sorted. It keeps track of the occurring swaps by the using of a flag. We can see from the data in the appendix the adaptive version of the bubble sort performs better than the regular version.
Insertion Sort
Insertion sort is a comparison-based algorithm and it is one of the better n 2 algorithms as it only goes through the array once and it only compares. Insertion sort can be very fast and efficient when used with smaller arrays but when dealing with large amounts of data, it loses this efficiency. Our data shows that for random and partially sorted lists, the insertion sort slightly better than the selection sort. When the input file is reverse ordered, the selection sort performs much better. However, the insertion sort is substantially faster than either of the bubble sort variations.
Selection Sort
In selection sort, in each pass, the unsorted element with the smallest value is moved to its proper position in the array. The selection sort 4 is the fastest of the sorts in project 1 for reverse ordered files.
Sequential Search
This is an unmodified linear search. It simply goes through the array until it finds an element that matches the key, and returns the index of that element. Because the key to be sought is selected randomly and array is not modified after a key is found, the plot we generated is scattered and shows no improvement over time. Most of the data points are found near the top of the graph, where counter values are larger than N/2, the average case for a sequential search. This is because the elements to be sought were chosen with a bias for the back of the input file. This is explained in the following section about the adaptive version of this search.
Adaptive Sequential Search (1)
The first adaptive version of the Sequential Search uses the move-to-front approach to organizing lists in order to optimize search time. This search is unique among the algorithms we wrote because it takes a linked- list, not an array, as the list of data to search. This is because moving an element to the front of a list is more efficient in a linked-list than it is in an array. We experimented also with an approach in which the found element was swapped with the first element in the listthis gave us no noticeable improvement over time. To run this test, we randomly selected (with a bias 6 for the end of the last, artificially making the runtime of the search longer) N elements (the same size as the list) from the list and then searched the list for those elements using the Adaptive Sequential Search. Each list was searched for those elements N*5 times (2500 algorithm runs for a size of 500, for example). Our Grapher tool then generated a graph of all the runs (on the following page). The improvement here is clearly visible. At first, the search struggles with the keys we tell it to findbecause they are mostly at the end of the list. At some point, the algorithm organizes the list. The sought keys, which were originally clustered around the end of the list, were gradually moved to the front of the list, which improved and stabilized (i.e. lowered the variance of) the search performance.
6 We wrote a random function that has a non- uniform distribution for selecting keys that are more likely to be at the end of the list using the following logic: if rand() is a function that returns a random floating between 0 and 1, then rand()*rand() has a distribution that looks like 1/x 2 , and 1-rand()*rand() is a vertical flip of that distribution, and this is what we want. The distribution of those functions is plotted below (the area on the right is more relevant to our use case):
5
Adaptive Sequential Search (2)
The second adaptive version of the sequential search uses a different approach to organizing the list for better search performance. Every time a key is found, it is swapped with the array index directly before it. Over time, this should bubble the more popular sought keys to the front of the list, increasing performance. On large file sizes such as the ones we tested, we saw no noticeable improvement in search time with more runs of the search. However, when we tested the performance on a smaller list of random numbers (100), we clearly saw an improvement in search time when running the search 10,000 times (top right). Note: this is a much higher number of runs than the program does by default, which is 5N runs for a file size of N. We hypothesize that the lack of improvement in larger file sizes is because the list takes longer to organize, and the potential improvement fell outside of the range of our tests.
3. PROJECT 2
A collection of more complex sorts, each of which usually has a faster run time than the sorts in project 1. These sorts were tested on the following file sizes:
n = 2500 n = 12,500 n = 62,500 n = 312,500
With input file types: Random 20% sorted 50% sorted 7
3.1. PROJECT 2 ALGORITHMS AND RESULT DISCUSSION
Note: All results are available in the appendix.
Merge Sort
This is a straightforward merge-sort. Merge- sort uses the divide and conquer paradigm to sort a list. In our results, we observed that an array that is partially sorted (20% sorted) would perform better than an array which half sorted (50% sorted). We also observed that an array with random values takes longer to sort than a partially sorted array.
Merge To Insertion Sort (20)
Merge To Insertion Sort (20) is similar to straightforward merge-sort, but sorts sub-lists using insertion sort when the sub-list has fewer than 20 elements. Our results show that if insertion sort is applied to a sub-list when it has fewer than 20 elements, it perform faster than straight merge-sort in all cases (random, 20% or 50% sorted list). This sort performs better
7 A randomly generated array in which 50% of the keys (1/2 of the file size) were randomly chosen to be been taken out, sorted, and put back in the array.
6 because Insertion sort is a fast sorting algorithm for sorting very small lists.
Merge To Insertion Sort (100)
Merge To Insertion Sort (100) is similar to straightforward merge-sort, but sorts sub-lists using insertion sort when the sub-list has fewer than 100 elements. It was observed in our results that if insertion sort is applied to a sub- list when it has fewer than 100 elements, it performs faster than merge-sort, to whom insertion sort is applied to when its sub-list contains fewer than 20 elements, in all cases (random, 20% or 50% sorted list).
Quick Sort
This is a plain quick sort that uses the last element in the list as the pivot. Quick sort also uses the divide and conquer paradigm to sort a list. Our results show that quick sort performs the fastest on a 20% sorted list. If a list is half sorted (50% sorted), then quick sort requires more time to sort than a randomized list.
Quick To Insertion Sort
This is the same as the quick sort above, but when a sub-list becomes small enough (less than 50 elements), it sorts it using insertion sort. Our results show that total work done by this sort is less than plain quick sort because insertion sort has a better constant than quick sort when sorting smaller lists.
Quick Median-Of-Three Sort
A quick sort that, instead of choosing the last element as the pivot, use the median-of-three strategy to select a pivot: it randomly chooses three elements from the list, then uses the median of those three as the pivot. This guarantees that the worst case for Quicksort, when the pivot is either the largest or the smallest element in the list, is omitted. It was observed in our results that this sort performs faster than plain quick sort and quick sort that uses insertion sort on the last 50 elements.
Quick Median-Of-Three To Insertion Sort
A median-of-three quick sort (same as above) that uses insertion sort on sub-lists smaller than 50 elements. It was observed in our results that this sort does perform faster than a plain median-of-three quick sort. It decrease the amount of exchanges need to sort but when insertion sort is used, it adds more work to the comparisons. Overall, this sort performs the best out of all different variation of quick sort used in this project.
Heap Sort
A sort in which all of the elements are inserted into a heap that is constantly being updated, while the largest values in it are being taken out and inserted back into the array. The total work done by heap sort was the worst out of all the big sorts. It was observed in our results that heap sort performed faster when the array was 50% sorted than 20% sorted. When the array was random, then more work was done with comparisons and exchanges.
Shell Sort
A shell sort is a variant of the bubble sort. It starts out by sorting elements that are far from each other, then it decreases that gap with consequent iterations. The increments provided to sort a list using shell sort performed faster than all the big sorts. It was observed in our results that shell sort 1 performed faster than shell sort 2 because the increments for the shell sort 1 were better.
4. FUTURE CONSIDERATIONS
We plan to expand the functionality of Fun With Sorts in the future for even easier testing of various algorithms. For one thing, there are several design improvements that can be made.
More Robust Stats
At the time of writing, there are two Stats modules: Timer, and Counter. While these are already easy to use, they are not scalable and need to be improved for two basic reasons. 7 1. They do not support grouping their results (into bins of file size and input type, for example), because they are backed by a one-dimensional data structure. At the time of writing, separation of Stats data into so-called bins is handled by the output function in the Utilities module. 2. They do not inherit from a parent class. As a result, having two Stats classes (we started with just one, the Counter) forces us to have copied code that duplicates functionality.
An Algorithm Group Interface
The two modules, Project 1 and Project 2, must also be made to extend a base Algorithm Group class that provides virtual function prototypes for all the settings you would expect in one of these groups. Each algorithm group needs to have a function that returns the file sizes and the input file types, for example.
Configurable Grapher
At the time of writing, the Grapher reads the output data from the Framework and creates a bar graph for each Stat data associated with sorts, and a line plot for each Stat data associated with a search. As mentioned before, it also does some simple math to calculate total work and average work. Generating proper, useful, graphs is currently ad-hock in that, if you insist on straying from the default behavior of the Grapher, you must change its code. A future iteration of the Grapher will be configurable. This would useful, for example, for grouping several Counters into a single bar graph.
A Test Suite
The program should automatically run all of the algorithms on small sets of data to make sure that the algorithms continue to work as the code is modified and more algorithms are added.
Existing Algorithms Improved
The code for some of the algorithms we implemented and tested needs work. Most notably, the shell sort should be modified to use different intervals. Also, the adaptive searches should be tested with sought elements having different distributions 8 .
Performance Improvements
Fun with Sorts should be thoroughly checked for memory leaks before its next use in more complex applications. The application should also support the use of multiple threads for running algorithms concurrently.
5. CONCLUSION
For the most part, the results we obtained using our testing application arent particularly useful because they serve simply to confirm well-known runtime complexities for several classic algorithms. Run time complexity for these functions as well as detailed analysis is readily available. However, the application has the potential to improve students understanding of how various algorithms work. It can also provide visual representations of various statistics collected while running the algorithms.
We found that the most insightful data came from the counters for the sequential searches in Project 1, which show a performance improvement over time for the adaptive versions of the search. In fact, as we were testing our code and going through iterations of the searches, we saw the plots change shape as the search code improved.
8 A distribution biased for the end of list was chosen because it has the most dramatic improvement. In reality, if a computer scientist knows that keys he is searching for are predominantly at the end of the file, he would write an algorithm that starts the search from the end, not the front, of the file. 8