Académique Documents
Professionnel Documents
Culture Documents
External Sorting
Problem: Sort 1Gb of data with 1Mb of RAM. When a file doesnt fit in memory, there are two stages in sorting:
1. File is divided into several segments, each of which sorted separately 2. Sorted segments are merged (Each stage involves reading and writing the file at least once)
CENG 351 Data Management and File Structures 2
Sorting Segments
Two possibilities depending on the number of disks: 1. Heapsort:
optimal routine if only one disk drive is available. It can be executed by overlapping the input/output with processing Each sorted segment will be the size of the available memory. optimal for two or more disk drives. Sorted segments are twice the size of memory. Reading in and writing out can be overlapped
CENG 351 Data Management and File Structures 3
2. Replacement selection:
Heapsort
What is a heap? A heap is a binary tree with the following properties:
1. Each node has a single key and that key is greater than or equal to the key at its parent node. 2. It is a complete binary tree. i.e. All leaves are on at most 2 levels, leaves on the lowest level are at the leftmost position. 3. Can be stored in an array; the root is at index 1, the children of node i are at indexes 2*i, and 2*i+1. Conversely, the parent of node j is stored at index j/2 (very compact: no need to store pointers)
CENG 351 Data Management and File Structures 4
Example
10 35 45 60 50 55 40 25 20 30 Heap as a binary tree:
Height = log n
Heap as an array:
10 35 20 45 40 25 30 60 50 55
CENG 351 Data Management and File Structures 5
Heapsort Algorithm
First Stage: Building the heap while reading the file:
While there is available space
Get the next record from current input buffer Put the new record at the end of the heap Reestablish the heap by exchanging the new node with its parent, if it is smaller than the parent: otherwise leave it, where it should be. Repeat this step as long as heap property is violated.
Second stage: Sorting while writing the heap out to the file:
While there are records in heap
Put the root record in the current output buffer. Replace the root by the last record in the heap. Restore the heap again, which has the complexity of O(log n)
CENG 351 Data Management and File Structures 6
Example
Trace the algorithm with: 48 70 30 19 50 45
100 15
Heapsort
How big is a heap?
As big as the available memory.
Multiway Merging
K-way merge: we want to merge K input lists to create a single sequentially ordered output list. (K is the order of a K-way merge) We will adapt the 2-way merge algorithm:
Instead of two lists, keep an array of lists: list[0], list[1], list[k-1] Keep an array of the items that are being used from each list: item[0], item[1], item[k-1] The merge processing requires a call to a function (say MinIndex) to find the index of the item with the minimum value.
CENG 351 Data Management and File Structures 9
Memory available as a work area: 10MB (not counting memory used to hold program, O.S., I/O buffers etc.) Total file size = 800MB Total number of bytes for all keys = 80MB So, we cannot do internal sorting nor keysorting.
CENG 351 Data Management and File Structures 11
Basic idea
1. Forming runs (i.e. sorted subfiles):
bring as many records as possible to main memory, sort them using heapsort, save it into a small file. Repeat this until we have read all records from the original file.
3. Reading runs into memory for merging. Read one chunk of each run, so 80 chunks. Since available memory is 10MB each chunk can have (10,000,000/80)bytes = 125,000 bytes = 1250 records.
How many chunks to be read for each run? Size of run/size of chunk = 10,000,000/125,000= 80 Total number of basic seeks = Total number of chunks (counting all runs) is 80 runs * 80 chunks/run = 802 chunks = 6400 seeks. Reading each chunk involves average seeking.
14
4. Writing sorted file to disk: after the first pass, the number of separate writes closely approximate reads. We estimate two seeks - one for reading and one for writing- for each piece: 80* 80 pieces therefore
6400 seeks
15
16
So K seeks are required to read all of the records in each run. Since there are K runs, merge requires K2 seeks. Because K is directly proportional to N it also follows that the sort merge is an O(N2) operation.
17
Improvements
There are several ways to reduce the time: 1. Allocate more hardware (e.g. Disk drives, memory) 2. Perform merge in more than one step. 3. Algorithmically increase the lengths of the initial sorted runs 4. Find ways to overlap I/O operations.
18
Multiple-step merges
Instead of merging all runs at once, we break the original set of runs into small groups and merge the runs in these groups separately.
more buffer space is available for each run; hence fewer seeks are required per run.
When all of the smaller merges are completed, a second pass merges the new set of merged runs.
CENG 351 Data Management and File Structures 19
21
Total number of seeks for reading in two steps: 25600 + 20000 = 45,600 What about the total time for merge?
We now have to transmit all of the records 4 times instead of two. We also write the records twice, requiring an extra 45,600 seeks.
Still the trade is profitable (see sections 8.5.1-8.5.5 for actual times)
CENG 351 Data Management and File Structures 22
How can we create initial runs that are twice as large as the number of records that we can hold in memory? => Replacement selection
CENG 351 Data Management and File Structures 23
Replacement Selection
Idea
always select the key from memory that has the lowest value output the key replacing it with a new key from the input list
24
Front of input
Memory (P=3) 5 47 16 12 47 16 67 47 16 67 47 21 67 47 _ 67 _ _ _ _ _
What about a key arriving in memory too late to be output into its proper position? => use of second heap
CENG 351 Data Management and File Structures 25
26
3. Repeat step 2 as long as there are records left in the primary heap and there are records to be read. 4. When the primary heap is empty make the secondary heap into primary heap and repeat steps 1-3.
27