Sorting is one of the most common things we do in programming. We are given a bunch of numbers and we want to arrange them according to some rule. Let’s say we want to arrange them in ascending order. To sort these numbers, people tend to use a sorting algorithm that takes place entirely within the memory of a computer. The memory we are talking about is the RAM. Here, we take all the numbers and store them in the memory so that we can sort them. This is possible only when the amount of data is small enough to be stored in the memory. What if we have a hundred trillion numbers to be sorted? It’s too big to be stored in the computer’s memory. How do we do it?
How do we sort things in the first place?
Some of the most common sorting algorithms include bubble sort, insertion sort, selection sort, quick sort, merge sort, heap sort, etc. This is called internal sorting. Now why is that? Let’s take the example of bubble sort to understand this. Here, the adjacent numbers are swapped in order to get them into the right order. This way, the numbers “bubble” up and down through the space. We have to go through all the numbers, and for each number we have to go through the remainder of the array. The complexity is O(n²) for bubble sort. We can do better by using a more efficient algorithm. For example, if we consider merge sort, we break up the data into smaller chunks and then merge them in the right order. We keep doing it recursively until the entire array is sorted. The complexity is O(n*logn) for merge sort. Pretty simple stuff!
Why do we need external sorting?
All the algorithms we talked about earlier assume that we have all the data available. We need to parse the entire dataset to build a sorted version of that data. But what do we do when we are sorting larger datasets? We can only hold a certain amount of data in memory at a time. The rest of the data is usually placed on some larger medium like a hard disk. The problem with this is that the hard disk is slow. If we keep accessing data from this hard disk, it will considerably slow down the whole sorting process.
Let’s consider bubble sorting again and see if we can easily extend it to solve our problem. As we know, sorting has to be done in chunks here. If we divide our input data into ‘n’ chunks, then we can fit each chunk into memory and carry on with the sorting process. We can sort the first chunk, and then we move on to the second chunk, and so on. But now, we realize that some of the records in the first chunk need to bubble through the second chunk, and vice versa. As in, there might be numbers in the second chunk that can go into the first chunk, or may be the chunks that are going to come later. Now this will require us to read and write the chunks into the hard disk, and our performance ends up taking a huge hit.
This is the reason we need an efficient external sorting algorithm that avoids these pitfalls. Some algorithms lend themselves better to handle external sorting. For example, merge sort breaks the data into smaller chunks, sorts the chunks by some other algorithm (maybe bubble sort or quick sort) and then recombines the chunks so that each recombined chunk is in order. This approach minimizes the number or reads and writes of data-chunks from the hard disk. Let’s see how to actually do it.
What exactly is external sorting?
All those algorithms we discussed earlier handle small amounts of data efficiently. As mentioned earlier, they belong to a class of sorting algorithms called Internal Sorting Algorithms. This is called internal sorting because all the data remains “internal” during the sorting process. The internal memory is usually RAM and it’s really fast. External sorting refers to a class of sorting algorithms that can handle massive amounts of data where the data can remain “external” during the sorting process. The external memory is usually the hard disk, and it’s slow. In external sorting, we usually use a strategy that efficiently combines the sorting and merging processes. In the sorting phase, chunks of data small enough to fit in main memory are read, sorted, and written out to a temporary file. In the merge phase, the sorted subfiles are combined into a single larger file.
Show me an example
Let’s consider a specific example and see how we can use external sorting to sort it. One of the popular techniques of external sorting is the external merge sort algorithm. In this procedure, we break down the data into smaller chunks that can fit into the RAM. We then merge the sorted chunks together. Let’s say we have only 2 GB of RAM and we want to sort 8 GB of data. We have 8 GB of data in main memory and we want to sort it using just 2 GB of RAM. We use the following procedure:
- Read the first 2 GB of data from the main memory into the RAM
- Sort the data in the RAM using quick sort
- Write the sorted data back to the main memory
- Repeat steps 1-3 until all of the 2 GB chunks are sorted. Now, we have 4 chunks (8GB / 2GB = 4 chunks) and we need to merge them into one single sorted output file.
- Read the first 0.4 GB of each sorted chunk and put it into RAM. Since we have 4 chunks, 1.6 GB (4 x 0.4 GB = 1.6 GB) is occupied in the RAM. We need the remaining 0.4 GB in the RAM as an output buffer.
- We now need to perform a 4-way merge and store the result in the output buffer. We just keep adding data into the output buffer and whenever the output buffer fills, we will write it to the final sorted file. Once the data is written out, we will purge the output buffer.
- As we keep merging and writing into the output buffer, our input chunks (4 chunks, 0.4 GB each) in the RAM can get empty. If that happens, we need to fill it with the next 0.4 GB of its associated 2 GB sorted chunk from the main memory.
We need to keep doing this until no more data from the chunks is available. This is an important step in this external sorting procedure and this is the step that makes external merge sort work “externally”. The good thing about this merge algorithm is that it makes only one pass sequentially through each of the chunks. As we can see here, each chunk does not have to be loaded completely. We just load them sequentially as and when needed.