Optimizing C++ - the WWW version

ISBN: 0-13-977430-0

Copyright 1999 by Prentice-Hall PTR

Copyright 2000 by Chrysalis Software Corporation

Return to the table of contents

Return to my main page

Zensort: A Sorting Algorithm for Limited Memory


This chapter will explain how to get around the major limitation of the otherwise very efficient distribution counting sort algorithm: its poor performance with limited available memory.

Algorithms Discussed

Zensort: A version of the distribution counting sort for use with limited available memory

Virtual Impossibility

For many years, I've been a very big fan of the "distribution counting sort", which is described in Chapter mail.htm. However, it does have one fairly serious drawback: it doesn't work very well in virtual memory. The problem with the distribution counting sort in virtual memory is that it has very poor locality of reference: that is, rather than stepping through memory in a predictable and linear way, it jumps all over the place. Although this is not optimal even when we are dealing with programs that access data solely in memory, because it makes poor use of the processor cache, it is disastrous when we are dealing with virtual memory. The difficulty is that random accesses to various areas of the disk are much, much slower than sequential accesses: in some cases, the difference may be a factor of 1,000 or more. I discovered this fact (not that I should have had to discover it by experimentation) when I ran some comparison tests between Quicksort and the distribution counting sort for very large files, where the data would not even remotely fit in memory. However, I didn't believe that this was an insuperable obstacle, and I have made a number of attempts to do something about it. Finally, after some years of on-and-off experimentation, I have found the solution.

The Urge to Merge

The solution to this problem is really very simple, when you look at it the right way. If the problem is that we're moving data in a random fashion all over the disk, and thereby incurring massive numbers of positioning operations, perhaps we would do much better if we were to write the data out to intermediate buffers in memory and write those buffers out to disk only when they became full. In this way, we would reduce the number of random disk accesses by a large factor.

Up to this point, I hadn't invented anything new. It has been known for many years that it is possible to sort very large files -- by dividing them into blocks each of which will fit into memory, sorting each block, and then merging the results into the final output file. The difficulty with this method of sorting is the merge phase, which requires a great deal of disk I/O in itself and can be a major contributor to the entire time taken to sort the file. However, I discovered that there was a way to avoid this merge phase.

The key (no pun intended) is that by accumulating the data for the keys over the entire file, we could determine exactly where each output record should go in the entire file, even though we were buffering only a small percentage of the data.

The Initial Implementation

This is probably easier to show in code than it is to explain in English, although of course I'll do both before we are finished. However, let's start with the code, which is shown in Figure zen01.cpp.

Initial implementation of Zensort (Zensort\zen01.cpp) (Figure zen01.cpp)


We start out reasonably enough by extracting the size of the keys, the input file name, and the output file name from the command line. After initializing the timing routine, we allocate space for the buffers that will be used to hold the data on its way to the disk. As you may recall from Chapter mail.htm, the distribution counting sort moves data from the input file to the output file based on the value of the current character of the key being sorted. Therefore, we need one buffer for each possible character in the key being sorted, so that we can keep the data for each character value separate from the data for every other character value.

Now we are ready to start the main body of the algorithm. For each pass through the data, we have to read from one file and write to the other. On the first pass, of course, we will read from the original input file and write to the original output file. However, on the second pass we'll read from the original output file and write back to the original input file, because after the first pass, the output file is sorted on only the last character of the key. In the distribution counting sort algorithm, we sort on each character of the key separately, in reverse order of their positions in the key. After the first pass, the original output file contains exactly the information we need as input for the second pass.

That's the reason we use the modulus operator in the if statement that determines which filenames to use for which files: on every odd numbered pass, we use the original input file name for the input file, and on every even numbered pass, we use the original output file name for the input file. Of course, the reverse is true for the output file.

Once we figure out which file will be used for input and which will be used for output, we open them. Then we initialize all of the buffers by clearing them to 0, initialize the displacement values for each buffer to 0, and initialize the total displacement values for each buffer to 0.

The displacement array keeps track of the amount of data in the input file for each possible value of the current key character. That is, the entry in the displacement array that has the index 65 (the ASCII value for 'A') represents the total size of all the records seen so far in the current pass that have the letter A in the current position in their key. The total displacement array, on the other hand, is used to accumulate the total amount of data in the input file that has a key character less than the ASCII value of the index in the total displacement array. For example, the entry in the total displacement array that has the index 65 represents the total size of all the records whose key character was less than the letter A, which is the same as the displacement into the output file of the first record whose key character is A. Of course, we cannot fill in the values of this array until we are finished with the pass through the file, because until then we do not know the total sizes of all records with a given key character.1

But we are getting a little ahead of ourselves here. Before we can calculate the total size of all the records for a given key character, we have to read all the records in the file, so let's continue by looking at the loop that does that. This "endless" loop starts by reading a line from the input file and checking whether it was successful. If the status of the input file indicates that the read did not work, then we break out of the loop. Otherwise, we increment the total number of keys read (for statistical purposes), calculate the length of the record, and increment the total amount of data read (also for statistical purposes). Next, we determine whether the record is long enough for us to extract the character of the key that we need; if it is, we do so, and otherwise we treat as though it were 0 so that such a record will sort to the beginning of the file.2 Once we have found (or substituted for) the character on which we are sorting, we add the length of the line (plus one for the new-line character that the getline function discards) to the displacement for that character. Then we continue with the next iteration of the loop.

Once we get to the end of the input file, we close it. Then we compute the total displacement values for each character, by adding the total displacement value for the previous character to the displacement value for the previous character. At this point, having read all of the data from the input file, we can display the statistics on the total number of keys and total amount of data in the file, if this is the first pass. This is also the point where we display the time taken to do the counting pass.

Now we're ready for the second, distribution, pass for this character position. This is another "endless" loop, very similar to the previous one. As before, we read a line from the input file and break out of the loop if the read fails. Next, we concatenate a new-line character to the end of the input line. This is necessary because the getline function discards that character from lines that it reads; therefore, if we did not take this step, our output file would have no new-line characters in it, which would undoubtedly be disconcerting to our users.

Next, we extract the current key character from the line, or substitute a null byte for it if it is not present. The next operation is to calculate the current amount of data in the buffer used to store data for this key character. Then we add the length of the current line to the amount of existing data in the buffer. If adding the new line to the buffer would cause it to overflow its bounds, then we have to write out the current data and clear the buffer before storing our new data in it.

To do this, we seek to the position in the output file corresponding to the current value of the total displacement array for the current character value. As we have already seen, the initial value of the total displacement array entry for each character is equal to the number of characters in the file for all records whose key character precedes this character. For example, if the current key character is a capital 'A', then element 65 in the total displacement array starts out at the beginning of the distribution loop with the offset into the output file where we want to write the first record whose key character is a capital 'A'. If this is the first time that we are writing the buffer corresponding to the letter 'A', we need to position the output file to the first place where records whose keys contain the key character 'A' should be written, so the initial value of the total displacement array element is what we need in this situation.

However, once we have written that first batch of records whose keys contain the letter 'A', we have to update the total displacement element for that character so that the next batch of records whose keys contain the letter 'A' will be written immediately after the first batch. That's the purpose of the next statement in the source code.

Now we have positioned the file properly and have updated the next output position, so we write the data in the buffer to the file. Then we update the total number of writes for statistical purposes, and clear the buffer in preparation for its next use to hold more records with the corresponding key character.

At this point, we are ready to rejoin the regular flow of the program, where we append the input line we have just read to the buffer that corresponds to its key character. That's the end of the second "endless" loop, so we return to the top of that loop to continue processing the rest of the lines in the file.

Once we've processed all the lines in the input file, there's one more task we have to handle before finishing this pass through the data: writing out whatever remains in the various buffers. This is the task of the for loop that follows the second "endless" loop. Of course, there's no reason to write out data from a buffer that doesn't contain anything, so we check whether the current length of the buffer is greater than 0 before writing it out to the file.

After displaying a message telling the user how long it took to do this distribution pass, we return to the top of the outer loop and begin again with the next pass through the file to handle the next character position in the key. When we get through with all the characters in the key, we are finished with the processing, so we display a final message indicating the total number of writes that we have performed, free the memory for the buffers, terminate the timing routines, and exit.

Performance: Baseline

So how does this initial version of the program actually perform? While I was working on the answer to this question, it occurred to me that perhaps it would be a good idea to run tests on machines of various amounts of physical memory. After all, even if this algorithm works well with limited physical memory, that doesn't mean that having additional memory wouldn't help its performance. In particular, when we are reading and writing a lot of data, the availability of memory to use as a disk cache can make a lot of difference. Therefore, I ran the tests twice, once with 64 MB of RAM in my machine and once with 192 MB. The amount of available memory did make a substantial difference, as you'll see when we discuss the various performance results.

Figure timings.01 illustrates how this initial version works with files of various sizes, starting with 100,000 records of approximately 60 bytes apiece and ending with one million similar records.3

Performance of Zensort version 1 (Figure timings.01)


According to these figures, this is in fact a linear sort, or close enough to make no difference, at least on the larger machine. An n log n sort would take exactly 1.2 times as long per element when sorting one million records as when sorting 100,000 records. While this sort takes almost exactly that much longer per element for the one million record file on the smaller machine, the difference is only three percent on the larger machine, so obviously the algorithm itself has the capability of achieving linear scaling.

But linear performance only matters if the performance is good enough in the region in which we are interested. Since this is a book on optimization, let's see if we can speed this up significantly.

The Initial Improvements

One of the most obvious areas where we could improve the efficiency of this algorithm is in the use of the buffer space. The particular input file that we are sorting has keys that consist entirely of digits, which means that allocating 256 buffers of equal size, one for each possible ASCII character, is extremely wasteful, because only 10 of those buffers will ever be used. Although not all keys consist only of digits, that is a very common key composition; similarly, many keys consist solely of alphabetic characters, and of course there are keys that combine both. In any of these cases, we would do much better to allocate more memory to buffers that are actually going to be used; in fact, we should not bother to allocate any memory for buffers that are not used at all. Luckily, we can determine this on the counting pass with very little additional effort, as you can see in Figure zen02.cpp.

Zensort version 2 (Zensort\zen02.cpp) (Figure zen02.cpp)


The first changes of any significance in this program are the addition of two new arrays that we will use to keep track of the buffer size for each possible key character and the total number of characters stored in each buffer. Of course, because we are assigning memory to buffers in a dynamic fashion, we can't allocate those buffers until we know how much memory we want to devote to each buffer. Therefore, the allocation has to be inside the main loop rather than preceding it. By the same token, we have to delete each buffer before the end of the main loop so that they can be re-allocated for the next pass.

The next question, of course, is how we decide how much space to devote to each buffer. It seemed to me that the best way to approach this would be to calculate the proportion of the entire file that the records for each key character correspond to, and allocate that proportion of the entire buffer space to the buffer for that key character, so that's how I did it.

First, we add up all of the record length totals; then we compute the ratio of the total amount of space available for buffers to the total amount of data in the file. Then we step through all the different key characters and compute the appropriate size of the buffer for each key character. If the result comes out to be zero, then we don't allocate any space for that buffer; instead, we assign a null pointer to that buffer address, as that is much more efficient than allocating a zero-length buffer. However, if the buffer size comes out to be greater than zero, we allocate the computed amount of space for that buffer, then clear it to zeros. Finally, we clear the buffer character count for that buffer, as we haven't stored anything in it yet.

When it's time to store some data in the buffer, we use the buffer character count array rather than calling strlen to find out how much data is currently in the buffer. I decided to track the count myself because when I first changed the buffer allocation strategy from fixed to variable, the program ran much more slowly than it had previously. This didn't make much sense to me at first, but upon reflection I realized that the longer the buffers are, the longer it would take strlen to find the end of each buffer. To prevent this undesirable effect, I decided to keep track of the size of buffers myself rather than relying on strlen to do it. Of course, that means that we have to add the length of each record to the total count for each buffer as we add the record to the buffer, so I added a line to handle this task.

The Second Version

So how does this second version of the program actually perform? Figure timings.02 illustrates how it works with files of various sizes.4

Performance of Zensort version 2 (Zensort\timings.02) (Figure timings.02)


If you compare the performance of this second version of the program to the previous version on the small file, you'll notice that it is almost 7.5 times as fast on both the 64 MB machine and the 192 MB machine as was the previous version. However, what is more important is how well it performs when we have a lot of data to sort. While we haven't achieved quite as much of a speed-up on larger files as on the smallest one, we have still sped up the one million record sort by a factor of almost 4.5 to one when running the sort on a machine with "only" 64 MB of RAM and more than 6.5 to 1 on the more generously equipped machine with 192 MB of RAM, which is not an insignificant improvement.5 Is this the best we can do? Not at all, as you'll see in the analysis of the other versions of the program. Let's continue with a very simple change that provided some improvement without any particular effort.

The Third Version

At this point in the development of the sorting algorithm, I decided that although saving memory is nice, we don't have to go overboard. On the assumption that anyone who wants to sort gigantic files has a reasonably capable computer, I decided to increase the amount of memory allocated to the buffers from 4 MB to 16 MB. As you might imagine, this improved performance significantly on larger files, although not nearly as much proportionally as our previous change did.

Originally, I wasn't planning to make any changes to the program from the previous version to this one other than increasing the buffer size. However, when running tests with the 64 MB memory configuration, I discovered that making just that one change caused the program to fail with a message telling me I was out of memory. This was hard to understand at first, because I was allocating only 16 MB at any one time; surely a 64 MB machine, even one running Windows 95, should be able to handle that without difficulty!

However, the program was crashing at the same place every time I ran it with the same data, after a number of passes through the main loop, so I had to figure out what the cause might be. At first, I didn't see anything questionable about the program. On further examination, however, I did notice something that was cause for concern: I was allocating and freeing those memory buffers every time through the main loop. While it seems reasonable to me that allocating a number of buffers and then freeing all of them should return the memory allocation map to its original state, apparently this was not the case. At least, that's the only explanation I can find for why the available memory displayed by the debugger should drop suddenly after a number of passes through the main loop in which it remained nearly constant.

Actually, even if allocating and freeing the buffers every time through the loop did work properly, it really isn't the right way to handle the memory allocation task. It's much more efficient to allocate one large buffer and just keep pointers to the places in that buffer where our smaller, logically distinct, buffers reside. Once I made those changes to the program, the crashes went away, so I apparently identified the problem correctly. The new, improved version is shown in Figure zen03.cpp.

Zensort version 3 (Zensort\zen03.cpp) (Figure zen03.cpp)


I think the changes in the program are relatively self-explanatory. Basically, the only changes are the allocation of a new variable called BigBuffer which is used to hold all the data for the records being sorted, and the change of the previously existing Buffer variable to an array of char* rather than an array of char. Rather than allocating and deleting the individual buffers on every pass through the main loop, we merely recalculate the position in the large buffer where the logical buffer for each character begins. The performance results for this version of the program are shown in Figure timings.03.

Performance of Zensort version 3 (Figure timings.03)


While we didn't get as much of an increase in performance from making more room available for the buffers as we did from improving the algorithm in the previous stage, we did get about a 13 percent increase in throughput on the largest file with a 64 MB system, and about 17 percent on the 192 MB system, which isn't negligible.6 Now let's take a look at another way of speeding up this algorithm that will have considerably more effect: sorting on two characters at a time.

The Fourth Version

Every pass we make through the file requires a significant amount of disk activity, both reading and writing. Therefore, anything that reduces the number of passes should help speed the program up noticeably. The simplest way of accomplishing this goal in a general way is to sort on two characters at a time rather than one as we have been doing previously.

This requires a number of changes to the program, none of which is particularly complicated. The new version is shown in Figure zen04.cpp.

Zensort version 4 (Zensort\zen04.cpp) (Figure zen04.cpp)


We'll start by examining a new function called CalculateKeySegment, which, as its name suggests, calculates the segment of the key that we're going to use for sorting. In this case, because we're going to be sorting on two characters at a time, this function calculates a key segment value by combining two characters of the input key, with the more significant character contributing more to the resulting value than the less significant character.

A simple way to think of this optimization is that we're going to sort on an alphabet consisting of 65536 characters, each of which is composed of two characters from the regular ASCII set. Because the maximum possible value of a character is 256, we can calculate the buffer in which we will store a particular record according to two characters of its key by multiplying the first character of the key by 256 and adding the second character of the key. This value will never be more than 65535 or less than 0, so we will allocate 65536 buffers, one for each possible combination of two characters.

Besides the substitution of the key segment for the individual character of the key, the other major change to the program is in the handling of a full buffer. In the old program, whenever a new record would cause the output buffer to overflow, we would write out the previous contents of the buffer and then store the new record at the beginning of the buffer. However, this approach has the drawback that it is possible in some cases to have a record that is larger than the allocated buffer, in which case the program will fail if we attempt to store the record in that buffer.

This wasn't too much of a problem in the previous version of the program, because with only 256 buffers, each of them would be big enough to hold any reasonably-sized record. However, now that we have 65536 buffers, this is a real possibility. With the current implementation, as long as the record isn't more than twice the size of the buffer, the program will work correctly. If we're worried about records that are larger than that, we can change the code to handle the record in any number of segments by using a while loop that will continue to store segments of the record in the buffer and write them out until the remaining segment will fit in the buffer.

So how does this fourth version of the program actually perform? Figure timings.04 answers that question.

Performance of Zensort version 4 (Figure timings.04)


If you compare the performance of this fourth version of the program to the previous version on the small file, you'll notice that it has nearly doubled on both memory configurations. As usual, however, what is more important is how well it performs when we have a lot of data to sort. As you can see from the performance results, the throughput when sorting the large file has improved by over 50 percent on both the small and large memory configurations.

We've just about reached the end of the line with incremental changes to the implementation. To get any further significant increases in performance, we'll need a radically different approach, and that's what the next version of this program provides.

The Fifth Version

Before we get to this change in the program, though, it might be instructive if I explain how I arrived at the conclusion that such a change was either necessary or even possible.

Unlike some well-known authors who shall remain nameless, I take technical writing very seriously. I test my code before publishing it, I typeset my books myself to reduce the likelihood of typesetting errors, and I even create my master CDs myself, to minimize the chance that errors will creep in somewhere in between my development system and your computer. Of course, this doesn't guarantee that there aren't any bugs in my programs; if major software development companies can't guarantee that, my one-man development and quality assurance organization certainly can't! However, I do a pretty good job, and when I miss something, I usually hear from readers right away and can get the correction into the next printing of the book in question.

You may be wondering what this has to do with the changes to the implementation of this sorting algorithm. The answer is that, although I thought I had discovered something important when I broke through the limited memory problem with distribution sorting, I decided it would be a good idea to see how its performance compares with other available sorts. Therefore, I asked a friend if he knew of any resources on sorting performance. After he found a page about sorting on the Internet, I followed up and found a page referring to a sorting contest.

Before I could tell how my implementation would compare to those in the contest, I had to generate some performance figures. Although the page about the contest was somewhat out of date, it gave me enough information so that I was able to generate a test file similar to the one described in the contest. The description was "one million 100-byte records, with a 10-byte random key". I wasn't sure what they meant by "random": was it a string of ten random digits, or 10 random binary bytes, or 10 random ASCII values? I decided to assume that 10 random decimal digits would be close enough to start with, so that's how I created an initial version of a test file. When I ran my latest, greatest version on this file, I was pretty happy when I discovered I could sort about 500,000 records in a minute, because the figures on the contest page indicated that this was quite cost-competitive in the "minute sort" category, which was based on the number of records sorted in a minute; although I was certainly not breaking any speed records as such, my system was much cheaper than the one that had set the record, so on a cost-performance basis I was doing quite well. However, I did need some more recent information to see how the latest competition was going.

So I contacted Jim Gray, who was listed on that page as a member of the contest committee, and heard back from him the next day. Imagine my surprise when I discovered that my "fast" sorting algorithm wasn't even in the ball park. My best throughput of approximately 800 KB/sec or so was less than one third of the leading competitors. Obviously, I had a lot more work to do if I wanted to compete in any serious way.

The first priority was to find out exactly why these other programs were so much faster than my program was. My discussion with Jim Gray gave me the clue when he told me that all of the best programs were limited by their disk I/O throughput. Obviously, if we have to make five passes through the file, reading and writing all of the data on each pass, we aren't going to be competitive with programs that do much less disk I/O, if that is the limiting factor on sorting speed.

Obviously, any possible sorting algorithm must read the entire input file at least once and write an output file of the same size. Is there any way to reduce the amount of I/O that our sorting algorithm uses so that it can approach that ideal?

Although we can't get to that limiting case, it is possible to do much better than we have done. However, doing so requires more attention to the makeup of the keys that we are sorting. Until now, we haven't cared very much about the distribution of the keys, except that we would get larger buffers if there were fewer different characters in the keys, which would reduce the number of disk write operations needed to create the output file and thereby improve performance.

However, if the keys were composed of reasonably uniformly distributed characters (or sets of characters) that we could use to divide up the input file into a number of segments of similar size based on their key values, then we could use a "divide and conquer" approach to sorting that can improve performance significantly. That's what the next version of this program, shown in Figure zen05.cpp, does.

Zensort version 5 (Zensort\zen05.cpp) (Figure zen05.cpp)


This new version of the algorithm works in a different way from the ones we've seen before. Instead of moving from right to left through the keys, sorting on the less significant positions to prepare the way for the more significant positions, we use the leftmost part of the key to decide which logical buffer the record will be put in. Once we have decided which logical buffer the record will go into, we use an insertion sort to stick it into the appropriate place in the buffer, i.e., after any record in that buffer with a lower key and ahead of any record in that buffer with a higher key.

Our use of the insertion sort to arrange the records in each buffer is the reason that we need a reasonably uniform distribution of keys to get good performance. As we have seen in previous versions of the program, if the keys are very non-uniform in their distribution, then each logical buffer will be very large. However, in contrast to our previous experience, with this version of the algorithm big logical buffers are a major impediment to good performance. The problem is that insertion sort is extremely time-consuming when used with large numbers of keys. Therefore, if we have a few big logical buffers, the insertion sort will dominate the time required to execute the algorithm. However, if the keys are reasonably uniformly distributed, then each logical buffer will be small and the insertion sort will not take very long to execute.

If we had enough memory to hold the entire input file, this version of the algorithm would be very simple to implement: we would examine the key for each record, decide which buffer it goes into, then put it into that buffer in the appropriate place via the insertion sort.

However, we may not have enough memory to hold a 100 MB file, and even if we do, using that much memory for our buffer is likely to be counterproductive because it will prevent that memory from being used for disk caching by the operating system, thus reducing the speed of our disk I/O. In addition, this solution is not scalable to larger files. Therefore, we need another approach.

Luckily, there is one that is not too difficult to implement: we make several passes through the input file, selecting the records that will fit in the buffer according to their keys. At the end of each pass, we write the entire buffer out to disk with one disk I/O operation. On each pass, we select successively higher key values, until we have handled all the records in the input file.

Let's see exactly how this works with our usual variable-length record input file. One fairly obvious change to the program is that we no longer are copying data from the input file to the output file and back again, which means we don't have to switch the filenames for the input and output files. Another change I've made, to reduce the amount of memory the program requires, is to eliminate several of the auxiliary arrays: Buffer, BufferSize, and TotalDisplacement. Instead of the Buffer array, I'm using a new array called BufferOffset that keeps track of the position in the one large buffer where each logical buffer starts. The TotalDisplacement array isn't really necessary, because it was only used during the calculations of the positions of the logical buffers and of the total data for all the records in the file; both of these functions have been replaced in the new implementation by a loop that calculates exactly how many records will fit in the big buffer and where they will go.

As for the BufferSize array, that turns out to be unnecessary in the new implementation of the algorithm. Because we will be precalculating the exact amount of space that will be occupied by the records in that buffer, we don't have to worry about whether any particular record will fit in the buffer: if it belongs in that buffer, it will fit.

Before we get to the major changes in the algorithm, I should explain the reason for the new version of CalculateKeySegment. As I've already mentioned, with this new version of the algorithm, it is extremely important to make each logical buffer as small as possible, to reduce the time needed to insert each record into its logical buffer. Therefore, because we want to be able to handle a one million record input file in an efficient manner, we will allocate one million logical buffers within our large physical buffer.

But how do we decide which logical buffer each record should be stored in? Because the keys in this file are composed of numeric digits, we can use the first six digits of the key to determine the appropriate logical buffer for each record. Although this is not theoretically optimal, because the keys in the file are not random, the performance results indicate that it is sufficient to give us a significant increase in speed over the previous version of the program, at least for large files.

Now let's get to the more complicated changes in this version of the algorithm, which are in the calculation of which data we will actually store in the big buffer. That's the responsibility of the code that starts with the line int Split[MAXPASSCOUNT] and ends with the line PassCount = j;. Let's go over this in some detail.

First, we have the declaration of the variable Split, which is used to keep track of which keys we are handling on this pass. This variable is an array of a number of elements sufficient to handle any file of reasonable size: to be exact, we need one element in this array for every possible pass through the input file, so 100 elements would suffice for a file of about 1.6 GB if we're using a 16 MB buffer for each pass.

Next, we have an array called SplitTotalSize, of the same number of elements, which is used to keep track of the total amount of data to be handled in each pass through the input file. We need this information to determine exactly how many bytes we're going to write from the big buffer at the end of each pass.

After declaring a couple of auxiliary variables, we get to the actual code. First, we initialize the value of the first element of the Split array to 0, because we know that on the first pass through the file, we will start handling records from the lowest possible key value, which of course is 0.

Now we're ready to start counting the keys and records that will be stored in the big buffer during each pass through the input file. To calculate the splits, we start by initializing the data count for each split to 0 and the offset of the first logical buffer to zero as well. Then we step through the array of buffer capacities, adding up the capacity of all the logical buffers that we encounter.

As long as we've not yet exceeded the size of the big buffer, we add the size of each logical buffer to the previous total size and set the offset of the next logical buffer to that total size. Once we get to a buffer whose size would cause an overflow of the big buffer capacity, we break out of the loop without updating the total size or the next logical buffer's offset.

Once we've exited from that inner loop, we know what segment of keys we're going to handle on this pass, so we set the next element of the Split array to the number of the last buffer that will fit in the big buffer on this pass. We also know the total size of the data that will be handled on this pass, so we set the value of the SplitTotalSize array element for this pass to that value. Next, we add the amount of data for this pass to the total data in the file so that we can report this information to the user. Finally, if we've reached the end of all the buffers, we break out of the outer loop, delete the BufferCapacity array, and set the PassCount variable to the number of passes that we will need to handle all the data.

Once we know how we are going to split up the input file, the rest of the changes are pretty simple. After opening the output file, we allocate the big buffer, then start the main loop that will execute once for each pass through the input file.

On each pass through the main loop, we handle the records whose keys fall in the range allocated to that pass. This requires changes in several areas of the code. First, at the beginning of the main loop, we clear the buffer character counts for each of the buffers that will be active during this pass. Then we reset the status of the input file, reposition it to its beginning, and clear the big buffer to all zeros.

The next set of changes occur when we have found the key segment for the current record from the input file. If that key segment falls within the current pass, then we have to put the record into the appropriate buffer. However, unlike our previous implementations, we have to be careful exactly where we put the record in the buffer, rather than just appending it at the end. This is because we're going to be writing the records into their final position in the file rather than merely copying them to an output file in partial order by a segment of the key. Therefore, we have to insert each record into the buffer in the precise relative position where it should go in the file. To do this, we compare the key of this record to the key of each record already in the buffer. When we find a record whose key is greater than the key of the record that we're trying to put in the buffer, we shift that record and all the following records toward the end of the buffer to make room for the new record. If the key of the record that we're looking at in the buffer is less than or equal to the key of the record that we want to put in the buffer, we have to locate the next record in the buffer to see if its key is greater than the new record. To do this, we increment our pointer into the buffer until we find a new-line character, which signals the end of the record whose key we have just examined, then continue our key comparison with the next record's key, which follows immediately after. Of course, if we don't find a record with a greater key by the time we get to the end of the buffer, then the new record goes at the end of the buffer.

Each time we reach the end of the outer loop, we have a big buffer filled with data that needs to be written to the file. Therefore, we call the write function to write the exact number of bytes that that pass has handled, which is stored in the SplitTotalSize variable for the particular pass that we are executing.

So how does this fifth version of the program actually perform? Figure timings.05 answers that question.

Performance of Zensort version 5 (Figure timings.05)


As in our previous comparison, we have improved performance radically on the smallest file, producing a 2.5 to 1 increase in speed with the smaller memory configuration and more than doubling it with the larger configuration. As usual, however, what is more important is how well it performs when we have a lot of data to sort. As you can see from the performance results, the throughput when sorting the large file has more than doubled on both the small and large memory configurations.

The Key Requirement

However, unlike our previous improvements, this one carries a fairly hefty price tag: if the keys are not reasonably evenly distributed, the program will run extremely slowly, and in some extreme cases may fail to work at all. The former problem results from the fact that as the number of keys in each logical buffer increases, the time taken to insert a key in that logical buffer increases as well. If the file is too big to fit in memory and a very large proportion of the keys are identical, then the program as currently written may fail to execute: this will happen if the size of one logical buffer exceeds the amount of memory allocated for all buffers.

Does this mean that this new version of the program is useless? Not at all: it means that we have to be prepared for the eventuality that this version of the algorithm may behave badly on certain data and handle that eventuality should it occur. This is the subject of a problem at the end of this chapter.

The Sixth Version

The previous version marks the end of our performance improvements on the original input file, mostly because I couldn't think of any other improvements to implement. However, even after increasing the performance significantly on that file, I was still very interested in seeing how well an adaptation of the same algorithm could be made to perform on an input file like the ones specified in the sorting contest.

It turned out that the changes needed to sort one of those files reasonably efficiently were not terribly complex, as you can see by looking at the next version of the program, shown in Figure zen06.cpp.

Zensort version 6 (Zensort\zen06.cpp) (Figure zen06.cpp)


In fact, all I had to do was change the calculation of the key segment to combine values from the first three ASCII characters in the input line rather than the first six digits of the numeric key in the previous input file, and change the number of logical buffers to correspond to this modification.

So how does this sixth version of the program actually perform? Figure timings.06 answers that question.

Performance of Zensort version 6 (Figure timings.06)


Unfortunately, we can't directly compare the results of this series of tests with the previous one because the file sizes are different. However, we can establish a lower bound on our improvements if we compare the throughput on the one million record file: because throughput generally decreases as the size of the file increases, we can be fairly sure that the throughput figures on the 100 MB file would be lower with the old algorithm than the throughput figures on the 63 MB file with that same algorithm.

As usual, the relative improvement is greatest on the smallest file, being almost twice as fast as the previous program on the smaller memory configuration and over twice as fast on the larger configuration. However, the improvement in throughput on the larger files with the larger memory configuration is also quite substantial. The better key distribution has enabled us to improve performance on the large file under the large memory configuration by 80 percent over our previous best with minimal code changes.

The Final Version

There's one more change we can make that will improve performance significantly on a file with fixed-length records, such as the one used in the sorting contest. Of course, if that were the only "application" using such records, they wouldn't be worth our attention. However, this is far from the case; in fact, most traditional data processing systems rely heavily on files with fixed-length records, which is a main reason they were chosen for the sorting contest in the first place. Therefore, optimizing the code for such records is worthwhile.

It turned out that the changes needed to handle fixed-length records more efficiently were not very difficult, as you can see by looking at the next version of the program, shown in Figure zen07.cpp.

Zensort version 7 (Zensort\zen07.cpp) (Figure zen07.cpp)


Basically these changes consisted of: using the read function rather than the getline function to read a fixed-length record into an input buffer; changing the calculation of the record length to use the fixed length; and stepping through the buffer in units of the fixed record length when searching for the next record rather than searching for the next new-line character as the end-of-record marker. I also took advantage of the fixed-length records to avoid clearing the big buffer on all passes after the first one: since I know exactly where each record (and therefore key) starts in the buffer, as well as how many records are in the buffer, records left over from the previous pass can safely be ignored.

So how does this final version of the program actually perform? Figure timings.07 answers that question.

Performance of Zensort version 7 (Figure timings.07)


I'm pretty happy with these figures, especially the ones for the 10 and 25 million byte files with the large memory configuration! However, even at the 100 million byte mark, the large memory configuration results are quite good, exceeding our previous mark by more than 40%.

As for the contest: Although the commercial sorting programs that have been entered in the sorting contest are noticeably faster than this one on the largest file, somehow I doubt that their authors will give you the source code. Therefore, if you need an algorithm that you can tinker with yourself, and whose inner workings are explained in detail, I don't think you'll find anything faster than this program.


In this chapter, we have developed an algorithm for sorting large files from rather humble beginnings to quite respectable performance by a series of small, incremental steps. This is a good example of how such algorithm development occurs in the real world, in addition to being potentially useful in its own right.

In all, we have improved performance on large files by more than 65 to 1 in the large memory configuration, and by a factor of over 20 even in a relatively small 64 MB machine, compared to our first attempt.


  1. How would you modify the "divide-and-conquer" version of the sorting program so that it would handle uneven key distributions in a reasonable manner rather than running extremely slowly or failing to execute at all?

(You can find suggested approaches to problems in Chapter artopt.htm).


  1. The displacement and total displacement arrays play the same roles in this program as the bucket count and bucket position arrays did in the distribution counting program in Chapter
  2. mail.htm.
  3. This special handling for short records is required so that we don't accidentally pick up garbage characters past the end of the key if the record we are handling is shorter than the number of characters on which we wish to sort.
  4. Notes on performance tables:

    1. All times are in seconds.

    2. All tests were run on a machine with a Pentium II processor running at 233 MHz with either 64 or 192 MB of RAM and a Western Digital model 35100 5.1 GB Ultra DMA hard drive. The disk partition on which the tests were run was defragmented before each test was run.

    3. The programs were run in a DOS session under Windows 95.

    4. I know that the entries in some of the figures look suspicious, as the timing for the small and large memory configurations are sometimes identical on the 100000 record case. However, the entries are correct; I guess it's just a fluke of testing.

  5. All times are in seconds.
  6. You may be wondering how much of the improvement in performance was due to the better buffer allocation strategy and how much to the other minor improvements such as keeping track of the buffer contents ourselves rather than calling strlen. Unfortunately, I can't give you that information because I did not run tests with each of those factors isolated; there are just too many combinations for me to test in a reasonable amount of time. However, because you have the source code for all the different versions of this program, you can find that out for yourself. I would be interested in receiving the results of any comparative tests you might run.
  7. In case you were wondering how much of the performance improvement was due to using one large buffer rather than many small buffers, there was little or no performance improvement from that change. However, according to the first law of optimization, that doesn't make any difference because the program didn't work properly until I made that change.

Comment on this book!

Return to the table of contents

Return to my main page