Optimizing C++ - the WWW version

ISBN: 0-13-977430-0

Copyright 1999 by Prentice-Hall PTR

Copyright 2000 by Chrysalis Software Corporation


Return to the table of contents

Return to my main page


A Mailing List System

Introduction

In this chapter we will use a selective mailing list system to illustrate rapid access to and rearrangement of information selected by criteria specified at run time. Our example will allow us to select certain customers of a mail-order business whose total purchases this year have been within a particular range and whose last order was within a certain time period. This would be very useful for a retailer who wants to send coupons to lure back the (possibly lost) customers who have spent more than $100 this year but who haven't been in for 30 days. The labels for the letters should be produced in ZIP code order, to take advantage of the discount for presorted mail.

Algorithms Discussed

The Distribution Counting Sort, Bitmaps

A First Approach

To begin, let us assume that the information we have about each customer can be described by the structure definition in Figure custinfo.

Customer information (Figure custinfo)

typedef struct

{

char last_name[LAST_NAME_LENGTH+1];

char first_name[FIRST_NAME_LENGTH+1];

char address1[ADDRESS1_LENGTH+1];

char address2[ADDRESS2_LENGTH+1];

char city[CITY_LENGTH+1];

char state[STATE_LENGTH+1];

char zip[ZIP_LENGTH+1];

int date_last_here;

int dollars_spent;

} DataRecord;

A straightforward approach would be to store all of this information in a disk file, with one DataRecord record for each customer. In order to construct a selective mailing list, we read through the file, testing each record to see whether it meets our criteria. We extract the sort key (ZIP code) for each record that passes the tests of amount spent and last transaction date, keeping these keys and their associated record numbers in memory. After reaching the end of the file, we sort the keys (possibly with a heapsort or Quicksort) and rearrange the record numbers according to the sort order, then reread the selected records in the order of their ZIP codes and print the address labels.

It may seem simpler just to collect records in memory as they meet our criteria. However, the memory requirements of this approach might be excessive if a large percentage of the records are selected. The customer file of a mail-order business is often fairly large, with 250000 or more 100-byte records not unheard of; in this situation a one-pass approach might require as much as 25 megabytes of memory for storing the selected records.

However, if we keep only the keys of the selected records in memory and print the label for each customer as his record is reread on the second pass, we never have to have more than one record in memory at a time. In our example, the length of the key (ZIP code) is nine bytes and the record number is two bytes long, so that 250000 selected records would require only 2.75 megabytes. This is a much more reasonable memory budget, considering that our program might not be the only one running, especially under an operating system like WindowsTM.

Even so, there is no reason to allocate all the storage we might ever need in advance, and every reason not to. We'd like the program to run in the minimum memory possible, so that it would be useful even on a machine with limited memory or one that is loaded up with network drivers and memory-resident utilities.

A linked list is a fairly simple way to allocate memory as needed, but we must be careful to use this method efficiently. Allocating a new block of storage for each record that matches our criteria would be very wasteful, as extra control storage is needed to keep track of each allocation. In MS-DOS, each allocation requires at least 16 bytes of control storage, so 250000 allocations would use about 4 megabytes just for the control storage! This problem is easy to solve; we will allocate storage for a number of records at once, 1000 in our example, which reduces that 4 megabyte overhead to 4 kilobytes.1

Saving Storage with Bitmaps

Let's start our optimization by trying to reduce the memory we need for each selected record during this first pass through the file. One possibility is to convert the ZIP codes to Radix40 representation to save memory. Unfortunately, such data is not suitable for sorting.

However, there is another way for us to save memory during the first pass: using a bitmap to keep track of the records that match our criteria. A bitmap is an array of bits that can be used to record one characteristic of a number of items, as long as that characteristic can be expressed as yes/no, true/false, present/absent or some similar pair of opposites. In this case, all we want to remember is whether each record was selected. So if we allocate a bitmap with one bit for each possible record number, we can clear the whole bitmap at the start (indicating we haven't selected any records yet) and then set the bit for each record number we select. When we get done with the selection, we will know how many records were selected, so we can allocate storage for the record numbers we need to sort. Then we can go through the bitmap, and every time we find a bit that is set, we will add the corresponding record number to the end of the list to be passed to the sort.

Of course, we could use an array of bytes, one per record, instead of a bitmap, which would not require us to write bit access functions. However, the bitmap requires only one-eighth as much as storage as an array of bytes, which can result in considerable savings with a big array. In our example, with a 250000-record file, a bitmap with one bit per record would occupy only about 31 KB, whereas an equivalent byte array would occupy 250 KB. While this difference may not be significant in the present case, larger files produce proportionally larger savings.

As with almost everything, there is a drawback to the use of bitmaps; they usually require more time to access than byte arrays. This is especially true for those machines (such as the 80x86) that do not have an instruction set designed for easy or efficient access to individual bits. Even on machines like the 68020 and its successors, which do have good bit manipulation instructions, compiler support for these functions may be poor. However, since CPU time is unlikely to be the limiting factor while we are reading records from our input file, this extra processing time is probably unimportant in our case.

Increasing Processing Speed

Now that we have reduced the memory requirements for the first pass, is there any way to speed up the program? In my tests with large numbers of selected records, it turns out that the disk reading and writing time are a small part of the entire elapsed time required to run the program, at least when we are using a standard sorting function. In such a case, by far the greatest proportion of the time is spent in sorting the keys, so that's where we'll focus our attention next.

Sorting Speed

Which sort should we use? Surely we can't improve on the standard sort supplied with the C++ compiler we're using (typically Quicksort) or one of its relatives such as heapsort. After all, the conventional wisdom is that even the fastest sorting algorithm takes time proportional to n*log(n). That is, if sorting 1000 items takes 1 second, sorting 1,000,000 items would take at least 2000 seconds, using an optimal sorting algorithm. Since these commonly used algorithms all have average times proportional to this best case, it seems that all we have to do is select the one best suited for our particular problem.

This is a common misconception: in fact, only sorts that require comparisons among keys have a lower bound proportional to n*log(n). The distribution counting sort we are going to use takes time proportional to n, not n*log(n), so that if it takes 1 second to sort 1000 items, it would take 1000 seconds to sort 1,000,000 items, not 2000 seconds.2 Moreover, this sort is quite competitive with the commonly used sorts even for a few hundred items, and it is easy to code as well. For some applications, however, its most valuable attribute is that its timing is data independent. That is, the sort takes the same amount of time to execute whether the data items are all the same, all different, already sorted, in reverse order, or randomly shuffled. This is particularly valuable in real-time applications, where knowing the average time is not sufficient.3 Actually, this sort takes time proportional to the number of keys multiplied by the length of each key. The reason that the length of the keys is important is that a distribution counting sort actually treats each character position of the sort keys as a separate "key"; these keys are used in order from the least to the most significant. Therefore, this method actually takes time proportional to n*m, where n is the number of keys and m is the length of each key. However, in most real-world applications of sorting, the number of items to be sorted far exceeds the length of each item, and additions to the list take the form of more items, not lengthening of each one. If we had a few very long items to sort, this sort would not be as appropriate.

You're probably wondering how fast this distribution sort really is, compared to Quicksort. According to the results of my tests, which sorted several different numbers of records between 23480 and 234801 on 5-digit ZIP codes using both Microsoft's implementation of the Quicksort algorithm (qsort) in version 5.0 of their Visual C++ compiler and my distribution counting sort, which I call "Megasort", there's no contest.4 The difference in performance ranged from approximately 43 to 1 at the low end up to an astonishing 316 to 1 at the high end! Figures promailx.00 and promail.00, near the end of this chapter, show the times in seconds when processing these varying sets of records.

Now let's see how such increases in performance can be achieved with a simple algorithm.

The Distribution Counting Sort

The basic method used is to make one pass through the keys for each character position in the key, in order to discover how many keys have each possible ASCII character in the character position that we are currently considering, and another pass to actually rearrange the keys. As a simplified example, suppose that we have ten keys to be sorted and we want to sort only on the first letter of each key.

The first pass consists of counting the number of keys that begin with each letter. In the example in Figure unsorted, we have three keys that begin with the letter 'A', five that begin with the letter 'B', and two that begin with the letter 'C'. Since 'A' is the lowest character we have seen, the first key we encounter that starts with an 'A' should be the first key in the result array, the second key that begins with an 'A' should be the second key in the result array, and the third 'A' key should be the third key in the result array, since all of the 'A' keys should precede all of the 'B' keys.

Unsorted keys (Figure unsorted)

 1        bicycle

2 airplane

3 anonymous

4 cashier

5 bottle

6 bongos

7 antacid

8 competent

9 bingo

10 bombardier

The next keys in sorted order will be the ones that start with 'B'; therefore, the first key that we encounter that starts with a 'B' should be the fourth key in the result array, the second through fifth 'B' keys should be the fifth through eighth keys in the result array, and the 'C' keys should be numbers nine and ten in the result array, since all of the 'B' keys should precede the 'C' keys. Figure counts.and.pointers illustrates these relationships among the keys.

Counts and pointers (Figure counts.and.pointers)

Counts            Starting indexes

A    B    C         A     B    C
3 5 2 1 4 9
| | |
+-+ | +--+
Key Old | New | | Explanation
Index | Index | |
Bicycle 1 |+--4---+ | The first B goes to position 4.
Airplane 2 ++--1--+ | The first A goes to position 1.
Anonymous 3 ++--2--+ | The second A goes after the first.
Cashier 4 +++--9-----+ The first C goes to position 9.
Bottle 5 ||+--5-+ The second B goes after the first.
Bongos 6 ||+--6-+ The third B goes after the second.
Antacid 7 |++--3 The third A goes after the second.
Competent 8 +-+--10 The second C goes after the first.
Bingo 9 +--7-+ The fourth B goes after the third.
Bombardier 10 8-+ The fifth B goes after the fourth.

If we rearrange the keys in the order indicated by their new indexes, we arrive at the situation shown in Figure afterfirst:

After the sort on the first character (Figure afterfirst)

Index     Unsorted Keys                 Sorted Keys

1 Bicycle-------+ +--------Airplane
| |
2 Airplane------+------+ +-----Anonymous
| |
3 Anonymous-----+---------+ +---Antacid
| +----+
4 Cashier----+ +------+--------Bicycle
| |
5 Bottle-----+---------+--------Bottle
| |
6 Bongos-----+---------+--------Bongos
| |
7 Antacid----+---------+ +---Bingo
| |
8 Competent--+--+ +------+---Bombardier
+--+----+------+-+
9 Bingo---------+----+------+ +-Cashier
+----+-----+
10 Bombardier---------+ +----Competent

Multicharacter Sorting

Of course, we usually want to sort on more than the first character. As we noted earlier, each character position is actually treated as a separate key; that is, the pointers to the keys are rearranged to be in order by whatever character position we are currently sorting on. With this in mind, let's take the case of a two character string; once we can handle that situation, the algorithm doesn't change significantly when we add more character positions.

We know that the final result we want is for all the A's to come before all the B's, which must precede all the C's, etc., but within the A's, the AA's must come before the AB's, which have to precede the AC's, etc. Of course, the same applies to keys starting with B and C: the BA's must come before the BB's, etc. How can we manage this?

We already know that we are going to have to sort the data separately on each character position, so let's work backward from what the second sort needs as input. When we are getting ready to do the second (and final) sort, we need to know that all the AA's precede all the AB's, which precede all the AC's, etc., and that all the BA's precede the BB's, which precede the BC's. The same must be true for the keys starting with C, D, and any other letter. The reason that the second sort will preserve this organization of its input data is that it moves keys with the same character at the current character position from input to output in order of their previous position in the input data. Therefore, any two keys that have the same character in the current position (both A's, B's, C's, etc.) will maintain their relative order in the output. For example, if all of the AA's precede all of the AB's in the input, they will also do so in the output, and similarly for the BA's and BB's. This is exactly what we want. Notice that we don't care at this point whether the BA's are behind or ahead of the AB's, as arranging the data according to the first character is the job of the second sort (which we haven't done yet). But how can we ensure that all the AA's precede the AB's, which precede the AC's, etc. in the input? By sorting on the second character position first!

For example, suppose we are sorting the following keys: AB, CB, BA, BC, CA, BA, BB, CC. We start the sort by counting the number of occurrences of each character in the second position of each key (the less significant position). There are three A's, three B's, and two C's. Since A is the character closest to the beginning of the alphabet, the first key that has an A in the second position goes in the first slot of the output. The second and third keys that have an A in the second position follow the first one. Those that have a B in the second position are next, in output positions 4, 5, and 6. The C's bring up the rear, producing the situation in Figure lesser.char.

After this first sort (on the second character position), all of the keys that have an A in the second position are ahead of all of the keys that have a B in the second position, which precede all those that have a C in the second position. Therefore, all AA keys precede all AB keys, which precede all AC keys, and the same is true for BA, BB, BC and CA, CB, and CC as well. This is the exact arrangement of input needed for the second sort.

Less significant character sort (Figure lesser.char)

1    AB---------++--------BA                 *
||
2 CB-------+ ||+-------CA
| |||
3 BA-------+-++|+------BA *
| | ||
4 BC-----+ | +-++------AB
| +---+++
5 CA-----+-----+|+-----CB
| |
6 BA-----+------++-----BB *
+-------+-+
7 BB-------------+ +---BC *

8 CC-------------------CC

Now we are ready for the second, and final, sorting operation. We start by counting the number of occurrences of each character in the first, or more significant, position. There is one A in this position, along with four B's, and three C's. Starting with the lowest character value, the key that has an A in this position ends up in the first slot of the output. The B's follow, starting with the second slot. Remember that the keys are already in order by their second, or less significant, character, because of our previous sort. This is important when we consider the order of keys that have the same first character. For example, in Figure lesser.char, the asterisks mark the keys that start with B.5 They are in order by their second character, since we have just sorted on that character. Therefore, when we rearrange the keys by their first character position, those that have the same first character will be in order by their second character as well, since we always move keys from the input to the output in order of their position in the input. That means that the B records have the same order in the output that they had in the input, as you can see in Figure greater.char.

This may seem like a lot of work to sort a few strings. However, the advantage of this method when we have many keys to be sorted is that the processing for each pass is extremely simple, requiring only a few machine instructions per byte handled (less than 30 on the 80x86 family).

More significant character sort (Figure greater.char)

1    BA-------+ +----------AB
| |
2 CA ----+ +-+----------BA
| |
3 BA-----+---+----------BA
| |
4 AB-----+---+ +--------BB
| |
5 CB----+| | +----BC
+++-----+ |
6 BB---+|+---------+----CA
++----------+
7 BC---++---------------CB

8 CC--------------------CC

On the Home Stretch

Having finished sorting the keys, we have only to retrieve the records we want from the input file in sorted order and write them to the output file. In our example, this requires reading the desired records and writing them into an output file. But how do we know which records we need to read and in what order?

The sort function requires more than just the keys to be sorted. We also have to give it a list of the record numbers to which those keys correspond, so that it can rearrange that list in the same way that it rearranges the keys. Then we can use the rearranged list of record numbers to retrieve the records in the correct order, which completes the task of our program.

The Code

Let's start our examination of the mailing list program by looking at the header file that defines its main constants and structures, which is shown in Figure mail.00a.

The main header file for the mailing list program (mail\mail.h) (Figure mail.00a)

codelist/mail.00a

Now we're ready to look at the implementation of these algorithms, starting with function main (Figure mail.00).

main function definition (from mail\mail.cpp) (Figure mail.00)

codelist/mail.00

This function begins by checking the number of arguments with which it was called, and exits with an informative message if there aren't enough. Otherwise, it constructs the output file name and opens the file for binary input. Then it calls the initialize function (Figure mail.01), which sets up the selection criteria according to input arguments 3 through 6 (minimum spent, maximum spent, earliest last-here date, and latest last-here date). Now we are ready to call process (Figure mail.02), to select the records that meet those criteria.

initialize function definition (from mail\mail.cpp) (Figure mail.01)

codelist/mail.01

process function definition (from mail\mail.cpp) (Figure mail.02)

codelist/mail.02

The first order of business in process is to set up the buffering for the list (output), and data files. It is important to note that we are using a large buffer for the list file and for the first pass through the data file, but are changing the buffer size to the size of 1 record for the second pass through the data file. What is the reason for this change?

Determining the Proper Buffer Size

On the first pass through the data file, we are going to read every record in physical order, so a large buffer is useful in reducing the number of physical disk accesses needed.

This analysis, however, does not apply to the second pass through the data file. In this case, using a bigger buffer for the data file would actually reduce performance, since reading a large amount of data at once is helpful only if you are going to use the data that you are reading.6 On the second pass, we will read the data records in order of their ZIP codes, forcing us to move to a different position in the data file for each record rather than reading them consecutively. Using a big buffer in this situation would mean that most of the data in the buffer would be irrelevant.

Preparing to Read the Key File

Continuing in process, we calculate the number of records in the data file, which determines how large our record selection bitmap should be. Then we call the macro allocate_bitmap, which is defined in bitfunc.h (Figure bitfunc.00a), to allocate storage for the bitmap.

The header file for the bitmap functions (mail\bitfunc.h) (Figure bitfunc.00a)

codelist/bitfunc.00a

Of course, each byte of a bitmap can store eight bits, so the macro divides the number of bits we need by eight and adds one byte to the result. The extra byte is to accommodate any remainder after the division by eight.

Now that we have allocated our bitmap, we can read through the data file and select the records that meet our criteria. After initializing our counts of "items read" and "found" to zero, we are ready to start reading records. Of course, we could calculate the number of times through the loop rather than continue until we run off the end of the input file, since we know how many records there are in the file. However, since we are processing records in batches, the last of which is likely to be smaller than the rest, we might as well take advantage of the fact that when we get a short count of items_read, the operating system is telling us that we have reached the end of the file.

Reading the Key File

The first thing we do in the "infinite" loop is to read a set of processing_batch records (to avoid the overhead of calling the operating system to read each record). Now we are ready to process one record at a time in the inner loop. Of course, we want to know whether the record we are examining meets our selection criteria, which are whether the customer has spent at least min_spent, no more than max_spent, and has last been here between min_date and max_date (inclusive). If the record fails to meet any of these criteria, we skip the remainder of the processing for this record via "continue". However, let's suppose that a record passes these four tests.

In that case, we increment items_found. Then we want to set the bit in the found bitmap that corresponds to this record. To do this, we need to calculate the current record number, by adding the number of records read before the current processing batch (total_items_read) and the entry number in the current batch (i). Now we are ready to call setbit (Figure bitfunc.00).

setbit function definition (from mail\bitfunc.cpp) (Figure bitfunc.00)

codelist/bitfunc.00

Setting a Bit in a Bitmap

The setbit function is quite simple. Since there are eight bits in a byte, we have to calculate which byte we need to access and which bit within that byte. Once we have calculated these two values, we can retrieve the appropriate byte from the bitmap.

In order to set the bit we are interested in, we need to create a "mask" to isolate that bit from the others in the same byte. The statement that does this, mask = 1 << bitnumber;, may seem mysterious, but all we are doing is generating a value that has a 1 in the same position as the bit we are interested in and 0 in all other positions. Therefore, after we perform a "logical or" operation of the mask and the byte from the bitmap, the resulting value, stored back into the bitmap, will have the desired bit set.

This setbit function also returns a value indicating the value of the bit before we set it. Thus, if we want to know whether we have actually changed the bit from off to on, we don't have to make a call to testbit before the call to setbit; we can use the return value from setbit to determine whether the bit was set before we called setbit. This would be useful, for example, in an application where the bitmap was being used to allocate some resource, such as a printer, which cannot be used by more than one process at a time. The function would call setbit and, if that bit had already been set, would return an error indicating that the resource was not available.

Now we have a means of keeping track of which records have been selected. However, we also need to save the ZIP code for each selected record for our sort. Unfortunately, we don't know how many records are going to be selected until we select them. This is easily dealt with in the case of the bitmap, which is so economical of storage that we can comfortably allocate a bit for every record in the file; ZIP codes, which take ten bytes apiece, pose a more difficult problem. We need a method of allocation which can provide storage for an unknown number of ZIP codes.

Allocate as You Go

Of course, we could use a simple linked list. In that approach, every time we found a record that matches our criteria, we would allocate storage for a ZIP code and a pointer to the next ZIP code. However, this consumes storage very rapidly, as additional memory is required to keep track of every allocation of storage. When very small blocks of ten bytes or so are involved, the overhead can easily exceed the amount of storage actually used for our purposes, so that allocating 250000 14-byte blocks can easily take 7.5 megabytes or more, rather than the 3.5 megabytes that we might expect.

To avoid this inefficiency, we can allocate larger blocks that can accommodate a number of ZIP codes each, and keep track of the addresses of each of these larger blocks so that we can retrieve individual ZIP codes later. That is the responsibility of the code inside the if statement that compares current_zip_entry with ZIP_BLOCK_ENTRIES.7 To understand how this works, let's look back at the lines in process that set current_zip_block to -1 and current_zip_entry to ZIP_BLOCK_ENTRIES. This initialization ensures that the code in the "if" will be executed for the first selected record. We start by incrementing current_zip_block (to zero, in this case) and setting current_zip_entry to zero, to start a new block. Then we allocate storage for a new block of ZIP codes (zip_block) and set up pointers to each one (temp_zip_pointer) so that we can copy the ZIP codes from the keys as we select each record.

Once that bookkeeping is out of the way, we can copy the ZIP code for the current record into the ZIP block via the temp_zip_pointer pointer array and increment which pointer in the array will be used next (current_zip_entry).

Now we're at the end of the processing for one record in a processing batch. Once all the records in the batch have been processed, we fall off the end of the inner loop. At this point, we can add the number of items we have read in this batch to total_items_read. If we have read less than a full batch, we must have reached the end of the file, so we are finished with this phase of processing.

Getting Ready to Sort

Now that we have all the information needed to sort the keys so that we will be able to read the records in ZIP code order on the second pass. However, the sort function, Megasort (Figure megac.00) requires that same information in a different form, so we have some preparation to do before we can call it.

The Megasort function (mail\megac.cpp) (Figure megac.00)

codelist/megac.00

Looking at the parameters for this function, we see that the data to be sorted are described by an array of string pointers (unsigned char **PtrArray), whereas what we have is a number of blocks of ZIP codes. The record numbers, which the sort function will rearrange according to the order of the ZIP codes, must be passed as an array of unsigned values (unsigned *RecNums), rather than as the bitmap in which they are stored by our search function.

To set up the first argument to Megasort, we need to produce an array of pointers to each of the ZIP codes we have stored in the blocks, which we do in the portion of Figure mail.02 between the function call that reports the "Pass 1 time" and the call to the Megasort function.

We start by allocating a block of ZIP code pointers (zip_pointer) that can hold all the pointers we will need (items_found). Then, for each block of ZIP codes we have created (zip_block[i]), we calculate the address of each ZIP code in the block and store it in an entry of our zip_pointer array.

That takes care of the pointers to the ZIP codes. Now we have to turn the bitmap into an array of record numbers to be rearranged. First we allocate the array of unsigned values, with items_found entries. Then we call testbit (Figure bitfunc.01).

The testbit function definition (from mail\bitfunc.cpp) (Figure bitfunc.01)

codelist/bitfunc.01

We call testbit once for every record number in the file, and every time we find a bit set, we store that record number in the next available space in the record_number array. The testbit function is very similar to setbit, described above. The difference is that we don't want to set the bit, but to find out whether it is already set; therefore, we don't modify the bitmap array. Now we are finally ready to call Megasort (Figure megac.00).

The Megasort Function

The main variables involved are the BucketCount array, which keeps track of the number of strings that have each character in the current sorting position (the number of A's, B's, etc.), and the BucketPosition array, which maintains the current position in the output array for each possible character (where the next A string should go, the next B string, etc.). We also have to allocate storage to the TempPtrArray variable for a copy of the pointers to the strings, and to the TempRecNums variable for a copy of the record numbers to be rearranged. All of this setup code having been executed, the main sorting loop is ready to go.

You may be surprised to see how few lines of code are involved in the main loop. If we disregard blank lines and those that contain only a starting or ending brace, only 17 lines contain any executable code. Yet in many common applications this simple algorithm will outperform all of the complex methods favored by the computer science textbooks.

The outer loop is executed once for each character position in the sort. That is, if we are sorting a number of ten-character strings, it will be executed ten times. Its role is to select which character of the strings to be sorted will be used in the current pass through the strings. The reason that the loop index i decreases for each pass through the loop was discussed above: each character position is treated as a separate "key", so we have to sort on the least significant "key" first and work our way up to the most significant one. Each pass of this sort leaves strings in the same order as they were in before the current pass if they have the same character in the current position.

The first operation in the main loop is to use the memset function to clear the BucketCount array, which maintains the counts for each possible ASCII character code. In the first pass through the input data, we step through PtrArray, selecting the ith character as the current character to be sorted from each element of the array. The ASCII value of that character, m, is used to select the element of BucketCount to be incremented, so that when we reach the end of the pointer array, we will know the number of strings that have that character in the current position being sorted.

Once all the counts have been accumulated, we can use that information to fill in the BucketPosition array, which indicates the next available slot for each possible character position. This serves the function of directing traffic from the input pointer array PtrArray to the output one, TempPtrArray. Each element in the BucketPosition array corresponds to the next output position to be used in TempPtrArray for a given character. For example, suppose that the lowest character in the current position of the array to be sorted is 'A', and there are five strings that have an 'A' in the current position. In that case, the initial value of BucketPosition['A'] would be 0, and the initial value of BucketPosition['B'] would be 5, since the first 'A' key should go into output slot 0 and the first 'B' key should go into output slot 5.

After filling in the BucketPosition array, we are ready to rearrange the pointers to the keys by copying these pointers into the temporary array TempPtrArray. That is done in the second inner loop through all the keys. First, we calculate the appropriate index into the BucketPosition array by retrieving the character from the input string, as in the first inner loop above. Then we use that index to retrieve the appropriate element of the BucketPosition array, which we will use to copy both the key pointer and the record number of the current record into the output arrays, TempPtrArray and TempRecNums, respectively.

Then we increment the element of BucketPosition that we have used, because the next key that has that particular character in the current position should go in the next position of the output. In our example, after the first 'A' key has been copied into the output array, BucketPosition['A'] would be incremented to 1, meaning that the next 'A' key should go into output slot 1. Similarly, after the first 'B' key has been copied into the output array, BucketPosition['B'] would be incremented to 6, meaning that the next 'B' key should go into output slot 6.

After copying all of the pointers from PtrArray into TempPtrArray, and all of the record numbers from RecNums into TempRecNums in sorted order, we copy them back into the original arrays for use in the next sorting pass, if there is one, or as the final result if we are done.

Finishing Up

When we are done with the sort, we can return the sorted keys and record numbers to process (Figure mail.02), so that it can retrieve and print the first and last name and the ZIP code for each customer in order by their ZIP codes. We then free the dynamic storage used in the function and return to the main program, which calls terminate to close the files, then exits.

Performance

You may be wondering how much performance improvement we have obtained by using distribution sorting in our example program, mail.cpp. In order to answer this question, I have created another program called mailx.cpp that is identical to mail.cpp except that mailx.cpp uses qsort rather than the distribution counting sort. I've included calls to timing functions in both versions and have run each of them four times with different numbers of keys to see how they perform; Figures promailx.00 and promail.00 summarize the results of those runs.

Performance figures for mailx.cpp (Figure promailx.00)

codelist/promailx.00

Performance figures for mail.cpp (Figure promail.00)

codelist/promail.00

While the improvement in performance increases with the number of elements, even at the low end, we've multiplied the performance of the sorting part of the task by about 43 times. At the high end, the difference in the sorting speed is a phenomenal 316 to 1, and the overall performance in that case is more than 50 times what it is using qsort. Is optimization always that simple?

This would be a good time for me to mention an attempted optimization that actually did not help performance in the final analysis: the idea of "strip mining", which can most simply be described as dividing a file into two parts, one of which is the "key" portion and the other the "data" portion. Theoretically, it should be faster to read just the keys on the first pass of our mailing list program, then read the data on the second pass only for those records that have been selected. Unfortunately, when I tried this "optimization", it actually slowed the program down. I found this quite surprising, because in earlier tests on other machines, it did help somewhat. However, it doesn't help on my current machine, so I've omitted it.

Moral Support

Is there a moral to this real-life optimization story that summarizes both its successes and failures?

I think there are several. First, we've seen that a very simple algorithm can grossly out-perform a standard library function from a well-known compiler vendor. Second, and more important, optimization is a tricky business. The common thread of these lessons is that no matter how carefully you plan a programming task, the unexpected is very likely to occur. You have to be willing to examine all your preconceptions and discard them if necessary in the face of new evidence. Perhaps this willingness, along with the ability to handle the unexpected when it occurs, is the difference between a novice and an expert.

A Cautionary Tale

Before we leave this chapter, I should mention that the programs described here do not work properly when compiled with the DJGPP compiler, but produce general protection faults at apparently random times during their execution. This is quite surprising, because they work perfectly with the Microsoft C++ version 5 compiler. Unfortunately, I haven't been able to track down and eliminate the source of this problem; however, at least I can tell you about the problem before you discover it for yourself. As to the cause of the problem: I can only guess that it is caused by bugs in the most recent release of DJGPP.

Summary

In this chapter, we have covered the use of a bitmap to keep track of one attribute of a number of records, sorting via the distribution counting sort, and how to gain rapid access to data according to criteria specified at run time. In the next chapter we will examine the arithmetic coding method of data compression.

Problems

What modifications to the distribution counting sort would:

  1. Arrange the output in descending order rather than ascending order?
  2. Make only one pass through the data for each character position, rather than two as at present?
  3. Add the ability to handle integer or floating-point data rather than character data?
  4. Add the ability to handle variable-length strings as keys?

(You can find suggested approaches to problems in Chapter artopt.htm).

Footnotes

  1. 250 blocks of 1000 records, at 16 bytes of overhead per block, yields an overhead of 250*16, or 4000 bytes.
  2. A description of this sort can be found in Donald Knuth'sKnuth, Donald E. book The Art of Computer Programming, vol. 3. Reading, Massachusetts: Addison-Wesley, 1968. Every programmer should be aware of this book, since it describes a great number of generally applicable algorithms. As I write this, a new edition of this classic work on algorithms is about to be published.
  3. However, distribution counting is not suitable for use where the data will be kept in virtual memory. See my article "Galloping Algorithms", in Windows Tech Journal, 2(February 1993), 40-43, for details on this limitation.
  4. As is standard in C and C++ library implementations, this version of qsort requires the user to supply the address of a function that will compare the items to be sorted. While this additional overhead biases the comparison slightly against qsort, this small disadvantage is not of the same order of magnitude as the difference in inherent efficiency of the two algorithms.
  5. Of course, we could just as well have used the keys that start with any other character.
  6. There are some situations where this is not strictly true. For example, suppose we want to read a large fraction of the records in a file in physical order, and the records are only a few hundred bytes or less. In that case, it is almost certainly faster to read them all with a big buffer and skip the ones we aren't interested in. The gain from reading large chunks of data at once is likely to outweigh the time lost in reading some unwanted records.
  7. Please note that there is a capacity constraint in this program relating to the total number of ZIP entries that can be processed. See Figure
  8. mail.00a for details on this constraint.

Comment on this book!


Return to the table of contents

Return to my main page