Optimizing C++ - the WWW version

ISBN: 0-13-977430-0

Copyright 1999 by Prentice-Hall PTR

Copyright 2000 by Chrysalis Software Corporation


Return to the table of contents

Return to my main page


Heavenly Hash: A Dynamic Hashing Algorithm

Introduction

It's time to revisit the topic of hash coding, or "hashing" for short, which we previously discussed in Chapter superm.htm.

Algorithm Discussed

Dynamic Hashing

Problems with Standard Hashing

Hashing is a way to store and retrieve large numbers of records rapidly, by calculating an address close to the one where the record is actually stored. However, traditional disk-based methods of hashing have several deficiencies:

  1. Variable-length records cannot be stored in the hash table.
  2. Deleting records and reusing the space is difficult.
  3. The time needed to store and retrieve a record in a file increases greatly as the file fills up, due to overflows between previously separated areas of the file.
  4. Most serious of all, the maximum capacity of the file cannot be changed after its creation.

The first of these problems can be handled by having fixed-length hash records pointing to variable-length records in another file, although this increases the access time. The second problem can be overcome with some extra work in the implementation to mark deleted records for later reuse. The third problem can be alleviated significantly by the use of "last-come, first-served" hashing, as I mentioned in a footnote in Chapter superm.htm. However, until relatively recently, the fourth problem, a fixed maximum capacity for a given file, seemed intractable.

Of course, nothing stops us from creating a new, bigger, hash table and copying the records from the old file into it, rehashing them as we go along. In fact, one of the questions at the end of Chapter superm.htm asks how we could handle a file that becomes full, as an off-line process, and the "suggested approaches" section in Chapter artopt.htm proposes that exact answer.

However, this solution to the problem of a full hash table has two serious drawbacks. For one thing, we have to have two copies of the file (one old and one new) while we're doing this update, which can take up a large amount of disk space temporarily. More important in many applications, this "big bang" file maintenance approach won't work for a file that has to be available all of the time. Without doing a lot of extra work, we can't update the data during this operation, which might take a long time (perhaps several hours or even days) if the file is big. Is it possible to increase the size of the file after its creation without incurring these penalties, and still gain the traditional advantages of hashing, i.e., extremely fast access to large amounts of data by key?

Until shortly before I wrote the second edition of this book, I would have said that it wasn't possible. Therefore, I was quite intrigued when I ran across an article describing a hashing algorithm that can dynamically resize the hash table as needed during execution, without any long pauses in execution.1 Unfortunately, the implementation described in the article relies on the use of linked lists to hold the records pointed to by the hash slots. While this is acceptable for in-memory hash tables, such an implementation is not appropriate for disk-based hash tables, where minimizing the number of disk seeks is critical to achieving good performance.2 Nonetheless, it was obviously a fascinating algorithm, so I decided to investigate it to see whether it could be adapted to disk-based uses.

Just Another Link in the Chain

According to the the Griswold and Townsend article, Per-Ake Larson adapted an algorithm called "linear dynamic hashing", previously limited to accessing external files, to make it practical for use with in-memory tables.3 The result, appropriately enough, is called "Larson's algorithm". How does it work?

First, let's look at the "slot" method of handling overflow used by many in-memory hash table algorithms, including Larson's algorithm. To store an element in the hash table, we might proceed as follows, assuming the element to be stored contains a "next" pointer to point to another such element:

  1. The key of each element to be stored is translated into a slot number to which the element should be attached, via the hashing algorithm.
  2. The "next" pointer of the element to be attached to the slot is set to the current contents of the slot's "first" pointer, which is initialized to a null pointer when the table is created.
  3. The slot's "first" pointer is set to the address of the element being attached to the slot.

The result of this procedure is that the new element is linked into the chain attached to the slot, as its first element.

To retrieve an element, given its key, we simply use the hashing algorithm to get the slot number, and then follow the chain from that slot until we find the element. If the element isn't found by the time we get to a null pointer, it's not in the table.

This algorithm is easy to implement, but as written it has a performance problem: namely, with a fixed number of slots, the chains attached to each slot get longer in proportion to the number of elements in the table, which means that the retrieval time increases proportionately to the number of elements added. However, if we could increase the number of slots in proportion to the number of elements added, then the average time to find an element wouldn't change, no matter how many elements were added to the table. The problem is how we can find elements we've already added to the table if we increase the number of slots. It may not be obvious why this is a problem; for example, can't we just leave them where they are and add the new elements into the new slots?

Unfortunately, this won't work. The reason is that when we want to look up an element, we don't generally know when the element was added, so as to tell what "section" of the hash file it resides in. The only piece of information we're guaranteed to have available is the key of the element, so we had better be able to find an element given only its key. Of course, we could always rehash every element in the table to fit the new number of slots, but we've already seen that this can cause serious delays in file availability. Larson's contribution was to find an efficient way to increase the number of slots in the table without having to relocate every element when we do it.

Relocation Assistance

The basic idea is reasonably simple, as are most great ideas, once you understand them.4 For our example, let's assume that the key is an unsigned value; we start out with 8 slots, all marked "active"; and we want to limit the average number of elements per slot to 6.

This means that in order to store an element in the table, we have to generate a number between 0 and 7, representing the number of the slot to which we will chain the new element being stored. For purposes of illustration, we'll simply divide the key by the number of slots and take the remainder.5 For reasons which will become apparent shortly, the actual algorithm looks like this:

  1. Divide the key by the number of slots (8), taking the remainder. This will result in a number from 0 to the number of slots - 1, i.e., 0-7 in this case.
  2. If the remainder is greater than or equal to the number of "active" slots (8), subtract one-half the number of slots (4).

You will notice that the way the parameters are currently set, the second rule will never be activated, since the remainder can't be greater than or equal to 8; we'll see the reason for the second rule shortly.

Now we're set up to handle up to 48 elements. On the 49th element, we should add another slot to keep the average number per slot from exceeding 6. What do we do to the hash function to allow us to access those elements we have already stored?

We double the number of slots, making it 16, recording that new number of slots allocated; then we increase the number of "active" slots, bringing that number to 9. The hash calculation now acts like this:

  1. Divide the key by the number of slots (16), taking the remainder; the result will be between 0 and 15 inclusive.
  2. If the remainder is more than the number of active slots (9), subtract one-half the number of slots (8).

This modified hashing algorithm will spread newly added elements over nine slots rather than eight, since the possible slot values now range from 0 to 8 rather than 0 to 7. This keeps the number of elements per slot from increasing as we add new elements, but how do we retrieve elements that were already in the table?

Here is where we see the stroke of genius in this new algorithm. Only one slot has elements that might have to be moved under the new hashing algorithm, and that is the first slot in the table! Why is this so?

The analysis isn't really very difficult. Let's look at the elements in the second slot (i.e., slot number 1). We know that their hash codes under the old algorithm were all equal to 1, or they wouldn't be in that slot. That is, the remainder after dividing their keys by 8 was always 1, which means that the binary value of the remainder always ended in 001. Now let's see what the possible hash values are under the new algorithm. We divide the key by 16 and take the remainder. This is equivalent to taking the last four bits of the remainder. However, the last three bits of the remainder after dividing by 16 and the last three bits of the remainder after dividing by 8 have to be the same.6 Therefore, we have to examine only the fourth bit (the 8's place) of the remainder to calculate the new hash code. There are only two possible values of that bit, 0 and 1. If the value of that bit is 0, we have the same hash code as we had previously, so we obviously can look up the element successfully even with the additional slot in the table. However, if the value is 1, the second rule tells us to compare the hash code, which is 1001 binary or 9 decimal against the number of active slots (9). Since the first is greater than or equal to the second, we subtract 8, getting 1 again; therefore, the lookup will always succeed. The same considerations apply to all the other slots previously occupied in the range 2-7. There's only one where this analysis breaks down, and that's slot 0. Why is this one different?

Because slot 8 is actually available; therefore, the second rule won't apply. In this case, we do have to know whether an element is in slot 0 or 8. While this will be taken care of automatically for newly added elements using the new hash code parameters, we still have the old elements in slot 0 to contend with; on the average, half of them will need to be moved to slot 8. This is easy to fix; all we have to do is to retrieve each element from slot 0 and recalculate its hash value with the new parameters. If the result is 0, we put it back in slot 0; if the result is 8, we move it to slot 8.

Now we're okay until we have to add the 57th element to the hash table, at which point we need to add another slot to keep the average chain length to 6. Happily, all we have to do is to increase the number of active slots by 1 (to 10) and rehash the elements in slot 1. It works just like the example above; none of the slots 2-7 is affected by the change, because the second rule "folds" all of their hash calculations into their current slots.

When we have added a total of 96 elements, the number of "active" slots and the total number of slots will both be 16, so we will be back to a situation similar to the one we were in before we added our first expansion slot. What do we do when we have to add the 97th element? We double the total number of slots again and start over as before.7 This can go on until the number of slots gets to be too big to fit in the size of integer in which we maintain it, or until we run out of memory, whichever comes first.

Of course, if we want to handle a lot of data, we will probably run out of memory before we run out of slot numbers. Since the ability to retrieve any record in any size table in constant time, on the average, would certainly be very valuable when dealing with very large databases, I considered that it was worth trying to adapt this algorithm to disk-based hash tables. All that was missing was a way to store the chains of records using a storage method appropriate for disk files.

Making a Quantum Leap

Upon further consideration, I realized that the quantum file access method could be used to store a variable-length "storage element" pointed to by each hash slot, with the rest of the algorithm implemented pretty much as it was in the article. This would make it possible to store a very large number of strings by key and get back any of them with an average of a little more than one disk access, without having to know how big the file would be when we created it. This algorithm also has the pleasant effect of making deletions fairly simple to implement, with the file storage of deleted elements automatically reclaimed as they are removed.

Contrary to my usual practice elsewhere in this book, I have not developed a sample "application" program, but have instead opted to write a test program to validate the algorithm. This was very useful to me during the development of the algorithm; after all, this implementation of dynamic hashing is supposed to be able to handle hundreds of thousands of records while maintaining rapid access, so it is very helpful to be able to demonstrate that it can indeed do that. The test program is hashtest.cpp (Figure hashtest.00).

The test program for the dynamic hashing algorithm (quantum\hashtest.cpp) (Figure hashtest.00)

codelist/hashtest.00

This program stores and retrieves strings by a nine-digit key, stored as a String value. To reduce the overhead of storing each entry as a separate object in the quantum file, all strings having the same hash code are combined into one "storage element" in the quantum file system; each storage element is addressed by one "hash slot". The current version of the dynamic hashing algorithm used by hashtest.cpp allocates one hash slot for every six strings in the file; since the average individual string length is about 60 characters, and there are 18 bytes of overhead for each string in a storage element (four bytes for the key length, four bytes for the data length, and 10 bytes for the key value including its null further), this means that the average storage element will be a bit under 500 bytes long. A larger element packing factor than six strings per element would produce a smaller hash table and would therefore be more space efficient. However, the choice of this value is not critical with the current implementation of this algorithm, because any storage elements that become too long to fit into quantum will be broken up and stored separately by a mechanism which I will get to later.

Of course, in order to run meaningful tests, we have to do more than store records in the hash file; we also have to retrieve what has been stored, which means that we have to store the keys someplace so that we can use them again to retrieve records corresponding to those keys. In the current version of this algorithm, I use a FlexArray (i.e., a persistent array of strings, such as we examined in the previous chapter) to store the values of the keys. However, in the original version of this algorithm, I was storing the key as an unsigned long value, so I decided to use the quantum file storage to implement a persistent array of unsigned long values, and store the keys in such an array.

Persistence Pays Off

It was surprisingly easy to implement a persistent array of unsigned long values,8 for which I defined a typedef of Ulong, mostly to save typing. The header file for this data type is persist.h (Figure persist.00).

The interface for the PersistentArrayUlong class (quantum\persist.h) (Figure persist.00)

codelist/persist.00

As you can see, this class isn't very complex, and most of the implementation code is also fairly straightforward. However, we get a lot out of those relatively few lines of code; these arrays are not only persistent, but they also automatically expand to any size up to and including the maximum size of a quantum file; with the current maximum of 10000 16K blocks, a maximum size PersistentArrayUlong could contain approximately 40,000,000 elements! Of course, we don't store each element directly in a separate addressable entry within a main object, as this would be inappropriate because the space overhead per item would be larger than the Ulongs we want to store! Instead, we employ a two-level system similar to the one used in the dynamic hashing algorithm; the quantum file system stores "segments" of data, each one containing as many Ulongs as will fit in a quantum. To store or retrieve an element, we determine which segment the element belongs to and access the element by its offset in that segment.

However, before we can use a PersistentArrayUlong, we have to construct it, which we can do via the default constructor (Figure persist.01).

The default constructor for the PersistentArrayUlong class (from quantum\persist.cpp) (Figure persist.01)

codelist/persist.01

This constructor doesn't actually create a usable array; it is only there to allow us to declare a PersistentArrayUlong before we want to use it. When we really want to construct a usable array, we use the normal constructor shown in Figure persist.02.

The normal constructor for the PersistentArrayUlong class (from quantum\persist.cpp) (Figure persist.02)

codelist/persist.02

As you can see, to construct a real usable array, we provide a pointer to the quantum file in which it is to be stored, along with a name for the array. The object directory of that quantum file is searched for a main object with the name specified in the constructor; if it is found, the construction is complete. Otherwise, a new object is created with one element, expandable to fill up the entire file system if necessary.

To store an element in the array we have created, we can use StoreElement (Figure persist.03).

The PersistentArrayUlong::StoreElement function (from quantum\persist.cpp) (Figure persist.03)

codelist/persist.03

This routine first calculates which segment of the array contains the element we need to retrieve, and the element number within that segment. Then, if we are running the "debugging" version (i.e., asserts are enabled), it checks whether the segment number is within the maximum range we set up when we created the array. This test should never fail unless there is something wrong with the calling routine (or its callers), so that the element number passed in is absurdly large. As discussed above, with all such conditional checks, we have to try to make sure that our testing is good enough to find any errors that might cause this to happen; with a "release" version of the program, this would be a fatal error.

Next, we check whether the segment number we need is already allocated to the array; if not, we increase the number of segments as needed by calling GrowMainObject, but don't actually initialize any new segments until they're accessed, so that "sparse" arrays won't take up as much room as ones that are filled in completely. Next, we get a copy of the segment containing the element to be updated; if it's of zero length, that means we haven't initialized it yet, so we have to allocate memory for the new segment and fill it with zeros. At this point, we are ready to create an AccessVector called TempUlongVector of type Ulong and use it as a "template" (no pun intended) to allow access to the element we want to modify. Since AccessVector has the semantics of an array, we can simply set the ElementNumberth element of TempUlongVector to the value of the input argument p_Element; the result of this is to place the new element value into the correct place in the TempVector array. Finally, we store TempVector back into the main object, replacing the old copy of the segment.

To retrieve an element from an array, we can use GetElement (Figure persist.04).

The PersistentArrayUlong::GetElement function (from quantum\persist.cpp) (Figure persist.04)

codelist/persist.04

First, we calculate the segment number and element number, and check (via qfassert) whether the segment number is within the range of allocated segments; if it isn't, we have committed the programming error of accessing an uninitialized value. Assuming this test is passed, we retrieve the segment, set up the temporary Vector TempUlongVector to allow access to the segment as an SVector of Ulongs, and return the value from the ElementNumberth element of the array. All this is very well if we want to write things like "Y.Put(100,100000L);" or "X = Y.Get(100);", to store or retrieve the 100th element of the Y "array", respectively. But wouldn't it be much nicer to be able to write "Y[100] = 100000L;" or "X = Y[100];" instead?

In Resplendent Array

Clearly, that would be a big improvement in the syntax; as it happens, it's not hard to make such references possible, with the addition of only a few lines of code.9 Unfortunately, this code is not the most straightforward, but the syntactic improvement that it provides is worth the trouble. The key is operator[ ] (Figure persist.05).

The PersistentArrayUlong::operator[ ] function (from quantum\persist.cpp) (Figure persist.05)

codelist/persist.05

This function returns a temporary value of a type that behaves differently in the context of an "lvalue" reference (i.e., a "write") than it does when referenced as an "rvalue" (i.e, a "read"). In order to follow how this process works, let's use the example in Figure persist1.

Persistent array example (Figure persist1)

codelist/perexam.00

The first question to be answered is how the compiler decodes the following line:

Save[1000000L] = 1234567L;

According to the definition of PersistentArrayUlong::operator[ ], this operator returns a PersistentArrayUlongRef that is constructed with the two parameters *this and p_Index, where the former is the PersistentArrayUlong object for which operator[ ] was called (i.e., Save), and the latter is the value inside the [ ], which in this case is 1000000L. What is this return value? To answer this question, we have to look at the normal constructor for the PersistentArrayUlongRef class (Figure persist.06).

The normal constructor for the PersistentArrayUlongRef class (from quantum\persist.cpp) (Figure persist.06)

codelist/persist.06

This constructor sets m_PAU to a reference to p_PAU and m_Index to the value of p_Index, so the object returned from the operator[ ] call will be a PersistentArrayUlongRef with those values. Therefore, the compiler-generated code looks something like this, so far:

PersistentArrayUlongRef T(Save,1000000L);

T = 1234567L;

where T is an arbitrary name for a temporary object.

The second of these lines is then translated into a call to PersistentArrayUlongRef::operator = (Figure persist.07), to do the assignment.

The PersistentArrayUlongRef::operator = function (from quantum\persist.cpp) (Figure persist.07)

codelist/persist.07

The generated code now looks something like this:

PersistentArrayUlongRef T(Save,1000000L);

T.operator =(1234567L);

The operator = code, as you can see from the figure, calls the StoreElement operation for the object m_PAU, which as we noted above is a reference to the original object Save; the arguments to that call are m_Index, a copy of the index supplied in the original operator[ ] call, and p_Element, which is the value specified in the operator = call. Thus, the result is the same as that of the statement Save.StoreElement(1000000L,1234567L);, while the notation is that of a "normal" array access.

However, we've only handled the case where we're updating an element of the array. We also need to be able to retrieve values once they've been stored. To see how that works, let's follow the translation of the following line:

TestValue = Save[1000000L];

The process is fairly similar to what we've already done. The definition of PersistentArrayUlong::operator[] causes the compiler to generate code somewhat like the following:

PersistentArrayUlongRef T(Save,1000000L);

TestValue = T;

This time, however, rather than translating the second line into a call to PersistentArrayUlongRef::operator=, the compiler translates it into a call to PersistentArrayUlongRef::operator Ulong, a "conversion function" that allows a PersistentArrayUlongRef to be used where a Ulong is expected (Figure persist.08).

The PersistentArrayUlongRef::operator Ulong function (from quantum\persist.cpp) (Figure persist.08)

codelist/persist.08

Therefore, the final generated code comes out something like this:

PersistentArrayUlongRef T(Save,1000000L);

TestValue = Ulong(T);

As should be evident by looking at the code for that conversion function, it merely calls the GetElement operation of the object m_PAU, which as we noted above is a reference to the original object Save, with the argument m_Index, which is a copy of the original element index. Thus, the result is the same as the statement TestValue = Save.GetElement(1000000L);, while the notation is that of a "normal" array access.

Before we move on to our next topic, I have a word of warning for you. If you use these "synthetic arrays" frequently, you may be tempted to inline the definitions of these auxiliary functions that make the array notation possible. I recommend that you don't do it, at least not without a lot of testing to make sure it works correctly. When I tried this with Borland C++ 3.1, the result appeared to work but generated terrible memory leaks; as far as I could determine, the memory that wasn't being freed was that belonging to the temporary objects that were created during the operation of these functions.

Some Fine Details

Now let's look at some of the other implementation details and problems I encountered in the process of getting the dynamic hashing algorithm and the hashtest.cpp test program to work.

During development of the test program, I discovered that running the tests on large numbers of records took a fairly large amount of time; to generate a quantum file with 500,000 records in it takes about 45 minutes, doing almost continual disk accesses the whole time.10 Although this is actually very fast when one considers the amount of data being processed, I still didn't want to have to execute such a program more often than I had to. Therefore, the test program needed the capability of incrementally adding records to an existing quantum file. However, this meant that the test program had to be able to start somewhere in the middle of the input file; adding the same records again and again wouldn't work because the algorithm isn't designed to handle records with duplicate keys.

As the test program is implemented, only two pieces of information need to be saved between runs in order to allow the addition of previously unused records to the quantum file: the index number in the SaveKeys array where the next key should be saved, and the offset into the input file where the next record begins. We don't have to save the name of the input file or the ASCII output file, since those are compiled into the program; of course, this would not be appropriate for a real application program, but in our test program we don't want to change the name of the input file between runs. Clearly, if the input file could change from one run to the next, the information as to where we stopped when reading records wouldn't be of much use. Since both of the data items that need to be preserved happen to be representable by an unsigned long value, I decided to use a PersistentArrayUlong named Save to save them between runs. As a result, the only difference between starting a new quantum file and adding records to an existing file is that when we start a new quantum file, the offset in the input file and the starting position for keys to be added to the SaveKeys array have to be reset to their initial values.

This ability to build on a previously created quantum file turned out to be quite useful in fixing an interesting bug that I ran into during capacity testing of the 16-bit version of this program. The theoretical capacity of a quantum file with the original size parameters, when used to support the dynamic hashing algorithm, could be calculated as 64K (maximum hash slots in the 16-bit implementation) * 6 (average records per hash slot), or 393,216. Although I didn't necessarily need to demonstrate that this exact number of records could actually be stored and retrieved successfully, especially in view of the number of hours that my machine would be doing continual random disk accesses, I felt that it was important to check that a very large number of records could be accommodated. I selected 250,000 as a nice round number that wouldn't take quite so long to test, and started to run tests for every multiple of 50,000 records up to 250,000.11 As each test was finished, I copied the resulting quantum file to a new name to continue the testing to the next higher multiple of 50,000 records.

Everything went along very nicely through the test that created a 150,000 record quantum file. Since adding 50,000 records to a file that already contained 150,000 records should have taken about 45 minutes, I started the next test, which should have generated a 200,000 record file, and went away to do some chores. Imagine my surprise when I came back and found that the program had aborted somewhere between elements 196,000 and 197,000 due to the failure of an assert that checks that all the records in a hash slot that is being split have the correct hash code to be put into either the old slot or the new slot. Upon investigating the reason for this bug, I discovered that the problem was that m_CurrentMaxSlotCount, which is used in DynamicHashArray::CalculateHash to calculate the hash code according to Larson's algorithm, was an unsigned rather than an unsigned long. As a result, when 32K slots were occupied and it was time to double the value of m_CurrentMaxSlotCount, its new value, which should have been 64K, was 0. This caused the hashing function to generate incorrect values and thus caused the assert to fail. Changing the type of m_CurrentMaxSlotCount to unsigned long solved the problem.

However, there was one other place where the code had to change to implement this solution, and that was in the destructor for DynamicHashArray. The reason for this change is that the dynamic hashing algorithm needs to store some state information in a persistent form. To be specific, we need to keep track of the number of active slots, the current maximum slot number, and the number of elements remaining before the next slot is to be activated. With these data items saved in the file, the next time we open the file, we're ready to add more records while keeping the previously stored ones accessible.

Unfortunately, we don't have a convenient persistent numeric array to save these values in, and it doesn't make much sense to create one just for three values. However, we do have a persistent string array, which we're using to store the hashed records, and we can use the first element of that array to store ASCII representations of the three values that must be maintained in the file.12 This is handled during the normal constructor for the DynamicHashArray type (Figure dynhash.00), which calls the Open function to do the actual work (Figure dynhash.01).13

The normal constructor for the DynamicHashArray class (from quantum\dynhash.cpp) (Figure dynhash.00)

codelist/dynhash.00

The DynamicHashArray::Open function (from quantum\dynhash.cpp) (Figure dynhash.01)

codelist/dynhash.01

The Open function retrieves these values from the first element of the array if it has already been created, and the destructor (Figure dynhash.02) stores them into the first element of the array.

The destructor for the DynamicHashArray class (from quantum\dynhash.cpp) (Figure dynhash.02)

codelist/dynhash.02

After the change to the type of m_CurrentMaxSlotCount, the sprintf format parameter for that type also had to change, to "%6lu" from "%6u", so that the whole value of the variable would be used. If I hadn't made this change, the data would have been written to the parameter string incorrectly, and the parameters wouldn't have been read back in correctly the next time the file was opened.

Overflow Handling

The previous version of this program, in the second edition of this book, did not have any means of handling the situation where the total size of the records with the same hash code is so large that the element used to store those records will no longer fit into a quantum. Therefore, in the event that too many records were put into a particular element, the program would fail. While this is acceptable in a program that is intended only to demonstrate the dynamic hashing algorithm, it is unacceptable for commercial use and therefore should really be addressed in a book intended for use in the real world.

I've thought about this problem off and on for some time without coming up with a really good solution. However, a few months ago I did figure out how to solve it in a reasonably easy and efficient way, which is illustrated in the code for DynamicHashArray::StoreElement (Figure dynhash.04).

The DynamicHashArray::StoreElement function (from quantum\dynhash.cpp) (Figure dynhash.04)

codelist/dynhash.04

The basic idea is that whenever a particular element that we are trying to store in the hash array is about to exceed the size of a quantum, we create a new element in which we'll store the records that wouldn't fit in the previous element. The question, of course, is how we find the records stored in this new element, when they will not be stored in the location in the hash array where we expect to find it.

The answer is that we modify the keys of the stored records in a way that we can do again when we're trying to find a particular record. Then we set a "chain" flag in the original element in the hash array indicating that it has overflowed, so that we don't mistakenly tell the user that the record in question is not in the file.

Exactly how do we modify the key of the record? By appending a character sequence to it that consists of a form feed character followed by an ASCII representation of the number of times that we've had an overflow in this particular chain. For example, if the original key was "ABC", the first overflow key would be "ABC\f0", where "\f" represents the form feed character, and "0" is the ASCII digit 0.

Of course, it's also possible that even the second element in which we're storing records with a particular hash key will overflow. However, the same algorithm will work in that case as well; in this case, the second record will have its "chain" flag set to indicate that the program should continue looking for the record in question if it is not found in the current element and the key will be modified again to make it unique; to continue our previous example, if the original key was "ABC", the second overflow key would be "ABC\f1", where "\f" represents the form feed character, and "1" is the ASCII digit 1.

How did I choose this particular way of modifying the keys? Because it would not limit the users' choice of keys in a way that they would object to. Of course, I did have to inform the users that they should use only printable ASCII characters in their keys, but they did not consider that a serious limitation. If your users object to this limitation, then you'll have to come up with another way of constructing unique keys that won't collide with any keys that the users actually want to use.

As this suggests, there were a few tricky parts to this solution, but overall it really wasn't that difficult to implement. I haven't done any serious performance testing on its effects, but I don't expect them to be significant; after all, assuming that we select our original parameters properly, overflows should be a rare event.

I should also explain how we find a record that has overflowed. That is the job of DynamicHashArray::FindElement (Figure dynhash.05).

The DynamicHashArray::FindElement function (from quantum\dynhash.cpp) (Figure dynhash.05)

codelist/dynhash.05

As you can see, the code to look up a record is not complicated very much by the possibility of an overflow. First, we calculate a slot number based on the key that we are given by the user. Then we check whether that key is found in the element for that slot. If it is, we are done, so we break out of the loop. However, if we haven't found the record we're looking for yet, we check to see whether the particular element that we are looking in has its chain flag set. If not, the record must not be in the file, so we break out of the loop.

On the other hand, if the chain flag is set in the element that we were looking at, then we have to keep looking. Therefore, we calculate what the key for that record would be if it were in the next element and continue processing at the top of the loop.

On the next pass through the loop, we'll retrieve the next element that the record might be in, based on its modified key. We'll continue through this loop until we either find the record we're looking for or get to an element whose chain flag is not set; the latter situation, of course, means that the desired record is not in the file.

Settling Our Hash

I've recently had the opportunity to use this algorithm in a commercial setting, and have discovered (once again) that how the hash code is calculated is critical to its performance. In this case, the keys were very poorly suited to the simple (if not simple-minded) calculation used in the version of DynamicHashArray::CalculateHash in Figure oldhash.00.

An early implementation of the DynamicHashArray::CalculateHash function (Figure oldhash.00)

codelist/oldhash.00

The problem was that the keys used in the application were all very similar to one another. In particular, the last seven or eight characters in each key were likely to differ from the other keys in only one or two places. This caused tremendous overloading of the hash slots devoted to those keys, with corresponding performance deterioration.

Luckily, I was able to provide a solution to this problem in short order by using an algorithm that produced much better hash codes. Interestingly enough, the algorithm that I substituted for my poorly performing hash code wasn't even designed to be a hash code algorithm at all. Instead, it was a cyclical redundancy check (CRC) function whose purpose was to calculate a value based on a block of data so that when the data was later read again, the reading program could determine whether it had been corrupted.

One second thought, perhaps it wasn't so strange that a CRC algorithm would serve as a good hash code. After all, for it to do its job properly, any change to the data should be very likely to produce a different CRC. Therefore, even if two keys differed by only one or two bytes, their CRC's would almost certainly be different, which of course is exactly what we want for a hash code. As it happens, this substitution greatly improved the performance of the program, so apparently my choice of a new hash algorithm was appropriate.

Fortunately, I came up with this solution just in time to include the new code on the CD-ROM in the back of this book. It is shown in Figure dynhash.06.

The DynamicHashArray::CalculateHash function (from quantum\dynhash.cpp) (Figure dynhash.06)

codelist/dynhash.06

As for the CRC class, its interface is shown in Figure crc32.00.

The interface for the CRC32 class (from quantum\crc32.h) (Figure crc32.00)

codelist/crc32.00

And its implementation, which also includes a brief description of the algorithm, is shown in Figure crc32.01.

The implementation for the CRC32 class (from quantum\crc32.cpp) (Figure crc32.01)

codelist/crc32.01

Bigger and Better

What are the limits on the maximum capacity of a dynamic hashing array? As it happens, there are two capacity parameters of the quantum file access method that we can adjust without affecting the implementation very much: BlockSize, which specifies how large each quantum is, and the maximum number of blocks allowed in the file, set by the MaxFileQuantumCount const in blocki.h. In the current implementation, BlockSize is set to 16K, which means we need 14 bits to specify the location of an item in a quantum. Since an ItemIndex (Figure blocki.08) uses a 16-bit word to hold the offset, we could increase the BlockSize to as much as 64K bytes. Let's take a look at the advantages and disadvantages of increasing the quantum size.

Suppose we increase the size of each quantum, for example, to 32K bytes from 16K. It's easy to see that this would double the maximum capacity of the file. What may not be so obvious is that this change would also decrease the memory requirements for efficient file access via dynamic hashing, for a given number of records in the file. To see why this is so, we have to look at the typical usage of disk buffers when looking up a string by key in the dynamic hashing algorithm.

Suppose we want to find the record that has the key 609643342. The algorithm calculates which hash slot points to the storage element in which a record with that key would be stored. It then calls the quantum file access routine GetModifiableElement to retrieve that storage element. GetModifiableElement retrieves the big pointer array block for the array that the storage elements are kept in; then it retrieves the little pointer array block for the correct storage element; and finally it gets the block where the storage element is stored, retrieves it from the block, and returns it. The dynamic hashing algorithm then searches the storage element to find the key we specified and, if it is found, extracts the record we want.14 So a total of three blocks are accessed for each retrieval of a string: the big pointer array block, a little pointer array block, and the final "leaf" block. The first of these blocks is referenced on every string retrieval, so it is almost certain to be in memory. The "leaf" block, on the other hand, is not very likely to be in memory, since the purpose of the hashing algorithm is to distribute the data as evenly as possible over the file: with a reasonably large file, most "leaf" accesses aren't going to be to one of the relatively few blocks we can keep in memory.

Fortunately, this pessimistic outlook does not apply to the second block retrieved, the little pointer array block. If we have 500,000 strings in our file, there are 83,333 storage elements that the quantum file algorithm has to deal with. With a 16K-byte block, approximately 4000 little pointer elements fit in each little pointer block, so the little pointer blocks would occupy 21 buffers.

Let's look at the situation if we go to 32K-byte blocks. If we double the number of strings in the average storage element to 12 so that the likelihood of an overflow would remain approximately the same as before, then the number of little pointer array elements needed to access a given number of records is halved; in addition, the number of little pointer array elements that fit in each little pointer block is doubled.15 This means that to be fairly sure that the little pointer block we need will be present in memory, instead of 21 blocks taking up almost 700K of memory, we need only 6 blocks taking up about 200K of memory.

In the case of dynamic hashing, this effect greatly alleviates the primary drawback of increasing the block size, which is that the number of blocks that can be held in a given amount of memory is inversely proportional to the size of each block. In the general case, however, this reduction in the number of buffers can hurt the performance of the program; the system can "thrash" if the working set of data needed at any given time needs more buffers than are available.

The only other apparent drawback of increasing the size of the quanta is that the free space codes become less accurate, since the code remains fixed in size at one byte; with a 32K block, each increment in the size code represents 128 bytes. However, I doubt that this will cause any significant problems with space utilization.

The More, the Merrier

The other fairly simple way to increase the capacity of the system is to increase the number of blocks that can be addressed. The ItemReference class (Figure blocki.12) defines objects that take up four bytes each, 20 bits for the m_QuantumNumber field and 12 bits for the m_RelativeItemNumber field.

The ItemReference class (from quantum\blocki.h) (Figure blocki.12)

codelist/blocki.12

If we wanted to increase the number of quanta from its current maximum of 10,000 (set by the MaxFileQuantumCount const in blocki.h), we could increase it up to 1024K without changing the size of the m_QuantumNumber field.

The main drawback of increasing the maximum block count is that the free space list gets bigger; in fact, that's the reason that I've set the maximum file quantum count to 10,000 despite the fact that the quantum number field in the item reference class can handle a file with far more quanta. However, if our application needs so much data that a 160-MB maximum file size is too small, the extra space taken up by a larger free space list probably isn't an obstacle.

Summary

In this chapter, we have used the quantum file access method as the base for a disk-based variant of Larson's dynamic hashing. This algorithm provides efficient hash-coded access by key to a very large amount of variable-length textual data, while eliminating the traditional drawbacks of hashing, especially the need to specify the maximum size of the file in advance.

In the final chapter, we will summarize the algorithms we have covered in this book and discuss some other resources we can use to improve the efficiency of our programs.

Problems

  1. What modifications to the dynamic hashing implementation would be needed to add the following capabilities?
    1. Shrinking the file when deleting records;
    2. Storing and retrieving records with duplicate keys.
  2. How could the PersistentArrayUlong class be generalized to other data types?

(You can find suggested approaches to problems in Chapter artopt.htm).

Footnotes

  1. William G. Griswold and Gregg M. Townsend, "The Design and Implementation of Dynamic Hashing for Sets and Tables in Icon". Software Practice and Experience, 23(April 1993), 351-367.
  2. In case you're thinking that the availability of virtual memory will solve this problem by allowing us to pretend that we have enough memory to hold the lists no matter how large they get, you may be surprised to discover that the performance of such an algorithm in virtual memory is likely to be extremely poor. I provide an example of this phenomenon in my previously cited article "Galloping Algorithms".
  3. P.-A. Larson, "Dynamic Hash Tables". Communications of the ACM, 31(1988).
  4. Of course, thinking them up in the first place is a little harder!
  5. This step alone will probably not provide a good distribution of records with a number of slots that is a power of two; in such cases, it effectively discards all the high-order bits of the key. In our actual test program, in order to improve the distribution, we precede this step by one that uses all of the bits in the input value. However, the principle is the same.
  6. This may be obvious, but if it's not, the following demonstration might make it so:

    1. Let's call the value to be hashed x.

    2. y = x >> 4 [divide by 16]

    3. z = y << 4 [multiply result by 16; result must have low four bits = 0]

    4. x % 16 = x - z [definition of "remainder"; result is low four bits of x]

    Obviously, an analogous result holds true for the remainder after dividing by 8; therefore, the low three bits of these two values must be the same.

  7. As it turns out, there's no reason to actually allocate storage for all the new slots when we double the current maximum count, because the only one that's immediately needed is the first new one created. Accordingly, Larson's algorithm allocates them in blocks of a fixed size as they're needed, which is effectively the way my adaptation works as well.
  8. Given the quantum file access method, that is!
  9. This is also from James Coplien's book, mentioned earlier.
  10. As an illustration of the immense improvements in computer speeds in a few years, creating a new file containing 250,000 records took about four hours with the computer I had in 1995 and about 15 minutes with the computer I have in 1998.
  11. You may wonder where I got such a large number of records to test with. They are extracted from a telephone directory on CD-ROM; unfortunately, I can't distribute them for your use, since I don't have permission to do so.
  12. In order to hide this implementation detail from the rest of the dynamic hashing algorithm, the GetString and PutString functions of the DynamicHashArray class increment the element index before using it. Thus, the request to read or update element 0 is translated into an operation on element 1, and so forth. To access element 0, the index is specified as -1.
  13. In case you're wondering why we need two functions here instead of one, it's because sometimes we want to create a dynamic hash array before we know what file it's going to be attached to. In such a case, we can use the default constructor and then call the Open function once we know the actual parameters; without separating construction and initialization, we would not have that capability.
  14. This is actually somewhat of an over-simplification, as it ignores the possibility of overflow of a storage element; if that occurs, then the whole process of looking up the storage element by key has to be repeated, doubling the number of block accesses. However, if we've chosen the parameters properly, overflows will be rare and therefore will not affect this analysis significantly.
  15. Of course, having a larger block size also makes the algorithm more suitable to other applications with larger data items.

Comment on this book!


Return to the table of contents

Return to my main page