Return to the table of contents
It's time to revisit the topic of hash coding, or "hashing" for short, which we previously discussed in Chapter superm.htm.
Dynamic Hashing
Hashing is a way to store and retrieve large numbers of records rapidly, by calculating an address close to the one where the record is actually stored. However, traditional disk-based methods of hashing have several deficiencies:
The first of these problems can be handled by having fixed-length hash records pointing to variable-length records in another file, although this increases the access time. The second problem can be overcome with some extra work in the implementation to mark deleted records for later reuse. The third problem can be alleviated significantly by the use of "last-come, first-served" hashing, as I mentioned in a footnote in Chapter superm.htm. However, until relatively recently, the fourth problem, a fixed maximum capacity for a given file, seemed intractable.
Of course, nothing stops us from creating a new, bigger, hash table and copying the records from the old file into it, rehashing them as we go along. In fact, one of the questions at the end of Chapter superm.htm asks how we could handle a file that becomes full, as an off-line process, and the "suggested approaches" section in Chapter artopt.htm proposes that exact answer.
However, this solution to the problem of a full hash table has two serious drawbacks. For one thing, we have to have two copies of the file (one old and one new) while we're doing this update, which can take up a large amount of disk space temporarily. More important in many applications, this "big bang" file maintenance approach won't work for a file that has to be available all of the time. Without doing a lot of extra work, we can't update the data during this operation, which might take a long time (perhaps several hours or even days) if the file is big. Is it possible to increase the size of the file after its creation without incurring these penalties, and still gain the traditional advantages of hashing, i.e., extremely fast access to large amounts of data by key?
Until shortly before I wrote the second edition of this book, I would have said that it wasn't possible. Therefore, I was quite intrigued when I ran across an article describing a hashing algorithm that can dynamically resize the hash table as needed during execution, without any long pauses in execution.1 Unfortunately, the implementation described in the article relies on the use of linked lists to hold the records pointed to by the hash slots. While this is acceptable for in-memory hash tables, such an implementation is not appropriate for disk-based hash tables, where minimizing the number of disk seeks is critical to achieving good performance.2 Nonetheless, it was obviously a fascinating algorithm, so I decided to investigate it to see whether it could be adapted to disk-based uses.
According to the the Griswold and Townsend article, Per-Ake Larson adapted an algorithm called "linear dynamic hashing", previously limited to accessing external files, to make it practical for use with in-memory tables.3 The result, appropriately enough, is called "Larson's algorithm". How does it work?
First, let's look at the "slot" method of handling overflow used by many in-memory hash table algorithms, including Larson's algorithm. To store an element in the hash table, we might proceed as follows, assuming the element to be stored contains a "next" pointer to point to another such element:
The result of this procedure is that the new element is linked into the chain attached to the slot, as its first element.
To retrieve an element, given its key, we simply use the hashing algorithm to get the slot number, and then follow the chain from that slot until we find the element. If the element isn't found by the time we get to a null pointer, it's not in the table.
This algorithm is easy to implement, but as written it has a performance problem: namely, with a fixed number of slots, the chains attached to each slot get longer in proportion to the number of elements in the table, which means that the retrieval time increases proportionately to the number of elements added. However, if we could increase the number of slots in proportion to the number of elements added, then the average time to find an element wouldn't change, no matter how many elements were added to the table. The problem is how we can find elements we've already added to the table if we increase the number of slots. It may not be obvious why this is a problem; for example, can't we just leave them where they are and add the new elements into the new slots?
Unfortunately, this won't work. The reason is that when we want to look up an element, we don't generally know when the element was added, so as to tell what "section" of the hash file it resides in. The only piece of information we're guaranteed to have available is the key of the element, so we had better be able to find an element given only its key. Of course, we could always rehash every element in the table to fit the new number of slots, but we've already seen that this can cause serious delays in file availability. Larson's contribution was to find an efficient way to increase the number of slots in the table without having to relocate every element when we do it.
The basic idea is reasonably simple, as are most great ideas, once
you understand them.4 For our example,
let's assume that the key is an unsigned
value; we start
out with 8 slots, all marked "active"; and we want to limit the average
number of elements per slot to 6.
This means that in order to store an element in the table, we have to generate a number between 0 and 7, representing the number of the slot to which we will chain the new element being stored. For purposes of illustration, we'll simply divide the key by the number of slots and take the remainder.5 For reasons which will become apparent shortly, the actual algorithm looks like this:
You will notice that the way the parameters are currently set, the second rule will never be activated, since the remainder can't be greater than or equal to 8; we'll see the reason for the second rule shortly.
Now we're set up to handle up to 48 elements. On the 49th element, we should add another slot to keep the average number per slot from exceeding 6. What do we do to the hash function to allow us to access those elements we have already stored?
We double the number of slots, making it 16, recording that new number of slots allocated; then we increase the number of "active" slots, bringing that number to 9. The hash calculation now acts like this:
This modified hashing algorithm will spread newly added elements over nine slots rather than eight, since the possible slot values now range from 0 to 8 rather than 0 to 7. This keeps the number of elements per slot from increasing as we add new elements, but how do we retrieve elements that were already in the table?
Here is where we see the stroke of genius in this new algorithm. Only one slot has elements that might have to be moved under the new hashing algorithm, and that is the first slot in the table! Why is this so?
The analysis isn't really very difficult. Let's look at the elements in the second slot (i.e., slot number 1). We know that their hash codes under the old algorithm were all equal to 1, or they wouldn't be in that slot. That is, the remainder after dividing their keys by 8 was always 1, which means that the binary value of the remainder always ended in 001. Now let's see what the possible hash values are under the new algorithm. We divide the key by 16 and take the remainder. This is equivalent to taking the last four bits of the remainder. However, the last three bits of the remainder after dividing by 16 and the last three bits of the remainder after dividing by 8 have to be the same.6 Therefore, we have to examine only the fourth bit (the 8's place) of the remainder to calculate the new hash code. There are only two possible values of that bit, 0 and 1. If the value of that bit is 0, we have the same hash code as we had previously, so we obviously can look up the element successfully even with the additional slot in the table. However, if the value is 1, the second rule tells us to compare the hash code, which is 1001 binary or 9 decimal against the number of active slots (9). Since the first is greater than or equal to the second, we subtract 8, getting 1 again; therefore, the lookup will always succeed. The same considerations apply to all the other slots previously occupied in the range 2-7. There's only one where this analysis breaks down, and that's slot 0. Why is this one different?
Because slot 8 is actually available; therefore, the second rule won't apply. In this case, we do have to know whether an element is in slot 0 or 8. While this will be taken care of automatically for newly added elements using the new hash code parameters, we still have the old elements in slot 0 to contend with; on the average, half of them will need to be moved to slot 8. This is easy to fix; all we have to do is to retrieve each element from slot 0 and recalculate its hash value with the new parameters. If the result is 0, we put it back in slot 0; if the result is 8, we move it to slot 8.
Now we're okay until we have to add the 57th element to the hash table, at which point we need to add another slot to keep the average chain length to 6. Happily, all we have to do is to increase the number of active slots by 1 (to 10) and rehash the elements in slot 1. It works just like the example above; none of the slots 2-7 is affected by the change, because the second rule "folds" all of their hash calculations into their current slots.
When we have added a total of 96 elements, the number of "active" slots and the total number of slots will both be 16, so we will be back to a situation similar to the one we were in before we added our first expansion slot. What do we do when we have to add the 97th element? We double the total number of slots again and start over as before.7 This can go on until the number of slots gets to be too big to fit in the size of integer in which we maintain it, or until we run out of memory, whichever comes first.
Of course, if we want to handle a lot of data, we will probably run out of memory before we run out of slot numbers. Since the ability to retrieve any record in any size table in constant time, on the average, would certainly be very valuable when dealing with very large databases, I considered that it was worth trying to adapt this algorithm to disk-based hash tables. All that was missing was a way to store the chains of records using a storage method appropriate for disk files.
Upon further consideration, I realized that the quantum file access method could be used to store a variable-length "storage element" pointed to by each hash slot, with the rest of the algorithm implemented pretty much as it was in the article. This would make it possible to store a very large number of strings by key and get back any of them with an average of a little more than one disk access, without having to know how big the file would be when we created it. This algorithm also has the pleasant effect of making deletions fairly simple to implement, with the file storage of deleted elements automatically reclaimed as they are removed.
Contrary to my usual practice elsewhere in this book, I have not
developed a sample "application" program, but have instead opted to
write a test program to validate the algorithm. This was very useful to
me during the development of the algorithm; after all, this
implementation of dynamic hashing is supposed to be able to handle
hundreds of thousands of records while maintaining rapid access, so it
is very helpful to be able to demonstrate that it can indeed do that.
The test program is hashtest.cpp
(Figure hashtest.00).
codelist/hashtest.00
This program stores and retrieves strings by a nine-digit key,
stored as a String
value. To reduce the overhead of
storing each entry as a separate object in the quantum file, all
strings having the same hash code are combined into one "storage
element" in the quantum file system; each storage element is addressed
by one "hash slot". The current version of the dynamic hashing
algorithm used by hashtest.cpp
allocates one hash slot
for every six strings in the file; since the average individual string
length is about 60 characters, and there are 18 bytes of overhead for
each string in a storage element (four bytes for the key length, four
bytes for the data length, and 10 bytes for the key value including its
null further), this means that the average storage element will be a
bit under 500 bytes long. A larger element packing factor than six
strings per element would produce a smaller hash table and would
therefore be more space efficient. However, the choice of this value is
not critical with the current implementation of this algorithm, because
any storage elements that become too long to fit into quantum will be
broken up and stored separately by a mechanism which I will get to
later.
Of course, in order to run meaningful tests, we have to do more
than store records in the hash file; we also have to retrieve what has
been stored, which means that we have to store the keys someplace so
that we can use them again to retrieve records corresponding to those
keys. In the current version of this algorithm, I use a FlexArray
(i.e., a persistent array of strings, such as we examined in the
previous chapter) to store the values of the keys. However, in the
original version of this algorithm, I was storing the key as an
unsigned long value, so I decided to use the quantum file storage to
implement a persistent array of unsigned long
values, and
store the keys in such an array.
It was surprisingly easy to implement a persistent array of unsigned
long
values,8 for which I defined a typedef
of Ulong
, mostly to save typing. The header file for this
data type is persist.h
(Figure persist.00).
PersistentArrayUlong class
(quantum\persist.h) (Figure persist.00)codelist/persist.00
As you can see, this class
isn't very complex, and
most of the implementation code is also fairly straightforward.
However, we get a lot out of those relatively few lines of code; these
arrays are not only persistent, but they also automatically expand to
any size up to and including the maximum size of a quantum file; with
the current maximum of 10000 16K blocks, a maximum size PersistentArrayUlong
could contain approximately 40,000,000 elements! Of course, we don't
store each element directly in a separate addressable entry within a
main object, as this would be inappropriate because the space overhead
per item would be larger than the Ulong
s we want to
store! Instead, we employ a two-level system similar to the one used in
the dynamic hashing algorithm; the quantum file system stores
"segments" of data, each one containing as many Ulong
s as
will fit in a quantum. To store or retrieve an element, we determine
which segment the element belongs to and access the element by its
offset in that segment.
However, before we can use a PersistentArrayUlong
, we
have to construct it, which we can do via the default constructor
(Figure persist.01).
PersistentArrayUlong class
(from quantum\persist.cpp) (Figure persist.01)codelist/persist.01
This constructor doesn't actually create a usable array; it is only
there to allow us to declare a PersistentArrayUlong
before we want to use it. When we really want to construct a usable
array, we use the normal constructor shown in Figure persist.02.
PersistentArrayUlong class
(from quantum\persist.cpp) (Figure persist.02)codelist/persist.02
As you can see, to construct a real usable array, we provide a pointer to the quantum file in which it is to be stored, along with a name for the array. The object directory of that quantum file is searched for a main object with the name specified in the constructor; if it is found, the construction is complete. Otherwise, a new object is created with one element, expandable to fill up the entire file system if necessary.
To store an element in the array we have created, we can use StoreElement
(Figure persist.03).
PersistentArrayUlong::StoreElement
function (from
quantum\persist.cpp) (Figure persist.03)codelist/persist.03
This routine first calculates which segment of the array contains the element we need to retrieve, and the element number within that segment. Then, if we are running the "debugging" version (i.e., asserts are enabled), it checks whether the segment number is within the maximum range we set up when we created the array. This test should never fail unless there is something wrong with the calling routine (or its callers), so that the element number passed in is absurdly large. As discussed above, with all such conditional checks, we have to try to make sure that our testing is good enough to find any errors that might cause this to happen; with a "release" version of the program, this would be a fatal error.
Next, we check whether the segment number we need is already
allocated to the array; if not, we increase the number of segments as
needed by calling GrowMainObject
, but don't actually
initialize any new segments until they're accessed, so that "sparse"
arrays won't take up as much room as ones that are filled in
completely. Next, we get a copy of the segment containing the element
to be updated; if it's of zero length, that means we haven't
initialized it yet, so we have to allocate memory for the new segment
and fill it with zeros. At this point, we are ready to create an AccessVector
called TempUlongVector
of type Ulong
and
use it as a "template" (no pun intended) to allow access to the element
we want to modify. Since AccessVector
has the semantics
of an array, we can simply set the ElementNumber
th
element of TempUlongVector
to the value of the input
argument p_Element
; the result of this is to place the
new element value into the correct place in the TempVector
array. Finally, we store TempVector
back into the main
object, replacing the old copy of the segment.
To retrieve an element from an array, we can use GetElement
(Figure persist.04).
PersistentArrayUlong::GetElement
function (from
quantum\persist.cpp) (Figure persist.04)codelist/persist.04
First, we calculate the segment number and element number, and
check (via qfassert
) whether the segment number is within
the range of allocated segments; if it isn't, we have committed the
programming error of accessing an uninitialized value. Assuming this
test is passed, we retrieve the segment, set up the temporary Vector TempUlongVector
to allow access to the segment as an SVector
of Ulong
s,
and return the value from the ElementNumber
th element of
the array. All this is very well if we want to write things like "Y.Put(100,100000L);
"
or "X = Y.Get(100);
", to store or retrieve the 100th
element of the Y
"array", respectively. But wouldn't it
be much nicer to be able to write "Y[100] = 100000L;
" or "X
= Y[100];
" instead?
Clearly, that would be a big improvement in the syntax; as it
happens, it's not hard to make such references possible, with the
addition of only a few lines of code.9
Unfortunately, this code is not the most straightforward, but the
syntactic improvement that it provides is worth the trouble. The key is
operator[ ]
(Figure persist.05).
PersistentArrayUlong::operator[ ]
function (from
quantum\persist.cpp) (Figure persist.05)codelist/persist.05
This function returns a temporary value of a type that behaves differently in the context of an "lvalue" reference (i.e., a "write") than it does when referenced as an "rvalue" (i.e, a "read"). In order to follow how this process works, let's use the example in Figure persist1.
codelist/perexam.00
The first question to be answered is how the compiler decodes the following line:
Save[1000000L] = 1234567L;
According to the definition of PersistentArrayUlong::operator[
]
, this operator returns a PersistentArrayUlongRef
that is constructed with the two parameters *this
and p_Index
,
where the former is the PersistentArrayUlong
object for
which operator[ ]
was called (i.e., Save
),
and the latter is the value inside the [ ]
, which in this
case is 1000000L
. What is this return value? To answer
this question, we have to look at the normal constructor for the PersistentArrayUlongRef
class
(Figure persist.06).
PersistentArrayUlongRef class
(from quantum\persist.cpp) (Figure persist.06)codelist/persist.06
This constructor sets m_PAU
to a reference to p_PAU
and m_Index
to the value of p_Index
, so the
object returned from the operator[ ]
call will be a PersistentArrayUlongRef
with those values. Therefore, the compiler-generated code looks
something like this, so far:
PersistentArrayUlongRef T(Save,1000000L);
T = 1234567L;
where T
is an arbitrary name for a temporary object.
The second of these lines is then translated into a call to PersistentArrayUlongRef::operator
=
(Figure persist.07), to
do the assignment.
PersistentArrayUlongRef::operator =
function
(from quantum\persist.cpp) (Figure persist.07)codelist/persist.07
The generated code now looks something like this:
PersistentArrayUlongRef T(Save,1000000L);
T.operator =(1234567L);
The operator =
code, as you can see from the figure,
calls the StoreElement
operation for the object m_PAU
,
which as we noted above is a reference to the original object Save
;
the arguments to that call are m_Index
, a copy of the
index supplied in the original operator[ ]
call, and p_Element
,
which is the value specified in the operator =
call.
Thus, the result is the same as that of the statement Save.StoreElement(1000000L,1234567L);
,
while the notation is that of a "normal" array access.
However, we've only handled the case where we're updating an element of the array. We also need to be able to retrieve values once they've been stored. To see how that works, let's follow the translation of the following line:
TestValue = Save[1000000L];
The process is fairly similar to what we've already done. The
definition of PersistentArrayUlong::operator[]
causes the
compiler to generate code somewhat like the following:
PersistentArrayUlongRef T(Save,1000000L);
TestValue = T;
This time, however, rather than translating the second line into a
call to PersistentArrayUlongRef::operator=
, the compiler
translates it into a call to PersistentArrayUlongRef::operator
Ulong
, a "conversion function" that allows a PersistentArrayUlongRef
to be used where a Ulong
is expected (Figure persist.08).
PersistentArrayUlongRef::operator Ulong
function
(from quantum\persist.cpp) (Figure persist.08)codelist/persist.08
Therefore, the final generated code comes out something like this:
PersistentArrayUlongRef T(Save,1000000L);
TestValue = Ulong(T);
As should be evident by looking at the code for that conversion
function, it merely calls the GetElement
operation of the
object m_PAU
, which as we noted above is a reference to
the original object Save
, with the argument m_Index
,
which is a copy of the original element index. Thus, the result is the
same as the statement TestValue = Save.GetElement(1000000L);
,
while the notation is that of a "normal" array access.
Before we move on to our next topic, I have a word of warning for you. If you use these "synthetic arrays" frequently, you may be tempted to inline the definitions of these auxiliary functions that make the array notation possible. I recommend that you don't do it, at least not without a lot of testing to make sure it works correctly. When I tried this with Borland C++ 3.1, the result appeared to work but generated terrible memory leaks; as far as I could determine, the memory that wasn't being freed was that belonging to the temporary objects that were created during the operation of these functions.
Now let's look at some of the other implementation details and
problems I encountered in the process of getting the dynamic hashing
algorithm and the hashtest.cpp
test program to work.
During development of the test program, I discovered that running the tests on large numbers of records took a fairly large amount of time; to generate a quantum file with 500,000 records in it takes about 45 minutes, doing almost continual disk accesses the whole time.10 Although this is actually very fast when one considers the amount of data being processed, I still didn't want to have to execute such a program more often than I had to. Therefore, the test program needed the capability of incrementally adding records to an existing quantum file. However, this meant that the test program had to be able to start somewhere in the middle of the input file; adding the same records again and again wouldn't work because the algorithm isn't designed to handle records with duplicate keys.
As the test program is implemented, only two pieces of information
need to be saved between runs in order to allow the addition of
previously unused records to the quantum file: the index number in the SaveKeys
array where the next key should be saved, and the offset into the input
file where the next record begins. We don't have to save the name of
the input file or the ASCII output file, since those are compiled into
the program; of course, this would not be appropriate for a real
application program, but in our test program we don't want to change
the name of the input file between runs. Clearly, if the input file
could change from one run to the next, the information as to where we
stopped when reading records wouldn't be of much use. Since both of the
data items that need to be preserved happen to be representable by an unsigned
long
value, I decided to use a PersistentArrayUlong
named Save
to save them between runs. As a result, the
only difference between starting a new quantum file and adding records
to an existing file is that when we start a new quantum file, the
offset in the input file and the starting position for keys to be added
to the SaveKeys
array have to be reset to their initial
values.
This ability to build on a previously created quantum file turned out to be quite useful in fixing an interesting bug that I ran into during capacity testing of the 16-bit version of this program. The theoretical capacity of a quantum file with the original size parameters, when used to support the dynamic hashing algorithm, could be calculated as 64K (maximum hash slots in the 16-bit implementation) * 6 (average records per hash slot), or 393,216. Although I didn't necessarily need to demonstrate that this exact number of records could actually be stored and retrieved successfully, especially in view of the number of hours that my machine would be doing continual random disk accesses, I felt that it was important to check that a very large number of records could be accommodated. I selected 250,000 as a nice round number that wouldn't take quite so long to test, and started to run tests for every multiple of 50,000 records up to 250,000.11 As each test was finished, I copied the resulting quantum file to a new name to continue the testing to the next higher multiple of 50,000 records.
Everything went along very nicely through the test that created a
150,000 record quantum file. Since adding 50,000 records to a file that
already contained 150,000 records should have taken about 45 minutes, I
started the next test, which should have generated a 200,000 record
file, and went away to do some chores. Imagine my surprise when I came
back and found that the program had aborted somewhere between elements
196,000 and 197,000 due to the failure of an assert that checks that
all the records in a hash slot that is being split have the correct
hash code to be put into either the old slot or the new slot. Upon
investigating the reason for this bug, I discovered that the problem
was that m_CurrentMaxSlotCount
, which is used in DynamicHashArray::CalculateHash
to calculate the hash code according to Larson's algorithm, was an unsigned
rather than an unsigned long
. As a result, when 32K slots
were occupied and it was time to double the value of m_CurrentMaxSlotCount
,
its new value, which should have been 64K, was 0. This caused the
hashing function to generate incorrect values and thus caused the
assert to fail. Changing the type of m_CurrentMaxSlotCount
to unsigned long
solved the problem.
However, there was one other place where the code had to change to
implement this solution, and that was in the destructor for DynamicHashArray
.
The reason for this change is that the dynamic hashing algorithm needs
to store some state information in a persistent form. To be specific,
we need to keep track of the number of active slots, the current
maximum slot number, and the number of elements remaining before the
next slot is to be activated. With these data items saved in the file,
the next time we open the file, we're ready to add more records while
keeping the previously stored ones accessible.
Unfortunately, we don't have a convenient persistent numeric array
to save these values in, and it doesn't make much sense to create one
just for three values. However, we do have a persistent string array,
which we're using to store the hashed records, and we can use the first
element of that array to store ASCII representations of the three
values that must be maintained in the file.12
This is handled during the normal constructor for the DynamicHashArray
type (Figure dynhash.00), which
calls the Open
function to do the actual work (Figure dynhash.01).13
DynamicHashArray class
(from quantum\dynhash.cpp) (Figure dynhash.00)codelist/dynhash.00
DynamicHashArray::Open
function (from
quantum\dynhash.cpp) (Figure dynhash.01)codelist/dynhash.01
The Open
function retrieves these values from the
first element of the array if it has already been created, and the
destructor (Figure dynhash.02)
stores them into the first element of the array.
DynamicHashArray class
(from
quantum\dynhash.cpp) (Figure dynhash.02)codelist/dynhash.02
After the change to the type of m_CurrentMaxSlotCount
,
the sprintf
format parameter for that type also had to
change, to "%6lu" from "%6u", so that the whole value of the variable
would be used. If I hadn't made this change, the data would have been
written to the parameter string incorrectly, and the parameters
wouldn't have been read back in correctly the next time the file was
opened.
The previous version of this program, in the second edition of this book, did not have any means of handling the situation where the total size of the records with the same hash code is so large that the element used to store those records will no longer fit into a quantum. Therefore, in the event that too many records were put into a particular element, the program would fail. While this is acceptable in a program that is intended only to demonstrate the dynamic hashing algorithm, it is unacceptable for commercial use and therefore should really be addressed in a book intended for use in the real world.
I've thought about this problem off and on for some time without
coming up with a really good solution. However, a few months ago I did
figure out how to solve it in a reasonably easy and efficient way,
which is illustrated in the code for DynamicHashArray::StoreElement
(Figure dynhash.04).
DynamicHashArray::StoreElement
function (from
quantum\dynhash.cpp) (Figure dynhash.04)codelist/dynhash.04
The basic idea is that whenever a particular element that we are trying to store in the hash array is about to exceed the size of a quantum, we create a new element in which we'll store the records that wouldn't fit in the previous element. The question, of course, is how we find the records stored in this new element, when they will not be stored in the location in the hash array where we expect to find it.
The answer is that we modify the keys of the stored records in a way that we can do again when we're trying to find a particular record. Then we set a "chain" flag in the original element in the hash array indicating that it has overflowed, so that we don't mistakenly tell the user that the record in question is not in the file.
Exactly how do we modify the key of the record? By appending a character sequence to it that consists of a form feed character followed by an ASCII representation of the number of times that we've had an overflow in this particular chain. For example, if the original key was "ABC", the first overflow key would be "ABC\f0", where "\f" represents the form feed character, and "0" is the ASCII digit 0.
Of course, it's also possible that even the second element in which we're storing records with a particular hash key will overflow. However, the same algorithm will work in that case as well; in this case, the second record will have its "chain" flag set to indicate that the program should continue looking for the record in question if it is not found in the current element and the key will be modified again to make it unique; to continue our previous example, if the original key was "ABC", the second overflow key would be "ABC\f1", where "\f" represents the form feed character, and "1" is the ASCII digit 1.
How did I choose this particular way of modifying the keys? Because it would not limit the users' choice of keys in a way that they would object to. Of course, I did have to inform the users that they should use only printable ASCII characters in their keys, but they did not consider that a serious limitation. If your users object to this limitation, then you'll have to come up with another way of constructing unique keys that won't collide with any keys that the users actually want to use.
As this suggests, there were a few tricky parts to this solution, but overall it really wasn't that difficult to implement. I haven't done any serious performance testing on its effects, but I don't expect them to be significant; after all, assuming that we select our original parameters properly, overflows should be a rare event.
I should also explain how we find a record that has overflowed.
That is the job of DynamicHashArray::FindElement
(Figure dynhash.05).
DynamicHashArray::FindElement
function (from
quantum\dynhash.cpp) (Figure dynhash.05)codelist/dynhash.05
As you can see, the code to look up a record is not complicated very much by the possibility of an overflow. First, we calculate a slot number based on the key that we are given by the user. Then we check whether that key is found in the element for that slot. If it is, we are done, so we break out of the loop. However, if we haven't found the record we're looking for yet, we check to see whether the particular element that we are looking in has its chain flag set. If not, the record must not be in the file, so we break out of the loop.
On the other hand, if the chain flag is set in the element that we were looking at, then we have to keep looking. Therefore, we calculate what the key for that record would be if it were in the next element and continue processing at the top of the loop.
On the next pass through the loop, we'll retrieve the next element that the record might be in, based on its modified key. We'll continue through this loop until we either find the record we're looking for or get to an element whose chain flag is not set; the latter situation, of course, means that the desired record is not in the file.
I've recently had the opportunity to use this algorithm in a
commercial setting, and have discovered (once again) that how the hash
code is calculated is critical to its performance. In this case, the
keys were very poorly suited to the simple (if not simple-minded)
calculation used in the version of DynamicHashArray::CalculateHash
in Figure oldhash.00.
DynamicHashArray::CalculateHash
function (Figure oldhash.00)codelist/oldhash.00
The problem was that the keys used in the application were all very similar to one another. In particular, the last seven or eight characters in each key were likely to differ from the other keys in only one or two places. This caused tremendous overloading of the hash slots devoted to those keys, with corresponding performance deterioration.
Luckily, I was able to provide a solution to this problem in short order by using an algorithm that produced much better hash codes. Interestingly enough, the algorithm that I substituted for my poorly performing hash code wasn't even designed to be a hash code algorithm at all. Instead, it was a cyclical redundancy check (CRC) function whose purpose was to calculate a value based on a block of data so that when the data was later read again, the reading program could determine whether it had been corrupted.
One second thought, perhaps it wasn't so strange that a CRC algorithm would serve as a good hash code. After all, for it to do its job properly, any change to the data should be very likely to produce a different CRC. Therefore, even if two keys differed by only one or two bytes, their CRC's would almost certainly be different, which of course is exactly what we want for a hash code. As it happens, this substitution greatly improved the performance of the program, so apparently my choice of a new hash algorithm was appropriate.
Fortunately, I came up with this solution just in time to include the new code on the CD-ROM in the back of this book. It is shown in Figure dynhash.06.
DynamicHashArray::CalculateHash
function (from
quantum\dynhash.cpp) (Figure dynhash.06)codelist/dynhash.06
As for the CRC class
, its interface is shown in Figure
crc32.00.
CRC32 class
(from
quantum\crc32.h) (Figure crc32.00)codelist/crc32.00
And its implementation, which also includes a brief description of the algorithm, is shown in Figure crc32.01.
CRC32 class
(from
quantum\crc32.cpp) (Figure crc32.01)codelist/crc32.01
What are the limits on the maximum capacity of a dynamic hashing
array? As it happens, there are two capacity parameters of the quantum
file access method that we can adjust without affecting the
implementation very much: BlockSize
, which specifies how
large each quantum is, and the maximum number of blocks allowed in the
file, set by the MaxFileQuantumCount
const
in blocki.h
. In the current implementation, BlockSize
is set to 16K, which means we need 14 bits to specify the location of
an item in a quantum. Since an ItemIndex
(Figure blocki.08) uses a 16-bit word to hold
the offset, we could increase the BlockSize
to as much as
64K bytes. Let's take a look at the advantages and disadvantages of
increasing the quantum size.
Suppose we increase the size of each quantum, for example, to 32K bytes from 16K. It's easy to see that this would double the maximum capacity of the file. What may not be so obvious is that this change would also decrease the memory requirements for efficient file access via dynamic hashing, for a given number of records in the file. To see why this is so, we have to look at the typical usage of disk buffers when looking up a string by key in the dynamic hashing algorithm.
Suppose we want to find the record that has the key 609643342. The
algorithm calculates which hash slot points to the storage element in
which a record with that key would be stored. It then calls the quantum
file access routine GetModifiableElement
to retrieve that
storage element. GetModifiableElement
retrieves the big
pointer array block for the array that the storage elements are kept
in; then it retrieves the little pointer array block for the correct
storage element; and finally it gets the block where the storage
element is stored, retrieves it from the block, and returns it. The
dynamic hashing algorithm then searches the storage element to find the
key we specified and, if it is found, extracts the record we want.14 So a total of three blocks are accessed
for each retrieval of a string: the big pointer array block, a little
pointer array block, and the final "leaf" block. The first of these
blocks is referenced on every string retrieval, so it is almost certain
to be in memory. The "leaf" block, on the other hand, is not very
likely to be in memory, since the purpose of the hashing algorithm is
to distribute the data as evenly as possible over the file: with a
reasonably large file, most "leaf" accesses aren't going to be to one
of the relatively few blocks we can keep in memory.
Fortunately, this pessimistic outlook does not apply to the second block retrieved, the little pointer array block. If we have 500,000 strings in our file, there are 83,333 storage elements that the quantum file algorithm has to deal with. With a 16K-byte block, approximately 4000 little pointer elements fit in each little pointer block, so the little pointer blocks would occupy 21 buffers.
Let's look at the situation if we go to 32K-byte blocks. If we double the number of strings in the average storage element to 12 so that the likelihood of an overflow would remain approximately the same as before, then the number of little pointer array elements needed to access a given number of records is halved; in addition, the number of little pointer array elements that fit in each little pointer block is doubled.15 This means that to be fairly sure that the little pointer block we need will be present in memory, instead of 21 blocks taking up almost 700K of memory, we need only 6 blocks taking up about 200K of memory.
In the case of dynamic hashing, this effect greatly alleviates the primary drawback of increasing the block size, which is that the number of blocks that can be held in a given amount of memory is inversely proportional to the size of each block. In the general case, however, this reduction in the number of buffers can hurt the performance of the program; the system can "thrash" if the working set of data needed at any given time needs more buffers than are available.
The only other apparent drawback of increasing the size of the quanta is that the free space codes become less accurate, since the code remains fixed in size at one byte; with a 32K block, each increment in the size code represents 128 bytes. However, I doubt that this will cause any significant problems with space utilization.
The other fairly simple way to increase the capacity of the system
is to increase the number of blocks that can be addressed. The ItemReference
class (Figure blocki.12) defines
objects that take up four bytes each, 20 bits for the m_QuantumNumber
field and 12 bits for the m_RelativeItemNumber
field.
ItemReference class
(from quantum\blocki.h) (Figure blocki.12)codelist/blocki.12
If we wanted to increase the number of quanta from its current
maximum of 10,000 (set by the MaxFileQuantumCount
const
in blocki.h
), we could increase it up to 1024K without
changing the size of the m_QuantumNumber
field.
The main drawback of increasing the maximum block count is that the
free space list gets bigger; in fact, that's the reason that I've set
the maximum file quantum count to 10,000 despite the fact that the
quantum number field in the item reference class
can
handle a file with far more quanta. However, if our application needs
so much data that a 160-MB maximum file size is too small, the extra
space taken up by a larger free space list probably isn't an obstacle.
In this chapter, we have used the quantum file access method as the base for a disk-based variant of Larson's dynamic hashing. This algorithm provides efficient hash-coded access by key to a very large amount of variable-length textual data, while eliminating the traditional drawbacks of hashing, especially the need to specify the maximum size of the file in advance.
In the final chapter, we will summarize the algorithms we have covered in this book and discuss some other resources we can use to improve the efficiency of our programs.
PersistentArrayUlong
class be
generalized to other data types? (You can find suggested approaches to problems in Chapter artopt.htm).
1. Let's call the value to be hashed x.
2. y = x >> 4 [divide by 16]
3. z = y << 4 [multiply result by 16; result must have low four bits = 0]
4. x % 16 = x - z [definition of "remainder"; result is low four bits of x]
Obviously, an analogous result holds true for the remainder after dividing by 8; therefore, the low three bits of these two values must be the same.
GetString
and PutString
functions of the DynamicHashArray
class
increment the element index before using it. Thus, the
request to read or update element 0 is translated into an operation on
element 1, and so forth. To access element 0, the index is specified as
-1.
Open
function once we know the actual parameters;
without separating construction and initialization, we would not have
that capability.