Return to the table of contents
In this chapter we will use a supermarket price lookup system to illustrate how to save storage by using a restricted character set and how to speed up access to records by employing hash coding (or "scatter storage") and caching (or keeping copies of recently accessed records in memory). We will look items up by their UPC (Universal Product Code), which is printed in the form of a "bar code" on virtually all supermarket items other than fresh produce. We will emphasize rapid retrieval of prices, as maintenance of such files is usually done after hours, when speed would be less significant.
Algorithms discussed: Hash Coding, Radix40 Data Representation, BCD Data Representation, Caching
To begin, let us assume that we can describe each item by the information in the structure definition in Figure initstruct.
typedef struct {
Number of Number of Total accesses Totalaccesses newly accessible to find all records
Average number of accesses per record = 12.3631 accesses/record
Figure binary.search shows the calculation of the average number of accesses for a 10,000 item file. Notice that each line represents twice the number of records as the one above, with the exception of line 14. The entry for that line (1809) is the number of 14-access records needed to reach the capacity of our 10,000 record file. As you can see, the average number of accesses is approximately 12.4 per record. Therefore, at a typical hard disk speed of 10 milliseconds per access, we would need almost 125 milliseconds to look up an average record using a binary search. While this lookup time might not seem excessive, remember that a number of checkout terminals would probably be attempting to access the database at the same time, and the waiting time could become noticeable. We might also be concerned about the amount of wear on the disk mechanism that would result from this approach.
Before we try to optimize our search, let us define some terms. There are two basic categories of storage devices, distinguished by the access they allow to individual records. The first type is sequential access;1 in order to read record 1000 from a sequential device, we must read records 1 through 999 first, or at least skip over them. The second type is direct access; on a direct access device, we can read record 1000 without going past all of the previous records. However, only some direct access devices allow nonsequential accesses without a significant time penalty; these are called random access devices. Unfortunately, disk drives are direct access devices, but not random access ones. The amount of time it takes to get to a particular data record depends on how close the read/write head is to the desired position; in fact, sequential reading of data may be more than ten times as fast as random access.
Is there a way to find a record in a large file with an average of about one nonsequential access? Yes; in fact, there are several such methods, varying in complexity. They are all variations on hash coding, or address calculation; as you will see, such methods actually can be implemented quite simply, although for some reason they have acquired a reputation for mystery.
Let's start by considering a linear or sequential search. That is, we start at the beginning of the file and read each record in the file until we find the one we want (because its key is the same as the key we are looking for). If we get to the end of the file without finding a record with the key we are looking for, the record isn't in the file. This is certainly a simple method, and indeed is perfectly acceptable for a very small file, but it has one major drawback: the average time it takes to find a given record increases every time we add another record. If the file gets twice as big, it takes twice as long to find a record, on the average. So this seems useless.
But what if, instead of having one big file, we had many little files, each with only a few records in it? Of course, we would need to know which of the little files to look in, or we wouldn't have gained anything. Is there any way to know that?
Let's see if we can find a way. Suppose that we have 1000 records to search through, keyed by telephone number. To speed up the lookup, we have divided the records into 100 subfiles, averaging 10 numbers each. We can use the last two digits of the telephone number to decide which subfile to look in (or to put a new record in), and then we have to search through only the records in that subfile. If we get to the end of the subfile without finding the record we are looking for, it's not in the file. That's the basic idea of hash coding.
But why did we use the last two digits, rather than the first two? Because they will probably be more evenly distributed than the first two digits. Most of the telephone numbers on your list probably fall within a few telephone exchanges near where you live (or work). For example, suppose my local telephone book contained a lot of 758 and 985 numbers and very few numbers from other exchanges. Therefore, if I were to use the first two digits for this hash coding scheme, I would end up with two big subfiles (numbers 75 and 98) and 98 smaller ones, thus negating most of the benefit of dividing the file. You see, even though the average subfile size would still be 10, about 90% of the records would be in the two big subfiles, which would have perhaps 450 records each. Therefore, the average search for 90% of the records would require reading 225 records, rather than the five we were planning on. That is why it is so important to get a reasonably even distribution of the data records in a hash-coded file.
It is inconvenient to have 100 little files lying around, and the time required to open and close each one as we need it makes this implementation inefficient. But there's no reason we couldn't combine all of these little files into one big one and use the hash code to tell us where we should start looking in the big file. That is, if we have a capacity of 1000 records, we could use the last two digits of the telephone number to tell us which "subfile" we need of the 100 "subfiles" in the big file (records 0-9, 10-19 ... 980-989, 990-999). To help visualize this, let's look at a smaller example: 10 subfiles having a capacity of four telephone numbers each and a hash code consisting of just the last digit of the telephone number (Figure initfile).
Subfile #
+-----------+-----------+-----------+-----------+
0 | I 0000000 | I 0000000 | I 0000000 | I 0000000 |
+-----------+-----------+-----------+-----------+
1 | I 0000000 | I 0000000 | I 0000000 | I 0000000 |
+-----------+-----------+-----------+-----------+
2 | I 0000000 | I 0000000 | I 0000000 | I 0000000 |
+-----------+-----------+-----------+-----------+
3 | I 0000000 | I 0000000 | I 0000000 | I 0000000 |
+-----------+-----------+-----------+-----------+
4 | I 0000000 | I 0000000 | I 0000000 | I 0000000 |
+-----------+-----------+-----------+-----------+
5 | I 0000000 | I 0000000 | I 0000000 | I 0000000 |
+-----------+-----------+-----------+-----------+
6 | I 0000000 | I 0000000 | I 0000000 | I 0000000 |
+-----------+-----------+-----------+-----------+
7 | I 0000000 | I 0000000 | I 0000000 | I 0000000 |
+-----------+-----------+-----------+-----------+
8 | I 0000000 | I 0000000 | I 0000000 | I 0000000 |
+-----------+-----------+-----------+-----------+
9 | I 0000000 | I 0000000 | I 0000000 | I 0000000 |
+-----------+-----------+-----------+-----------+
One way is to add a "valid-data" flag to every entry in the file, which is initialized to "I" (for invalid) in the entries in Figure initfile, and set each entry to "valid" (indicated by a "V" in that same position) as we store data in it. Then if we get to an invalid record while looking up a record in the file, we know that we are at the end of the subfile and therefore the record is not in the file (Figure distinct).
Subfile #
+-----------+-----------+-----------+-----------+
0 | I 0000000 | I 0000000 | I 0000000 | I 0000000 |
+-----------+-----------+-----------+-----------+
1 | V 9876541 | V 2323231 | V 9898981 | I 0000000 |
+-----------+-----------+-----------+-----------+
2 | V 2345432 | I 0000000 | I 0000000 | I 0000000 |
+-----------+-----------+-----------+-----------+
3 | I 0000000 | I 0000000 | I 0000000 | I 0000000 |
+-----------+-----------+-----------+-----------+
4 | I 0000000 | I 0000000 | I 0000000 | I 0000000 |
+-----------+-----------+-----------+-----------+
5 | I 0000000 | I 0000000 | I 0000000 | I 0000000 |
+-----------+-----------+-----------+-----------+
6 | I 0000000 | I 0000000 | I 0000000 | I 0000000 |
+-----------+-----------+-----------+-----------+
7 | I 0000000 | I 0000000 | I 0000000 | I 0000000 |
+-----------+-----------+-----------+-----------+
8 | I 0000000 | I 0000000 | I 0000000 | I 0000000 |
+-----------+-----------+-----------+-----------+
9 | I 0000000 | I 0000000 | I 0000000 | I 0000000 |
+-----------+-----------+-----------+-----------+
For example, if we are looking for the number "9898981", we start at the beginning of subfile 1 in Figure distinct (because the number ends in 1), and examine each record from there on. The first two entries have the numbers "9876541" and "2323231", which don't match, so we continue with the third one, which is the one we are looking for. But what if we were looking for "9898971"? Then we would go through the first three entries without finding a match. The fourth entry is "I 0000000", which is an invalid entry. This is the marker for the end of this subfile, so we know the number we are looking for isn't in the file.
Now let's add another record to the file with the phone number "1212121". As before, we start at the beginning of subfile 1, since the number ends in 1. Although the first three records are already in use, the fourth (and last) record in the subfile is available, so we store our new record there, resulting in the situation in Figure merged.
Subfile #
+-----------+-----------+-----------+-----------+
0 | I 0000000 | I 0000000 | I 0000000 | I 0000000 |
+-----------+-----------+-----------+-----------+
1 | V 9876541 | V 2323231 | V 9898981 | V 1212121 |
+-----------+-----------+-----------+-----------+
2 | V 2345432 | I 0000000 | I 0000000 | I 0000000 |
+-----------+-----------+-----------+-----------+
3 | I 0000000 | I 0000000 | I 0000000 | I 0000000 |
+-----------+-----------+-----------+-----------+
4 | I 0000000 | I 0000000 | I 0000000 | I 0000000 |
+-----------+-----------+-----------+-----------+
5 | I 0000000 | I 0000000 | I 0000000 | I 0000000 |
+-----------+-----------+-----------+-----------+
6 | I 0000000 | I 0000000 | I 0000000 | I 0000000 |
+-----------+-----------+-----------+-----------+
7 | I 0000000 | I 0000000 | I 0000000 | I 0000000 |
+-----------+-----------+-----------+-----------+
8 | I 0000000 | I 0000000 | I 0000000 | I 0000000 |
+-----------+-----------+-----------+-----------+
9 | I 0000000 | I 0000000 | I 0000000 | I 0000000 |
+-----------+-----------+-----------+-----------+
To answer that question, we have to see what would happen if we had added another record that belonged in subfile 1. There are a number of possible ways to handle this situation, but most of them are appropriate only for memory-resident files. As I mentioned above, reading the next record in a disk file is much faster than reading a record at a different place in the file. Therefore, for disk-based data, the most efficient place to put "overflow" records is in the next open place in the file, which means that adding a record with phone number "1234321" to this file would result in the arrangement in Figure overflow.2
Subfile #
+-----------+-----------+-----------+-----------+
0 | I 0000000 | I 0000000 | I 0000000 | I 0000000 |
+-----------+-----------+-----------+-----------+
1 | V 9876541 | V 2323231 | V 9898981 | V 1212121 |
+-----------+-----------+-----------+-----------+
2 | V 2345432 | V 1234321 | I 0000000 | I 0000000 |
+-----------+-----------+-----------+-----------+
3 | I 0000000 | I 0000000 | I 0000000 | I 0000000 |
+-----------+-----------+-----------+-----------+
4 | I 0000000 | I 0000000 | I 0000000 | I 0000000 |
+-----------+-----------+-----------+-----------+
5 | I 0000000 | I 0000000 | I 0000000 | I 0000000 |
+-----------+-----------+-----------+-----------+
6 | I 0000000 | I 0000000 | I 0000000 | I 0000000 |
+-----------+-----------+-----------+-----------+
7 | I 0000000 | I 0000000 | I 0000000 | I 0000000 |
+-----------+-----------+-----------+-----------+
8 | I 0000000 | I 0000000 | I 0000000 | I 0000000 |
+-----------+-----------+-----------+-----------+
9 | I 0000000 | I 0000000 | I 0000000 | I 0000000 |
+-----------+-----------+-----------+-----------+
When considering these less-than-ideal characteristics, we should remember that other search methods have their own disadvantages, particularly in speed of lookup. All in all, the disadvantages of such a simply implemented access method seem rather minor in comparison with its benefits, at least for the current application. In addition, recent innovations in hashing have made it possible to improve the flexibility of SDBH methods by fairly simple changes. For example, the use of "last-come, first-served" hashing, which stores a newly added record exactly where the hash code indicates, greatly reduces the maximum time needed to find any record; it also makes it possible to determine for any given file what that maximum is, thus removing one of the barriers to using SDBH in a time-critical application.3 Even more recently, a spectacular advance in the state of the art has made it possible to increase the file capacity incrementally, as well as to delete records efficiently. Chapter dynhash.htm provides a C++ implementation of this remarkable innovation in hashing.
Of course, even one random disk access takes a significant amount of time, from the computer's point of view. Wouldn't it be better to avoid accessing the disk at all?
While that would result in the fastest possible lookup, it would require us to have the entire database in memory, which is usually not feasible. However, if we had some of the database in memory, perhaps we could eliminate some of the disk accesses.
A cache is a portion of a large database that we want to access, kept in a type of storage that can be accessed more rapidly than the type used to store the whole database. For example, if your system has an optical disk which is considerably slower than its hard disk, a portion of the database contained on the optical disk may be kept on the hard disk. Why don't we just use the hard disk for the whole database? The optical disk may be cheaper per megabyte or may have more capacity, or the removability and long projected life span of the optical diskettes may be the major reason. Of course, our example of a cache uses memory to hold a portion of the database that is stored on the hard disk, but the principle is the same: memory is more expensive than disk storage, and with certain computers, it may not be possible to install enough memory to hold a copy of the entire database even if price were no object.
If a few items account for most of the transactions, the use of a cache speeds up access to those items, since they are likely to be among the most recently used records. This means that if we keep a number of the most recently accessed records in memory, we can reduce the number of disk accesses significantly. However, we have to have a way to locate items in the cache quickly: this problem is very similar to a hash-coded file lookup, except that we have more freedom in deciding how to handle overflows, where the entry we wish to use is already being used by another item. Since the cache is only a copy of data that is also stored elsewhere, we don't have to be so concerned about what happens to overflowing entries; we can discard them if we wish to, since we can always get fresh copies from the disk.
The simplest caching lookup algorithm is a direct-mapped cache. That means each key corresponds to one and only one entry in the cache. In other words, overflow is handled by overwriting the previous record in that space with the new entry. The trouble with this method is that if two (or more) commonly used records happen to map to the same cache entry, only one of them can be in the cache at one time. This requires us to go back to the disk repeatedly for these conflicting records. The solution to this problem is to use a multiway associative cache. In this algorithm, each key corresponds to a cache line, which contains more than one entry. In our example, it contains eight entries. Therefore, if a number of our records have keys that map to the same line, up to eight of them can reside in the cache simultaneously. How did I decide on an eight-way associative cache? By trying various cache sizes until I found the one that yielded the greatest performance.
The performance of a disk caching system is defined by its hit
ratio, or the proportion of accesses to the disk that are avoided by
using the cache. In order to estimate the performance of this caching
algorithm, we have to quantify our assumptions. I have written a
program called stestgen.cpp
to generate test keys in
which 20% of the items account for 80% of the database accesses (Figure
stestgen.00), along with another
called supinit.cpp
that initializes the database (Figure supinit.00), and a third called supert.cpp
to read the test keys and look them up (Figure supert.00). The results of this
simulation indicated that, using an eight-way associative cache,
approximately 44% of the disk accesses that would be needed in a
noncached system could be eliminated.
codelist/stestgen.00
codelist/supinit.00
codelist/supert.00
Figure line.size shows the results of my experimentation with various line sizes.
Line size Hit ratio2 .38
4 .42
8 .44
16 .43
The line size is defined by the constant MAPPING_FACTOR
in superm.h
(Figure superm.00a).
codelist/superm.00a
Now that we have added a cache to our optimization arsenal, only three more changes are necessary to reach the final lookup algorithm that we will implement. The first is to shrink each of the subfiles to one entry. That is, we will calculate a record address rather than a subfile address when we start trying to add or look up a record. This tends to reduce the length of the search needed to find a given record, as each record (on average) will have a different starting position, rather than a number of records having the same starting position, as is the case with longer subfiles.
The second change is that, rather than having a separate flag to indicate whether a record position in the file is in use, we will create an "impossible" key value to mean that the record is available. Since our key will consist only of decimal digits (compressed to two digits per byte), we can set the first digit to 0xf (the hex representation for 15), which cannot occur in a genuine decimal number. This will take the place of our "invalid" flag, without requiring extra storage in the record.
Finally, we have to deal with the possibility that the search for a record will encounter the end of the file, because the last position in the file is occupied by another record. In this case, we will wrap around to the beginning of the file and keep looking. In other words, position 0 in the file will be considered to follow immediately after the last position in the file.
Now that we have decided on our lookup algorithm, we can shift our attention to reducing the amount of storage required for each record in our supermarket price lookup program. Without any special encoding, the disk storage requirements for one record would be 35 bytes (10 for the UPC, 21 for the description, and 4 for the price). For a file of 10,000 items, this would require 350 Kbytes; allowing 25% extra space to prevent the hashing from getting too slow means that the file would end up taking up about 437 Kbytes. For this application, disk storage space would not be a problem; however, the techniques we will use to reduce the file size are useful in many other applications as well. Also, searching a smaller file is likely to be faster, because the heads have to move a shorter distance on the average to get to the record where we are going to start our search.
If you look back at Figure initstruct,
you will notice that the upc
field is ten characters
long. Using the ASCII code for each digit, which is the usual
representation for character data, takes one byte per digit or 10 bytes
in all. I mentioned above that we would be using a limited character
set to reduce the size of the records. UPC codes are limited to the
digits 0 through 9; if we pack two digits into one byte, by using four
bits to represent each digit, we can cut that down to five bytes for
each UPC value stored. Luckily, this is quite simple, as you will see
when we discuss the BCD
(binary-coded decimal) conversion
code below. The other data compression method we will employ is to
convert the item descriptions from strings of ASCII characters, limited
to the sets 0-9, A-Z, and the special characters comma, period, minus,
and space, to Radix40
representation, mentioned in
Chapter prologue.htm. The main difference
between Radix40
conversions and those for BCD
is that in the former case we need to represent 40 different
characters, rather than just the 10 digits, and therefore the packing
of data must be done in a slightly more complicated way than just using
four bits per character.
Now that we have covered the optimizations that we will use in our
price lookup system, it's time to go through the code that implements
these algorithms. This specific implementation is set up to handle a
maximum of FILE_CAPACITY
items, defined in superm.h
(Figure superm.00a).5
Each of these items, as defined in the ItemRecord
structure in the same file, has a price, a description, and a key,
which is the UPC code. The key would be read in by a bar-code scanner
in a real system, although our test program will read it in from the
keyboard.
Several of the fields in the ItemRecord
structure
definition require some explanation, so let's take a closer look at
that definition, shown in Figure superm.00.
ItemRecord
struct definition (from superm\superm.h) (Figure superm.00)codelist/superm.00
The upc
field is defined as a BCD
(binary-coded decimal) value of ASCII_KEY_SIZE
digits
(contained in BCD_KEY_SIZE
bytes). The description
field is defined as a Radix40
field DESCRIPTION_WORDS
in size; each of these words contains three Radix40
characters.
A BCD
value is stored as two digits per byte, each
digit being represented by a four-bit code between 0000(0) and 1001(9).
Function ascii_to_BCD
in bcdconv.cpp
(Figure bcdconv.00) converts a
decimal number, stored as ASCII digits, to a BCD
value by
extracting each digit from the input argument and subtracting the code
for '0' from the digit value; BCD_to_ascii
(Figure bcdconv.01) does the reverse.
codelist/bcdconv.00
codelist/bcdconv.01
A UPC code is a ten-digit number between 0000000000 and 9999999999,
which unfortunately is too large to fit in a long integer of 32 bits.
Of course, we could store it in ASCII, but that would require 10 bytes
per UPC code. So BCD
representation saves five bytes per
item compared to ASCII.
A Radix40
field, as mentioned above, stores three
characters (from a limited set of possibilities) in 16 bits. This
algorithm (like some other data compression techniques) takes advantage
of the fact that the number of bits required to store a character
depends on the number of distinct characters to be represented.6 The BCD
functions described
above are an example of this approach. In this case, however, we need
more than just the 10 digits. If our character set can be limited to 40
characters (think of a Radix40
value as a "number" in
base 40), we can fit three of them in 16 bits, because 403
is less than 216.
Let's start by looking at the header file for the Radix40
conversion functions, which is shown in Figure radix40.00a.
Radix40
conversion
(superm\radix40.h) (Figure radix40.00a)codelist/radix40.00a
The legal_chars
array, shown in Figure radix40.00 defines the characters
that can be expressed in this implementation of Radix40
.7 The variable weights
contains
the multipliers to be used to construct a two-byte Radix40
value from the three characters that we wish to store in it.
legal_chars
array (from superm\radix40.cpp) (Figure radix40.00)codelist/radix40.00
As indicated in the comment at the beginning of the ascii_to_radix40
function (Figure radix40.01), the
job of that function is to convert a null-terminated ASCII character
string to Radix40
representation. After some
initialization and error checking, the main loop begins by incrementing
the index to the current word being constructed, after every third
character is translated. It then translates the current ASCII character
by indexing into the lookup_chars
array, which is shown
in Figure radix40.02. Any
character that translates to a value with its high bit set is an
illegal character and is converted to a hyphen; the result
flag is changed to S_ILLEGAL
if this occurs.
ascii_to_radix40
function (from
superm\radix40.cpp) (Figure radix40.01)codelist/radix40.01
lookup_chars
array (from superm\radix40.cpp) (Figure radix40.02)codelist/radix40.02
In the line radix40_data[current_word_index] +=
weights[cycle] * j;
, the character is added into the current
output word after being multiplied by the power of 40 that is
appropriate to its position. The first character in a word is
represented by its position in the legal_chars
string.
The second character is represented by 40 times that value and the
third by 1600 times that value, as you would expect for a base-40
number.
The complementary function radix40_to_ascii
(Figure radix40.03) decodes each character
unambiguously. First, the current character is extracted from the
current word by dividing by the weight appropriate to its position;
then the current word is updated so the next character can be
extracted. Finally, the ASCII value of the character is looked up in
the legal_chars
array.
radix40_to_ascii
function (from
superm\radix40.cpp) (Figure radix40.03)codelist/radix40.03
Now that we have examined the user-defined types used in the ItemRecord
structure, we can go on to the PriceFile
structure, which
is used to keep track of the data for a particular price file.8 The best way to learn about this structure
is to follow the program as it creates, initializes, and uses it. The
function main
, which is shown in Figure superm.01, after checking that it was
called with the correct number of arguments, calls the initialize_price_file
function (Figure suplook.00) to
set up the PriceFile
structure.
main
function (from superm\superm.cpp) (Figure superm.01)codelist/superm.01
initialize_price_file
function (from
superm\suplook.cpp) (Figure suplook.00)codelist/suplook.00
The initialize_price_file
function allocates storage
for and initializes the PriceFile
structure, which is
used to control access to the price file. This structure contains
pointers to the file, to the array of cached records that we have in
memory, and to the array of record numbers of those cached records. As
we discussed earlier, the use of a cache can reduce the amount of time
spent reading records from the disk by maintaining copies of a number
of those records in memory, in the hope that they will be needed again.
Of course, we have to keep track of which records we have cached, so
that we can tell whether we have to read a particular record from the
disk or can retrieve a copy of it from the cache instead.
When execution starts, we don't have any records cached; therefore,
we initialize each entry in these arrays to an "invalid" state (the key
is set to INVALID_BCD_VALUE
). If file_mode
is set to CLEAR_FILE
, we write such an "invalid" record
to every position in the price file as well, so that any old data left
over from a previous run is erased.
Now that access to the price file has been set up, we can call the process
function (Figure superm.02). This
function allows us to enter items and/or look up their prices and
descriptions, depending on mode
.
process
function (from superm\superm.cpp) (Figure superm.02)codelist/superm.02
First, let's look at entering a new item (INPUT_MODE
).
We must get the UPC code, the description, and the price of the item.
The UPC code is converted to BCD
, the description to Radix40
,
and the price to unsigned. Then we call write_record
(Figure suplook.01) to add the
record to the file.
write_record
function (from superm\suplook.cpp) (Figure suplook.01)codelist/suplook.01
In order to write a record to the file, write_record
calls lookup_record_number
(Figure suplook.02) to determine where the
record should be stored so that we can retrieve it quickly later. The lookup_record_number
function does almost the same thing as lookup_record
(Figure suplook.03), except tha
the latter returns a pointer to the record rather than its number.
Therefore, they are implemented as calls to a common function: lookup_record_and_number
(Figure suplook.04).
lookup_record_number
function (from
superm\suplook.cpp) (Figure suplook.02)codelist/suplook.02
lookup_record
function (from superm\suplook.cpp) (Figure suplook.03)codelist/suplook.03
lookup_record_and_number
function (from
superm\suplook.cpp) (Figure suplook.04)codelist/suplook.04
After a bit of setup code, lookup_record_and_number
determines whether the record we want is already in the cache, in which
case we don't have to search the file for it. To do this, we call compute_cache_hash
(Figure suplook.05), which in turn
calls compute_hash
(Figure suplook.06) to do most of the work of
calculating the hash code.
compute_cache_hash
function (from
superm\suplook.cpp) (Figure suplook.05)codelist/suplook.05
compute_hash
function (from superm\suplook.cpp) (Figure suplook.06)codelist/suplook.06
This may look mysterious, but it's actually pretty simple. After
clearing the hash code we are going to calculate, it enters a loop that
first shifts the old hash code one (decimal) place to the left, end
around, then adds the low four bits of the next character from the key
to the result. When it finishes this loop, it returns to the caller, in
this case compute_cache_hash
. How did I come up with this
algorithm?
Well, as you will recall from our example of looking up a telephone number, the idea of a hash code is to make the most of variations in the input data, so that there will be a wide distribution of "starting places" for the records in the file. If all the input values produced the same hash code, we would end up with a linear search again, which would be terribly slow. In this case, our key is a UPC code, which is composed of decimal digits. If each of those digits contributes equally to the hash code, we should be able to produce a fairly even distribution of hash codes, which are the starting points for searching through the file for each record. As we noted earlier, this is one of the main drawbacks of hashing: the difficulty of coming up with a good hashing algorithm. After analyzing the nature of the data, you may have to try a few different algorithms with some test data, until you get a good distribution of hash codes. However, the effort is usually worthwhile, since you can often achieve an average of slightly over one disk access per lookup (assuming that several records fit in one physical disk record).
Meanwhile, back at compute_cache_hash
, we convert the
result of compute_hash
, which is an unsigned
value, into an index into the cache. This is then returned to lookup_record_and_number
as the starting cache index. As mentioned above, we are using an
eight-way associative cache, in which each key can be stored in any of
eight entries in a cache line. This means that we need to know
where the line starts, which is computed by compute_starting_cache_hash
(Figure suplook.07) and where it
ends, which is computed by compute_ending_cache_hash
(Figure suplook.08).9
compute_starting_cache_hash
function (from
superm\suplook.cpp) (Figure suplook.07)codelist/suplook.07
compute_ending_cache_hash
function (from
superm\suplook.cpp) (Figure suplook.08)codelist/suplook.08
After determining the starting and ending positions where the key
might be found in the cache, we compare the key in each entry to the
key that we are looking for, and if they are equal, we have found the
record in the cache. In this event, we set the value of the record_number
argument to the file record number for this cache entry, and return
with the status set to FOUND
.
Otherwise, the record isn't in the cache, so we will have to look
for it in the file; if we find it, we will need a place to store it in
the cache. So we pick a "random" entry in the line (cache_replace_index
)
by calculating the remainder after dividing the number of accesses we
have made by the MAPPING_FACTOR
. This will generate an
entry index between 0 and the highest entry number, cycling through all
the possibilities on each successive access, thus not favoring a
particular entry number.
However, if the line has an invalid entry (where the key is INVALID_BCD_VALUE
),
we should use that one, rather than throwing out a real record that
might be needed later. Therefore, we search the line for such an empty
entry, and if we are successful, we set cache_replace_index
to its index.
Next, we calculate the place to start looking in the file, via compute_file_hash
,
(Figure suplook.09), which is very
similar to compute_cache_hash
except that it uses the FILE_SIZE
constant in superm.h
(Figure superm.00a) to calculate the index
rather than the CACHE_SIZE
constant, as we want a
starting index in the file rather than in the cache.
compute_file_hash
function (from
superm\suplook.cpp) (Figure suplook.09)codelist/suplook.09
As we noted above, this is another of the few drawbacks of this hashing method: the size of the file must be decided in advance, rather than being adjustable as data is entered. The reason is that to find a record in the file, we must be able to calculate its approximate position in the file in the same manner as it was calculated when the record was stored. The calculation of the hash code is designed to distribute the records evenly throughout a file of known size; if we changed the size of the file, we wouldn't be able to find records previously stored. Of course, different files can have different sizes, as long as we know the size of the file we are operating on currently: the size doesn't have to be an actual constant as it is in our example, but it does have to be known in advance for each file.
Now we're ready to start looking for our record in the file at the
position specified by starting_file_index
. Therefore, we
enter a loop that searches from this starting position toward the end
of the file, looking for a record with the correct key. First we set
the file pointer to the first position to be read, using position_record
(Figure suplook.10), then read the
record at that position.
position_record
function (from superm\suplook.cpp) (Figure suplook.10)codelist/suplook.10
If the key in that record is the one we are looking for, our search
is successful. On the other hand, if the record is invalid, then the
record we are looking for is not in the file; when we add records to
the file, we start at the position given by starting_file_index
and store our new record in the first invalid record we find.10 Therefore, no record can overflow past an
invalid record, as the invalid record would have been used to store the
overflow record.
In either of these cases, we are through with the search, so we break out of the loop. On the other hand, if the entry is neither invalid nor the record we are looking for, we keep looking through the file until either we have found the record we want, we discover that it isn't in the file by encountering an invalid record, or we run off the end of the file. In the last case we start over at the beginning of the file.
If we have found the record, we copy it to the cache entry we've
previously selected and copy its record number into the list of record
numbers in the cache so that we'll know which record we have stored in
that cache position. Then we return to the calling function, write_record
,
with the record we have found. If we have determined that the record is
not in the file, then we obviously can't read it into the cache, but we
do want to keep track of the record number where we stopped, since that
is the record number that will be used for the record if we write it to
the file.
To clarify this whole process, let's make a file with room for only
nine records by changing FILE_SIZE
to 6 in superm.h
(Figure superm.00a). After adding
a few records, a dump looks like Figure initcondition.
Position Key Data3. 0000121212: OATMEAL, 1 LB.:300
6. 0000012345: JELLY BEANS:150
Let's add a record with the key "23232" to the file. Its hash code turns out to be 3, so we look at position 3 in the file. That position is occupied by a record with key "121212", so we can't store our new record there. The next position we examine, number 4, is invalid, so we know that the record we are planning to add is not in the file. (Note that this is the exact sequence we follow to look up a record in the file as well). We use this position to hold our new record. The file now looks like Figure aftermilk.
Position Key Data3. 0000121212: OATMEAL, 1 LB.:300
6. 0000012345: JELLY BEANS:150
In order to continue looking for this record, we
must start over at the beginning of the file. That is, position 0 is
the next logical position after the last one in the file. As it
happens, position 0 contains an invalid record, so we know that the
record we want isn't in the file.11 In
any event, we are now finished with lookup_record_and_number
.
Therefore, we return to lookup_record_number
, which
returns the record number to be used to write_record
(Figure suplook.01), along with a
status value of FILE_FULL
, FOUND
, or NOT_IN_FILE
(which is the status we want). FILE_FULL
is an error, as
we cannot add a record to a file that has reached its capacity. So is FOUND
,
in this situation, as we are trying to add a new record, not find one
that alreadys exists. In either of these cases, we simply return the
status to the calling function, process
, (Figure superm.02), which gives an appropriate
error message and continues execution.
However, if the status is NOT_IN_FILE
, write_record
continues by positioning the file to the record number returned by lookup_record_number
,
writing the record to the file, and returns the status NOT_IN_FILE
to process
, which continues execution normally.
That concludes our examination of the input mode in process
.
The lookup mode is very similar, except that it uses the lookup_record
function (Figure suplook.03)
rather than lookup_record_number
, since it wants the
record to be returned, not just the record number. The lookup mode, of
course, also differs from the entry mode in that it expects the record
to be in the file, and displays the record data when found.
After process
terminates when the user enters "*"
instead of a code number to be looked up or entered, main
finishes up by calling terminate_price_file
(Figure suplook.11) , which closes the price
file and returns. All processing complete, main
exits to
the operating system.
terminate_price_file
function (from
superm\suplook.cpp) (Figure suplook.11)codelist/suplook.11
In this chapter, we have covered ways to save storage by using a restricted character set and to gain rapid access to data by an exact key, using hash coding and caching. In the next chapter we will see how to use bitmaps and distribution sorting to aid in rearranging information by criteria that can be specified at run-time.
(You can find suggested approaches to problems in Chapter artopt.htm).
FILE_CAPACITY
would also increase the
amount of memory required for the cache unless we reduced the cache
size as a fraction of the file size by reducing the value .20 in the
calculation of APPROXIMATE_CACHE_SIZE
.
legal_chars
array must
be kept in synchronization with the lookup_chars
array,
shown in Figure