Optimizing C++ - the WWW version

ISBN: 0-13-977430-0

Cn U Rd Ths (Qkly)? A Data Compression Utility

Introduction

In this chapter we will examine the Huffman coding and arithmetic coding methods of data compression and develop an implementation of the latter algorithm. The arithmetic coding algorithm allows a tradeoff among memory consumption and compression ratio; our emphasis will be on minimum memory consumption, on the assumption that the result would eventually be used as embedded code within a larger program.

Algorithms Discussed

Huffman Coding, Arithmetic Coding, Lookup Tables

Huffman Coding

Huffman coding is widely recognized as the most efficient method of encoding characters for data compression. This algorithm is a way of encoding different characters in different numbers of bits, with the most common characters encoded in the fewest bits. For example, suppose we have a message made up of the letters 'A', 'B', and 'C', which can occur in any combination. Figure huffman.freq shows the relative frequency of each of these letters and the Huffman code assigned to each one.

Huffman code table (Figure huffman.freq)

                            Huffman    Fixed-length
        Letter    Frequency  Code         Code
                 +--------+---------+--------------+
          A      |  1/4   |   00    |      00      |
          B      |  1/4   |   01    |      01      |
          C      |  1/2   |   1     |      10      |
                 +--------+---------+--------------+

The codes are determined by the frequencies: as mentioned above, the letter with the greatest frequency, 'C', has the shortest code, of one bit. The other two letters, 'A' and 'B', have longer codes. On the other hand, the simplest code to represent any of three characters would use two bits for each character. How would the length of an encoded message be affected by using the Huffman code rather than the fixed-length one?

Let's encode the message "CABCCABC" using both codes. The results are shown in Figure huffman.fixed.

Huffman vs. fixed-length coding (Figure huffman.fixed)

                   Huffman  Fixed-length
       Letter       Code       Code
                  +-------+------------+
         C        |  1    |     10     |
         A        |  00   |     00     |
         B        |  01   |     01     |
         C        |  1    |     10     |
         C        |  1    |     10     |
         A        |  00   |     00     |
         B        |  01   |     01     |
         C        |  1    |     10     |
                  +-------+------------+
Total bits used      12         16

Here we have saved one-fourth of the bits required to encode this message; often, the compression can be much greater. Since we ordinarily use an eight-bit ASCII code to represent characters, if one of those characters (such as carriage return or line feed) accounts for a large fraction of the characters in a file, giving it a short code of two or three bits can reduce the size of the file noticeably.

Let's see how arithmetic coding ¹ would encode the first three characters of the same message, "CAB".

One-character messages (Figure aritha1)

              Cum.                       Previous Current  Output
Message Freq. Freq.       Codes           Output   Output  So Far
       +-----+----+---------------------+-------+---------+-------+
  A    |  16 | 16 |000000( 0)-001111(15)|  None |    00   |  00   |
  B    |  16 | 32 |010000(16)-011111(31)|  None |    01   |  01   |
* C    |  32 | 64 |100000(32)-111111(63)|  None |    1    |  1    |
       +-----+----+---------------------+-------+---------+-------+

Figure aritha1 is the first of several figures which contain the information needed to determine how arithmetic coding would encode messages of up to three characters from an alphabet consisting of the letters 'A', 'B', and 'C', with frequencies of 1/4, 1/4, and 1/2, respectively. The frequency of a message composed of three characters chosen independently is the product of the frequencies of those characters. Since the lowest common denominator of these three fractions is 1/4, the frequency of any three-character message will be a multiple of (1/4)³, or 1/64. For example, the frequency of the message "CAB" will be (1/2)*(1/4)*(1/4), or 1/32 (=2/64). For this reason, we will express all of the frequency values in terms of 1/64ths.

Thus, the "Freq." column signifies the expected frequency of occurrence of each message, in units of 1/64; the "Cum. Freq." column accumulates the values in the first column; the "Codes" column indicates the range of codes that can represent each message²; the "Previous Output" column shows the bits that have been output before the current character was encoded; the "Current Output" column indicates what output we can produce at this point in the encoding process; and the "Output So Far" column shows the cumulative output for that message, starting with the first character encoded.

As the table indicates, since the first character happens to be a 'C', then we can output "1", because all possible messages starting with 'C' have codes starting with a "1". Let's continue with Figure aritha2 to see the encoding for a two-character message.

Two-character messages (Figure aritha2)

              Cum.                       Previous Current  Output
Message Freq. Freq.       Codes           Output   Output  So Far
       +-----+----+---------------------+-------+---------+-------+
  AA   |   4 |  4 |000000(00)-000011(03)|   00  |    00   | 0000  |
  AB   |   4 |  8 |000100(04)-000111(07)|   00  |    01   | 0001  |
  AC   |   8 | 16 |001000(08)-001111(15)|   00  |    1    | 001   |
  BA   |   4 | 20 |010000(16)-010011(19)|   01  |    00   | 0100  |
  BB   |   4 | 24 |010100(20)-010111(23)|   01  |    01   | 0101  |
  BC   |   8 | 32 |011000(24)-011111(31)|   01  |    1    | 011   |
* CA   |   8 | 40 |100000(32)-100111(39)|   1   |    00   | 100   |
  CB   |   8 | 48 |101000(40)-101111(47)|   1   |    01   | 101   |
  CC   |  16 | 64 |110000(48)-111111(63)|   1   |    1    | 11    |
       +-----+----+---------------------+-------+---------+-------+

After encoding the first two characters of our message, "CA", our cumulative output is "100", since the range of codes for messages starting with "CA" is from "100000" to "100111"; all these codes start with "100". The whole three-character message is encoded as shown in Figure aritha3.

We have generated exactly the same output from the same input as we did with Huffman coding. So far, this seems to be an exercise in futility; is arithmetic coding just another name for Huffman coding?

These two algorithms provide the same compression efficiency only when the frequencies of the characters to be encoded happen to be representable as integral powers of 1/2, as was the case in our examples so far; however, consider the frequency table shown in Figure huffman.poor.

Three-character messages (Figure aritha3)

              Cum.                       Previous Current  Output
Message Freq. Freq.       Codes           Output   Output  So Far
       +-----+----+---------------------+-------+---------+-------+
  AAA  |   1 |  1 |000000(00)-000000(00)|  0000 |    00   | 000000|
  AAB  |   1 |  2 |000001(01)-000001(01)|  0000 |    10   | 000001|
  AAC  |   2 |  4 |000010(02)-000011(03)|  0000 |    1    | 00001 |
  ABA  |   1 |  5 |000100(04)-000100(04)|  0001 |    00   | 000100|
  ABB  |   1 |  6 |000101(05)-000101(05)|  0001 |    01   | 000101|
  ABC  |   2 |  8 |000110(06)-000111(07)|  0001 |    1    | 00011 |
  ACA  |   2 | 10 |001000(08)-001001(09)|  001  |    00   | 00100 |
  ACB  |   2 | 12 |001010(10)-001011(11)|  001  |    01   | 00101 |
  ACC  |   4 | 16 |001100(12)-001111(15)|  001  |    1    | 0011  |
  BAA  |   1 | 17 |010000(16)-010000(16)|  0100 |    00   | 010000|
  BAB  |   1 | 18 |010001(17)-010001(17)|  0100 |    01   | 010001|
  BAC  |   2 | 20 |010010(18)-010011(19)|  0100 |    1    | 01001 |
  BBA  |   1 | 21 |010100(20)-010100(20)|  0101 |    00   | 010100|
  BBB  |   1 | 22 |010101(21)-010101(21)|  0101 |    01   | 010101|
  BBC  |   2 | 24 |010110(22)-010111(23)|  0101 |    1    | 01011 |
  BCA  |   2 | 26 |011000(24)-011001(25)|  011  |    00   | 01100 |
  BCB  |   2 | 28 |011010(26)-011111(27)|  011  |    01   | 01101 |
  BCC  |   4 | 32 |011100(28)-011111(31)|  011  |    1    | 0111  |
  CAA  |   2 | 34 |100000(32)-100001(33)|  100  |    00   | 10000 |
* CAB  |   2 | 36 |100010(34)-100011(35)|  100  |    01   | 10001 |
  CAC  |   4 | 40 |100100(36)-100111(39)|  100  |    1    | 1001  |
  CBA  |   2 | 42 |101000(40)-101001(41)|  101  |    00   | 10100 |
  CBB  |   2 | 44 |101010(42)-101011(43)|  101  |    01   | 10101 |
  CBC  |   4 | 48 |101100(44)-101111(47)|  101  |    1    | 1011  |
  CCA  |   4 | 52 |110000(48)-110011(51)|  11   |    00   | 1100  |
  CCB  |   4 | 56 |110100(52)-110111(55)|  11   |    01   | 1101  |
  CCC  |   8 | 64 |110000(56)-111111(63)|  11   |    1    | 111   |
       +-----+----+---------------------+-------+---------+-------+

Suboptimal Huffman code table (Figure huffman.poor)

                            Huffman    Fixed-length
        Letter    Frequency  Code         Code
                 +--------+---------+--------------+
          A      |  3/4   |   0     |      0       |
          B      |  1/4   |   1     |      1       |
                 +--------+---------+--------------+

Using Huffman coding, there is no way to assign theoretically optimal codes to characters with such a frequency distribution.³ Since the shortest possible Huffman code is one bit, the best we can do is to assign a one-bit code to each character, although this does not reflect the difference in their frequencies. In fact, Huffman coding can never provide any compression at all with a two-character alphabet.

However, such a situation is handled very well indeed by arithmetic compression. To see how, we will start by asking a fundamental question: what is a bit?

Half a Bit Is Better than One

A bit is the amount of information required to specify an alternative that has a frequency of 50%; two bits can specify alternatives that have a frequency of 25%, and so forth. For example, the toss of a coin can result in either heads or tails, each of which can be optimally represented by a one-bit code; similarly, if a chance event has four equally likely outcomes, we can express each possible result most economically with two bits. On the other hand, as we have seen in our discussion of Huffman codes, we can have a number of alternatives that are not equally likely; in that case, we assign longer codes for those alternatives that are less likely. However, the shortest possible code in Huffman coding is one bit, which is assigned to an outcome with a frequency of one-half.

The general formula for the optimal length of a code specifying a particular outcome with frequency f is "log₂(1/f)". In our previous examples, an outcome with a frequency of .5 should have a code length of log₂(1/(1/2)) or 1 bit. Similarly, if an outcome has a frequency of .25 it should have a code length of log₂(1/(1/4)), or two bits.

But what if one of the possible outcomes has a frequency greater than one-half? Logically, we should use less than one bit to specify it. For example, if we have a data file in which 84% of the characters are spaces, we can calculate the appropriate number of bits in the code for that character as log₂(1/(.84)), or approximately .25 bits. If the remaining 255 characters all have equal frequencies, each of these frequencies is .0627%, so that our formula reduces to log₂(1/(.0627)), or approximately 10.63 bits each. This would result in an average code length of (.84)*(.25) + (.16)*10.63, or 1.91 bits per character. By contrast, a Huffman code would require (.84)*1 + (.16)*9, or 2.28 bits per character. If we were compressing a 250-Kbyte file with such characteristics, using Huffman codes would produce a 71-Kbyte file, whereas arithmetic coding would result in a 60-Kbyte file, about a 20% difference between these two approaches.⁴

Getting a Bit Excited

Of course, we can't output a code of less than one bit. However, we can use a set of one or more bits to represent more than one character in the same message. This is why the statement that Huffman coding is the most efficient way to represent characters is true, but misleading; if our messages contain more than one character each, we may be able to combine the codes for a number of characters while consuming a fractional number of bits for some characters. To make this clearer, let's go through the encoding of a three-character message, "ABA", from a two-character alphabet in which the letter 'A' accounts for three-fourths of all the characters and the letter 'B' the remaining one-fourth. The situation after we see the first character is shown in Figure arithb1.

The first character (Figure arithb1)

Message       Cum.                       Previous Current  Output
So Far  Freq. Freq.        Codes          Output   Output  So Far
       +-----+----+---------------------+-------+---------+-------+
* A    |  48 | 48 |000000( 0)-101111(47)|  None |  None   |  None |
  B    |  16 | 64 |110000(48)-111111(63)|  None |  11     |  11   |
       +-----+----+---------------------+-------+---------+-------+

If the first character were a 'B', then we could output "11", because all possible messages starting with 'B' have codes starting with those two bits. However, what can we output when the first character is an 'A'?

Nothing! We don't know whether the first bit of our encoded message will be a 0 or a 1; that depends on what happens next. Remember, messages starting with the letter 'A' can have codes starting with 00 through 10, or three-fourths of all possible codes. An 'A' gives us somewhat less than 1/2 bit of information, not nearly enough to produce any output by itself. Now let's look at Figure arithb2 for the information needed to encode the next character.

The second character (Figure arithb2)

Message       Cum.                       Previous Current   Output
So Far  Freq. Freq.        Codes          Output   Output    So Far
       +-----+----+---------------------+-------+---------+-------+
  AA   |  36 | 36 |000000( 0)-100011(35)|  None |  None   |  None |
* AB   |  12 | 48 |100100(36)-101111(47)|  None |  10     |  10   |
  BA   |  12 | 60 |110000(48)-111011(59)|  11   |  None   |  11   |
  BB   |   4 | 64 |111100(60)-111111(63)|  11   |  11     |  1111 |
       +-----+----+---------------------+-------+---------+-------+

We have two bits of output, since all the codes for messages starting with "AB" have the initial bits "10".⁵ Let's continue with Figure arithb3 for the third (and last) character of our message.

The third character (Figure arithb3)

Message       Cum.                       Previous Current  Output
So Far  Freq. Freq.    Codes              Output   Output   So Far
       +-----+----+---------------------+-------+---------+-------+
  AAA  |  27 | 27 |000000( 0)-011010(26)|  None |  0      | 0     |
  AAB  |   9 | 36 |011011(27)-100011(35)|  None |  None   | None  |
* ABA  |   9 | 45 |100100(36)-101100(44)|  10   |  None   | 10    |
  ABB  |   3 | 48 |101101(45)-101111(47)|  10   |  11     | 1011  |
  BAA  |   9 | 57 |110000(48)-111000(56)|  11   |  None   | 11    |
  BAB  |   3 | 60 |111001(57)-111011(59)|  11   |  10     | 1110  |
  BBA  |   3 | 63 |111100(60)-111110(62)|  1111 |  None   | 1111  |
  BBB  |   1 | 64 |111111(63)-111111(63)|  1111 |  11     | 111111|
       +-----+----+---------------------+-------+---------+-------+

We have still produced only two bits of output from our three character message, "ABA". The best we could do with Huffman coding is three bits. However, this is not the extreme case; if we were encoding the most frequent message, "AAA", we would have only a "0" bit.⁶

A Character Study

This algorithm will work nicely if we happen to know in advance what the frequencies of all the possible characters will be. But how do we acquire this information? If our program reads an input file and writes an output file, we can go through the input file once, counting the number of times a given character appears, build the table of frequencies, and then make a second pass using this table to do the actual encoding. However, this is not possible when we are compressing data as it is being generated, as there is then no input file to be analyzed. This might seem to be an insurmountable obstacle.

Luckily, it is not; calculating the character frequencies as we encode yields about the same compression efficiency as precalculation, and in some cases even better efficiency! The reason for this surprising result is that most files (or sets of data) are not uniform throughout, but rather exhibit local variations in the frequency of a given character or set of characters. Calculating the frequencies on the fly often produces a better match for these changing characteristics.

Therefore, our approach is as follows. Every character is initially assumed to have the same frequency of occurrence. After each character is read and encoded, its entry in the table of frequencies is increased to reflect our having seen it. The next time we encode it, its encoding will account for the fact that we have seen it before.

That may seem quite simple for the sender, but not for the receiver. If the encoding table is being modified as we go, how does the receiver keep in step?

The receiver has the same initial table as the transmitter; each character has the same expected frequency of occurrence. Then, as the receiver decodes each character, it updates its copy of the frequency table. This also explains why we increase a character's frequency after we encode it, rather than before; until the receiver decodes the character, it cannot update the frequency of that character's occurrence.⁷

Keeping It in Context

In our implementation, we achieve a large improvement in compression efficiency by using the context in which a character occurs when estimating its frequency of occurrence. It should be apparent that the frequency of a given character appearing in a message at a given point is quite dependent on the previous context. For example, in English text, a "Q" is almost certain to be followed by a "u", at least if we exclude articles about software such as DESQview! Ideally, therefore, the amount of information needed to encode a "u" following a "q" should be very small. The same principle, of course, applies to the encoding of a line feed following a carriage return in text files where this is a virtual certainty.

On the other hand, the amount of storage required to keep track of a large amount of previous context can become excessive.⁸ Even one character of previous context requires the construction of 256 tables of frequencies, one for each possible previous character. A direct extension of the approach given in the reference in footnote would require over 300 KB of storage for these tables.

We will apply a number of space-saving methods to reduce the above storage requirement by about 90%, to approximately 35 KB, while still achieving good data-compression performance in most cases.

Conspicuous Nonconsumption

In order to achieve such a reduction in memory consumption, we must avoid storing anything that can be recalculated, such as the cumulative frequencies of all possible characters.⁹ We must also dispense with the use of a self-organizing table that attempts to speed up encoding and decoding by moving more frequently used characters toward the front of the table, as is done in the reference in footnote . However, we must provide as large a dynamic range of frequencies as possible: the larger the ratio between the highest frequency and the lowest, the greater the possible efficiency. The greatest dynamic range is needed when one character always occurs in a particular context, such as line feed after carriage return. Assuming that we must be able to encode any of the 256 possible one-byte values, the algorithm limits the possible dynamic range to approximately one-fourth of the range of an unsigned value.¹⁰ For maximum compression efficiency we therefore need 256 tables of 256 two-byte entries each, consuming a total of 128 KB.¹¹ When I first implemented this algorithm, I saw no way to reduce this significantly.

Then early one morning, I woke up with the answer. We need some very large frequency values and some very small ones, but surely not every one in between. Why not use a code to represent one of a number of frequency values? These values would be spaced out properly to get as close as possible to optimal compression efficiency, but each would be represented by a small index, rather than literally.

How small an index? First I considered using an eight-bit index. But that still would require 64 KB for a complete 256x256 table. Maybe a smaller index would help. But wouldn't that impose an unreasonable processing time penalty?

Amazingly enough, we can use a four-bit index with no time penalty: in fact, processing is actually faster than with a byte index! This seemingly magical feat is accomplished by one of my favorite time-saving methods: the lookup table.

You see, a significant amount of the CPU time needed to encode a symbol is spent in a loop that computes the necessary portion of the cumulative frequency table for the current context. If each frequency value occupied one byte, we would have to execute that loop once for every index in the table up to the ASCII value of the character whose cumulative frequency we are trying to compute. However, since we have packed two indexes into one byte, we can accumulate two indexes worth of frequency values with one addition, thus reducing the number of times the loop has to be executed by approximately 50%.

Each execution of the loop translates a pair of indexes contained in one byte into a total frequency value that is the sum of their individual frequency values by using the byte as an index into a 256-word table, which has one entry for each possible combination of two indexes. This table (the both_weights table) is calculated once at the start of the program.

Once we have determined the frequency of occurrence assigned to the character to be encoded, we can decide how many bits we can send and what their values are.

Receiving the Message

The receiver uses almost the same code as the sender, with two main exceptions. First, the receiver reads the input one bit at a time and outputs each character as soon as it is decoded, just the reverse of the sender. Second, rather than knowing how many frequency entries we have to accumulate in advance as the sender does, we have to accumulate them until we find the one that corresponds to the range we are trying to interpret. The latter situation reduces the efficiency of our loop control, which accounts for much of the difference in speed between encoding and decoding.

The Code

Let's start at the beginning, with the main function in encode.cpp (Figure encode.00).

Main program for encoding (compress\encode.cpp) (Figure encode.00)

codelist/encode.00

This function opens the input and output files and sets up buffers to speed up input and output, then calls start_model (Figure adapt.00).

`start_model` function (from compress\adapt.cpp) (Figure adapt.00)

codelist/adapt.00

This function starts out by initializing the upgrade_threshold array, which is used to determine when to promote a character to the next higher frequency index value. As noted above, these values are not consecutive, so that we can use a four-bit index rather than literal values; this means that we have to promote a character only once in a while rather than every time we see it, as we would do with literal frequency values. How do we decide when to do this?

A pseudorandom approach seems best: we can't use a genuine random number to tell us when to increment the index, because the receiver would have no way of reproducing our decisions when decoding the message. My solution is to keep a one-byte hash total of the ASCII codes for all the characters that have been sent (char_total) and to increment the index in question whenever char_total is greater than the corresponding value stored in the upgrade_threshold array. That threshold value is calculated so that the probability of incrementing the index is inversely proportional to the gap between frequency values in the translation table.¹² If, for example, each frequency value in the translation table were twice the previous value, there would be a 1/2 probability of incrementing the index each time a character was encountered. After we finish initializing the upgrade_threshold array, we set char_total to 0, in preparation for accumulating our hash total.

The next operation in start_model is to generate the both_weights table. As we discussed above, this table allows us to translate from a pair of frequency values (or weights) to the total frequency to which they correspond. We calculate it by generating every possible pair and adding them together to fill in the corresponding table entry. The values in the translate table are defined in Figure model.00, in the line that starts with #define TRANSLATE_TABLE. How did I generate these values?

Header file for translation table constants and variables (compress\model.h) (Figure model.00)

codelist/model.00

A Loose Translation

It wasn't easy. I knew that I wanted to allow the largest possible dynamic range, which means that the lowest value has to be 1 and the highest value has to be close to the maximum that can be accommodated by the algorithm (16,128). The reason I chose a top value lower than that maximum value is that if the highest value were 16,128, then the occurrence of any character other than the preferred one would cause the total frequency to exceed the allowable maximum, with the result that the table of frequencies would be recalculated to reduce the top value to the next lower step. This would greatly reduce the efficiency of the compression for this case. That accounts for the lowest and highest values. What about the ones in between?

Initially, I decided to use a geometric progression, much like the tuning of a piano; in such a progression, each value is a fixed multiple of the one before it. However, I found that I achieved better compression on a fairly large sample of files by starting the progression with the second value at 16 and ending with the next to the last at 1024. Why is this so?

The reason for leaving a big gap between the lowest frequency and the second lowest one is that many characters never occur in a particular situation. If they occur once, they are likely to recur later. Therefore, setting the next-to-the-lowest frequency to approximately one-tenth of 1% of the maximum value improves the efficiency of the compression. I have also found through experimentation that the compression ratio is improved if the first time a character is seen, it is given a frequency of about six-tenths of 1%, which requires an initial index of 7. Lower initial frequencies retard adaptation to popular new characters, and higher ones overemphasize new characters that turn out to be unpopular in the long run.

The reason to leave a large gap between the next-to-highest frequency and the highest one is that most of the time, a very skewed distribution has exactly one extremely frequent character. It is rare to find several very high-frequency characters that have about the same frequency. Therefore, allowing the highest frequency to approach the theoretical maximum produces the best results. Of course, these are only empirical findings. If you have samples that closely resemble the data that you will be compressing, you can try modifying these frequency values to improve the compression.

Getting in on the Ground Floor

Continuing in start_model (Figure adapt.00), we initialize NO_OF_CHARS frequency_info tables, one for every possible character. Each of these tables will store the frequencies for characters that follow a particular character. If we start to encode the string "This is a test", the first character, 'T' will be encoded using the table for character 0 (a null); this is arbitrary, since we haven't seen any characters before this one. Then the 'h' will be encoded using the table for 'T'; the 'i' will be encoded with the table for 'h', and so forth. This approach takes advantage of the context dependence of English text; as we noted before, after we see a 'q', the next character is almost certain to be a 'u', so we should use a very short code to represent a 'u' in that position.

However, our initialization code contains no information about the characteristics of English text (or any other kind of data for that matter). We assign the same frequency (the lowest possible) to every character in every frequency table. As discussed above, the encoding program will learn the appropriate frequencies for each character in each table as it processes the input data. At the end of the initialization for each table, we set its total_freq value to the translation of the lowest frequency, multiplied by the number of index pairs in the table. This value is needed to calculate the code that corresponds to each character, and recalculating it every time we access the table would be time-consuming.

The last operation in start_model initializes the char_translation array, which is used to translate the internal representation of the characters being encoded and decoded to their ASCII codes.¹³ Then we return to encode.cpp.

Gentlemen, Start Your Output

The next operation in main is to call start_outputing_bits (Figure arenc.00).

`start_outputing_bits` function (from compress\arenc.cpp) (Figure arenc.00)

codelist/arenc.00

Of course, we can't send individual bits to the output file; we have to accumulate at least one byte's worth. Whenever we have some bits to be output, we will store them in buffer; since we haven't encoded any characters yet, we set it to 0. In order to keep track of how many more bits we need before we can send buffer to the output file, we set bits_to_go to eight; then we return to main.

The next thing we do in main is to call start_encoding (Figure arenc.01).

`start_encoding` function (from compress\arenc.cpp) (Figure arenc.01)

codelist/arenc.01

This is a very short function, but it initializes some of the most important variables in the program; low, high, and bits_to_follow. The first two of these keep track of the range of codes for the current message; at this point we know nothing about the message, so they are set to indicate the widest possible range of messages, from 0 to TOP_VALUE, which is defined in arith.h (Figure arith.00).

Header file for arithmetic coding constants (compress\arith.h) (Figure arith.00)

codelist/arith.00

In our current implementation, with 16-bit code values, TOP_VALUE evaluates to 65535.¹⁴ The third variable, bits_to_follow, keeps track of the number of bits that have been deferred for later output. This is used where the range of possible codes for the current message includes codes that start with 0 and some that start with 1; as we have seen already, in such a situation we're not ready to send out any bits yet. After initializing these variables, we return to main.

The Main Loop

Upon returning to main, we need to do one more initialization before we enter the main loop of our program, which is executed once for each character in the input file; namely, setting oldch to 0. This variable controls the context in which we will encode each character. Since we haven't seen any characters as yet, it doesn't really matter which frequency table we use, as long as the decoding program selects the same initial table.

The first operation in the main loop is to get the character to be encoded, via getc. If we have reached the end of the file, we break out of the loop. Otherwise, we call encode_symbol (Figure arenc.02) to place the representation of the current character in the output stream.

`encode_symbol` function (from compress\arenc.cpp) (Figure arenc.02)

codelist/arenc.02

This function takes two parameters: ch, the character to be encoded, and oldch, which determines the frequency table to be used for this encoding. As we have noted above, selecting a frequency table based upon the previous character encoded provides far better compression efficiency, as the frequency of occurrence of a particular character is greatly affected by the preceding character.

`encode_symbol`

Although encode_symbol is a short function, it is the subtlest function in this chapter; fortunately, you can use this algorithm effectively without going through the explanation below. The version here closely follows the reference in footnote ; if you are really interested in the details of operation of this function, I strongly advise you to study that reference very carefully in conjunction with the explanation here.

To clarify this algorithm, we will work through a somewhat simplified example as we examine the code. For ease in calculation, we will set TOP_VALUE to 255, or 11111111 binary, rather than 65535 as in the actual program; as a result, high will start out at 255 as well. We will also use a single, constant frequency table (Figure sampfreq) containing only four entries rather than selecting a 256-entry table according to the previous character and modifying it as we see each character, so our translate and both_weights tables (Figures samptrans and sampboth, respectively) will be adjusted correspondingly. Instead of ASCII, we will use the codes from 0 (for 'A') to 3 (for 'D') in our example message.

Sample frequency table (Figure sampfreq)

Character  Frequency code
    A,B       00010011 (1,3)
    C,D       00000010 (0,2)

Sample translate table (Figure samptrans)

Sample `both_weights` table (Figure sampboth)

First       Second Index
Index      0    1    2    3
        +----+----+----+----+
 0      |  2 |  3 |  9 | 53 |
        +----+----+----+----+
 1      |  3 |  4 | 10 | 54 |
        +----+----+----+----+
 2      |  9 | 10 | 16 | 60 |
        +----+----+----+----+
 3      | 53 | 54 | 60 |104 |
        +----+----+----+----+

As we begin encode_symbol, we establish temp_freq_info as a pointer to the structure containing the current frequency table. Next, we set freq_ptr to the address of the frequency table itself and total_freq to the stored total frequency in the frequency structure; as we will see shortly, total_freq is used to determine what fraction of the frequency range is accounted for by the particular character being encoded. The final operation before entering the frequency accumulation loop is to set prev_cum to 0. This variable is used to keep track of the cumulative frequency of all characters up to but not including the one being encoded; this is used to determine the position of the character being encoded as a part of the entire range of possibilities.

Now we are ready to enter the frequency accumulation loop, which is shown in Figure arenc.03.

The frequency accumulation loop (from compress\arenc.cpp) (Figure arenc.03)

codelist/arenc.03

The reason we need this loop is that, as we saw before, we cannot afford to keep the cumulative frequency tables in memory; they would occupy hundreds of kilobytes of memory. Instead, we calculate the cumulative frequency for each character being encoded, as we need it. The total_freq variable, however, we do maintain from one encoding to the next; recalculating it would require us to go all the way through the frequency table for each character, even if we are encoding a character with a low ASCII code. Since we are saving the total frequency with the table, we have to accumulate the frequencies only up to the ASCII code of the character we are encoding.

Let's see how this loop accumulates the frequencies of all the characters up to the character we want to encode. The first interesting item is that the loop executes only half as many times as the value of the character being encoded, since we are packing two 4-bit indexes into each byte of the frequency table. So the first statement in the loop retrieves one of these pairs of indexes from the frequency table and increments the pointer to point to the next pair. Then we index into the both_weights table with the index pair we just retrieved and set total_pair_weight to that entry in the table. The both_weights table is the key to translating two 4-bit indexes into a total frequency. Each entry in the table is the sum of the frequency values that correspond to the two indexes that make up the byte we use to index into the table. Finally, we add total_pair_weight to prev_cum, which is accumulating all of the frequencies.

In our example, the first letter of our message is 'D', which has a symbol value of 3. Using the frequency table in Figure sampfreq, we execute the statements in the loop once. First, we set current_pair to the first entry in the frequency table, 00010011, which indicates that the frequency code for 'A' is 1 (0001 binary) and the frequency code for 'B' is 3 (0011 binary). Then we set total_pair_weight to entry (1,3) from the both_weights table, the sum of the frequencies to which this pair of indexes corresponds; its value is 54. The last statement in the loop adds this value to prev_cum, which was set to 0 before the loop was started.

The next section of code, shown in Figure arenc.04, finishes the accumulation of the cumulative frequencies of the character to be encoded and the previous character, for both of the possible alignments of the character we are encoding and the previous character.

Finishing the accumulation (from compress\arenc.cpp) (Figure arenc.04)

codelist/arenc.04

If the target character has an even ASCII code, we already have the correct value for prev_cum; to calculate cum, the frequency accumulation for the current character, we need the first index from the next byte, which is stored in the high half of the byte. So we pick up current_pair, shift its high index down, and use it to retrieve the corresponding weight from the translate table. Then we add that frequency value to prev_cum to calculate cum.

On the other hand, if the target character has an odd ASCII code, we need to update both prev_cum and cum. First, we add the total weights for the last two characters to prev_cum, which results in cum. Then we translate the high half of the current_pair and add that to prev_cum.

In our example, the code of the first character is 3, so we need to update both prev_cum and cum. The value of current_pair is (0,2). Looking in the both_weights table, we set total_pair_weight to the translation of that pair, which is 9. Then cum is calculated as the sum of prev_cum and total_pair_weight, or 63. Then we extract the high part of current_pair, 0, which translates to 1; we add this amount to prev_cum, setting it to 55. This means that the code range associated with character "D" starts at 55 and ends slightly before 64, a range of 9 positions out of the total of 64. This will allow us to send out slightly less than three bits for this character, as we will narrow the range by a factor of more than 7.

Now that we have calculated the cumulative frequencies of the current character and the previous character, we are ready to narrow the range of frequencies that correspond to our message, as shown in Figure arenc.05.

Narrowing the range of frequencies (from compress\arenc.cpp) (Figure arenc.05)

codelist/arenc.05

The first line of this section of code calculates the previous range of frequencies for our message. Then the other lines calculate the new range of the message.

The purpose of low and high is to delimit the frequency interval in which the current message falls; range is just the size of the frequency interval extending from the old value of low to the old value of high, inclusive.

In our example, the old value of low is 0 and the old value of high is 255. Therefore, the formula for range, (long)(high-low)+1, produces 256, which is its maximum value. This makes sense, as we have not yet used the information from the first character of our message.

The new values of low and high represent the narrowing of the previous range due to the frequency of the new character. We calculate the new value of high, which represents the high end of the new range of frequencies for our message after the most recent character is added to the message.¹⁵ Similarly, the new value of low represents the low end of the range of frequencies for our message after the most recent character is added to the message.

In our example, low is still 0, range is 256, cum is 63, and total_freq is 63. The expression to calculate high is low + (range * cum) / total_freq - 1. Therefore, high is calculated as 0+(256*63)/63-1, or 255. This means that the new range of frequencies for our message ends slightly below 256. Next, we recalculate low. Its value is still 0 so far, range is 256, prev_cum is 55, and total_freq is 63. The expression to calculate low is low + (range * prev_cum)/total_freq. Therefore, we calculate the new value of low as 0+(256*55)/63, or 223. This means that the new range of frequencies for our message begins at 223.

Now we are finally ready to start sending bits out. The loop shown in Figure arenc.06 extracts as many bits as possible from the encoding of the message so far, and widens the range correspondingly to allow the encoding of more characters.

The bit output loop (from compress\arenc.cpp) (Figure arenc.06)

codelist/arenc.06

In order to understand this code, we need to look at the table of possible initial bits of high and low, which is given in Figure initcode.

The entries in this table could use some explanation. The first two columns contain all the possible combinations of the first two bits of low and high; we know that low cannot be greater than high, since these two values delimit the range of codes for the message being compressed. If low ever became greater than high, it would be impossible to encode any further characters. The "Action" column indicates what, if anything, we can output now. Clearly, if low and high have the same first bit, we can output that bit. The entries labeled "Next" indicate that since the separation between the values of low and high is at least one-fourth of the total range of values (0-TOP_VALUE), we can encode at least one more character now; this is the reason for the limit on the total frequency of all characters.

Possible initial code bits (Figure initcode)

     Low       High     Action
  +--------+---------+---------+
  |   00   |    00   |     0   |
  |   00   |    01   |     0   |
  |   00   |    10   |   Next  |
  |   00   |    11   |   Next  |
  |   01   |    01   |     0   |
  |   01   |    10   |   Defer |
  |   01   |    11   |   Next  |
  |   10   |    10   |     1   |
  |   10   |    11   |     1   |
  |   11   |    11   |     1   |
  +--------+---------+---------+

The entry "Defer" means that we can't do any output now; however, when we do have output, we will be able to emit at least the first two bits of the result, since we know already that these bits are either "01" or "10". As we will see shortly, this condition is indicated by a nonzero value for bits_to_follow.

The first test determines whether the highest frequency value allocated to our message is less than one-half of the total frequency range. If it is, we know that the first output bit is 0, so we call the bit_plus_follow_0 macro (Figure arenc.07) to output that bit. Let's take a look at that macro.

First we call output_0, which adds a 0 bit to the output buffer and writes it out if it is full. Then, in the event that bits_to_follow is greater than 0, we call output_1 and decrement bits_to_follow until it reaches 0. Why do we do this?

The `bit_plus_follow_0` macro (from compress\arenc.cpp) (Figure arenc.07)

codelist/arenc.07

The reason is that bits_to_follow indicates the number of bits that could not be output up till now because we didn't know the first bit to be produced ("deferred" bits). For example, if the range of codes for the message had been 011 through 100, we would be unable to output any bits until the first bit was decided. However, once we have enough information to decide that the first bit is 0, we can output three bits, "011". The value of bits_to_follow would be 2 in that case, since we have two deferred bits. Of course, if the first bit turns out to be 1, we would emit "100" instead. The reason that we know that the following bits must be the opposite of the initial bit is that the only time we have to defer bits is when the code range is split between codes starting with 0 and those starting with 1; if both low and high started with the same bit, we could send that bit out.

The values of HALF, FIRST_QTR, and THIRD_QTR are all based on TOP_VALUE (Figure arith.00). In our example, FIRST_QTR is 64, HALF is 128, and THIRD_QTR is 192. The current value of high is 255, which is more than HALF, so we continue with our tests.

Assuming that we haven't obtained the current output bit yet, we continue by testing the complementary condition. If low is greater than or equal to HALF, we know that the entire frequency range allocated to our message so far is in the top half of the total frequency range; therefore, the first bit of the message is 1. If this occurs, we output a 1 via bit_plus_follow_1. The next two lines reduce high and low by HALF, since we know that both are above that value.

In our example, low is 223, which is more than HALF. Therefore, we can call bit_plus_follow_1 to output a 1 bit. Then we adjust both low and high by subtracting HALF, to account for the 1 bit we have just produced; low is now 95 and high is 127.

If we haven't passed either of the tests to output a 0 or a 1 bit, we continue by testing whether the range is small enough but in the wrong position to provide output at this time. This is the situation labeled "Defer" in figure initcode. We know that the first two bits of the output will be either 01 or 10; we just don't know which of these two possibilities it will be. Therefore, we defer the output, incrementing bits_to_follow to indicate that we have done so; we also reduce both low and high by FIRST_QTR, since we know that both are above that value. If this seems mysterious, remember that the encoding information we want is contained in the differences between low and high, so we want to remove any redundancy in their values.¹⁶ If we get to the break statement near the end of the loop, we still don't have any idea what the next output bit(s) will be. This means that the range of frequencies corresponding to our message is still greater than 50% of the maximum possible frequency; we must have encoded an extremely frequent character, since it hasn't even contributed one bit! If this happens, we break out of the output loop; obviously, we have nothing to declare.

In any of the other three cases, we now have arrived at the statement low <<= 1; with the values of low and high guaranteed to be less than half the maximum possible frequency.¹⁷ Therefore, we shift the values of low and high up one bit to make room for our next pass through the loop. One more detail: we increment high after shifting it because the range represented by high actually extends almost to the next frequency value; we have to shift a 1 bit in rather than a 0 to keep this relationship.

In our example, low is 95 and high is 127, which represents a range of frequencies from 95 to slightly less than 128. The shifts give us 190 for low and 255 for high, which represents a range from 190 to slightly less than 256. If we hadn't added 1 to high, the range would have been from 190 to slightly less than 255.

Since we have been over the code in this loop once already, we can continue directly with the example. We start out with low at 190 and high at 255. Since high is not less than HALF (128), we proceed to the second test, where low turns out to be greater than HALF. So we call bit_plus_follow_1 again as on the first loop and then reduce both low and high by HALF, producing 62 for low and 127 for high. At the bottom of the loop, we shift a 0 into low and a 1 into high, resulting in 124 and 255, respectively.

On the next pass through the loop, high is not less than HALF, low isn't greater than or equal to HALF, and high isn't less than THIRD_QTR (192), so we hit the break and exit the loop. We have sent two bits to the output buffer.

We are finished with encode_symbol for this character. Now we will start processing the next character of our example message, which is 'B'. This character has a symbol value of 1, as it is the second character in our character set. First, we set prev_cum to 0. The frequency accumulation loop in Figure arenc.03 will not be executed at all, since symbol/2 evaluates to 0; we fall through to the adjustment code in Figure arenc.04 and select the odd path after the else, since the symbol code is odd. We set current_pair to (1,3), since that is the first entry in the frequency table. Then we set total_pair_weight to the corresponding entry in the both_weights table, which is 54. Next, we set cum to 0 + 54, or 54. The high part of the current pair is 1, so high_half_weight becomes entry 1 in the translate table, or 2; we add this to prev_cum, which becomes 2 as well.

Now we have reached the first line in Figure arenc.05. Since the current value of low is 124 and the current value of high is 255, the value of range becomes 131. Next, we recalculate high as 124 + (131*54)/63 - 1, or 235. The new value of low is 124 + (131*2)/63, or 128. We are ready to enter the output loop.

First, high is not less than HALF, so the first test fails. Next, low is equal to HALF, so the second test succeeds. Therefore, we call bit_plus_follow_1 to output a 1 bit; it would also output any deferred bits that we might have been unable to send out before, although there aren't any at present. We also adjust low and high by subtracting HALF, to account for the bit we have just sent; their new values are 0 and 107, respectively.

Next, we proceed to the statements beginning at low <<= 1;, where we shift low and high up, injecting a 0 and a 1, respectively; the new values are 0 for low and 215 for high. On the next pass through the loop we will discover that these values are too far apart to emit any more bits, and we will break out of the loop and return to the main function.

We could continue with a longer message, but I imagine you get the idea. So let's return to the main function (Figure encode.00).¹⁸

`update_model`

The next function called is update_model (Figure adapt.01), which adjusts the frequencies in the current frequency table to account for having seen the most recent character.

The `update_model` function (from compress\adapt.cpp) (Figure adapt.01)

codelist/adapt.01

The arguments to this function are symbol, the internal value of the character just encoded, and oldch, the previous character encoded, which indicates the frequency table that was used to encode that character. What is the internal value of the character? In the current version of the program, it is the same as the ASCII code of the character; however, near the end of the chapter we will employ an optimization that involves translating characters from ASCII to an internal code to speed up translation.

This function starts out by adding the character's ASCII code to char_total, a hash total which is used in the simple pseudorandom number generator that we use to help decide when to upgrade a character's frequency to the next frequency index code. We use the symbol_translation table to get the ASCII value of the character before adding it to char_total; this is present for compatibility with our final version which employs character translation.

The next few lines initialize some variables: old_weight_code, which we set when changing a frequency index code or "weight code", so that we can update the frequency total for this frequency table; temp_freq_info, a pointer to the frequency table structure for the current character; and freq_ptr, the address of the frequency table itself.

Next, we compute the index into the frequency table for the weight code we want to examine. If this index is even, that means that this symbol is in the high part of that byte in the frequency table. In this case, we execute the code in the true branch of the if statement "if (symbol % 2 == 0)". This starts by setting temp_freq to the high four bits of the table entry. If the result is 0, this character has the lowest possible frequency value; we assume that this is because it has never been encountered before and set its frequency index code to INIT_INDEX. Then we update the total_freq element in the frequency table.

However, if temp_freq is not 0, we have to decide whether to upgrade this character's frequency index code to the next level. The probability of this upgrade is inversely proportional to the ratio of the current frequency to the next frequency; the larger the gap between two frequency code values, the less the probability of the upgrade. So we compare char_total to the entry in upgrade_threshold; if char_total is greater, we want to do the upgrade, so we record the previous frequency code in old_weight_code and add HIGH_INCREMENT to the byte containing the frequency index for the current character. We have to use HIGH_INCREMENT rather than 1 to adjust the frequency index, since the frequency code for the current character occupies the high four bits of its byte.

Of course, the character is just as likely to be in the low part of its byte; in that case, we execute the code in the false branch of that if statement, which corresponds exactly to the above code. In either case, we follow up with the if statement whose condition is "(old_weight_code != -1)", which tests whether a frequency index code was incremented. If it was, we add the difference between the new code and the old one to the total_freq entry in the current frequency table, unless the character previously had a frequency index code of 0; in that case, we have already adjusted the total_freq entry.

The last operation in update_model is to make sure that the total of all frequencies in the frequency table does not exceed the limit of MAX_FREQUENCY; if that were to happen, more than one character might map into the same value between high and low, so that unambiguous decoding would become impossible. Therefore, if temp_total_freq exceeds MAX_FREQUENCY, we have to reduce the frequency indexes until this is no longer the case. The while loop whose continuation expression is "(temp_total_freq > MAX_FREQUENCY)" takes care of this problem in the following way.

First, we initialize temp_total_freq to 0, as we will use it to accumulate the frequencies as we modify them. Then we set freq_ptr to the address of the first entry in the frequency table to be modified. Now we are ready to step through all the bytes in the frequency table; for each one, we test whether both indexes in the current byte are 0. If so, we can't reduce them, so we just add the frequency value corresponding to the translation of two 0 indexes (BOTH_WEIGHTS_ZERO) to temp_total_freq.

Otherwise, we copy the current index pair into freq. If the high index is nonzero, we decrement it. Similarly, if the low index is nonzero, we decrement it. After handling either or both of these cases, we add the translation of the new index pair to temp_total_freq. After we have processed all of the index values in this way, we retest the while condition, and when temp_total_freq is no longer out of range, we store it back into the frequency table and return to the main program.

Finally, we have returned to the main function (Figure encode.00), where we copy ch to oldch, so that the current character will be used to select the frequency table for the next character to be encoded; then we continue in the main loop until all characters have been processed.

When we reach EOF in the input file, the main loop terminates; we use encode_symbol to encode EOF_SYMBOL, which tells the receiver to stop decoding. Then we call done_encoding (Figure arenc.08) and done_outputing_bits (Figure arenc.09) to flush any remaining information to the output file. Finally, we close the input and output files and exit.

The `done_encoding` function (from compress\arenc.cpp) (Figure arenc.08)

codelist/arenc.08

The `done_outputing_bits` function (from compress\arenc.cpp) (Figure arenc.09)

codelist/arenc.09

Finding the Bottlenecks

Now that we have developed a working version of our program, let's see how much more performance we can get by judicious optimization in the right places. The first step is to find out where those places are. Therefore, I ran the program under Microsoft's profiler on a file of approximately 11 KB, compressing it to a little less than 5 KB (a compression ratio of 2.3 to 1). This didn't take very long, as you can see from the profile in Figure encode.pro1.

Profile of encode.exe before optimization (Figure encode.pro1)

Total time: 167.100 milliseconds

  Func
  Time   %        Function
--------------------------
80.010  49.0    encode_symbol(unsigned int,unsigned int) (arenc.obj)
29.139  17.9    _main (encode.obj)
18.717  11.5    output_1(void) (arenc.obj)
17.141  10.5    output_0(void) (arenc.obj)
15.957   9.8    update_model(int,unsigned char) (adapt.obj)
 2.234   1.4    start_model(void) (adapt.obj)

The majority of the CPU time is spent in the encode_symbol function, so that's where we'll look first. As I mentioned in the discussion of the algorithm, a significant amount of the CPU time needed to encode a symbol is spent in the loop shown in Figure arenc.03. Accordingly, we should be able to increase the speed of our implementation by using a more efficient representation of the characters to be encoded, so that the accumulation loop will be executed fewer times in total. While we cannot spare the space for self-organizing tables of character translations, a fixed translation of characters according to some reasonable estimate of their likelihood in the text is certainly possible. Let's see how that would affect our performance.

A Bunch of Real Characters

After analyzing a number of (hopefully representative) text files, I used the results to produce a list of the characters in descending order of number of occurrences. This information is stored in the symbol_translation table (Figure adapt.02), which is used during decoding to translate from the internal code into ASCII.

The `symbol_translation` table (from compress\adapt.cpp) (Figure adapt.02)

codelist/adapt.02

The inverse transformation, used by the encoder, is supplied by the char_translation table, which is initialized near the end of the start_model function (Figure adapt.00).

To use these new values for each character when encoding and decoding, we have to make some small changes in the encoding and decoding main programs. Specifically, in main, we have to use the translated character rather than the ASCII code as read from the input file when we call the encode_symbol function. The new version of the main program for encoding is shown in Figure encode1.00.

The new main encoding program (compress\encode1.cpp) (Figure encode1.00)

codelist/encode1.00

Of course, corresponding changes have to be made to the main function in the decoding program, which then looks like Figure decode1.00.

The new main decoding program (compress\decode1.cpp) (Figure decode1.00)

codelist/decode1.00

Figure char.trans shows the speed improvement resulting from this modification.

Profile of encode1.exe, after character translation (Figure char.trans)

Total time: 145.061 millisecond

    Func
    Time   %        Function
   (msec)
----------------------------
58.980  41.8    encode_symbol(unsigned int,unsigned int) (arenc.obj)
25.811  18.3    _main (encode1.obj)
18.964  13.4    output_0(void) (arenc.obj)
18.586  13.2    output_1(void) (arenc.obj)
16.565  11.7    update_model(int,unsigned char) (adapt.obj)
 2.213   1.6    start_model(void) (adapt.obj)