Return to the table of contents

In this chapter we will examine the Huffman coding and arithmetic coding methods of data compression and develop an implementation of the latter algorithm. The arithmetic coding algorithm allows a tradeoff among memory consumption and compression ratio; our emphasis will be on minimum memory consumption, on the assumption that the result would eventually be used as embedded code within a larger program.

Huffman Coding, Arithmetic Coding, Lookup Tables

**Huffman coding** is widely recognized as the most efficient
method of encoding characters for data compression. This algorithm is a
way of encoding different characters in different numbers of bits, with
the most common characters encoded in the fewest bits. For example,
suppose we have a message made up of the letters 'A', 'B', and 'C',
which can occur in any combination. Figure huffman.freq shows the relative
frequency of each of these letters and the Huffman code assigned to
each one.

Huffman Fixed-length

Letter Frequency Code Code

+--------+---------+--------------+

A | 1/4 | 00 | 00 |

B | 1/4 | 01 | 01 |

C | 1/2 | 1 | 10 |

+--------+---------+--------------+

Let's encode the message "CABCCABC" using both codes. The results are shown in Figure huffman.fixed.

Huffman Fixed-length

Letter Code Code

+-------+------------+

C | 1 | 10 |

A | 00 | 00 |

B | 01 | 01 |

C | 1 | 10 |

C | 1 | 10 |

A | 00 | 00 |

B | 01 | 01 |

C | 1 | 10 |

+-------+------------+

Total bits used 12 16

Let's see how arithmetic coding^{1} would encode the first three characters of
the same message, "CAB".

Cum. Previous Current Output

Message Freq. Freq. Codes Output Output So Far

+-----+----+---------------------+-------+---------+-------+

A | 16 | 16 |000000( 0)-001111(15)| None | 00 | 00 |

B | 16 | 32 |010000(16)-011111(31)| None | 01 | 01 |

* C | 32 | 64 |100000(32)-111111(63)| None | 1 | 1 |

+-----+----+---------------------+-------+---------+-------+

Figure aritha1
is the first of several figures which contain the information needed to
determine how arithmetic coding would encode messages of up to three
characters from an alphabet consisting of the letters 'A', 'B', and
'C', with frequencies of 1/4, 1/4, and 1/2, respectively. The frequency
of a message composed of three characters chosen independently is the
product of the frequencies of those characters. Since the lowest common
denominator of these three fractions is 1/4, the frequency of any
three-character message will be a multiple of (1/4)^{3}, or
1/64. For example, the frequency of the message "CAB" will be
(1/2)*(1/4)*(1/4), or 1/32 (=2/64). For this reason, we will express
all of the frequency values in terms of 1/64ths.

Thus, the "Freq." column signifies the expected frequency of
occurrence of each message, in units of 1/64; the "Cum. Freq." column
accumulates the values in the first column; the "Codes" column
indicates the range of codes that can represent each message^{2};
the "Previous Output" column shows the bits that have been output
before the current character was encoded; the "Current Output" column
indicates what output we can produce at this point in the encoding
process; and the "Output So Far" column shows the cumulative output for
that message, starting with the first character encoded.

As the table indicates, since the first character happens to be a 'C', then we can output "1", because all possible messages starting with 'C' have codes starting with a "1". Let's continue with Figure aritha2 to see the encoding for a two-character message.

Cum. Previous Current Output

Message Freq. Freq. Codes Output Output So Far

+-----+----+---------------------+-------+---------+-------+

AA | 4 | 4 |000000(00)-000011(03)| 00 | 00 | 0000 |

AB | 4 | 8 |000100(04)-000111(07)| 00 | 01 | 0001 |

AC | 8 | 16 |001000(08)-001111(15)| 00 | 1 | 001 |

BA | 4 | 20 |010000(16)-010011(19)| 01 | 00 | 0100 |

BB | 4 | 24 |010100(20)-010111(23)| 01 | 01 | 0101 |

BC | 8 | 32 |011000(24)-011111(31)| 01 | 1 | 011 |

* CA | 8 | 40 |100000(32)-100111(39)| 1 | 00 | 100 |

CB | 8 | 48 |101000(40)-101111(47)| 1 | 01 | 101 |

CC | 16 | 64 |110000(48)-111111(63)| 1 | 1 | 11 |

+-----+----+---------------------+-------+---------+-------+

After encoding the first two characters of our message, "CA", our cumulative output is "100", since the range of codes for messages starting with "CA" is from "100000" to "100111"; all these codes start with "100". The whole three-character message is encoded as shown in Figure aritha3.

We have generated exactly the same output from the same input as we did with Huffman coding. So far, this seems to be an exercise in futility; is arithmetic coding just another name for Huffman coding?

These two algorithms provide the same compression efficiency only when the frequencies of the characters to be encoded happen to be representable as integral powers of 1/2, as was the case in our examples so far; however, consider the frequency table shown in Figure huffman.poor.

Cum. Previous Current Output

Message Freq. Freq. Codes Output Output So Far

+-----+----+---------------------+-------+---------+-------+

AAA | 1 | 1 |000000(00)-000000(00)| 0000 | 00 | 000000|

AAB | 1 | 2 |000001(01)-000001(01)| 0000 | 10 | 000001|

AAC | 2 | 4 |000010(02)-000011(03)| 0000 | 1 | 00001 |

ABA | 1 | 5 |000100(04)-000100(04)| 0001 | 00 | 000100|

ABB | 1 | 6 |000101(05)-000101(05)| 0001 | 01 | 000101|

ABC | 2 | 8 |000110(06)-000111(07)| 0001 | 1 | 00011 |

ACA | 2 | 10 |001000(08)-001001(09)| 001 | 00 | 00100 |

ACB | 2 | 12 |001010(10)-001011(11)| 001 | 01 | 00101 |

ACC | 4 | 16 |001100(12)-001111(15)| 001 | 1 | 0011 |

BAA | 1 | 17 |010000(16)-010000(16)| 0100 | 00 | 010000|

BAB | 1 | 18 |010001(17)-010001(17)| 0100 | 01 | 010001|

BAC | 2 | 20 |010010(18)-010011(19)| 0100 | 1 | 01001 |

BBA | 1 | 21 |010100(20)-010100(20)| 0101 | 00 | 010100|

BBB | 1 | 22 |010101(21)-010101(21)| 0101 | 01 | 010101|

BBC | 2 | 24 |010110(22)-010111(23)| 0101 | 1 | 01011 |

BCA | 2 | 26 |011000(24)-011001(25)| 011 | 00 | 01100 |

BCB | 2 | 28 |011010(26)-011111(27)| 011 | 01 | 01101 |

BCC | 4 | 32 |011100(28)-011111(31)| 011 | 1 | 0111 |

CAA | 2 | 34 |100000(32)-100001(33)| 100 | 00 | 10000 |

* CAB | 2 | 36 |100010(34)-100011(35)| 100 | 01 | 10001 |

CAC | 4 | 40 |100100(36)-100111(39)| 100 | 1 | 1001 |

CBA | 2 | 42 |101000(40)-101001(41)| 101 | 00 | 10100 |

CBB | 2 | 44 |101010(42)-101011(43)| 101 | 01 | 10101 |

CBC | 4 | 48 |101100(44)-101111(47)| 101 | 1 | 1011 |

CCA | 4 | 52 |110000(48)-110011(51)| 11 | 00 | 1100 |

CCB | 4 | 56 |110100(52)-110111(55)| 11 | 01 | 1101 |

CCC | 8 | 64 |110000(56)-111111(63)| 11 | 1 | 111 |

+-----+----+---------------------+-------+---------+-------+

Huffman Fixed-length

Letter Frequency Code Code

+--------+---------+--------------+

A | 3/4 | 0 | 0 |

B | 1/4 | 1 | 1 |

+--------+---------+--------------+

Using Huffman coding, there is no way to
assign theoretically optimal codes to characters with such a frequency
distribution.^{3} Since the shortest
possible Huffman code is one bit, the best we can do is to assign a
one-bit code to each character, although this does not reflect the
difference in their frequencies. In fact, Huffman coding can never
provide any compression at all with a two-character alphabet.

However, such a situation is handled very well indeed by arithmetic compression. To see how, we will start by asking a fundamental question: what is a bit?

A bit is the amount of information required to specify an alternative that has a frequency of 50%; two bits can specify alternatives that have a frequency of 25%, and so forth. For example, the toss of a coin can result in either heads or tails, each of which can be optimally represented by a one-bit code; similarly, if a chance event has four equally likely outcomes, we can express each possible result most economically with two bits. On the other hand, as we have seen in our discussion of Huffman codes, we can have a number of alternatives that are not equally likely; in that case, we assign longer codes for those alternatives that are less likely. However, the shortest possible code in Huffman coding is one bit, which is assigned to an outcome with a frequency of one-half.

The general formula for the optimal length of a code specifying a
particular outcome with frequency *f* is "log_{2}(1/*f*)".
In our previous examples, an outcome with a frequency of .5 should have
a code length of log_{2}(1/(1/2)) or 1 bit. Similarly, if an
outcome has a frequency of .25 it should have a code length of log_{2}(1/(1/4)),
or two bits.

But what if one of the possible outcomes has a frequency greater
than one-half? Logically, we should use less than one bit to specify
it. For example, if we have a data file in which 84% of the characters
are spaces, we can calculate the appropriate number of bits in the code
for that character as log_{2}(1/(.84)), or approximately .25
bits. If the remaining 255 characters all have equal frequencies, each
of these frequencies is .0627%, so that our formula reduces to log_{2}(1/(.0627)),
or approximately 10.63 bits each. This would result in an average code
length of (.84)*(.25) + (.16)*10.63, or 1.91 bits per character. By
contrast, a Huffman code would require (.84)*1 + (.16)*9, or 2.28 bits
per character. If we were compressing a 250-Kbyte file with such
characteristics, using Huffman codes would produce a 71-Kbyte file,
whereas arithmetic coding would result in a 60-Kbyte file, about a 20%
difference between these two approaches.^{4}

Of course, we can't output a code of less than one bit. However, we can use a set of one or more bits to represent more than one character in the same message. This is why the statement that Huffman coding is the most efficient way to represent characters is true, but misleading; if our messages contain more than one character each, we may be able to combine the codes for a number of characters while consuming a fractional number of bits for some characters. To make this clearer, let's go through the encoding of a three-character message, "ABA", from a two-character alphabet in which the letter 'A' accounts for three-fourths of all the characters and the letter 'B' the remaining one-fourth. The situation after we see the first character is shown in Figure arithb1.

Message Cum. Previous Current Output

So Far Freq. Freq. Codes Output Output So Far

+-----+----+---------------------+-------+---------+-------+

* A | 48 | 48 |000000( 0)-101111(47)| None | None | None |

B | 16 | 64 |110000(48)-111111(63)| None | 11 | 11 |

+-----+----+---------------------+-------+---------+-------+

Nothing! We don't know whether the first bit of our encoded message will be a 0 or a 1; that depends on what happens next. Remember, messages starting with the letter 'A' can have codes starting with 00 through 10, or three-fourths of all possible codes. An 'A' gives us somewhat less than 1/2 bit of information, not nearly enough to produce any output by itself. Now let's look at Figure arithb2 for the information needed to encode the next character.

Message Cum. Previous Current Output

So Far Freq. Freq. Codes Output Output So Far

+-----+----+---------------------+-------+---------+-------+

AA | 36 | 36 |000000( 0)-100011(35)| None | None | None |

* AB | 12 | 48 |100100(36)-101111(47)| None | 10 | 10 |

BA | 12 | 60 |110000(48)-111011(59)| 11 | None | 11 |

BB | 4 | 64 |111100(60)-111111(63)| 11 | 11 | 1111 |

+-----+----+---------------------+-------+---------+-------+

We have two bits of output, since all the codes
for messages starting with "AB" have the initial bits "10".^{5} Let's continue with Figure arithb3 for the third (and last)
character of our message.

Message Cum. Previous Current Output

So Far Freq. Freq. Codes Output Output So Far

+-----+----+---------------------+-------+---------+-------+

AAA | 27 | 27 |000000( 0)-011010(26)| None | 0 | 0 |

AAB | 9 | 36 |011011(27)-100011(35)| None | None | None |

* ABA | 9 | 45 |100100(36)-101100(44)| 10 | None | 10 |

ABB | 3 | 48 |101101(45)-101111(47)| 10 | 11 | 1011 |

BAA | 9 | 57 |110000(48)-111000(56)| 11 | None | 11 |

BAB | 3 | 60 |111001(57)-111011(59)| 11 | 10 | 1110 |

BBA | 3 | 63 |111100(60)-111110(62)| 1111 | None | 1111 |

BBB | 1 | 64 |111111(63)-111111(63)| 1111 | 11 | 111111|

+-----+----+---------------------+-------+---------+-------+

We have still produced only two bits of output
from our three character message, "ABA". The best we could do with
Huffman coding is three bits. However, this is not the extreme case; if
we were encoding the most frequent message, "AAA", we would have only a
"0" bit.^{6}

This algorithm will work nicely if we happen to know in advance what the frequencies of all the possible characters will be. But how do we acquire this information? If our program reads an input file and writes an output file, we can go through the input file once, counting the number of times a given character appears, build the table of frequencies, and then make a second pass using this table to do the actual encoding. However, this is not possible when we are compressing data as it is being generated, as there is then no input file to be analyzed. This might seem to be an insurmountable obstacle.

Luckily, it is not; calculating the character frequencies as we encode yields about the same compression efficiency as precalculation, and in some cases even better efficiency! The reason for this surprising result is that most files (or sets of data) are not uniform throughout, but rather exhibit local variations in the frequency of a given character or set of characters. Calculating the frequencies on the fly often produces a better match for these changing characteristics.

Therefore, our approach is as follows. Every character is initially assumed to have the same frequency of occurrence. After each character is read and encoded, its entry in the table of frequencies is increased to reflect our having seen it. The next time we encode it, its encoding will account for the fact that we have seen it before.

That may seem quite simple for the sender, but not for the receiver. If the encoding table is being modified as we go, how does the receiver keep in step?

The receiver has the same initial table as the transmitter; each
character has the same expected frequency of occurrence. Then, as the
receiver decodes each character, it updates its copy of the frequency
table. This also explains why we increase a character's frequency after
we encode it, rather than before; until the receiver decodes the
character, it cannot update the frequency of that character's
occurrence.^{7}

In our implementation, we achieve a large improvement in compression efficiency by using the context in which a character occurs when estimating its frequency of occurrence. It should be apparent that the frequency of a given character appearing in a message at a given point is quite dependent on the previous context. For example, in English text, a "Q" is almost certain to be followed by a "u", at least if we exclude articles about software such as DESQview! Ideally, therefore, the amount of information needed to encode a "u" following a "q" should be very small. The same principle, of course, applies to the encoding of a line feed following a carriage return in text files where this is a virtual certainty.

On the other hand, the amount of storage required to keep track of
a large amount of previous context can become excessive.^{8}
Even one character of previous context requires the construction of 256
tables of frequencies, one for each possible previous character. A
direct extension of the approach given in the reference in footnote
would require over 300 KB of storage for these tables.

We will apply a number of space-saving methods to reduce the above storage requirement by about 90%, to approximately 35 KB, while still achieving good data-compression performance in most cases.

In order to achieve such a reduction in memory consumption, we must
avoid storing anything that can be recalculated, such as the cumulative
frequencies of all possible characters.^{9} We
must also dispense with the use of a self-organizing table that
attempts to speed up encoding and decoding by moving more frequently
used characters toward the front of the table, as is done in the
reference in footnote . However, we must provide as large a dynamic
range of frequencies as possible: the larger the ratio between the
highest frequency and the lowest, the greater the possible efficiency.
The greatest dynamic range is needed when one character always occurs
in a particular context, such as line feed after carriage return.
Assuming that we must be able to encode any of the 256 possible
one-byte values, the algorithm limits the possible dynamic range to
approximately one-fourth of the range of an `unsigned`

value.^{10} For maximum compression
efficiency we therefore need 256 tables of 256 two-byte entries each,
consuming a total of 128 KB.^{11} When I
first implemented this algorithm, I saw no way to reduce this
significantly.

Then early one morning, I woke up with the answer. We need some very large frequency values and some very small ones, but surely not every one in between. Why not use a code to represent one of a number of frequency values? These values would be spaced out properly to get as close as possible to optimal compression efficiency, but each would be represented by a small index, rather than literally.

How small an index? First I considered using an eight-bit index. But that still would require 64 KB for a complete 256x256 table. Maybe a smaller index would help. But wouldn't that impose an unreasonable processing time penalty?

Amazingly enough, we can use a four-bit index with no time penalty: in fact, processing is actually faster than with a byte index! This seemingly magical feat is accomplished by one of my favorite time-saving methods: the lookup table.

You see, a significant amount of the CPU time needed to encode a symbol is spent in a loop that computes the necessary portion of the cumulative frequency table for the current context. If each frequency value occupied one byte, we would have to execute that loop once for every index in the table up to the ASCII value of the character whose cumulative frequency we are trying to compute. However, since we have packed two indexes into one byte, we can accumulate two indexes worth of frequency values with one addition, thus reducing the number of times the loop has to be executed by approximately 50%.

Each execution of the loop translates a pair of indexes contained
in one byte into a total frequency value that is the sum of their
individual frequency values by using the byte as an index into a
256-word table, which has one entry for each possible combination of
two indexes. This table (the `both_weights`

table) is
calculated once at the start of the program.

Once we have determined the frequency of occurrence assigned to the character to be encoded, we can decide how many bits we can send and what their values are.

The receiver uses almost the same code as the sender, with two main exceptions. First, the receiver reads the input one bit at a time and outputs each character as soon as it is decoded, just the reverse of the sender. Second, rather than knowing how many frequency entries we have to accumulate in advance as the sender does, we have to accumulate them until we find the one that corresponds to the range we are trying to interpret. The latter situation reduces the efficiency of our loop control, which accounts for much of the difference in speed between encoding and decoding.

Let's start at the beginning, with the `main`

function
in `encode.cpp`

(Figure encode.00).

codelist/encode.00

This function opens the input and output files and sets up buffers
to speed up input and output, then calls `start_model`

(Figure adapt.00).

`start_model`

function (from compress\adapt.cpp) (Figure adapt.00)codelist/adapt.00

This function starts out by initializing the `upgrade_threshold`

array, which is used to determine when to promote a character to the
next higher frequency index value. As noted above, these values are not
consecutive, so that we can use a four-bit index rather than literal
values; this means that we have to promote a character only once in a
while rather than every time we see it, as we would do with literal
frequency values. How do we decide when to do this?

A pseudorandom approach seems best: we can't use a genuine random
number to tell us when to increment the index, because the receiver
would have no way of reproducing our decisions when decoding the
message. My solution is to keep a one-byte hash total of the ASCII
codes for all the characters that have been sent (`char_total`

)
and to increment the index in question whenever `char_total`

is greater than the corresponding value stored in the `upgrade_threshold`

array. That threshold value is calculated so that the probability of
incrementing the index is inversely proportional to the gap between
frequency values in the translation table.^{12}
If, for example, each frequency value in the translation table were
twice the previous value, there would be a 1/2 probability of
incrementing the index each time a character was encountered. After we
finish initializing the `upgrade_threshold`

array, we set `char_total`

to 0, in preparation for accumulating our hash total.

The next operation in `start_model`

is to generate the `both_weights`

table. As we discussed above, this table allows us to translate from a
pair of frequency values (or weights) to the total frequency to which
they correspond. We calculate it by generating every possible pair and
adding them together to fill in the corresponding table entry. The
values in the `translate`

table are defined in Figure model.00, in the line that starts
with `#define TRANSLATE_TABLE`

. How did I generate these
values?

codelist/model.00

It wasn't easy. I knew that I wanted to allow the largest possible dynamic range, which means that the lowest value has to be 1 and the highest value has to be close to the maximum that can be accommodated by the algorithm (16,128). The reason I chose a top value lower than that maximum value is that if the highest value were 16,128, then the occurrence of any character other than the preferred one would cause the total frequency to exceed the allowable maximum, with the result that the table of frequencies would be recalculated to reduce the top value to the next lower step. This would greatly reduce the efficiency of the compression for this case. That accounts for the lowest and highest values. What about the ones in between?

Initially, I decided to use a **geometric progression**, much
like the tuning of a piano; in such a progression, each value is a
fixed multiple of the one before it. However, I found that I achieved
better compression on a fairly large sample of files by starting the
progression with the second value at 16 and ending with the next to the
last at 1024. Why is this so?

The reason for leaving a big gap between the lowest frequency and the second lowest one is that many characters never occur in a particular situation. If they occur once, they are likely to recur later. Therefore, setting the next-to-the-lowest frequency to approximately one-tenth of 1% of the maximum value improves the efficiency of the compression. I have also found through experimentation that the compression ratio is improved if the first time a character is seen, it is given a frequency of about six-tenths of 1%, which requires an initial index of 7. Lower initial frequencies retard adaptation to popular new characters, and higher ones overemphasize new characters that turn out to be unpopular in the long run.

The reason to leave a large gap between the next-to-highest frequency and the highest one is that most of the time, a very skewed distribution has exactly one extremely frequent character. It is rare to find several very high-frequency characters that have about the same frequency. Therefore, allowing the highest frequency to approach the theoretical maximum produces the best results. Of course, these are only empirical findings. If you have samples that closely resemble the data that you will be compressing, you can try modifying these frequency values to improve the compression.

Continuing in `start_model`

(Figure adapt.00), we initialize `NO_OF_CHARS`

`frequency_info`

tables, one for every possible character.
Each of these tables will store the frequencies for characters that
follow a particular character. If we start to encode the string "This
is a test", the first character, 'T' will be encoded using the table
for character 0 (a null); this is arbitrary, since we haven't seen any
characters before this one. Then the 'h' will be encoded using the
table for 'T'; the 'i' will be encoded with the table for 'h', and so
forth. This approach takes advantage of the context dependence of
English text; as we noted before, after we see a 'q', the next
character is almost certain to be a 'u', so we should use a very short
code to represent a 'u' in that position.

However, our initialization code contains no information about the
characteristics of English text (or any other kind of data for that
matter). We assign the same frequency (the lowest possible) to every
character in every frequency table. As discussed above, the encoding
program will learn the appropriate frequencies for each character in
each table as it processes the input data. At the end of the
initialization for each table, we set its `total_freq`

value to the translation of the lowest frequency, multiplied by the
number of index pairs in the table. This value is needed to calculate
the code that corresponds to each character, and recalculating it every
time we access the table would be time-consuming.

The last operation in `start_model`

initializes the `char_translation`

array, which is used to translate the internal representation of the
characters being encoded and decoded to their ASCII codes.^{13}
Then we return to `encode.cpp`

.

The next operation in `main`

is to call `start_outputing_bits`

(Figure arenc.00).

`start_outputing_bits`

function (from compress\arenc.cpp) (Figure arenc.00)codelist/arenc.00

Of course, we can't send individual bits to the output file; we
have to accumulate at least one byte's worth. Whenever we have some
bits to be output, we will store them in `buffer`

; since we
haven't encoded any characters yet, we set it to 0. In order to keep
track of how many more bits we need before we can send `buffer`

to the output file, we set `bits_to_go`

to eight; then we
return to `main`

.

The next thing we do in `main`

is to call `start_encoding`

(Figure arenc.01).

`start_encoding`

function (from compress\arenc.cpp) (Figure arenc.01)codelist/arenc.01

This is a very short function, but it initializes some of the most
important variables in the program; `low`

, `high`

,
and `bits_to_follow`

. The first two of these keep track of
the range of codes for the current message; at this point we know
nothing about the message, so they are set to indicate the widest
possible range of messages, from 0 to `TOP_VALUE`

, which is
defined in `arith.h`

(Figure arith.00).

codelist/arith.00

In our current implementation, with 16-bit code values, `TOP_VALUE`

evaluates to 65535.^{14} The third variable, `bits_to_follow`

,
keeps track of the number of bits that have been deferred for later
output. This is used where the range of possible codes for the current
message includes codes that start with 0 and some that start with 1; as
we have seen already, in such a situation we're not ready to send out
any bits yet. After initializing these variables, we return to `main`

.

Upon returning to `main`

, we need to do one more
initialization before we enter the main loop of our program, which is
executed once for each character in the input file; namely, setting `oldch`

to 0. This variable controls the context in which we will encode each
character. Since we haven't seen any characters as yet, it doesn't
really matter which frequency table we use, as long as the decoding
program selects the same initial table.

The first operation in the main loop is to get the character to be
encoded, via `getc`

. If we have reached the end of the
file, we break out of the loop. Otherwise, we call `encode_symbol`

(Figure arenc.02) to place the
representation of the current character in the output stream.

`encode_symbol`

function (from compress\arenc.cpp) (Figure arenc.02)codelist/arenc.02

This function takes two parameters: `ch`

, the character
to be encoded, and `oldch`

, which determines the frequency
table to be used for this encoding. As we have noted above, selecting a
frequency table based upon the previous character encoded provides far
better compression efficiency, as the frequency of occurrence of a
particular character is greatly affected by the preceding character.

`encode_symbol`

Although `encode_symbol`

is a short function, it is the
subtlest function in this chapter; fortunately, you can use this
algorithm effectively without going through the explanation below. The
version here closely follows the reference in footnote ;
if you are really interested in the details of operation of this
function, I strongly advise you to study that reference very carefully
in conjunction with the explanation here.

To clarify this algorithm, we will work through a somewhat
simplified example as we examine the code. For ease in calculation, we
will set `TOP_VALUE`

to 255, or 11111111 binary, rather
than 65535 as in the actual program; as a result, `high`

will start out at 255 as well. We will also use a single, constant
frequency table (Figure sampfreq)
containing only four entries rather than selecting a 256-entry table
according to the previous character and modifying it as we see each
character, so our `translate`

and `both_weights`

tables (Figures samptrans and sampboth, respectively) will be
adjusted correspondingly. Instead of ASCII, we will use the codes from
0 (for 'A') to 3 (for 'D') in our example message.

Character Frequency code

Index Value

`both_weights`

table (Figure sampboth)First Second Index

Index 0 1 2 3

+----+----+----+----+

0 | 2 | 3 | 9 | 53 |

+----+----+----+----+

1 | 3 | 4 | 10 | 54 |

+----+----+----+----+

2 | 9 | 10 | 16 | 60 |

+----+----+----+----+

3 | 53 | 54 | 60 |104 |

+----+----+----+----+

Now we are ready to enter the frequency accumulation loop, which is shown in Figure arenc.03.

codelist/arenc.03

The reason we need this loop is that, as we saw before, we cannot
afford to keep the cumulative frequency tables in memory; they would
occupy hundreds of kilobytes of memory. Instead, we calculate the
cumulative frequency for each character being encoded, as we need it.
The `total_freq`

variable, however, we do maintain from one
encoding to the next; recalculating it would require us to go all the
way through the frequency table for each character, even if we are
encoding a character with a low ASCII code. Since we are saving the
total frequency with the table, we have to accumulate the frequencies
only up to the ASCII code of the character we are encoding.

Let's see how this loop accumulates the frequencies of all the
characters up to the character we want to encode. The first interesting
item is that the loop executes only half as many times as the value of
the character being encoded, since we are packing two 4-bit indexes
into each byte of the frequency table. So the first statement in the
loop retrieves one of these pairs of indexes from the frequency table
and increments the pointer to point to the next pair. Then we index
into the `both_weights`

table with the index pair we just
retrieved and set `total_pair_weight`

to that entry in the
table. The `both_weights`

table is the key to translating
two 4-bit indexes into a total frequency. Each entry in the table is
the sum of the frequency values that correspond to the two indexes that
make up the byte we use to index into the table. Finally, we add `total_pair_weight`

to `prev_cum`

, which is accumulating all of the
frequencies.

In our example, the first letter of our message is 'D', which has aThe next section of code, shown in Figure arenc.04, finishes the accumulation of the cumulative frequencies of the character to be encoded and the previous character, for both of the possible alignments of the character we are encoding and the previous character.`symbol`

value of 3. Using the frequency table in Figure sampfreq, we execute the statements in the loop once. First, we set`current_pair`

to the first entry in the`frequency`

table, 00010011, which indicates that the frequency code for 'A' is 1 (0001 binary) and the frequency code for 'B' is 3 (0011 binary). Then we set`total_pair_weight`

to entry (1,3) from the`both_weights`

table, the sum of the frequencies to which this pair of indexes corresponds; its value is 54. The last statement in the loop adds this value to`prev_cum`

, which was set to 0 before the loop was started.

codelist/arenc.04

If the target character has an even ASCII code, we already have the
correct value for `prev_cum`

; to calculate `cum`

,
the frequency accumulation for the current character, we need the first
index from the next byte, which is stored in the high half of the byte.
So we pick up `current_pair`

, shift its high index down,
and use it to retrieve the corresponding weight from the `translate`

table. Then we add that frequency value to `prev_cum`

to
calculate `cum`

.

On the other hand, if the target character has an odd ASCII code,
we need to update both `prev_cum`

and `cum`

.
First, we add the total weights for the last two characters to `prev_cum`

,
which results in `cum`

. Then we translate the high half of
the `current_pair`

and add that to `prev_cum`

.

In our example, the code of the first character is 3, so we need to update both`prev_cum`

and`cum`

. The value of`current_pair`

is (0,2). Looking in the`both_weights`

table, we set`total_pair_weight`

to the translation of that pair, which is 9. Then`cum`

is calculated as the sum of`prev_cum`

and`total_pair_weight`

, or 63. Then we extract the high part of`current_pair`

, 0, which translates to 1; we add this amount to`prev_cum`

, setting it to 55. This means that the code range associated with character "D" starts at 55 and ends slightly before 64, a range of 9 positions out of the total of 64. This will allow us to send out slightly less than three bits for this character, as we will narrow the range by a factor of more than 7.

Now that we have calculated the cumulative frequencies of the current character and the previous character, we are ready to narrow the range of frequencies that correspond to our message, as shown in Figure arenc.05.

codelist/arenc.05

The first line of this section of code calculates the previous range of frequencies for our message. Then the other lines calculate the new range of the message.

The purpose of `low`

and `high`

is to
delimit the frequency interval in which the current message falls; `range`

is just the size of the frequency interval extending from the old value
of `low`

to the old value of `high`

, inclusive.

In our example, the old value ofThe new values of`low`

is 0 and the old value of`high`

is 255. Therefore, the formula for`range`

,`(long)(high-low)+1`

, produces 256, which is its maximum value. This makes sense, as we have not yet used the information from the first character of our message.

`low`

and `high`

represent
the narrowing of the previous range due to the frequency of the new
character. We calculate the new value of `high`

, which
represents the high end of the new range of frequencies for our message
after the most recent character is added to the message.`low`

represents the low end of
the range of frequencies for our message after the most recent
character is added to the message.
In our example,`low`

is still 0,`range`

is 256,`cum`

is 63, and`total_freq`

is 63. The expression to calculate`high`

is`low + (range * cum) / total_freq - 1`

. Therefore,`high`

is calculated as 0+(256*63)/63-1, or 255. This means that the new range of frequencies for our message ends slightly below 256. Next, we recalculate`low`

. Its value is still 0 so far,`range`

is 256,`prev_cum`

is 55, and`total_freq`

is 63. The expression to calculate`low`

is`low + (range * prev_cum)/total_freq`

. Therefore, we calculate the new value of`low`

as 0+(256*55)/63, or 223. This means that the new range of frequencies for our message begins at 223.

Now we are finally ready to start sending bits out. The loop shown in Figure arenc.06 extracts as many bits as possible from the encoding of the message so far, and widens the range correspondingly to allow the encoding of more characters.

codelist/arenc.06

In order to understand this code, we need to look at the table of
possible initial bits of `high`

and `low`

,
which is given in Figure initcode.

The entries in this table could use some explanation. The first two
columns contain all the possible combinations of the first two bits of `low`

and `high`

; we know that `low`

cannot be
greater than `high`

, since these two values delimit the
range of codes for the message being compressed. If `low`

ever became greater than `high`

, it would be impossible to
encode any further characters. The "Action" column indicates what, if
anything, we can output now. Clearly, if `low`

and `high`

have the same first bit, we can output that bit. The entries labeled
"Next" indicate that since the separation between the values of `low`

and `high`

is at least one-fourth of the total range of
values (0-`TOP_VALUE`

), we can encode at least one more
character now; this is the reason for the limit on the total frequency
of all characters.

Low High Action

+--------+---------+---------+

| 00 | 00 | 0 |

| 00 | 01 | 0 |

| 00 | 10 | Next |

| 00 | 11 | Next |

| 01 | 01 | 0 |

| 01 | 10 | Defer |

| 01 | 11 | Next |

| 10 | 10 | 1 |

| 10 | 11 | 1 |

| 11 | 11 | 1 |

+--------+---------+---------+

The first test determines whether the highest
frequency value allocated to our message is less than one-half of the
total frequency range. If it is, we know that the first output bit is
0, so we call the `bit_plus_follow_0`

macro (Figure arenc.07) to output that bit. Let's
take a look at that macro.

First we call `output_0`

, which adds a 0 bit to the
output buffer and writes it out if it is full. Then, in the event that `bits_to_follow`

is greater than 0, we call `output_1`

and decrement `bits_to_follow`

until it reaches 0. Why do we do this?

`bit_plus_follow_0`

macro (from compress\arenc.cpp) (Figure arenc.07)codelist/arenc.07

The reason is that `bits_to_follow`

indicates the
number of bits that could not be output up till now because we didn't
know the first bit to be produced ("deferred" bits). For example, if
the range of codes for the message had been 011 through 100, we would
be unable to output any bits until the first bit was decided. However,
once we have enough information to decide that the first bit is 0, we
can output three bits, "011". The value of `bits_to_follow`

would be 2 in that case, since we have two deferred bits. Of course, if
the first bit turns out to be 1, we would emit "100" instead. The
reason that we know that the following bits must be the opposite of the
initial bit is that the only time we have to defer bits is when the
code range is split between codes starting with 0 and those starting
with 1; if both `low`

and `high`

started with
the same bit, we could send that bit out.

The values ofAssuming that we haven't obtained the current output bit yet, we continue by testing the complementary condition. If`HALF`

,`FIRST_QTR`

, and`THIRD_QTR`

are all based on`TOP_VALUE`

(Figure arith.00). In our example,`FIRST_QTR`

is 64,`HALF`

is 128, and`THIRD_QTR`

is 192. The current value of`high`

is 255, which is more than`HALF`

, so we continue with our tests.

`low`

is greater than or equal to `HALF`

, we know that the entire
frequency range allocated to our message so far is in the top half of
the total frequency range; therefore, the first bit of the message is
1. If this occurs, we output a 1 via `bit_plus_follow_1`

.
The next two lines reduce `high`

and `low`

by `HALF`

,
since we know that both are above that value.
In our example,If we haven't passed either of the tests to output a 0 or a 1 bit, we continue by testing whether the range is small enough but in the wrong position to provide output at this time. This is the situation labeled "Defer" in figure initcode. We know that the first two bits of the output will be either 01 or 10; we just don't know which of these two possibilities it will be. Therefore, we defer the output, incrementing`low`

is 223, which is more than`HALF`

. Therefore, we can call`bit_plus_follow_1`

to output a 1 bit. Then we adjust both`low`

and`high`

by subtracting`HALF`

, to account for the 1 bit we have just produced;`low`

is now 95 and`high`

is 127.

`bits_to_follow`

to
indicate that we have done so; we also reduce both `low`

and `high`

by `FIRST_QTR`

, since we know that
both are above that value. If this seems mysterious, remember that the
encoding information we want is contained in the differences between `low`

and `high`

, so we want to remove any redundancy in their
values.`break`

statement near the end of the loop, we still don't have any idea what
the next output bit(s) will be. This means that the range of
frequencies corresponding to our message is still greater than 50% of
the maximum possible frequency; we must have encoded an extremely
frequent character, since it hasn't even contributed one bit! If this
happens, we break out of the output loop; obviously, we have nothing to
declare.
In any of the other three cases, we now have arrived at the
statement `low <<= 1;`

with the values of `low`

and `high`

guaranteed to be less than half the maximum
possible frequency.^{17} Therefore, we shift
the values of `low`

and `high`

up one bit to
make room for our next pass through the loop. One more detail: we
increment `high`

after shifting it because the range
represented by `high`

actually extends almost to the next
frequency value; we have to shift a 1 bit in rather than a 0 to keep
this relationship.

In our example,`low`

is 95 and`high`

is 127, which represents a range of frequencies from 95 to slightly less than 128. The shifts give us 190 for`low`

and 255 for`high`

, which represents a range from 190 to slightly less than 256. If we hadn't added 1 to`high`

, the range would have been from 190 to slightly less than 255.

Since we have been over the code in this loop once already, we can
continue directly with the example. We start out with `low`

at 190 and `high`

at 255. Since `high`

is not
less than `HALF`

(128), we proceed to the second test,
where `low`

turns out to be greater than `HALF`

.
So we call `bit_plus_follow_1`

again as on the first loop
and then reduce both `low`

and `high`

by `HALF`

,
producing 62 for `low`

and 127 for `high`

. At
the bottom of the loop, we shift a 0 into `low`

and a 1
into `high`

, resulting in 124 and 255, respectively.

On the next pass through the loop, `high`

is not less
than `HALF`

, `low`

isn't greater than or equal
to `HALF`

, and `high`

isn't less than `THIRD_QTR`

(192), so we hit the `break`

and exit the loop. We have
sent two bits to the output buffer.

We are finished with `encode_symbol`

for this
character. Now we will start processing the next character of our
example message, which is 'B'. This character has a symbol value of 1,
as it is the second character in our character set. First, we set `prev_cum`

to 0. The frequency accumulation loop in Figure arenc.03 will not be executed at all,
since `symbol/2`

evaluates to 0; we fall through to the
adjustment code in Figure arenc.04
and select the odd path after the `else`

, since the symbol
code is odd. We set `current_pair`

to (1,3), since that is
the first entry in the frequency table. Then we set `total_pair_weight`

to the corresponding entry in the `both_weights`

table,
which is 54. Next, we set `cum`

to 0 + 54, or 54. The high
part of the current pair is 1, so `high_half_weight`

becomes entry 1 in the `translate`

table, or 2; we add this
to `prev_cum`

, which becomes 2 as well.

Now we have reached the first line in Figure arenc.05. Since the current value of `low`

is 124 and the current value of `high`

is 255, the value of
`range`

becomes 131. Next, we recalculate `high`

as 124 + (131*54)/63 - 1, or 235. The new value of `low`

is
124 + (131*2)/63, or 128. We are ready to enter the output loop.

First, `high`

is not less than `HALF`

, so
the first test fails. Next, `low`

is equal to `HALF`

,
so the second test succeeds. Therefore, we call `bit_plus_follow_1`

to output a 1 bit; it would also output any deferred bits that we might
have been unable to send out before, although there aren't any at
present. We also adjust `low`

and `high`

by
subtracting `HALF`

, to account for the bit we have just
sent; their new values are 0 and 107, respectively.

Next, we proceed to the statements beginning at ```
low
<<= 1;
```

, where we shift `low`

and `high`

up, injecting a 0 and a 1, respectively; the new values are 0 for `low`

and 215 for `high`

. On the next pass through the loop we
will discover that these values are too far apart to emit any more
bits, and we will break out of the loop and return to the main
function.

We could continue with a longer message, but I imagine you get the
idea. So let's return to the `main`

function (Figure encode.00).^{18}

`update_model`

The next function called is `update_model`

(Figure adapt.01), which adjusts the
frequencies in the current frequency table to account for having seen
the most recent character.

`update_model`

function (from compress\adapt.cpp) (Figure adapt.01)codelist/adapt.01

The arguments to this function are `symbol`

, the
internal value of the character just encoded, and `oldch`

,
the previous character encoded, which indicates the frequency table
that was used to encode that character. What is the internal value of
the character? In the current version of the program, it is the same as
the ASCII code of the character; however, near the end of the chapter
we will employ an optimization that involves translating characters
from ASCII to an internal code to speed up translation.

This function starts out by adding the character's ASCII code to `char_total`

,
a hash total which is used in the simple pseudorandom number generator
that we use to help decide when to upgrade a character's frequency to
the next frequency index code. We use the `symbol_translation`

table to get the ASCII value of the character before adding it to `char_total`

;
this is present for compatibility with our final version which employs
character translation.

The next few lines initialize some variables: `old_weight_code`

,
which we set when changing a frequency index code or "weight code", so
that we can update the frequency total for this frequency table; `temp_freq_info`

,
a pointer to the frequency table structure for the current character;
and `freq_ptr`

, the address of the frequency table itself.

Next, we compute the index into the frequency table for the weight
code we want to examine. If this index is even, that means that this
symbol is in the high part of that byte in the frequency table. In this
case, we execute the code in the `true`

branch of the `if`

statement "`if (symbol % 2 == 0)`

". This starts by setting `temp_freq`

to the high four bits of the table entry. If the result is 0, this
character has the lowest possible frequency value; we assume that this
is because it has never been encountered before and set its frequency
index code to `INIT_INDEX`

. Then we update the `total_freq`

element in the frequency table.

However, if `temp_freq`

is not 0, we have to decide
whether to upgrade this character's frequency index code to the next
level. The probability of this upgrade is inversely proportional to the
ratio of the current frequency to the next frequency; the larger the
gap between two frequency code values, the less the probability of the
upgrade. So we compare `char_total`

to the entry in `upgrade_threshold`

;
if `char_total`

is greater, we want to do the upgrade, so
we record the previous frequency code in `old_weight_code`

and add `HIGH_INCREMENT`

to the byte containing the
frequency index for the current character. We have to use `HIGH_INCREMENT`

rather than 1 to adjust the frequency index, since the frequency code
for the current character occupies the high four bits of its byte.

Of course, the character is just as likely to be in the low part of
its byte; in that case, we execute the code in the `fals`

`e`

branch of that `if`

statement, which corresponds exactly to
the above code. In either case, we follow up with the `if`

statement whose condition is "`(old_weight_code != -1)`

",
which tests whether a frequency index code was incremented. If it was,
we add the difference between the new code and the old one to the `total_freq`

entry in the current frequency table, unless the character previously
had a frequency index code of 0; in that case, we have already adjusted
the `total_freq`

entry.

The last operation in `update_model`

is to make sure
that the total of all frequencies in the frequency table does not
exceed the limit of `MAX_FREQUENCY`

; if that were to
happen, more than one character might map into the same value between `high`

and `low`

, so that unambiguous decoding would become
impossible. Therefore, if `temp_total_freq`

exceeds `MAX_FREQUENCY`

,
we have to reduce the frequency indexes until this is no longer the
case. The `while`

loop whose continuation expression is "```
(temp_total_freq
> MAX_FREQUENCY)
```

" takes care of this problem in the following
way.

First, we initialize `temp_total_freq`

to 0, as we will
use it to accumulate the frequencies as we modify them. Then we set `freq_ptr`

to the address of the first entry in the frequency table to be
modified. Now we are ready to step through all the bytes in the
frequency table; for each one, we test whether both indexes in the
current byte are 0. If so, we can't reduce them, so we just add the
frequency value corresponding to the translation of two 0 indexes (`BOTH_WEIGHTS_ZERO`

)
to `temp_total_freq`

.

Otherwise, we copy the current index pair into `freq`

.
If the high index is nonzero, we decrement it. Similarly, if the low
index is nonzero, we decrement it. After handling either or both of
these cases, we add the translation of the new index pair to `temp_total_freq`

.
After we have processed all of the index values in this way, we retest
the `while`

condition, and when `temp_total_freq`

is no longer out of range, we store it back into the frequency table
and return to the main program.

Finally, we have returned to the `main`

function (Figure
encode.00), where we copy `ch`

to `oldch`

, so that the current character will be used to
select the frequency table for the next character to be encoded; then
we continue in the main loop until all characters have been processed.

When we reach `EOF`

in the input file, the main loop
terminates; we use `encode_symbol`

to encode `EOF_SYMBOL`

,
which tells the receiver to stop decoding. Then we call `done_encoding`

(Figure arenc.08) and `done_outputing_bits`

(Figure arenc.09) to flush any
remaining information to the output file. Finally, we close the input
and output files and exit.

`done_encoding`

function (from compress\arenc.cpp) (Figure arenc.08)codelist/arenc.08

`done_outputing_bits`

function (from
compress\arenc.cpp) (Figure arenc.09)codelist/arenc.09

Now that we have developed a working version of our program, let's see how much more performance we can get by judicious optimization in the right places. The first step is to find out where those places are. Therefore, I ran the program under Microsoft's profiler on a file of approximately 11 KB, compressing it to a little less than 5 KB (a compression ratio of 2.3 to 1). This didn't take very long, as you can see from the profile in Figure encode.pro1.

Total time: 167.100 milliseconds

80.010 49.0 encode_symbol(unsigned int,unsigned int) (arenc.obj)

29.139 17.9 _main (encode.obj)

18.717 11.5 output_1(void) (arenc.obj)

17.141 10.5 output_0(void) (arenc.obj)

15.957 9.8 update_model(int,unsigned char) (adapt.obj)

2.234 1.4 start_model(void) (adapt.obj)

The majority of the CPU time is spent in the `encode_symbol`

function, so that's where we'll look first. As I mentioned in the
discussion of the algorithm, a significant amount of the CPU time
needed to encode a symbol is spent in the loop shown in Figure arenc.03. Accordingly, we should be
able to increase the speed of our implementation by using a more
efficient representation of the characters to be encoded, so that the
accumulation loop will be executed fewer times in total. While we
cannot spare the space for self-organizing tables of character
translations, a fixed translation of characters according to some
reasonable estimate of their likelihood in the text is certainly
possible. Let's see how that would affect our performance.

After analyzing a number of (hopefully representative) text files, I
used the results to produce a list of the characters in descending
order of number of occurrences. This information is stored in the `symbol_translation`

table (Figure adapt.02), which is
used during decoding to translate from the internal code into ASCII.

`symbol_translation`

table (from compress\adapt.cpp) (Figure adapt.02)codelist/adapt.02

The inverse transformation, used by the encoder, is supplied by the
`char_translation`

table, which is initialized near the end
of the `start_model`

function (Figure adapt.00).

To use these new values for each character when encoding and
decoding, we have to make some small changes in the encoding and
decoding main programs. Specifically, in `main`

, we have to
use the translated character rather than the ASCII code as read from
the input file when we call the `encode_symbol`

function.
The new version of the main program for encoding is shown in Figure encode1.00.

codelist/encode1.00

Of course, corresponding changes have to be made to the `main`

function in the decoding program, which then looks like Figure decode1.00.

codelist/decode1.00

Figure char.trans shows the speed improvement resulting from this modification.

Total time: 145.061 millisecond

58.980 41.8 encode_symbol(unsigned int,unsigned int) (arenc.obj)

25.811 18.3 _main (encode1.obj)

18.964 13.4 output_0(void) (arenc.obj)

18.586 13.2 output_1(void) (arenc.obj)

16.565 11.7 update_model(int,unsigned char) (adapt.obj)

2.213 1.6 start_model(void) (adapt.obj)

- How could arithmetic coding be used where each of a number of blocks of text must be decompressed without reference to the other blocks?
- If we knew that the data to be compressed consisted entirely of characters with ASCII codes 0 through 127, could we reduce the memory requirements of this algorithm further? If so, how much and how?

(You can find suggested approaches to problems in Chapter artopt.htm).

- I. H. Witten, R. M. Neal, and J. G. Cleary. "Arithmetic Coding for Data Compression". Commun. ACM 30(6), 520-540 (June 1987).
- This column is derived from the "Cum. Freq." column; it represents the range of codes that has been allocated to messages that begin with the characters in the "Message So Far" column. For example, messages beginning with the character 'A' have a cumulative frequency of 16/64. This means that messages that start with that letter have codes between the cumulative frequency of the previous message (in this case 0, since this is the first message) and 15/64. Similarly, since messages beginning with the letter 'B' have a cumulative frequency of 32/64, codes for such messages must lie between 16/64, which is the cumulative frequency of the previous entry, and 31/64.
- For that matter, a similar problem occurs when we have three characters of equal frequency; the characters have to end up with different code lengths in a Huffman code, even though their frequencies are equal.
- As we will see later, there are commonly occurring situations which have even more lopsided distributions than this, when the context in which each character is seen is taken into account; for example, following a carriage return in a DOS-formatted text file, the next character is almost certain to be a line feed. In such cases, we can use much less than one-fourth of a bit to represent a line feed character following a carriage return and gain far more efficiency compared to Huffman coding.
- Notice that if we were encoding "AA", we still wouldn't have even the first bit!
- The second most frequent message, "AAB", is a special case, which we will see again later. Although it occupies only nine of the 64 possible message positions, some of those nine start with a 0 and some with a 1; therefore, we don't know the first bit to be output. However, we do know that the code for messages beginning with "AAB" starts with either "011" or "100"; as we continue encoding characters, eventually we will have enough information to decide which. At that point, we will be able to output at least those first three bits.
- A consequence of this approach is that we cannot decompress data in any order other than the one in which it was compressed, which prevents direct access to records in compressed files. This limitation is the subject of one of the problems at the end of this chapter.
- A. Moffat. "Word-Based Text Compression".
*Software -- Practice and Experience*, 19(2), 185-198 (February 1989). - This is the column labeled "Cum. Freq." in Figures aritha1-aritha3 and arithb1-arithb3.
- For compilers which produce two-byte
`unsigned`

s, the maximum cumulative frequency is 16,383 if we wish to avoid the use of`long`

arithmetic. This limits the dynamic range to 16,128 for the one common character and 1 for each of the other characters. - Theoretically, we could use 14 bits per entry, but the CPU time required to encode or decode would be greatly increased by packing and unpacking frequencies stored in that way.
- The last entry in the
`upgrade_threshold`

array is set to 255, which prevents the index from being increased beyond 15, the maximum value that can be stored in a four-bit value;`char_total`

, being an`unsigned char`

variable, cannot exceed 255. - Although this array is not used in the initial version of our program, we will see near the end of this chapter how it contributes to our optimization effort.
- To fully understand the function of
`low`

and`high`

, we must wait until we examine`encode_symbol`

(Figure
arenc.02). For now, let's
just say that they indicate the current state of our knowledge about
the message being encoded; the closer together they are, the more bits
we are ready to output.
- Actually, the top end of the range of frequencies
for the message is just below
`high+1`

; even the authors of the reference in footnote , which is the source of the original version of this algorithm, admit that this is confusing! - In our example, we don't execute this particular adjustment.
- The proof of this is left to the reader as an exercise; see the problems at the end of the chapter.
- You may be wondering what happened to the
remaining information contained in
`low`

and`high`

. Rest assured that it will not be left twisting in the wind; there is a function called`done_encoding`

that makes sure that all of the remaining information is encoded in the output stream.