Return to the table of contents
In this chapter we will examine the Huffman coding and arithmetic coding methods of data compression and develop an implementation of the latter algorithm. The arithmetic coding algorithm allows a tradeoff among memory consumption and compression ratio; our emphasis will be on minimum memory consumption, on the assumption that the result would eventually be used as embedded code within a larger program.
Huffman Coding, Arithmetic Coding, Lookup Tables
Huffman coding is widely recognized as the most efficient method of encoding characters for data compression. This algorithm is a way of encoding different characters in different numbers of bits, with the most common characters encoded in the fewest bits. For example, suppose we have a message made up of the letters 'A', 'B', and 'C', which can occur in any combination. Figure huffman.freq shows the relative frequency of each of these letters and the Huffman code assigned to each one.
Huffman Fixed-length
Letter Frequency Code Code
+--------+---------+--------------+
A | 1/4 | 00 | 00 |
B | 1/4 | 01 | 01 |
C | 1/2 | 1 | 10 |
+--------+---------+--------------+
Let's encode the message "CABCCABC" using both codes. The results are shown in Figure huffman.fixed.
Huffman Fixed-length
Letter Code Code
+-------+------------+
C | 1 | 10 |
A | 00 | 00 |
B | 01 | 01 |
C | 1 | 10 |
C | 1 | 10 |
A | 00 | 00 |
B | 01 | 01 |
C | 1 | 10 |
+-------+------------+
Total bits used 12 16
Let's see how arithmetic coding1 would encode the first three characters of the same message, "CAB".
Cum. Previous Current Output
Message Freq. Freq. Codes Output Output So Far
+-----+----+---------------------+-------+---------+-------+
A | 16 | 16 |000000( 0)-001111(15)| None | 00 | 00 |
B | 16 | 32 |010000(16)-011111(31)| None | 01 | 01 |
* C | 32 | 64 |100000(32)-111111(63)| None | 1 | 1 |
+-----+----+---------------------+-------+---------+-------+
Figure aritha1 is the first of several figures which contain the information needed to determine how arithmetic coding would encode messages of up to three characters from an alphabet consisting of the letters 'A', 'B', and 'C', with frequencies of 1/4, 1/4, and 1/2, respectively. The frequency of a message composed of three characters chosen independently is the product of the frequencies of those characters. Since the lowest common denominator of these three fractions is 1/4, the frequency of any three-character message will be a multiple of (1/4)3, or 1/64. For example, the frequency of the message "CAB" will be (1/2)*(1/4)*(1/4), or 1/32 (=2/64). For this reason, we will express all of the frequency values in terms of 1/64ths.
Thus, the "Freq." column signifies the expected frequency of occurrence of each message, in units of 1/64; the "Cum. Freq." column accumulates the values in the first column; the "Codes" column indicates the range of codes that can represent each message2; the "Previous Output" column shows the bits that have been output before the current character was encoded; the "Current Output" column indicates what output we can produce at this point in the encoding process; and the "Output So Far" column shows the cumulative output for that message, starting with the first character encoded.
As the table indicates, since the first character happens to be a 'C', then we can output "1", because all possible messages starting with 'C' have codes starting with a "1". Let's continue with Figure aritha2 to see the encoding for a two-character message.
Cum. Previous Current Output
Message Freq. Freq. Codes Output Output So Far
+-----+----+---------------------+-------+---------+-------+
AA | 4 | 4 |000000(00)-000011(03)| 00 | 00 | 0000 |
AB | 4 | 8 |000100(04)-000111(07)| 00 | 01 | 0001 |
AC | 8 | 16 |001000(08)-001111(15)| 00 | 1 | 001 |
BA | 4 | 20 |010000(16)-010011(19)| 01 | 00 | 0100 |
BB | 4 | 24 |010100(20)-010111(23)| 01 | 01 | 0101 |
BC | 8 | 32 |011000(24)-011111(31)| 01 | 1 | 011 |
* CA | 8 | 40 |100000(32)-100111(39)| 1 | 00 | 100 |
CB | 8 | 48 |101000(40)-101111(47)| 1 | 01 | 101 |
CC | 16 | 64 |110000(48)-111111(63)| 1 | 1 | 11 |
+-----+----+---------------------+-------+---------+-------+
After encoding the first two characters of our message, "CA", our cumulative output is "100", since the range of codes for messages starting with "CA" is from "100000" to "100111"; all these codes start with "100". The whole three-character message is encoded as shown in Figure aritha3.
We have generated exactly the same output from the same input as we did with Huffman coding. So far, this seems to be an exercise in futility; is arithmetic coding just another name for Huffman coding?
These two algorithms provide the same compression efficiency only when the frequencies of the characters to be encoded happen to be representable as integral powers of 1/2, as was the case in our examples so far; however, consider the frequency table shown in Figure huffman.poor.
Cum. Previous Current Output
Message Freq. Freq. Codes Output Output So Far
+-----+----+---------------------+-------+---------+-------+
AAA | 1 | 1 |000000(00)-000000(00)| 0000 | 00 | 000000|
AAB | 1 | 2 |000001(01)-000001(01)| 0000 | 10 | 000001|
AAC | 2 | 4 |000010(02)-000011(03)| 0000 | 1 | 00001 |
ABA | 1 | 5 |000100(04)-000100(04)| 0001 | 00 | 000100|
ABB | 1 | 6 |000101(05)-000101(05)| 0001 | 01 | 000101|
ABC | 2 | 8 |000110(06)-000111(07)| 0001 | 1 | 00011 |
ACA | 2 | 10 |001000(08)-001001(09)| 001 | 00 | 00100 |
ACB | 2 | 12 |001010(10)-001011(11)| 001 | 01 | 00101 |
ACC | 4 | 16 |001100(12)-001111(15)| 001 | 1 | 0011 |
BAA | 1 | 17 |010000(16)-010000(16)| 0100 | 00 | 010000|
BAB | 1 | 18 |010001(17)-010001(17)| 0100 | 01 | 010001|
BAC | 2 | 20 |010010(18)-010011(19)| 0100 | 1 | 01001 |
BBA | 1 | 21 |010100(20)-010100(20)| 0101 | 00 | 010100|
BBB | 1 | 22 |010101(21)-010101(21)| 0101 | 01 | 010101|
BBC | 2 | 24 |010110(22)-010111(23)| 0101 | 1 | 01011 |
BCA | 2 | 26 |011000(24)-011001(25)| 011 | 00 | 01100 |
BCB | 2 | 28 |011010(26)-011111(27)| 011 | 01 | 01101 |
BCC | 4 | 32 |011100(28)-011111(31)| 011 | 1 | 0111 |
CAA | 2 | 34 |100000(32)-100001(33)| 100 | 00 | 10000 |
* CAB | 2 | 36 |100010(34)-100011(35)| 100 | 01 | 10001 |
CAC | 4 | 40 |100100(36)-100111(39)| 100 | 1 | 1001 |
CBA | 2 | 42 |101000(40)-101001(41)| 101 | 00 | 10100 |
CBB | 2 | 44 |101010(42)-101011(43)| 101 | 01 | 10101 |
CBC | 4 | 48 |101100(44)-101111(47)| 101 | 1 | 1011 |
CCA | 4 | 52 |110000(48)-110011(51)| 11 | 00 | 1100 |
CCB | 4 | 56 |110100(52)-110111(55)| 11 | 01 | 1101 |
CCC | 8 | 64 |110000(56)-111111(63)| 11 | 1 | 111 |
+-----+----+---------------------+-------+---------+-------+
Huffman Fixed-length
Letter Frequency Code Code
+--------+---------+--------------+
A | 3/4 | 0 | 0 |
B | 1/4 | 1 | 1 |
+--------+---------+--------------+
Using Huffman coding, there is no way to assign theoretically optimal codes to characters with such a frequency distribution.3 Since the shortest possible Huffman code is one bit, the best we can do is to assign a one-bit code to each character, although this does not reflect the difference in their frequencies. In fact, Huffman coding can never provide any compression at all with a two-character alphabet.
However, such a situation is handled very well indeed by arithmetic compression. To see how, we will start by asking a fundamental question: what is a bit?
A bit is the amount of information required to specify an alternative that has a frequency of 50%; two bits can specify alternatives that have a frequency of 25%, and so forth. For example, the toss of a coin can result in either heads or tails, each of which can be optimally represented by a one-bit code; similarly, if a chance event has four equally likely outcomes, we can express each possible result most economically with two bits. On the other hand, as we have seen in our discussion of Huffman codes, we can have a number of alternatives that are not equally likely; in that case, we assign longer codes for those alternatives that are less likely. However, the shortest possible code in Huffman coding is one bit, which is assigned to an outcome with a frequency of one-half.
The general formula for the optimal length of a code specifying a particular outcome with frequency f is "log2(1/f)". In our previous examples, an outcome with a frequency of .5 should have a code length of log2(1/(1/2)) or 1 bit. Similarly, if an outcome has a frequency of .25 it should have a code length of log2(1/(1/4)), or two bits.
But what if one of the possible outcomes has a frequency greater than one-half? Logically, we should use less than one bit to specify it. For example, if we have a data file in which 84% of the characters are spaces, we can calculate the appropriate number of bits in the code for that character as log2(1/(.84)), or approximately .25 bits. If the remaining 255 characters all have equal frequencies, each of these frequencies is .0627%, so that our formula reduces to log2(1/(.0627)), or approximately 10.63 bits each. This would result in an average code length of (.84)*(.25) + (.16)*10.63, or 1.91 bits per character. By contrast, a Huffman code would require (.84)*1 + (.16)*9, or 2.28 bits per character. If we were compressing a 250-Kbyte file with such characteristics, using Huffman codes would produce a 71-Kbyte file, whereas arithmetic coding would result in a 60-Kbyte file, about a 20% difference between these two approaches.4
Of course, we can't output a code of less than one bit. However, we can use a set of one or more bits to represent more than one character in the same message. This is why the statement that Huffman coding is the most efficient way to represent characters is true, but misleading; if our messages contain more than one character each, we may be able to combine the codes for a number of characters while consuming a fractional number of bits for some characters. To make this clearer, let's go through the encoding of a three-character message, "ABA", from a two-character alphabet in which the letter 'A' accounts for three-fourths of all the characters and the letter 'B' the remaining one-fourth. The situation after we see the first character is shown in Figure arithb1.
Message Cum. Previous Current Output
So Far Freq. Freq. Codes Output Output So Far
+-----+----+---------------------+-------+---------+-------+
* A | 48 | 48 |000000( 0)-101111(47)| None | None | None |
B | 16 | 64 |110000(48)-111111(63)| None | 11 | 11 |
+-----+----+---------------------+-------+---------+-------+
Nothing! We don't know whether the first bit of our encoded message will be a 0 or a 1; that depends on what happens next. Remember, messages starting with the letter 'A' can have codes starting with 00 through 10, or three-fourths of all possible codes. An 'A' gives us somewhat less than 1/2 bit of information, not nearly enough to produce any output by itself. Now let's look at Figure arithb2 for the information needed to encode the next character.
Message Cum. Previous Current Output
So Far Freq. Freq. Codes Output Output So Far
+-----+----+---------------------+-------+---------+-------+
AA | 36 | 36 |000000( 0)-100011(35)| None | None | None |
* AB | 12 | 48 |100100(36)-101111(47)| None | 10 | 10 |
BA | 12 | 60 |110000(48)-111011(59)| 11 | None | 11 |
BB | 4 | 64 |111100(60)-111111(63)| 11 | 11 | 1111 |
+-----+----+---------------------+-------+---------+-------+
We have two bits of output, since all the codes for messages starting with "AB" have the initial bits "10".5 Let's continue with Figure arithb3 for the third (and last) character of our message.
Message Cum. Previous Current Output
So Far Freq. Freq. Codes Output Output So Far
+-----+----+---------------------+-------+---------+-------+
AAA | 27 | 27 |000000( 0)-011010(26)| None | 0 | 0 |
AAB | 9 | 36 |011011(27)-100011(35)| None | None | None |
* ABA | 9 | 45 |100100(36)-101100(44)| 10 | None | 10 |
ABB | 3 | 48 |101101(45)-101111(47)| 10 | 11 | 1011 |
BAA | 9 | 57 |110000(48)-111000(56)| 11 | None | 11 |
BAB | 3 | 60 |111001(57)-111011(59)| 11 | 10 | 1110 |
BBA | 3 | 63 |111100(60)-111110(62)| 1111 | None | 1111 |
BBB | 1 | 64 |111111(63)-111111(63)| 1111 | 11 | 111111|
+-----+----+---------------------+-------+---------+-------+
We have still produced only two bits of output from our three character message, "ABA". The best we could do with Huffman coding is three bits. However, this is not the extreme case; if we were encoding the most frequent message, "AAA", we would have only a "0" bit.6
This algorithm will work nicely if we happen to know in advance what the frequencies of all the possible characters will be. But how do we acquire this information? If our program reads an input file and writes an output file, we can go through the input file once, counting the number of times a given character appears, build the table of frequencies, and then make a second pass using this table to do the actual encoding. However, this is not possible when we are compressing data as it is being generated, as there is then no input file to be analyzed. This might seem to be an insurmountable obstacle.
Luckily, it is not; calculating the character frequencies as we encode yields about the same compression efficiency as precalculation, and in some cases even better efficiency! The reason for this surprising result is that most files (or sets of data) are not uniform throughout, but rather exhibit local variations in the frequency of a given character or set of characters. Calculating the frequencies on the fly often produces a better match for these changing characteristics.
Therefore, our approach is as follows. Every character is initially assumed to have the same frequency of occurrence. After each character is read and encoded, its entry in the table of frequencies is increased to reflect our having seen it. The next time we encode it, its encoding will account for the fact that we have seen it before.
That may seem quite simple for the sender, but not for the receiver. If the encoding table is being modified as we go, how does the receiver keep in step?
The receiver has the same initial table as the transmitter; each character has the same expected frequency of occurrence. Then, as the receiver decodes each character, it updates its copy of the frequency table. This also explains why we increase a character's frequency after we encode it, rather than before; until the receiver decodes the character, it cannot update the frequency of that character's occurrence.7
In our implementation, we achieve a large improvement in compression efficiency by using the context in which a character occurs when estimating its frequency of occurrence. It should be apparent that the frequency of a given character appearing in a message at a given point is quite dependent on the previous context. For example, in English text, a "Q" is almost certain to be followed by a "u", at least if we exclude articles about software such as DESQview! Ideally, therefore, the amount of information needed to encode a "u" following a "q" should be very small. The same principle, of course, applies to the encoding of a line feed following a carriage return in text files where this is a virtual certainty.
On the other hand, the amount of storage required to keep track of a large amount of previous context can become excessive.8 Even one character of previous context requires the construction of 256 tables of frequencies, one for each possible previous character. A direct extension of the approach given in the reference in footnote would require over 300 KB of storage for these tables.
We will apply a number of space-saving methods to reduce the above storage requirement by about 90%, to approximately 35 KB, while still achieving good data-compression performance in most cases.
In order to achieve such a reduction in memory consumption, we must
avoid storing anything that can be recalculated, such as the cumulative
frequencies of all possible characters.9 We
must also dispense with the use of a self-organizing table that
attempts to speed up encoding and decoding by moving more frequently
used characters toward the front of the table, as is done in the
reference in footnote . However, we must provide as large a dynamic
range of frequencies as possible: the larger the ratio between the
highest frequency and the lowest, the greater the possible efficiency.
The greatest dynamic range is needed when one character always occurs
in a particular context, such as line feed after carriage return.
Assuming that we must be able to encode any of the 256 possible
one-byte values, the algorithm limits the possible dynamic range to
approximately one-fourth of the range of an unsigned
value.10 For maximum compression
efficiency we therefore need 256 tables of 256 two-byte entries each,
consuming a total of 128 KB.11 When I
first implemented this algorithm, I saw no way to reduce this
significantly.
Then early one morning, I woke up with the answer. We need some very large frequency values and some very small ones, but surely not every one in between. Why not use a code to represent one of a number of frequency values? These values would be spaced out properly to get as close as possible to optimal compression efficiency, but each would be represented by a small index, rather than literally.
How small an index? First I considered using an eight-bit index. But that still would require 64 KB for a complete 256x256 table. Maybe a smaller index would help. But wouldn't that impose an unreasonable processing time penalty?
Amazingly enough, we can use a four-bit index with no time penalty: in fact, processing is actually faster than with a byte index! This seemingly magical feat is accomplished by one of my favorite time-saving methods: the lookup table.
You see, a significant amount of the CPU time needed to encode a symbol is spent in a loop that computes the necessary portion of the cumulative frequency table for the current context. If each frequency value occupied one byte, we would have to execute that loop once for every index in the table up to the ASCII value of the character whose cumulative frequency we are trying to compute. However, since we have packed two indexes into one byte, we can accumulate two indexes worth of frequency values with one addition, thus reducing the number of times the loop has to be executed by approximately 50%.
Each execution of the loop translates a pair of indexes contained
in one byte into a total frequency value that is the sum of their
individual frequency values by using the byte as an index into a
256-word table, which has one entry for each possible combination of
two indexes. This table (the both_weights
table) is
calculated once at the start of the program.
Once we have determined the frequency of occurrence assigned to the character to be encoded, we can decide how many bits we can send and what their values are.
The receiver uses almost the same code as the sender, with two main exceptions. First, the receiver reads the input one bit at a time and outputs each character as soon as it is decoded, just the reverse of the sender. Second, rather than knowing how many frequency entries we have to accumulate in advance as the sender does, we have to accumulate them until we find the one that corresponds to the range we are trying to interpret. The latter situation reduces the efficiency of our loop control, which accounts for much of the difference in speed between encoding and decoding.
Let's start at the beginning, with the main
function
in encode.cpp
(Figure encode.00).
codelist/encode.00
This function opens the input and output files and sets up buffers
to speed up input and output, then calls start_model
(Figure adapt.00).
start_model
function (from compress\adapt.cpp) (Figure adapt.00)codelist/adapt.00
This function starts out by initializing the upgrade_threshold
array, which is used to determine when to promote a character to the
next higher frequency index value. As noted above, these values are not
consecutive, so that we can use a four-bit index rather than literal
values; this means that we have to promote a character only once in a
while rather than every time we see it, as we would do with literal
frequency values. How do we decide when to do this?
A pseudorandom approach seems best: we can't use a genuine random
number to tell us when to increment the index, because the receiver
would have no way of reproducing our decisions when decoding the
message. My solution is to keep a one-byte hash total of the ASCII
codes for all the characters that have been sent (char_total
)
and to increment the index in question whenever char_total
is greater than the corresponding value stored in the upgrade_threshold
array. That threshold value is calculated so that the probability of
incrementing the index is inversely proportional to the gap between
frequency values in the translation table.12
If, for example, each frequency value in the translation table were
twice the previous value, there would be a 1/2 probability of
incrementing the index each time a character was encountered. After we
finish initializing the upgrade_threshold
array, we set char_total
to 0, in preparation for accumulating our hash total.
The next operation in start_model
is to generate the both_weights
table. As we discussed above, this table allows us to translate from a
pair of frequency values (or weights) to the total frequency to which
they correspond. We calculate it by generating every possible pair and
adding them together to fill in the corresponding table entry. The
values in the translate
table are defined in Figure model.00, in the line that starts
with #define TRANSLATE_TABLE
. How did I generate these
values?
codelist/model.00
It wasn't easy. I knew that I wanted to allow the largest possible dynamic range, which means that the lowest value has to be 1 and the highest value has to be close to the maximum that can be accommodated by the algorithm (16,128). The reason I chose a top value lower than that maximum value is that if the highest value were 16,128, then the occurrence of any character other than the preferred one would cause the total frequency to exceed the allowable maximum, with the result that the table of frequencies would be recalculated to reduce the top value to the next lower step. This would greatly reduce the efficiency of the compression for this case. That accounts for the lowest and highest values. What about the ones in between?
Initially, I decided to use a geometric progression, much like the tuning of a piano; in such a progression, each value is a fixed multiple of the one before it. However, I found that I achieved better compression on a fairly large sample of files by starting the progression with the second value at 16 and ending with the next to the last at 1024. Why is this so?
The reason for leaving a big gap between the lowest frequency and the second lowest one is that many characters never occur in a particular situation. If they occur once, they are likely to recur later. Therefore, setting the next-to-the-lowest frequency to approximately one-tenth of 1% of the maximum value improves the efficiency of the compression. I have also found through experimentation that the compression ratio is improved if the first time a character is seen, it is given a frequency of about six-tenths of 1%, which requires an initial index of 7. Lower initial frequencies retard adaptation to popular new characters, and higher ones overemphasize new characters that turn out to be unpopular in the long run.
The reason to leave a large gap between the next-to-highest frequency and the highest one is that most of the time, a very skewed distribution has exactly one extremely frequent character. It is rare to find several very high-frequency characters that have about the same frequency. Therefore, allowing the highest frequency to approach the theoretical maximum produces the best results. Of course, these are only empirical findings. If you have samples that closely resemble the data that you will be compressing, you can try modifying these frequency values to improve the compression.
Continuing in start_model
(Figure adapt.00), we initialize NO_OF_CHARS
frequency_info
tables, one for every possible character.
Each of these tables will store the frequencies for characters that
follow a particular character. If we start to encode the string "This
is a test", the first character, 'T' will be encoded using the table
for character 0 (a null); this is arbitrary, since we haven't seen any
characters before this one. Then the 'h' will be encoded using the
table for 'T'; the 'i' will be encoded with the table for 'h', and so
forth. This approach takes advantage of the context dependence of
English text; as we noted before, after we see a 'q', the next
character is almost certain to be a 'u', so we should use a very short
code to represent a 'u' in that position.
However, our initialization code contains no information about the
characteristics of English text (or any other kind of data for that
matter). We assign the same frequency (the lowest possible) to every
character in every frequency table. As discussed above, the encoding
program will learn the appropriate frequencies for each character in
each table as it processes the input data. At the end of the
initialization for each table, we set its total_freq
value to the translation of the lowest frequency, multiplied by the
number of index pairs in the table. This value is needed to calculate
the code that corresponds to each character, and recalculating it every
time we access the table would be time-consuming.
The last operation in start_model
initializes the char_translation
array, which is used to translate the internal representation of the
characters being encoded and decoded to their ASCII codes.13
Then we return to encode.cpp
.
The next operation in main
is to call start_outputing_bits
(Figure arenc.00).
start_outputing_bits
function (from compress\arenc.cpp) (Figure arenc.00)codelist/arenc.00
Of course, we can't send individual bits to the output file; we
have to accumulate at least one byte's worth. Whenever we have some
bits to be output, we will store them in buffer
; since we
haven't encoded any characters yet, we set it to 0. In order to keep
track of how many more bits we need before we can send buffer
to the output file, we set bits_to_go
to eight; then we
return to main
.
The next thing we do in main
is to call start_encoding
(Figure arenc.01).
start_encoding
function (from compress\arenc.cpp) (Figure arenc.01)codelist/arenc.01
This is a very short function, but it initializes some of the most
important variables in the program; low
, high
,
and bits_to_follow
. The first two of these keep track of
the range of codes for the current message; at this point we know
nothing about the message, so they are set to indicate the widest
possible range of messages, from 0 to TOP_VALUE
, which is
defined in arith.h
(Figure arith.00).
codelist/arith.00
In our current implementation, with 16-bit code values, TOP_VALUE
evaluates to 65535.14 The third variable, bits_to_follow
,
keeps track of the number of bits that have been deferred for later
output. This is used where the range of possible codes for the current
message includes codes that start with 0 and some that start with 1; as
we have seen already, in such a situation we're not ready to send out
any bits yet. After initializing these variables, we return to main
.
Upon returning to main
, we need to do one more
initialization before we enter the main loop of our program, which is
executed once for each character in the input file; namely, setting oldch
to 0. This variable controls the context in which we will encode each
character. Since we haven't seen any characters as yet, it doesn't
really matter which frequency table we use, as long as the decoding
program selects the same initial table.
The first operation in the main loop is to get the character to be
encoded, via getc
. If we have reached the end of the
file, we break out of the loop. Otherwise, we call encode_symbol
(Figure arenc.02) to place the
representation of the current character in the output stream.
encode_symbol
function (from compress\arenc.cpp) (Figure arenc.02)codelist/arenc.02
This function takes two parameters: ch
, the character
to be encoded, and oldch
, which determines the frequency
table to be used for this encoding. As we have noted above, selecting a
frequency table based upon the previous character encoded provides far
better compression efficiency, as the frequency of occurrence of a
particular character is greatly affected by the preceding character.
encode_symbol
Although encode_symbol
is a short function, it is the
subtlest function in this chapter; fortunately, you can use this
algorithm effectively without going through the explanation below. The
version here closely follows the reference in footnote ;
if you are really interested in the details of operation of this
function, I strongly advise you to study that reference very carefully
in conjunction with the explanation here.
To clarify this algorithm, we will work through a somewhat
simplified example as we examine the code. For ease in calculation, we
will set TOP_VALUE
to 255, or 11111111 binary, rather
than 65535 as in the actual program; as a result, high
will start out at 255 as well. We will also use a single, constant
frequency table (Figure sampfreq)
containing only four entries rather than selecting a 256-entry table
according to the previous character and modifying it as we see each
character, so our translate
and both_weights
tables (Figures samptrans and sampboth, respectively) will be
adjusted correspondingly. Instead of ASCII, we will use the codes from
0 (for 'A') to 3 (for 'D') in our example message.
Character Frequency code
Index Value
both_weights
table (Figure sampboth)First Second Index
Index 0 1 2 3
+----+----+----+----+
0 | 2 | 3 | 9 | 53 |
+----+----+----+----+
1 | 3 | 4 | 10 | 54 |
+----+----+----+----+
2 | 9 | 10 | 16 | 60 |
+----+----+----+----+
3 | 53 | 54 | 60 |104 |
+----+----+----+----+
Now we are ready to enter the frequency accumulation loop, which is shown in Figure arenc.03.
codelist/arenc.03
The reason we need this loop is that, as we saw before, we cannot
afford to keep the cumulative frequency tables in memory; they would
occupy hundreds of kilobytes of memory. Instead, we calculate the
cumulative frequency for each character being encoded, as we need it.
The total_freq
variable, however, we do maintain from one
encoding to the next; recalculating it would require us to go all the
way through the frequency table for each character, even if we are
encoding a character with a low ASCII code. Since we are saving the
total frequency with the table, we have to accumulate the frequencies
only up to the ASCII code of the character we are encoding.
Let's see how this loop accumulates the frequencies of all the
characters up to the character we want to encode. The first interesting
item is that the loop executes only half as many times as the value of
the character being encoded, since we are packing two 4-bit indexes
into each byte of the frequency table. So the first statement in the
loop retrieves one of these pairs of indexes from the frequency table
and increments the pointer to point to the next pair. Then we index
into the both_weights
table with the index pair we just
retrieved and set total_pair_weight
to that entry in the
table. The both_weights
table is the key to translating
two 4-bit indexes into a total frequency. Each entry in the table is
the sum of the frequency values that correspond to the two indexes that
make up the byte we use to index into the table. Finally, we add total_pair_weight
to prev_cum
, which is accumulating all of the
frequencies.
In our example, the first letter of our message is 'D', which has aThe next section of code, shown in Figure arenc.04, finishes the accumulation of the cumulative frequencies of the character to be encoded and the previous character, for both of the possible alignments of the character we are encoding and the previous character.symbol
value of 3. Using the frequency table in Figure sampfreq, we execute the statements in the loop once. First, we setcurrent_pair
to the first entry in thefrequency
table, 00010011, which indicates that the frequency code for 'A' is 1 (0001 binary) and the frequency code for 'B' is 3 (0011 binary). Then we settotal_pair_weight
to entry (1,3) from theboth_weights
table, the sum of the frequencies to which this pair of indexes corresponds; its value is 54. The last statement in the loop adds this value toprev_cum
, which was set to 0 before the loop was started.
codelist/arenc.04
If the target character has an even ASCII code, we already have the
correct value for prev_cum
; to calculate cum
,
the frequency accumulation for the current character, we need the first
index from the next byte, which is stored in the high half of the byte.
So we pick up current_pair
, shift its high index down,
and use it to retrieve the corresponding weight from the translate
table. Then we add that frequency value to prev_cum
to
calculate cum
.
On the other hand, if the target character has an odd ASCII code,
we need to update both prev_cum
and cum
.
First, we add the total weights for the last two characters to prev_cum
,
which results in cum
. Then we translate the high half of
the current_pair
and add that to prev_cum
.
In our example, the code of the first character is 3, so we need to update bothprev_cum
andcum
. The value ofcurrent_pair
is (0,2). Looking in theboth_weights
table, we settotal_pair_weight
to the translation of that pair, which is 9. Thencum
is calculated as the sum ofprev_cum
andtotal_pair_weight
, or 63. Then we extract the high part ofcurrent_pair
, 0, which translates to 1; we add this amount toprev_cum
, setting it to 55. This means that the code range associated with character "D" starts at 55 and ends slightly before 64, a range of 9 positions out of the total of 64. This will allow us to send out slightly less than three bits for this character, as we will narrow the range by a factor of more than 7.
Now that we have calculated the cumulative frequencies of the current character and the previous character, we are ready to narrow the range of frequencies that correspond to our message, as shown in Figure arenc.05.
codelist/arenc.05
The first line of this section of code calculates the previous range of frequencies for our message. Then the other lines calculate the new range of the message.
The purpose of low
and high
is to
delimit the frequency interval in which the current message falls; range
is just the size of the frequency interval extending from the old value
of low
to the old value of high
, inclusive.
In our example, the old value ofThe new values oflow
is 0 and the old value ofhigh
is 255. Therefore, the formula forrange
,(long)(high-low)+1
, produces 256, which is its maximum value. This makes sense, as we have not yet used the information from the first character of our message.
low
and high
represent
the narrowing of the previous range due to the frequency of the new
character. We calculate the new value of high
, which
represents the high end of the new range of frequencies for our message
after the most recent character is added to the message.15
Similarly, the new value of low
represents the low end of
the range of frequencies for our message after the most recent
character is added to the message.
In our example,low
is still 0,range
is 256,cum
is 63, andtotal_freq
is 63. The expression to calculatehigh
islow + (range * cum) / total_freq - 1
. Therefore,high
is calculated as 0+(256*63)/63-1, or 255. This means that the new range of frequencies for our message ends slightly below 256. Next, we recalculatelow
. Its value is still 0 so far,range
is 256,prev_cum
is 55, andtotal_freq
is 63. The expression to calculatelow
islow + (range * prev_cum)/total_freq
. Therefore, we calculate the new value oflow
as 0+(256*55)/63, or 223. This means that the new range of frequencies for our message begins at 223.
Now we are finally ready to start sending bits out. The loop shown in Figure arenc.06 extracts as many bits as possible from the encoding of the message so far, and widens the range correspondingly to allow the encoding of more characters.
codelist/arenc.06
In order to understand this code, we need to look at the table of
possible initial bits of high
and low
,
which is given in Figure initcode.
The entries in this table could use some explanation. The first two
columns contain all the possible combinations of the first two bits of low
and high
; we know that low
cannot be
greater than high
, since these two values delimit the
range of codes for the message being compressed. If low
ever became greater than high
, it would be impossible to
encode any further characters. The "Action" column indicates what, if
anything, we can output now. Clearly, if low
and high
have the same first bit, we can output that bit. The entries labeled
"Next" indicate that since the separation between the values of low
and high
is at least one-fourth of the total range of
values (0-TOP_VALUE
), we can encode at least one more
character now; this is the reason for the limit on the total frequency
of all characters.
Low High Action
+--------+---------+---------+
| 00 | 00 | 0 |
| 00 | 01 | 0 |
| 00 | 10 | Next |
| 00 | 11 | Next |
| 01 | 01 | 0 |
| 01 | 10 | Defer |
| 01 | 11 | Next |
| 10 | 10 | 1 |
| 10 | 11 | 1 |
| 11 | 11 | 1 |
+--------+---------+---------+
The first test determines whether the highest
frequency value allocated to our message is less than one-half of the
total frequency range. If it is, we know that the first output bit is
0, so we call the bit_plus_follow_0
macro (Figure arenc.07) to output that bit. Let's
take a look at that macro.
First we call output_0
, which adds a 0 bit to the
output buffer and writes it out if it is full. Then, in the event that bits_to_follow
is greater than 0, we call output_1
and decrement bits_to_follow
until it reaches 0. Why do we do this?
bit_plus_follow_0
macro (from compress\arenc.cpp) (Figure arenc.07)codelist/arenc.07
The reason is that bits_to_follow
indicates the
number of bits that could not be output up till now because we didn't
know the first bit to be produced ("deferred" bits). For example, if
the range of codes for the message had been 011 through 100, we would
be unable to output any bits until the first bit was decided. However,
once we have enough information to decide that the first bit is 0, we
can output three bits, "011". The value of bits_to_follow
would be 2 in that case, since we have two deferred bits. Of course, if
the first bit turns out to be 1, we would emit "100" instead. The
reason that we know that the following bits must be the opposite of the
initial bit is that the only time we have to defer bits is when the
code range is split between codes starting with 0 and those starting
with 1; if both low
and high
started with
the same bit, we could send that bit out.
The values ofAssuming that we haven't obtained the current output bit yet, we continue by testing the complementary condition. IfHALF
,FIRST_QTR
, andTHIRD_QTR
are all based onTOP_VALUE
(Figure arith.00). In our example,FIRST_QTR
is 64,HALF
is 128, andTHIRD_QTR
is 192. The current value ofhigh
is 255, which is more thanHALF
, so we continue with our tests.
low
is greater than or equal to HALF
, we know that the entire
frequency range allocated to our message so far is in the top half of
the total frequency range; therefore, the first bit of the message is
1. If this occurs, we output a 1 via bit_plus_follow_1
.
The next two lines reduce high
and low
by HALF
,
since we know that both are above that value.
In our example,If we haven't passed either of the tests to output a 0 or a 1 bit, we continue by testing whether the range is small enough but in the wrong position to provide output at this time. This is the situation labeled "Defer" in figure initcode. We know that the first two bits of the output will be either 01 or 10; we just don't know which of these two possibilities it will be. Therefore, we defer the output, incrementinglow
is 223, which is more thanHALF
. Therefore, we can callbit_plus_follow_1
to output a 1 bit. Then we adjust bothlow
andhigh
by subtractingHALF
, to account for the 1 bit we have just produced;low
is now 95 andhigh
is 127.
bits_to_follow
to
indicate that we have done so; we also reduce both low
and high
by FIRST_QTR
, since we know that
both are above that value. If this seems mysterious, remember that the
encoding information we want is contained in the differences between low
and high
, so we want to remove any redundancy in their
values.16 If we get to the break
statement near the end of the loop, we still don't have any idea what
the next output bit(s) will be. This means that the range of
frequencies corresponding to our message is still greater than 50% of
the maximum possible frequency; we must have encoded an extremely
frequent character, since it hasn't even contributed one bit! If this
happens, we break out of the output loop; obviously, we have nothing to
declare.
In any of the other three cases, we now have arrived at the
statement low <<= 1;
with the values of low
and high
guaranteed to be less than half the maximum
possible frequency.17 Therefore, we shift
the values of low
and high
up one bit to
make room for our next pass through the loop. One more detail: we
increment high
after shifting it because the range
represented by high
actually extends almost to the next
frequency value; we have to shift a 1 bit in rather than a 0 to keep
this relationship.
In our example,low
is 95 andhigh
is 127, which represents a range of frequencies from 95 to slightly less than 128. The shifts give us 190 forlow
and 255 forhigh
, which represents a range from 190 to slightly less than 256. If we hadn't added 1 tohigh
, the range would have been from 190 to slightly less than 255.
Since we have been over the code in this loop once already, we can
continue directly with the example. We start out with low
at 190 and high
at 255. Since high
is not
less than HALF
(128), we proceed to the second test,
where low
turns out to be greater than HALF
.
So we call bit_plus_follow_1
again as on the first loop
and then reduce both low
and high
by HALF
,
producing 62 for low
and 127 for high
. At
the bottom of the loop, we shift a 0 into low
and a 1
into high
, resulting in 124 and 255, respectively.
On the next pass through the loop, high
is not less
than HALF
, low
isn't greater than or equal
to HALF
, and high
isn't less than THIRD_QTR
(192), so we hit the break
and exit the loop. We have
sent two bits to the output buffer.
We are finished with encode_symbol
for this
character. Now we will start processing the next character of our
example message, which is 'B'. This character has a symbol value of 1,
as it is the second character in our character set. First, we set prev_cum
to 0. The frequency accumulation loop in Figure arenc.03 will not be executed at all,
since symbol/2
evaluates to 0; we fall through to the
adjustment code in Figure arenc.04
and select the odd path after the else
, since the symbol
code is odd. We set current_pair
to (1,3), since that is
the first entry in the frequency table. Then we set total_pair_weight
to the corresponding entry in the both_weights
table,
which is 54. Next, we set cum
to 0 + 54, or 54. The high
part of the current pair is 1, so high_half_weight
becomes entry 1 in the translate
table, or 2; we add this
to prev_cum
, which becomes 2 as well.
Now we have reached the first line in Figure arenc.05. Since the current value of low
is 124 and the current value of high
is 255, the value of
range
becomes 131. Next, we recalculate high
as 124 + (131*54)/63 - 1, or 235. The new value of low
is
124 + (131*2)/63, or 128. We are ready to enter the output loop.
First, high
is not less than HALF
, so
the first test fails. Next, low
is equal to HALF
,
so the second test succeeds. Therefore, we call bit_plus_follow_1
to output a 1 bit; it would also output any deferred bits that we might
have been unable to send out before, although there aren't any at
present. We also adjust low
and high
by
subtracting HALF
, to account for the bit we have just
sent; their new values are 0 and 107, respectively.
Next, we proceed to the statements beginning at low
<<= 1;
, where we shift low
and high
up, injecting a 0 and a 1, respectively; the new values are 0 for low
and 215 for high
. On the next pass through the loop we
will discover that these values are too far apart to emit any more
bits, and we will break out of the loop and return to the main
function.
We could continue with a longer message, but I imagine you get the
idea. So let's return to the main
function (Figure encode.00).18
update_model
The next function called is update_model
(Figure adapt.01), which adjusts the
frequencies in the current frequency table to account for having seen
the most recent character.
update_model
function (from compress\adapt.cpp) (Figure adapt.01)codelist/adapt.01
The arguments to this function are symbol
, the
internal value of the character just encoded, and oldch
,
the previous character encoded, which indicates the frequency table
that was used to encode that character. What is the internal value of
the character? In the current version of the program, it is the same as
the ASCII code of the character; however, near the end of the chapter
we will employ an optimization that involves translating characters
from ASCII to an internal code to speed up translation.
This function starts out by adding the character's ASCII code to char_total
,
a hash total which is used in the simple pseudorandom number generator
that we use to help decide when to upgrade a character's frequency to
the next frequency index code. We use the symbol_translation
table to get the ASCII value of the character before adding it to char_total
;
this is present for compatibility with our final version which employs
character translation.
The next few lines initialize some variables: old_weight_code
,
which we set when changing a frequency index code or "weight code", so
that we can update the frequency total for this frequency table; temp_freq_info
,
a pointer to the frequency table structure for the current character;
and freq_ptr
, the address of the frequency table itself.
Next, we compute the index into the frequency table for the weight
code we want to examine. If this index is even, that means that this
symbol is in the high part of that byte in the frequency table. In this
case, we execute the code in the true
branch of the if
statement "if (symbol % 2 == 0)
". This starts by setting temp_freq
to the high four bits of the table entry. If the result is 0, this
character has the lowest possible frequency value; we assume that this
is because it has never been encountered before and set its frequency
index code to INIT_INDEX
. Then we update the total_freq
element in the frequency table.
However, if temp_freq
is not 0, we have to decide
whether to upgrade this character's frequency index code to the next
level. The probability of this upgrade is inversely proportional to the
ratio of the current frequency to the next frequency; the larger the
gap between two frequency code values, the less the probability of the
upgrade. So we compare char_total
to the entry in upgrade_threshold
;
if char_total
is greater, we want to do the upgrade, so
we record the previous frequency code in old_weight_code
and add HIGH_INCREMENT
to the byte containing the
frequency index for the current character. We have to use HIGH_INCREMENT
rather than 1 to adjust the frequency index, since the frequency code
for the current character occupies the high four bits of its byte.
Of course, the character is just as likely to be in the low part of
its byte; in that case, we execute the code in the fals
e
branch of that if
statement, which corresponds exactly to
the above code. In either case, we follow up with the if
statement whose condition is "(old_weight_code != -1)
",
which tests whether a frequency index code was incremented. If it was,
we add the difference between the new code and the old one to the total_freq
entry in the current frequency table, unless the character previously
had a frequency index code of 0; in that case, we have already adjusted
the total_freq
entry.
The last operation in update_model
is to make sure
that the total of all frequencies in the frequency table does not
exceed the limit of MAX_FREQUENCY
; if that were to
happen, more than one character might map into the same value between high
and low
, so that unambiguous decoding would become
impossible. Therefore, if temp_total_freq
exceeds MAX_FREQUENCY
,
we have to reduce the frequency indexes until this is no longer the
case. The while
loop whose continuation expression is "(temp_total_freq
> MAX_FREQUENCY)
" takes care of this problem in the following
way.
First, we initialize temp_total_freq
to 0, as we will
use it to accumulate the frequencies as we modify them. Then we set freq_ptr
to the address of the first entry in the frequency table to be
modified. Now we are ready to step through all the bytes in the
frequency table; for each one, we test whether both indexes in the
current byte are 0. If so, we can't reduce them, so we just add the
frequency value corresponding to the translation of two 0 indexes (BOTH_WEIGHTS_ZERO
)
to temp_total_freq
.
Otherwise, we copy the current index pair into freq
.
If the high index is nonzero, we decrement it. Similarly, if the low
index is nonzero, we decrement it. After handling either or both of
these cases, we add the translation of the new index pair to temp_total_freq
.
After we have processed all of the index values in this way, we retest
the while
condition, and when temp_total_freq
is no longer out of range, we store it back into the frequency table
and return to the main program.
Finally, we have returned to the main
function (Figure
encode.00), where we copy ch
to oldch
, so that the current character will be used to
select the frequency table for the next character to be encoded; then
we continue in the main loop until all characters have been processed.
When we reach EOF
in the input file, the main loop
terminates; we use encode_symbol
to encode EOF_SYMBOL
,
which tells the receiver to stop decoding. Then we call done_encoding
(Figure arenc.08) and done_outputing_bits
(Figure arenc.09) to flush any
remaining information to the output file. Finally, we close the input
and output files and exit.
done_encoding
function (from compress\arenc.cpp) (Figure arenc.08)codelist/arenc.08
done_outputing_bits
function (from
compress\arenc.cpp) (Figure arenc.09)codelist/arenc.09
Now that we have developed a working version of our program, let's see how much more performance we can get by judicious optimization in the right places. The first step is to find out where those places are. Therefore, I ran the program under Microsoft's profiler on a file of approximately 11 KB, compressing it to a little less than 5 KB (a compression ratio of 2.3 to 1). This didn't take very long, as you can see from the profile in Figure encode.pro1.
Total time: 167.100 milliseconds
80.010 49.0 encode_symbol(unsigned int,unsigned int) (arenc.obj)
29.139 17.9 _main (encode.obj)
18.717 11.5 output_1(void) (arenc.obj)
17.141 10.5 output_0(void) (arenc.obj)
15.957 9.8 update_model(int,unsigned char) (adapt.obj)
2.234 1.4 start_model(void) (adapt.obj)
The majority of the CPU time is spent in the encode_symbol
function, so that's where we'll look first. As I mentioned in the
discussion of the algorithm, a significant amount of the CPU time
needed to encode a symbol is spent in the loop shown in Figure arenc.03. Accordingly, we should be
able to increase the speed of our implementation by using a more
efficient representation of the characters to be encoded, so that the
accumulation loop will be executed fewer times in total. While we
cannot spare the space for self-organizing tables of character
translations, a fixed translation of characters according to some
reasonable estimate of their likelihood in the text is certainly
possible. Let's see how that would affect our performance.
After analyzing a number of (hopefully representative) text files, I
used the results to produce a list of the characters in descending
order of number of occurrences. This information is stored in the symbol_translation
table (Figure adapt.02), which is
used during decoding to translate from the internal code into ASCII.
symbol_translation
table (from compress\adapt.cpp) (Figure adapt.02)codelist/adapt.02
The inverse transformation, used by the encoder, is supplied by the
char_translation
table, which is initialized near the end
of the start_model
function (Figure adapt.00).
To use these new values for each character when encoding and
decoding, we have to make some small changes in the encoding and
decoding main programs. Specifically, in main
, we have to
use the translated character rather than the ASCII code as read from
the input file when we call the encode_symbol
function.
The new version of the main program for encoding is shown in Figure encode1.00.
codelist/encode1.00
Of course, corresponding changes have to be made to the main
function in the decoding program, which then looks like Figure decode1.00.
codelist/decode1.00
Figure char.trans shows the speed improvement resulting from this modification.
Total time: 145.061 millisecond
58.980 41.8 encode_symbol(unsigned int,unsigned int) (arenc.obj)
25.811 18.3 _main (encode1.obj)
18.964 13.4 output_0(void) (arenc.obj)
18.586 13.2 output_1(void) (arenc.obj)
16.565 11.7 update_model(int,unsigned char) (adapt.obj)
2.213 1.6 start_model(void) (adapt.obj)
(You can find suggested approaches to problems in Chapter artopt.htm).
unsigned
s,
the maximum cumulative frequency is 16,383 if we wish to avoid the use
of long
arithmetic. This limits the dynamic range to
16,128 for the one common character and 1 for each of the other
characters.
upgrade_threshold
array is set to 255, which prevents the index from being increased
beyond 15, the maximum value that can be stored in a four-bit value; char_total
,
being an unsigned char
variable, cannot exceed 255.
low
and high
, we must wait until we examine encode_symbol
(Figure high+1
; even the authors of
the reference in footnote , which is the source of the original version
of this algorithm, admit that this is confusing!
low
and high
.
Rest assured that it will not be left twisting in the wind; there is a
function called done_encoding
that makes sure that all of
the remaining information is encoded in the output stream.