sparse strings
 
regardless of what the input data represents—text, binary, images, or anything else—we
 
can think of the input stream as a string of bits. if most of the bits are zeros, the string
 
is sparse. sparse strings can be compressed very efficiently, and this section describes
 
methods developed specifically for this task. before getting to the individual methods
 
it may be useful to convince the reader that sparse strings are not a theoretical concept
 
but do occur commonly in practice. here are some examples.
 
1. a drawing. imagine a drawing, technical or artistic, done with a black pen on
 
white paper. if the drawing is not very complex, most of it remains white. when
 
such a drawing is scanned and digitized, most of the resulting pixels are white, and the
 
percentage of black ones is small. the resulting bitmap is an example of a sparse string.
 
2. a  bitmap index for a large data base. imagine a large data base of text documents.
 
a bitmap for such a data base is a set of bit-strings (or bitvectors) that makes it easy
 
to identify all the documents where a given word w appears. to implement a bitmap,
 
we first have to prepare a list of all the distinct words wj in all the documents. suppose
 
that there are w such words. the next step is to go over each document di and prepare
 
a bit-string di that is w bits long, containing a 1 in position j if word wj appears in
 
document di. the bitmap is the set of all those bit-strings. depending on the database,
 
such bit-strings may be sparse.
 
(indexing large databases is an important operation, since a computerized database
 
should be easy to search. the traditional method of indexing is to prepare a concordance.
 
originally, the word concordance referred to any comprehensive index of the bible, but
 
today there are concordances for the collected works of shakespeare, wordsworth, and
 
many others. a computerized concordance is called an inverted file. such a file includes
 
one entry for each term in the documents constituting the database. an entry is a list
 
of pointers to all the occurrences of the term, similar to an index of a book. a pointer
 
may be the number of a chapter, of a section, a page, a page-line pair, of a sentence, or
 
even of a word. an inverted file where pointers point to individual words is considered
 
fine grained. one where they point to, say, a chapter is considered coarse grained. a
 
fine-grained inverted file may seem preferable, but it must use large pointers, and as a
 
result, it may turn out to be so large that it may have to be stored in compressed form.)
 
3. sparse strings have also been mentioned in section 4.8.3, in connection with jpeg.
 
the methods described here (except prefix compression, section 8.5.5) are due to
 
[fraenkel 85]. section 2.4 discusses the golomb codes and illustrates their application
 
to sparse strings.
 
  or-ing bits
 
this method starts with a sparse string l1 of size n1 bits. in the first step, l1 is divided
 
into k substrings of equal size. in each substring all bits are logically or-ed, and the
 
results (one bit per substring) become string l2, which will be compressed in step 2. all
 
zero substrings of l1 are now deletingd. here is an example of a sparse, 64-bit string l1,
 
which we divide into 16 substrings of size 4 each:
 
l1 = 0000|0000|0000|0100|0000|0000|0000|1000|0000
 
|0000|0000|0000|0010|0000|0000|0000.
 
after oring each 4-bit substring we get the 16-bit string l2 = 0001|0001|0000|1000.
 
in step 2, the same process is applied to l2, and the result is the 4-bit string
 
l3 = 1101, which is short enough so no more compression steps are needed. after
 
deleting all zero substrings in l1 and l2, we end up with the three short strings
 
l1 = 0100|1000|0010, l2 = 0001|0001|1000, l3 = 1101.
 
the output stream consists of seven 4-bit substrings instead of the original 16! (a few
 
more numbers are needed, to indicate how long each substring is.)
 
the decoder works differently (this is an asymmetric compression method). it starts
 
with l3 and considers each of its 1-bits a pointer to a substring of l2 and each of its
 
0-bits a pointer to a substring of all zeros that is not stored in l2. this way string l2
 
can be reconstructed from l3, and string l1, in turn, from l2. figure 8.21 illustrates
 
this process. the substrings shown in square brackets are the ones not contained in the
 
compressed stream.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
this method becomes highly inefficient for strings that are not sparse, and may
 
easily result in expansion. we analyze the worst case, where every group of l1 is
 
nonzero. all n1 bits of string l1 need be written on the output stream. this already
 
shows that there is going to be no compression. string l2 consists of n1/k 1’s, so all of
 
it has to be written on the output stream. string l3 similarly consists of n1/k2 1’s, and
 
so on. the size of the output stream is thus