انت هنا الان : شبكة جامعة بابل > موقع الكلية > نظام التعليم الالكتروني > مشاهدة المحاضرة

oring

الكلية كلية العلوم للبنات     القسم قسم الحاسبات     المرحلة 4
أستاذ المادة محمد عبيد مهدي الجبوري       3/25/2011 8:08:39 AM

sparse strings

 

regardless of what the input data represents—text, binary, images, or anything else—we

 

can think of the input stream as a string of bits. if most of the bits are zeros, the string

 

is sparse. sparse strings can be compressed very efficiently, and this section describes

 

methods developed specifically for this task. before getting to the individual methods

 

it may be useful to convince the reader that sparse strings are not a theoretical concept

 

but do occur commonly in practice. here are some examples.

 

1. a drawing. imagine a drawing, technical or artistic, done with a black pen on

 

white paper. if the drawing is not very complex, most of it remains white. when

 

such a drawing is scanned and digitized, most of the resulting pixels are white, and the

 

percentage of black ones is small. the resulting bitmap is an example of a sparse string.

 

2. a  bitmap index for a large data base. imagine a large data base of text documents.

 

a bitmap for such a data base is a set of bit-strings (or bitvectors) that makes it easy

 

to identify all the documents where a given word w appears. to implement a bitmap,

 

we first have to prepare a list of all the distinct words wj in all the documents. suppose

 

that there are w such words. the next step is to go over each document di and prepare

 

a bit-string di that is w bits long, containing a 1 in position j if word wj appears in

 

document di. the bitmap is the set of all those bit-strings. depending on the database,

 

such bit-strings may be sparse.

 

(indexing large databases is an important operation, since a computerized database

 

should be easy to search. the traditional method of indexing is to prepare a concordance.

 

originally, the word concordance referred to any comprehensive index of the bible, but

 

today there are concordances for the collected works of shakespeare, wordsworth, and

 

many others. a computerized concordance is called an inverted file. such a file includes

 

one entry for each term in the documents constituting the database. an entry is a list

 

of pointers to all the occurrences of the term, similar to an index of a book. a pointer

 

may be the number of a chapter, of a section, a page, a page-line pair, of a sentence, or

 

even of a word. an inverted file where pointers point to individual words is considered

 

fine grained. one where they point to, say, a chapter is considered coarse grained. a

 

fine-grained inverted file may seem preferable, but it must use large pointers, and as a

 

result, it may turn out to be so large that it may have to be stored in compressed form.)

 

3. sparse strings have also been mentioned in section 4.8.3, in connection with jpeg.

 

the methods described here (except prefix compression, section 8.5.5) are due to

 

[fraenkel 85]. section 2.4 discusses the golomb codes and illustrates their application

 

to sparse strings.

 

  or-ing bits

 

this method starts with a sparse string l1 of size n1 bits. in the first step, l1 is divided

 

into k substrings of equal size. in each substring all bits are logically or-ed, and the

 

results (one bit per substring) become string l2, which will be compressed in step 2. all

 

zero substrings of l1 are now deletingd. here is an example of a sparse, 64-bit string l1,

 

which we divide into 16 substrings of size 4 each:

 

l1 = 0000|0000|0000|0100|0000|0000|0000|1000|0000

 

|0000|0000|0000|0010|0000|0000|0000.

 

after oring each 4-bit substring we get the 16-bit string l2 = 0001|0001|0000|1000.

 

in step 2, the same process is applied to l2, and the result is the 4-bit string

 

l3 = 1101, which is short enough so no more compression steps are needed. after

 

deleting all zero substrings in l1 and l2, we end up with the three short strings

 

l1 = 0100|1000|0010, l2 = 0001|0001|1000, l3 = 1101.

 

the output stream consists of seven 4-bit substrings instead of the original 16! (a few

 

more numbers are needed, to indicate how long each substring is.)

 

the decoder works differently (this is an asymmetric compression method). it starts

 

with l3 and considers each of its 1-bits a pointer to a substring of l2 and each of its

 

0-bits a pointer to a substring of all zeros that is not stored in l2. this way string l2

 

can be reconstructed from l3, and string l1, in turn, from l2. figure 8.21 illustrates

 

this process. the substrings shown in square brackets are the ones not contained in the

 

compressed stream.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

this method becomes highly inefficient for strings that are not sparse, and may

 

easily result in expansion. we analyze the worst case, where every group of l1 is

 

nonzero. all n1 bits of string l1 need be written on the output stream. this already

 

shows that there is going to be no compression. string l2 consists of n1/k 1’s, so all of

 

it has to be written on the output stream. string l3 similarly consists of n1/k2 1’s, and

 

so on. the size of the output stream is thus

 

 


المادة المعروضة اعلاه هي مدخل الى المحاضرة المرفوعة بواسطة استاذ(ة) المادة . وقد تبدو لك غير متكاملة . حيث يضع استاذ المادة في بعض الاحيان فقط الجزء الاول من المحاضرة من اجل الاطلاع على ما ستقوم بتحميله لاحقا . في نظام التعليم الالكتروني نوفر هذه الخدمة لكي نبقيك على اطلاع حول محتوى الملف الذي ستقوم بتحميله .