انت هنا الان : شبكة جامعة بابل > موقع الكلية > نظام التعليم الالكتروني > مشاهدة المحاضرة

lossy compression

الكلية كلية العلوم للبنات     القسم قسم الحاسبات     المرحلة 4
أستاذ المادة محمد عبيد مهدي الجبوري       4/2/2011 12:04:38 PM

2.1 introduction

 

                  in this  section we will give some examples of more sophisticated models. all these techniques take advantage of the “context” in some way. this can either be done by transforming the data before coding (e.g., run-length coding, move-to-front coding, and residual coding), or directly using conditional probabilities based on a context (jbig and ppm).

 

2.2 lossless methods

 

  2.2.1 huffman codes

 

huffman codes are optimal prefix codes generated from a set of probabilities by a particular algo­rithm, the huffman coding algorithm. david huffman developed the algorithm as a student in a class on information theory at mit in 1950. the algorithm is now probably the most prevalently used component of compression algorithms, used as the back end of gzip, jpeg and many other utilities.

 

the huffman algorithm is very simple and is most easily described in terms of how it generates the prefix-code tree.

 

* start with a forest of trees, one for each message. each tree contains a single vertex with weight

 

* repeat until only a single tree remains

 

select two trees with the lowest weight roots .

 

combine them into a single tree by adding a new root with weight , and making the two trees its children. it does not matter which is the left or right child, but our convention will be to put the lower weight root on the left if .

 

for a code of size n this algorithm will require n-1 steps since every complete binary tree with n leaves has n-1 internal nodes, and each step creates one internal node. if we use a priority queue with   time insertions and find-mins (e.g., a heap) the algorithm will run in   time. the key property of huffman codes is that they generate optimal pre?x codes.

 

 

2.2.2 arithmetic coding

 

arithmetic coding is a technique for coding that allows the information from the messages in a message sequence to be combined to share the same bits. the technique allows the total number of bits sent to asymptotically approach the sum of the self information of the individual messages (recall that the self information of a message is defined as ).

 

to see the significance of this, consider sending a thousand messages each having probability .999. using a huffman code, each message has to take at least 1 bit, requiring 1000 bits to be sent. on the other hand the self information of each message is   bits, so the sum of this

 

self-information over 1000 messages is only 1.4 bits. it turns out that arithmetic coding will send all the messages using only 3 bits, a factor of hundreds fewer than a huffman coder. of course this is an extreme case, and when all the probabilities are small, the gain will be less significant. arithmetic coders are therefore most useful when there are large probabilities in the probability distribution.

 

the main idea of arithmetic coding is to represent each possible sequence of p messages by a separate interval on the number line between 0 and 1, e.g. the interval from .2 to .5. for a sequence of messages with probabilities , the algorithm will assign the sequence to an interval of size, , by starting with an interval of size 1 (from 0 to 1) and narrowing the interval by a factor of   on each message . we can bound the number of bits required to uniquely identify an interval of size s, and use this to relate the length of the representation to the self information of the messages.

 

 

2.2.3 run-length coding

 

probably the simplest coding scheme that takes advantage of the context is run-length coding. although there are many variants, the basic idea is to identify strings of adjacent messages of equal value and replace them with a single occurrence along with a count. for example, the message sequence acccbbaaabb could be transformed to (a,1), (c,3), (b,2), (a,3), (b,2). once transformed, a probability coder (e.g., huffman coder) can be used to code both the message values and the counts. it is typically important to probability code the run-lengths since short lengths (e.g., 1 and 2) are likely to be much more common than long lengths (e.g., 1356).

 

an example of a real-world use of run-length coding is for the itu-t t4 (group 3) standard for facsimile (fax) machines . at the time of writing (1999), this was the standard for all home and business fax machines used over regular phone lines. fax machines transmit black-and-white images. each pixel is called a pel and the horizontal resolution is fixed at 8.05 pels/mm. the vertical resolution varies depending on the mode. the t4 standard uses run-length encoding to code each sequence of black and white pixels. since there are only two message values black and white, only the run-lengths need to be transmitted. the t4 standard specifies the start color by placing a dummy white pixel at the front of each row so that the first run is always assumed to be a white run. for example, the sequence bbbbwwbbbbb would be transmitted as 1,4,2,5. the t4 standard uses static huffman codes to encode the run-lengths, and uses a separate codes for the black and white pixels. to account for runs of more than 64, it has separate codes to specify multiples of 64. for example, a length of 150, would consist of the code for 128 followed by the code for 22. a small subset of the codes are given in table 4.1.  these huffman codes are based  itu-t is part of the international telecommunications union.

 

 

2.2.4 move-to-front coding

 

another simple coding schemes that takes advantage of the context is move-to-front coding. this is used as a sub-step in several other algorithms including the burrows-wheeler algorithm discussed later. the idea of move-to-front coding is to preprocess the message sequence by converting it into a sequence of integers, which hopefully is biases toward integers with low values. the algorithm then uses some form of probability coding to code these values. in practice the conversion and coding are interleaved, but we that each message comes from the same alphabet, and starts with a total order on the alphabet (e.g., [a, b, c, d, …]). for each message, the first pass of the algorithm outputs the position of the character in the current order of the alphabet, and then updatings the order so that the character is at the head. for example, coding the character c with an order [a, b, c, d, …] would output a 3 and change the order to [c, a, b, d, …]. this is repeated for the full message sequence. the second pass converts the sequence of integers into a bit sequence using huffman or arithmetic coding.

 

the hope is that equal characters often appear close to each other in the message sequence so that the integers will be biased to have low values. this will give a skewed probability distribution and good compression.

 


2.2.5 residual coding: jpeg-ls

 

residual compression is another general compression technique used as a sub-step in several algo­rithms. as with move-to-front coding, it preprocesses the data so that the message values have a better skew in their probability distribution, and then codes this distribution using a standard proba­bility coder. the approach can be applied to message values that have some meaningful total order (i.e., in which being close in the order implies similarity), and is most commonly used for integers values. the idea of residual coding is that the encoder tries to guess the next message value based on the previous context and then outputs the difference between the actual and guessed value. this is called the residual. the hope is that this residual is biased toward low values so that it can be effectively compressed. assuming the decoder has already decoded the previous context, it can make the same guess as the coder and then use the residual it receives to correct the guess. by not specifying the residual to its full accuracy, residual coding can also be used for lossy compression.

 

residual coding is used in jpeg lossless (jpeg ls), which is used to compress both grey scale and color images. here we discuss how it is used on gray scale images. color images can simply be compressed by compressing each of the three color planes separately. the algorithm compresses images in raster order—the pixels are processed starting at the top-most row of an image from left to right and then the next row, continuing down to the bottom. when guessing a pixel the encoder and decoder therefore have as their disposal the pixels to the left in the current row and all the pixels above it in the previous rows. the jpeg ls algorithm just uses 4 other pixels as a context for the guess—the pixel to the left (w), above and to the left (nw), above (n), and above and to the right (ne). the guess works in two stages. the first stage makes the following guess for each pixel value.

 

                                          (3)

 

this might look like a magical equation, but it is based on the idea of taking an average of nearby pixels while taking account of edges. the ?rst and second clauses capture horizontal and vertical edges. for example if n> w and   this indicates a horizontal edge and w is used as the guess. the last clause captures diagonal edges.

 

given an initial guess g a second pass adjusts that guess based on local gradients. it uses the three gradients between the pairs of pixels (nw, w), (nw, n), and (n,ne). based on the value of the gradients (the difference between the two adjacent pixels) each is classified into one of 9 groups. this gives a total of 729 contexts, of which only 365 are needed because of symmetry. each context stores its own adjustment value which is used to adjust the guess. each context also stores information about the quality of previous guesses in that context. this can be used to predict variance and can help the probability coder. once the algorithm has the final guess for the pixel, it determines the residual and codes it.

 

 


2.2.6 context coding: ppm

 

over the past decade, variants of this algorithm have consistently given either the best or close to the best compression ratios.

 

the main idea of ppm (prediction by partial matching) is to take advantage of the previous k characters to generate a conditional probability of the current character. the simplest way to do this would be to keep a dictionary for every possible string of characters, and for each string have counts for every character x  that follows s. the conditional probability of x  in the context   s is then , where c(x|s)  is the number of times x follows s and c(s) is the number of times s appears. the probability distributions can then be used by a huffman or arithmetic coder to generate a bit sequence. for example, we might have a dictionary with qu appearing 100 times and e appearing 45 times after qu. the conditional probability of the e is then .45 and the coder should use about 1 bit to encode it. note that the probability distribution will change from character to character since each context has its own distribution. in terms of decoding, as long as the context precedes the character being coded, the decoder will know the context and therefore know which probability distribution to use.

 

there are two problems with the basic dictionary method described in the previous paragraph. first, the dictionaries can become very large. there is no solution to this problem other than to keep small, typically 3 or 4. a second problem is what happens if the count is zero. we cannot use zero probabilities in any of the coding methods (they would imply infinitely long strings). one way to get around this is to assume a probability of not having seen a sequence before and evenly distribute this probability among the possible following characters that have not been seen. unfortunately this gives a completely even distribution, when in reality we might know that a is more likely than b, even without knowing its context.

 

the ppm algorithm has a clever way to deal with the case when a context has not been seen before, and is based on the idea of partial matching. the algorithm builds the dictionary on the fly starting with an empty dictionary, and every time the algorithm comes across a string it has not seen before it tries to match a string of one shorter length. this is repeated for shorter and shorter lengths until a match is found. for each length 0, 1, …, k the algorithm keeps statistics of patterns

 

order 0                         order 1                         order 2

 

 

figure 2.1: an example of the ppm table for k=2 on the string accbaccacba.

 

it has seen before and counts of the following characters. in practice this can all be implemented in a single trie. in the case of the length-1contexts the counts are just counts of each character seen assuming no context.

 

an example table is given in figure 1.4 for a string accbaccacba. now consider following this string with a c. since the algorithm has the context ba followed by c in its dictionary, it can output the c based on its probability in this context. although we might think the probability should be 1, since c the only character that has ever followed ba, we need to give some probability of no match, which we will call the “escape” probability. we will get back to how this probability is set shortly. if instead of c the next character to code is an a, then the algorithm does not find a match for a length 2 context so it looks for a match of length 1, in this case the context is the previous a. since a has never followed by another a, the algorithm still does not find a match, and looks for a match with a zero length context. in this case it finds the a and uses the appropriate probability for a (4/11). what if the algorithm needs to code a d? in this case the algorithm does not even find the character in the zero-length context, so it assigns the character a probability assuming all unseen characters have even likelihood.

 

2.2.7  lzw compression

 

lzw compression is named after its developers, a. lempel and j. ziv, with later modifications by terry a. welch. it is the foremost technique for general purpose data compression due to its simplicity and versatility. typically, you can expect lzw to compress text, executable code, and similar data files to about one-half their original size. lzw also performs well when presented with extremely redundant data files, such as tabulated numbers, computer source code, and acquired signals. compression ratios of 5:1 are common for these cases. lzw is the basis of several personal computer utilities that claim to ‘double the capacity of your hard drive. “lzw compression is always used in glf image files, and offered as an option in tiff and postscript. lzw compression is protected under u.s. patent number 4, 558, 302, granted december 10, 1985 to sperry corporation (now the unisys corporation)[10].

 

 

 

 

example code table

 

code numbers

 

translation

 

0000

 

0

 

0001

 

1

 

.

 

.

 

 

fig 2.2 shows code table

.

 

.

 

.

 

.

 

0254

 

254

 

0255

 

255

 

0256

 

256

 

0257

 

257

 

.

 

.

 

.

 

.

 

.

 

.

 

4095

 

xxx

 

 

                                                                                                                                                                                     

 

                original data stream 123 145 201 4 119 89 243 245 59 11 206 145 201 4 243 245

 

 

code table encoded 123 256 119 89 257 59 11 206 256 257…

 

 

 

example of code table compression. this is the basis of the popular lzw compression method. encoding occurs by identifying sequences of bytes in the original file exists in the code table. the 12 bit code representing the sequence is placed in the compressed file instead of the sequence. the first 256 entries in the table correspond to the single byte values, 0 to 255, while the remaining entries correspond to sequences of types. the lzw algorithm is an efficient way of generating the code table based on the particular data being compressed. (the code table in this figure is a simplified example, not one actually generated by the lzw algorithm).

 

 

2.3 lossy compression techniques

 

lossy compression is compression in which some of the information from the original message sequence is lost. this means the original sequences cannot be regenerated from the compressed sequence. just because information is lost doesn’t mean the quality of the output is reduced. for example, random noise has very high information content, but when present in an image or a sound file, we would typically be perfectly happy to droping it. also certain losses in images or sound might be completely imperceptible to a human viewer (e.g. the loss of very high frequencies). for this reason, lossy compression algorithms on images can often get a factor of 2 better compression than lossless algorithms with an imperceptible loss in quality. however, when quality does start degrading in a noticeable way, it is important to make sure it degrades in a way that is least objec­tionable to the viewer (e.g., dropingping random pixels is probably more objectionable than dropingping some color information). for these reasons, the way most lossy compression techniques are used are highly dependent on the media that is being compressed. lossy compression for sound, for example, is very different than lossy compression for images.

 

in this section we go over some general techniques that can be applied in various contexts, and in the next two sections we go over more speci?c examples and techniques.

 

 

2.3.1 scalar quantization

 

a simple way to implement lossy compression is to take the set of possible messages s and reduce it to a smaller set s` by mapping each element of s   to an element in s`. for example we could take

 

 

 

 

figure 2.3: examples of (a) uniform and (b) non-uniform scalar quantization.

 

8-bit integers and divide by 4 (i.e., droping the lower two bits), or take a character set in which upper and lowercase characters are distinguished and replace all the uppercase ones with lowercase ones. this general technique is called quantization. since the mapping used in quantization is many-to-one, it is irreversible and therefore lossy.

 

in the case that the set s comes from a total order and the total order is broken up into re­gions that map onto the elements of s`, the mapping is called scalar quantization. the example of dropingping the lower two bits given in the previous paragraph is an example of scalar quantiza­tion. applications of scalar quantization include reducing the number of color bits or gray-scale levels in images (used to save memory on many computer monitors), and classifying the intensity of frequency components in images or sound into groups (used in jpeg compression). in fact we mentioned an example of quantization when talking about jpeg-ls. there quantization is used to reduce the number of contexts instead of the number of message values. in particular we categorized each of 3 gradients into one of 9 levels so that the context table needs only 93 entries (actually only (93+1)/2 due to symmetry).

 

the term uniform scalar quantization is typically used when the mapping is linear. again, the example of dividing 8-bit integers by 4 is a linear mapping. in practice it is often better to use a nonuniform scalar quantization. for example, it turns out that the eye is more sensitive to low values of red than to high values. therefore we can get better quality compressed images by making the regions in the low values smaller than the regions in the high values. another choice is to base the nonlinear mapping on the probability of different input values. in fact, this idea can be formalized—for a given error metric and a given probability distribution over the input values, we want a mapping that will minimize the expected error. for certain error-metrics, finding this mapping might be hard. for the root-mean-squared error metric there is an iterative algorithm known as the lloyd-max algorithm that will find the optimal mapping. an interesting point is that finding this optimal mapping will have the effect of decreasing the effectiveness of any probability coder that is used on the output. this is because the mapping will tend to more evenly spread the probabilities in s`.

 

 

 

                        figure 2.4: examples of  vector- quantization for a hight-wieght chart

 

2.3.2 vector quantization

 

scalar quantization allows one to separately map each color of a color image into a smaller set of output values. in practice, however, it can be much more effective to map regions of 3-d color space into output values. by more effective we mean that a better compression ratio can be achieved based on an equivalent loss of quality.

 

the general idea of mapping a multidimensional space into a smaller set of messages s` is called vector quantization. vector quantization is typically implemented by selecting a set of representatives from the input space, and then mapping all other points in the space to the closest representative. the representatives could be fixed for all time and part of the compression protocol, or they could be determined for each file (message sequence) and sent as part of the sequence. the most interesting aspect of vector quantization is how one selects the representatives. typically it is implemented using a clustering algorithm that finds some number of clusters of points in the data. a representative is then chosen for each cluster by either selecting one of the points in the cluster or using some form of centroid for the cluster. finding good clusters is a whole interesting topic on its own.

 

vector quantization is most effective when the variables along the dimensions of the space are correlated. figure 1.6 gives an example of possible representatives for a height-weight chart. there is clearly a strong correlation between people’s height and weight and therefore the representatives can be concentrated in areas of the space that make physical sense, with higher densities in more common regions. using such representatives is very much more effective than separately using scalar quantization on the height and weight.

 

we should note that vector quantization, as well as scalar quantization, can be used as part of a lossless compression technique. in particular if in addition to sending the closest representative, the coder sends the distance from the point to the representative, then the original point can be reconstructed. the distance is often referred to as the residual. in general this would not lead to any compression, but if the points are tightly clustered around the representatives, then the technique can be very effective for lossless compression since the residuals will be small and probability coding will work well in reducing the number of bits.

 


 

 

2.3.3 transform coding

 

the idea of transform coding is to transform the input into a different form which can then either be compressed better, or for which we can more easily droping certain terms without as much qualitative loss in the output. one form of transform is to select a linear set of basis functions     that span the space to be transformed. some common sets include sin, cos, polynomials, spherical harmonics, bessel functions, and wavelets. figure 18 shows some examples of the first three basis functions for discrete cosine, polynomial, and wavelet transformations. for a set of n values, transforms can be expressed as an     matrix t. multiplying the input by this matrix t gives, the transformed coef?cients. multiplying the coef?cients by     will convert the data back to the original form. for example, the coef?cients for the discrete cosine transform (dct) are

 

                                      (4)

the dct is one of the most commonly used transforms in practice for image compression, more so than the discrete fourier transform (dft). this is because the dft assumes periodicity, which is not necessarily true in images. in particular to represent a linear function over a region requires many large amplitude high-frequency components in a dft. this is because the period­icity assumption will view the function as a sawtooth, which is highly discontinuous at the teeth requiring the high-frequency components. the dct does not assume periodicity and will only re­quire much lower amplitude high-frequency components. the dct also does not require a phase, which is typically represented using complex numbers in the dft.

 

for the purpose of compression, the properties we would like of a transform are (1) to decor-relate the data, (2) have many of the transformed coefficients be small, and (3) have it so that from the point of view of perception, some of the terms are more important than others.

 


2.4 other lossy transform codes

 

2.4.1 wavelet compression

 

jpeg and mpeg decompose images into sets of cosine waveforms. unfortunately, cosine is a periodic function this can create problems when an image contains strong a periodic features. such local high-frequency spikes would require an infinite number of cosine waves to encode properly. jpeg and mpeg solve this problem by breaking up images into fixed-size blocks and transforming each block in isolation. this effectively clips the infinitely-repeating cosine function, making it possible to encode local features.

 

an alternative approach would be to choose a set of basis functions that exhibit good locality without artificial clipping. such basis functions, called “wavelets”, could be applied to the entire image, without requiring blocking and without degenerating when presented with high-frequency local features.

 

how do we derive a suitable set of basis functions? we start with a single function, called a “mother function”. whereas cosine repeats indefinitely, we want the wavelet mother function, , to be contained within some local region, and approach zero as we stray further away:

 

                      (5)

 

the family of basis functions are scaled and translated versions of this mother function. for some scaling factor s and translation factor l,

 

                            (6)

 

a well know family of wavelets are the haar wavelets, which are derived from the following mother function:

 

                            (7)

 

 

2.4.2 model-based compression

 

we briefly present one last transform coding scheme, model-based compression. the idea here is to characterize the source data in terms of some strong underlying model. the popular example here is faces. we might devise a general model of human faces, describing them in terms of anatomical parameters like nose shape, eye separation, skin color, cheekbone angle, and so on. instead of transmitting the image of a face, we could transmit the parameters that define that face within our general model. assuming that we have a suitable model for the data at hand, we may be able to describe the entire system using only a few bytes of parameter data.

 

both sender and receiver share a large body of a priori knowledge contained in the model itself (e.g., the fact that faces have two eyes and one nose). the more information is shared in the model, the less need be transmitted with any given data set. like wavelet compression, model-based compression works by characterizing data in terms of a deeper underlying generator. model-based encoding has found applicability in such areas as computerized recognition of four-legged animals or facial expressions.

 


المادة المعروضة اعلاه هي مدخل الى المحاضرة المرفوعة بواسطة استاذ(ة) المادة . وقد تبدو لك غير متكاملة . حيث يضع استاذ المادة في بعض الاحيان فقط الجزء الاول من المحاضرة من اجل الاطلاع على ما ستقوم بتحميله لاحقا . في نظام التعليم الالكتروني نوفر هذه الخدمة لكي نبقيك على اطلاع حول محتوى الملف الذي ستقوم بتحميله .