انت هنا الان : شبكة جامعة بابل > موقع الكلية > نظام التعليم الالكتروني > مشاهدة المحاضرة

DNA compresssion

الكلية كلية العلوم للبنات     القسم قسم الحاسبات     المرحلة 4
أستاذ المادة محمد عبيد مهدي الجبوري       4/19/2011 6:32:44 AM

In this lecture, we study one basic question: the compressibility of DNA sequences. Life represents order. It is not chaotic or random. Thus, we expect the DNA sequences that encode life to be nonrandom. In other words, they should be very compressible. There are also strong biological avidness that support this claim: it is well-known that DNA sequences. Contain many  tandem repeats; it is also well-known that many essential given have many copies; it also has been conjectured that genes duplicate themselves sometimes for evolutionary or simple for "selfish" purposes. All this give more concrete support that DNA sequences should be reasonably compressible. Our purpose to study such subtitles in DNA sequences. Almost all known biological functions can be traced to the sequence of nucleotides or bases known as DNA that is found in the cells of every organism. The base sequence in the DNA is transcribed one-to-one in the cell to another sequence of nucleotides known as RNA (whose bases are identical to the bases of DNA with one small difference, noted below). The latter in the presence of another of catalysts is translated into a set of proteins, which are sequences of amino acids that contribute to a host of cellular higher-level functions of the organism. The ‘genetic code’ used in the translation maps a triplet of bases or ‘coden’ to an amino acid. The code is degenerate with 61 of the 64 possible codens mapping to 20 amino acids (the remaining three coding for terminators). The entire DNA sequence for an organism is called a ‘genome’, it consists of segments of coding regions and non-coding regions (referred to as ‘introns’). A coding region that codes for a single protein is called a ‘gene’. In most species, the non-coding regions are far greater in size than the coding regions. For example, the human genome consists of about 3 million bases arranged in 23 pairs of ‘chromosomes’ each of which contains a number of related genes. Only about 10% of the genome is coding, the other 90% is non-coding with most of it known or suspected to be involved in the chemistry of transcription and translation.

 

          DNA’s alphabet is the set {T, C, A, G} (corresponding to the bases thymine, cytosine, adinine, and guanine), while RNA’s is {U, C, A, G} (corresponding to the bases uracil, cytocine, adinie, and guanine). The protein alphabet is a set of 20 amino acids {A, C, D, E, F, G, H, I, L, M, N, P, Q, R, S, T, V, W, Y}, where the letters in the set are the 1-letter codes for alanine, cysteine, aspartic acid, glutamic acid, phenylalanine, glycine, histidine, isoleucine, lysine, methionine, leucine, asparagines, praline, glutamine, arginine, serine, threonine, valine, trypotophan, and tyrosine.

    we will present a lossless DNA compression algorithm based on Exact matching that gives best compression results and we will apply this algorithm on the image files. There are two approaches for compressing DNA sequences:-

 

 A. exact matching

 

Ex. aagtacacgtacagt

 

Can be written as aagtacac (3,5)gt

 

B-approximate matching

 

Ex: s1 gaccgtca

 

     s 2 gaccggca can be written as (R, 6, g)

 

that means replace six location by g

 

 

 

 

 


المادة المعروضة اعلاه هي مدخل الى المحاضرة المرفوعة بواسطة استاذ(ة) المادة . وقد تبدو لك غير متكاملة . حيث يضع استاذ المادة في بعض الاحيان فقط الجزء الاول من المحاضرة من اجل الاطلاع على ما ستقوم بتحميله لاحقا . في نظام التعليم الالكتروني نوفر هذه الخدمة لكي نبقيك على اطلاع حول محتوى الملف الذي ستقوم بتحميله .