Vous êtes sur la page 1sur 5

International Journal of Computer Trends and Technology- volume4Issue3- 2013

Digital Signal Processing Techniques for ProteinCoding Regions identification of Rheumatic Arthritis (RA) disease.
Dr.K.B.Ramesh1, Prabhu Shankar.K.S2, Dr.B.P.Mallikarjunaswamy3, Dr.E.T.Puttaiah4
(1) Associate Professor, Dept. of Instrumentation Technology, R.V.College of Engineering, Bangalore, India (2) Biomedical Signal Processing and Instrumentation, I.T Dept., R.V.College of Engineering, Bangalore, India. (3) Professor, Department of Computer Science and Engineering, SSIT, Tumkur, Karnataka, India. (4) Professor, Dept. of Environmental science, Vice-chancellor, Gulbarga University, Karnataka, India.

Abstract Rheumatic arthritis is a chronic disease, which

disables auto-immune system and ultimately affects a persons ability to carry out everyday tasks. The prediction of genes involved in RA is an important application in bioinformatics. In order to analyze the proteins, locations and lengths of genomic sequence plays a prominent role in predicting the exons. DSP tools have been applied in this field based on the observation that coding regions have a prominent period-3 spectrum peak at frequency f=1/ 3 due to presence of codons (three nucleic acids), while non-coding regions lack such a prominent peak. This paper presents the different Digital Signal Processing techniques which approaches result in improved computational techniques for the solution of useful problems in genomic information science and technology.
Keywords DNA, Rheumatic arthritis [RA], Gene prediction, Period-3 periodicity, Digital signal processing Techniques.

primarily DNA sequencing, that have led to an exponential growth of linear descriptions of protein, DNA and RNA molecules. The Bioinformatics Toolbox offers computational molecular biologists and other research scientists an open and extensible environment in which to explore ideas, prototype new algorithms, and build applications in drug research, genetic engineering, and other genomics and proteomics projects. The toolbox provides access to genomic and proteomic data formats, analysis techniques, and specialized visualizations for genomic and proteomic sequence and microarray analysis. Most functions are implemented in the open MATLAB language, enabling you to customize the algorithms or develop your own. Rheumatoid arthritis is a form of inflammatory arthritis which is a chronic disease that may affect many tissues and organs, but principally attacks flexible (synovial) joints.



Bioinformatics represents a new, growing area of science that uses computational approaches to answer biological questions. With the explosion of sequence and structural information available to researchers, the field of bioinformatics is playing an increasingly large role in the study of fundamental biomedical problems. In all areas of biological and medical research, the role of the computer has been dramatically enhanced in the last five to ten year period. While the first wave of computational analysis did focus on sequence analysis, where many highly important unsolved problems still remain, the current and future needs will in particular concern sophisticated integration of extremely diverse sets of data. These novel types of data originate from a variety of experimental techniques of which many are capable of data production at the levels of entire cells, organs, organisms, or even populations. The main driving force behind the changes has been the advent of new, efficient experimental techniques,

Fig.1 Joint affected by RA Rheumatoid arthritis can also produce diffuse inflammation in the lungs, membrane around the heart the membranes of the lung (pleura), and white of the eye and also nodular lesions, most common in subcutaneous. Rheumatoid arthritis may affect many different joints and cause damage to cartilage, tendons and ligaments it can even wear away the ends of your bones. One common outcome is joint deformity and disability. Some people with RA develop rheumatoid nodules;

ISSN: 2231-2803 http://www.internationaljournalssrg.org

Page 436

International Journal of Computer Trends and Technology- volume4Issue3- 2013

lumps of tissue that form under the skin, often over bony areas exposed to pressure. These occur most often around the elbows but can be found elsewhere on the body, such as on the fingers, over the spine or on the heels. Over time, the inflammation that characterizes RA can also affect numerous organs and internal systems. A DNA sequence is made from an alphabet of four elements, namely A, C, G and T (respectively, adenine, cytosine, guanine, and thymine). The letters A, C, G and T represent molecules called nucleotides or bases. A large number of functions in living organisms are governed by proteins. Proteins are sequences made of amino acids. Since there are 64 possible codons but only 20 amino acids, the mapping from codons to amino acids is many-to-one. The introns do not participate in the protein synthesis. Gene identification is a very complex problem and the identification of period-3 regions is only a step towards gene and exon identification coding and non-protein coding regions. At each time the sliding window is shifted by one or more base positions, the power spectrum at frequency L/3 is computed. After extracting the period-3 property, sequences are classified into exons (protein coding) and introns (non-protein coding) regions using a threshold or learning process. Thresholding is one of the major challenges in this field since the selection of its optimal value could be different from one sequence to other sequence. According to [6], the complementary strands are statistically symmetric. Thus, the non-coding region for the 5-3 will have the same period-3 property as the coding region. It means that the period-3 still works in prokaryotes, and it can be used to detect coding region. As shown in Figure 2, a DNA sequence can be divided into genes and intragenic spaces. The genes are responsible for protein synthesis. A gene can be divided into two sub regions called the exons and introns. Only the exons are involved in protein coding. The bases in the exon region can be imagined to be divided into groups of three adjacent bases. coding regions of nucleotide sequences exhibit a period-3 property which is likely resulted from the three-base-length of codons used to generate amino acids. The process is divided into 4 steps as shown in fig. 3. The first component in the process is to convert the DNA sequence into numerical sequence, since DSP tools can only handle only numerical entities. Next step involves in the choosing of window, so that large sequence is divided into frames in which specific length of sequence is processed at a time. Then the window is slide over the whole sequence. It has been shown that window size and length affect the prediction results. In addition the researchers are interested in this field can address weaknesses of their methods and work on improving the efficiency of each component of the process. The third stage is important stage of the process. Here the period-3 component is extracted to discriminate protein.

Fig. 3 Functional Block Diagram

Organization of the paper: Each type of technique is reviewed in the sections II, III, IV, V, and VI. Conclusion is drawn in the section VII. The references are listed in VII. II. METHODS AND RESULTS P. P. Vaidyanathan and Byung-Jun Yoon[2] proposed a digital filtering technique for the prediction of genes. Here the sliding window is considered as the filtering technique which has the impulse response of

Fig. 2 Various regions in a DNA molecule

Each triplet is called a codon . Scanning the gene from left to right, a codon sequence can be defined by concatenation of the codons in all the exons. Each codon instructs the cell machinery to synthesize an amino acid. The codon sequence therefore uniquely identifies an amino acid sequence which defines a protein. It is known that exons (or coding regions) are rich in nucleotides C and G whereas introns (or noncoding regions) are rich in nucleotides A and T; and that protein

Fig. 4 Impulse responses of Band pass Filter.

Computational complexity and even period-3 behaviour from background information such as 1/f noise more effectively can be reduced by efficient deign of a filter. Here the band pass

ISSN: 2231-2803 http://www.internationaljournalssrg.org

Page 437

International Journal of Computer Trends and Technology- volume4Issue3- 2013

filter with a pass band of 0 =2/3 and minimum stop band attenuation of about 13 dB as shown in above fig. 4 A digital filter H(z) with indicator sequence xG(n) as its input. With the indicator sequence xG(n) taken as input, let yG(n) denote its output as indicated in below fig. 5 1.L0 and L1 is initiated with L x /3, and p0 is set to 2 , where L0 is initial window length, L1 is current window length,p0 is original peak number. 2.Window with length of L1 is used to detect the peaks of the filter output sequence f (n). 3. If the number of detected peaks is p1 < po +1, L1=L1 and go to step 2; otherwise, set the minimum value of variance Vmin = . 4. For this p1 peak distribution, the peak interval sequence {Ij}.j = 1, 2 p1 1 is found out. 5. Then the variance of {Ij}is found with V1 = Where I = .

Fig. 5 Digital filter H(z) with impulse response H(ejw) The narrow band filter H(z) [1].have been regarded as an antinotch IIR filter for gene prediction and the gene sequence F56F11.4 in the C-elegans chromosome III was used for the analysis. This gene has five exons as depicted in below fig. 6

6. If V1 < Vmin. , Vmin = V1, p0 =p1, L0 = L1 and go back to step 4 ; otherwise L1 = L0. 7. In the same ways step 2 and 3, minimize L1 for peak number p1 = p0 and L = 2. [1/3 Min. (L1)] + 1, where Min (.) is the minimization function and [.] denotes the function that round its argument up to the next integer. The result obtained from this approach was better than DFT and SONF approaches, in detecting more coding regions which was applied to both prokaryote and eukaryote genome sequences and output obtained as shown below fig.7

Fig. 6 Output obtained from Digital filter III. CORRELATING AND FILTERING APPROACH Lun Huang, Mohammad Al Bataineh, G. E. Atkin[3], has introduced the novel gene detection method based on period-3 property which uses correlation[13] and filtering techniques. Here complex poly-phase set {1, j, -1,-j} which is correlated with four sequences to predict the exon regions.. These four sequences are composed by using 3 out the 4 types of bases in a period-3 pattern. Genome sequence is X(n) is assumed to be input the four period-3 sequences are S1 (n) , S2 (n) , S3 (n), S4 (n) . The outputs after the correlation are Yi(n) = X(n) Si(-n) Where denotes convolution. Maximum ratio combination algorithm [8] has been used to combine all four sequences respectively .period-3 filtering is used as sliding window. In order to predict the exon, length of the window L (L is considered as odd to make the window symmetric around the center) is used for the analysis. The algorithm was used as follows. Fig. 7 Exons predicted by novel gene prediction method.


There was a disadvantage with the method introduced by[2,3] in which harmonics of the frequency 2 /3 appears along with the 2 /3 frequency components. These harmonic frequencies provide false measure of period-3 property by adding more peak strengths to exons and introns. So [4] proposed a harmonic suppression and maximum variance filter which are needed to suppress the harmonics. Harmonic suppression filter consists of dominant zeros at the multiples of frequency 2 /6except at 2 /3 dominant pole is designed at 2 /3 in such a way that suppresses the samples of harmonic frequencies of 2 /3 while passes samples of frequency 2 /3. The peak at 16 KHz is evidence of the pole at angular frequency 2 /3 radians. A pole-zero plot of designed filter having poles of magnitude 0:898; 0:898; 0:998; 0:898 at angular frequencies = 0, 2 /6, 2 /3, radians, respectively, and zeros of magnitude 0:998; 0:998; 0:898; 0:998 at angular frequencies = 0, 2 /6, 2 /3, radians,

ISSN: 2231-2803 http://www.internationaljournalssrg.org

Page 438

International Journal of Computer Trends and Technology- volume4Issue3- 2013

respectively. Since the suppression of the higher period harmonics is not necessary because their contribution in a window size of only 351 samples is negligible but its necessary to suppress the harmonics of 2 /3. Fig.8 shows that HS filter is able to detect the smaller length exons where antinotch failed; however, the problem of suppressing the spurious peaks in introns still remains because of failure to attenuate the complex conjugate harmonic frequency components which can be suppressed by Maximum variance filter.[9]. the impulse response of the Minimum Variance filter with band-pass frequency = 2 /3. First the band pass filter was designed that rejects the maximum amount of band power at center frequency = 2 /3 while passing the component at frequency with no distortion. Then the power is estimated by filtering DNA sequence x (n) by filter in each output process y (n) as designed in below fig. 10

Fig.9 shows the predicted exons using MV Filter. The accuracy was achieved by using above approaches to suppress the harmonic frequencies by means of HS filter and adaptive minimum Variance filter. V. Recursive wiener Khinchine Theorem. [5] proposed a comparative approach based on recursive Weiner khinchine theorem [10] for locating the protein coding regions was explored. Its an efficient algorithm which makes use of the sliding window from autocorrelation function by applying Weiner khinchine theorem to estimate the power spectral density [14]. By defining the size of the window as N, every DFT of window is calculated from previous N-point window by just one complex multiplication and two real additions. In this paper both DFT and RWKT were implied on a DNA sequence and it has been shown that RKWT provides better estimate of power spectral density when compared to DFT but it is inefficient when time computation is considered

Fig. 8 Shows the output obtained by HS Filter. Another approach[4] for the gene prediction which make use of the Minimum variance filter which is an adaptation of the Maximum Likelihood Method (MLM) developed by Capon for the analysis of two dimensional power spectral densities. Minimum Variance Spectrum Estimation technique gives us flexibility to minimize the power in the side lobe frequencies thus maximizing the power in main lobe. The minimum variance spectrum estimation technique involves the following steps: 1) Design a band pass filter g(n) with center frequency = 2 /3 so that the filter rejects the maximum amount of out-ofband power while passing the component at frequency with no distortion. 2) Filter the DNA sequence x(n) with the filter and estimate the power in each output process y(n). Hence the impulse response of such a filter for a given input sequence can be given as, g= e/

Where, eH represents the Hermition (complex conjugate) matrix of exponential vector e, Rx is the p * p autocorrelation to eplitz matrix of the samples in the current window and g is Fig. 10 shows the output of RWKT with good resolution

ISSN: 2231-2803 http://www.internationaljournalssrg.org

Page 439

International Journal of Computer Trends and Technology- volume4Issue3- 2013

[2]. Vaidyanathan, P.P., and Yoon, B.J. Digital filters for gene prediction applications. Proc. Asilomar Conf. Signals Syst. Comput. 306310. 2002. VI. FFT, FIR, IIR Methods. [3]. Lun Huang, Mohammad Al Bataineh, G. E. Atkin, Senior [6] Compared the performance of three spectral methods FFT, Member, IEEE, Siyun Wang, Wei Zhang A Novel Gene FIR and IIR (DF-Based) methods for the classification short Detection Method Based on Period-3 Property 31st Annual introns and exons. Computational complexity has been International Conference of the IEEE ,September 2-6, 2009. Vikrant Tomar, Dipesh Gandhi, Digital Signal taken as a factor for the comparison and it depends [4]. Processing for Gene Prediction IEEE,pp:435-439,2008 on the FFT algorithm used. The K-Quaternary Code [5]. M.Roy, S. Barman Spectral analysis of DNA sequence with a nucleotide to numeric mapping as C = -1, G = -j, A = 1, using Recursive Weiner Khinchine theorem- A comparative T = j is adopted. Three thresholds are adopted which include approach 978-1-4577,2011 the mid threshold Tm, the proportional threshold Tp, and the [6]. Benjamin Y. M. Kwan, Jennifer Y. Y. Kwan, Hon Keung cumulative distribution threshold Tc is studied. The first Kwan Spectral Techniques for Classifying Short Exon and spectral method is called the FFT-based spectral method Intron Sequences IEEE ,pp : 219-226, 2012. which uses the Fast Fourier transform (FFT) to compute the [7]. M.K.Hota, V.K..Srinivasa DSP technique for gene and period-3 value of a numerical represented nucleotide sequence exon prediction taking complex indicator sequence pp. 354with procedures described in [2]. The second (third) spectral 359,Pune, India, 03 05 January, 2009. method is called the FIR DF-based (IIR DF-based) spectral [8]. Mohammad Al Bataineh, Lun Huang, Ismaeel. Muhamed, method which uses a FIR (IIR)[11] band pass digital filter Nick Menhart, and Guillermo Atkin, Gene Expression with its pass band centred at a normalized frequency of 2/3 Analysis using Communications, Coding and Information radians to compute the period- 3 value of a numerical Theory Based Models, BIOCOMP'09 -, July 13-16, 2009. represented nucleotide sequence. The FFT requires a total of [9]. Hayes, M. H., Statistical digital signal processing and Llog2 (L) complex to complex multiplications and Llog2 (L) modeling, John Wiley & Sons, Inc, USA,1996. complex additions or 4Llog2 (L) real multiplications and [10]. Khalid M. Aamir and Mohammad A. Maud 2Llog2 (L) real additions. An even Nth-order linear phase FIR Recursive Weiner Kh inchine theorem World Academy of digital filter requires a total of L(N/2+1) complex to real Science, Engineering and Technology 2 2007 multiplications and L(N+1) complex additions or 2L(N/2+1) [11]. I. W. Selesnick, M. Lang, and C.S. Burrus. Constrained real multiplications and 2L(N+1) real additions. An Nth- least square design of FIR filters without specified transition order[12] IIR digital filter requires a total of L(2N+1) complex bands, IEEE Transactions on Signal Processing, vol. 44, no. to real multiplications and 2LN complex additions or 2L(2N+1) 8, pp. 1879-1892, August 1996. real multiplications and 4LN real additions. [12]. B. Giardine, C. Riemer, R. C. Hardison, R. Burhans, L. Elnitski, P. Shah, Y. Zhang, D. Blankenberg, I. Albert, VII. CONCLUSION. J.Taylor, W. Miller, W. J. Kent, and A. Nekrutenko, Galaxy: A platform for interactive large scale genome analysis, Current problem in the field of genomic signal processing is to Genome Research, vol. 15, issue 10, pp. 1451- 1455, 15 detect the genes in the DNA sequence of rheumatic arthritis. October 2005. In this paper we are trying to review some DSP techniques so [13]. H. Herzel, E. N. Trifonov, O. Weiss, and I. Groe, far and implementing it on a rheumatic arthritis for prediction Interpreting correlations in biosequences, Physica A, vol. of genes responsible for it and then comparing it with the 249, pp. 449459, 1998. normal sequence, concluding that where exactly the mutation [14].J.Tuqan and A.Rashdi A DSP approach for finding the takes place in the RA sequence. The obtained results from the codons bias in DNA sequence IEEE journal on signal above techniques which are incorporated in the software processing vol. 2, no. 3,pp-343-356,jun 2008. module be used for the characterization of the disease and the results will support the Physicians and analysts to diagnose and better understanding of disease development, treatment and prevention of the disease. From the review the RWKT [5] and FFT[6] are the best methods for the prediction of genes with the better resolution and fast computation respectively. compared to DFT method.



[1]. P. P. Vaidyanathan and B.-J. Yoon, Digital filters for gene prediction applications, in Proc. Asilomar Conference on Signals, Systems, and Computers, pp. 306310, Nov 2002.

ISSN: 2231-2803 http://www.internationaljournalssrg.org

Page 440