CONTENTS:I. ABSTRACT II. INTRODUCTION III. EARLIER MEASURES IV.

NEWER APPROACHES a) b) c)
USING NEURAL NETWORKS USING CELLULAR AUTOMATA DISTRIBUTED DATABASES

2 3 5 7
7 13 18

V. TWO STAGE CLASSIFIER VI. FUTURE WORKS VII. CONCLUSION VIII. REFERENCES
1|Pa ge

21 23 24 24

PROTEIN CODING REGION IDENTIFICATION
Abstract:- Genes carry the instructions for making proteins that are found in a cell as a specific sequence of nucleotides in DNA molecules. These specific sequences of bases encode instructions on how to make protein. But the regions of gene that code for proteins may occupy only a small region of a sequence. The problem of gene recognition is to define an algorithm which takes as input DNA sequence and produces as output a feature table describing the location and structure of the patterns making up any genes present in the sequence. Identifying the coding region is of vital importance in understanding these genes. Here I have given certain pattern classifiers which will the test data set presented by fickett and tung to identify protein coding regions. Firstly we have considered some of the test data set using earlier measures for PCRI. We have described some of them not all Then we have proposed both Feed Forward connectivity and Backpropagation approach of neural networks for test data identification. We have found a highly accurate but space consuming CRM module. We have compared its results Then we propose cellular automata based pattern classifier to identify the coding region of a DNA sequence both in centralized and distributed databases. CA is simple efficient and produces more accurate classifier than that have previously been obtained for a range of different sequence lengths. And we have compared some of the older measures with FMACA based pattern classifier. At the end a genetic algorithm based two-stage pattern classifier was used to solve the PCRI. And although implementation was not done but we have found its effectiveness in various other data bases and suitable work can be done in this direction.

2|Pa ge

Fig 1:.N C N h t is P otein Coding egion dentifi tion? It is a technique to identify the coding (EXONS) and non coding (INTRONS) region in a protein coding base. There are two types of regions Coding and Non-coding. Our objective is to define an algorithm which takes an input DNA sequence and produces an output a feature table describing the location and structure. A protein coding gene basically found in a t -RNA It is any pattern in a DNA sequence It results in the generation of protein product through a process known as transcription.Protein Synthesis in a DNA 3|Page .

3.bje tives of PC :1. It can be a good mechani m for genetic engineering. Finding Protein coding regions these sequences biochemicall is time consuming and expensi e. A great many ideas have been suggested for recognition of the components of genes. How to build an algorithm for PCRI? A natural overall approach to building gene recognition algorithms is to first construct component algorithms that recogni e the major features of genes: statistical bias in Exon sequence. namely an objective comparison and evaluation of the competing recognition techniques. etc.. Fig 2:. but for a systematic approach to building a comprehensive recognition algorithm one thing is still missing. enhancers. the patterns at Intron junctions. and then to build a combined algorithm that recogni es when all these component patterns occur in a pattern consistent with that present in a gene . 2.How to Develop an Algorithm 4|Page . 4. It i i . promoters. Possibilit exists of using computational procedures to anal l i i i t t i t i e raw base sequence information and to predict coding regions (as well as other signal regions) by the use of duplicate fragments.

It is the vector of 4096 dicodon frequencies of every possible dicodon where the dicodon counts are accumulated at sights whose starting point is multiple of 3.The Run feature is a vector of length 14 for each non trivial subset S = {A. Dicodon Measure:.C.It is again a vector of the 440 amino acids frequency which are obtained by translating from nucleotide sequence to an amino acid string Fig 3:.Training Set of Fickett And Tung DataBase 5|Page . Diamino Acid Usage measure :.A coding measure is a vector of measurements on a DNA sequence. VII. Hyper plane Equation was used.The extent to which the nucleotide is distributed over the three codon positions. Hexamer-1 & 2 :. If f(b. i )/3 Asymmetry is defined as assymm(b) = ™i ( f( b .It is simply the longest sequence of codons that do not contain a stop codon.ARL R MEASURES:A. Measures for Protein Coding regions Identifications Definition:. i ) .i) is the frequency of base b over position I and (b) = ™i f( b.They are offset by 1& 2 and similar to dicodon measure III.G. Position asymmetry measure: . It contains an entry which gives longest contiguous subsequence that has all entries from s V. According to Fickett and tung there are 21 coding measures I.T}. Open Reading Frame Measure:.(b) )2 VI. II. IV. Codon usage Measure :.The codon usage feature is a vector of 64 codon frequencies. Run Measure: .

B. we define a window si e to be the width of the region over which each calculation is to be done 6|Page . Types of approaches Direct approach: Look for stretches that can be interpreted as protein using the genetic code Example: Open reading Frame measure Statistical approaches: Use other knowledge about likely coding regions Example: Dicodon usage Fig 4:-Open Reading Frame Measure Calculation Windows: Many sequence analyses require calculating some statistic over a long sequence looking for regions where the statistic is unusually high or low To do this.

Neural networks were originally conceived as computational models of the way in which the human brain works. It receives a set of input signals and. VII. Once trained.It is a number associated with a node which decreases or increases with each learning. II. VI. This training is known as supervised learning IV.Name of this network is due to the direction of flow of the data. combined with an internal function. Processing:-The unit (also known as a neuron or node) is the main processing element of the neural network. Like the human brain. Unsupervised learning:. VIII. So What we get a measure for learning.the task for the network is to relate the variables it receives in the input layer to some desired behaviour at the output layer by repeatedly presenting the examples to the network in a process known as training. converts the input signals to an output signal.The simplest neural networks there are only two layers ± one µinput¶ III. Weights:. Learning:. Using neural networks:I. But we can have multiple layer also. Defi i i :.The errors incurred during learning are propagated back through the network.Unsupervised learning does not require a desired response (and therefore output data) to train and is concerned with discovering a common set of features across the input data in one class or pattern. Feed-Forward Approach:. they consist of many units connected to each other by variable strength links layer and one µoutput¶ layer ± and are known as µperceptrons¶. Archi ecture :.NEWER APPROACHES:1. Back Propagation:. 7|Pa ge . V. the neural network should be capable of predicting an output given a previously unseen set of inputs.

PERCEPTRON LE RNING RULE 1. i2.e.Preceptron Learning Rule Where w (t) is the weight value at time t. ¡ ¡ ¡ ¡ ¡ ¡   ¡ ¡ Fig 1:. Calculate the output from the network using the expression: f (s) 4. . i ) and the desired output (o) data to the network 3. (b) if output is 0 and should be 1 w (t + 1) = w (t) + i (t). 2. Present the input data (1. Adapt the weights: (a) if correct w (t + 1) = w (t). i3 .i (t). w (t + 1) is the weight value at time t + 1 (i. . after updating) and f (s) is the function used to compute the output in the network 8|Page . Initiali e the network weights and unit thresholds to some small random values. (c) if output is 1 and should be 0 w (t + 1) = w (t) .

Present the input data (i1. Calculate the output from the network using the expression f (s) ] for each layer in the network. . 2. 4. i ) and the desired output (o) data to the network. Initiali e the thresholds and weights. o is the actual output c is a constant used in the sigmoid function. starting from the output layer and moving backwards using the equation Wij (t + 1) = wij (t) + pjopj d is the desired output. . Adapt the weights. The sum shown for hidden layers is over the k nodes in the layer above the current layer for which the pj will already have been computed Fig 6:. i2. The output from the final layer is the vector of output values. i3 .Backpropagation:1.BackPropagation 9|Page ¢ . 3.

We can employ a supervised Neural Network by using Bayes theorem p(T {Xi}) = p({Xi} T)p(T)/{p({Xi} T)p(T)+p({Xi} F)p(F)} Here P(t) and p(f) are the priori probabilities of the true class and false class.Protein Synthesis in a DNA Fig 1:.Protein Synthesis in a DNA 10 | P a g e .Feed Forward Connectivity:I. Given a collection of examples labeled true or false for exons it is true for introns it is false II.Feed Forward Connectivity Fig 1:. III. Assuming that individual codon positions are independent of the codons p({Xi} T)=šI.s sixsi Where si represents the probability of codon s at position i p (T {Xi})=1/(1 + exp(™sTsNs + )) Ts = log ( qs + ps ) And = log (p (f)/p (t)) Fig 7:.

3. 6.CRM Based Approach:1. To determine the likely hood of a given sequence position a program written in C language is used which calculates the sensors Values for a 99 base sequence and the sensor signals between 0. 2 Hidden layers of 14 and 5 nodes . This network is consisting of 7 input nodes . In the following table we will see that it is a very accurate algorithm Fig 10:. and an output node.0 and 1. A back propagation neural net was created to integrate the output from the 7 sensors Algorithms. CRM means Coding Recognition Module.0 are then evaluated. The Coding Module is used to find exons or parts of exons that encode protein. 2.Using A CRM Module 11 | P a g e . 5. 4.

Fig 11:.Table showing a Database comparison for CRM NOTE:. 12 | P a g e .we see that our classifier almost gives a 99% accuracy.

In a 3-neighborhood dependency. The global evolution results from synchronous application of the local rule to all cells of the array. and is denoted as qi(t+1)=f(qi-1(t) . ‡ Each component. Using Cellular Automata:A Cellular Automaton (CA) can be viewed as an autonomous finite state machine (FSM) consisting of a number of cells. as noted in is an inverted tree. ‡ The transient states finally settles down to one of the attractor basins after passing through a lot of intermediate states. the next state qi(t+1) of a cell is assumed to be dependent only on itself and on its two neighbours (left and right). qi+1(t) ) Fuzzy Cellular Automata:An elementary FCA is a linear array of cells which evolves in time. each rooted on a cyclic state.State diagram of 3 cell 3 state FMACA 13 | P a g e .1] (fuzzy states) and changes its state according to a local evolution function on its own state and states of its two neighbours. Each cell of the array assumes a state qi.2. a rational value in the interval [0. FMACA:‡ A FMACA is a special class of FCA that can efficiently model an associative memory to perform pattern recognition/classification task ‡ Its state transition behaviour consists of multiple components . qi(t) . Fig 12:. ‡ A cycle in a component is referred to as an attractor.

If S/ belongs to any one particular class (i. Let S/ be the set of elements in an attractor basin. FMACA with k attractor basins is generated. Step 3. Step 4. The elements of the training set S get distributed into k attractor basins/nodes. then mark it as a leaf node and label that attractor basin/node as that class. all patterns of S/ are covered by an attractor basin/node belonging to only one particular class). Stop. Step 2. We show the algorithmic approach on building a tree classifier.FMACA BASED K-means clustering Algorithm:Algorithm 1:Step 1. 14 | P a g e . Step 5. Otherwise if S/ belongs to more than one class.e. then repeat the steps 1 to 4.

An important issue in applying FMACA works to DNA sequence classification is how to encode DNA sequences as the input of the FMACA based tree structure classifier. Time complexity is O(n) By using Dependency Vector scheme 15 | P a g e . G. C. 2. Errors and other Benchmarks:The objective function (which is to be minimized) is the sums of squares distances of each DNA sequence and its assigned cluster centres. From a one dimensional point of view a DNA sequence contain characters from a 4-letter base pair S = {A. Small tree height leading to low retrieval time. We have followed this procedure in our study and we have used the benchmark human dna database developed fickett and tung. Less number of nodes leading to low memory overhead 3. SSdistances =™ x-C(x)2] where C(x) is the mean of the cluster that DNA position x is assigned to Minimizing the SSdistances is equivalent to minimizing the Mean Squared Error (MSE) Wrong to state that lesser MSE means better classifications The basic criteria for FMACA tree design are: 1. T}. The goal is to identify whether the entirely all-coding or all-non coding. Thus good implementation must be the focus.Implementation:For using a FMACA based classifier we need a limited window for eg 100 base pairs which can be used for standard experimental observations and it computes the features of that window alone. Low error rate leading to maximum classification accuracy.

16 | P a g e .Here we have compared with some of the other measures and found out that our FMACA based classifier is indeed a very accurate classifier. We have shown that in the graphs below. And also takes very less memory overhead and retrieval time.

Fig 12:.FMACA INTERFACE 17 | P a g e .

ISSUES IN DISTRIBUTED DATABASES ` The aggregation of the base classifiers ` The communication cost from different local data sources to the central location where final classification is performed. ` Results derived out of base classifiers are next combined at central location to enhance the classification accuracy of predictive models. ` the base classifier at each local site must have high classification accuracy ` it must be scalable to handle disk resident datasets. 18 | P a g e .DISTRIBUTED DATABASES ` Distributed classifiers work in ensemble learning approach ` Multiple classifiers as base classifiers at different data sources. ` It should have high throughput with less storage requirements ` Volume of data transferred from each local to central site should be minimum ` Aggregation of base classifiers must induce globally meaningful knowledge.

We have shown here algorithmic approach for distributed database PCRI 19 | P a g e .

The dataset also includes the reverse complement of every sequence. and the mixed sequences are discarded. The FMACA based base classifiers (BCi s) are next collected in the centralized location where predictive accuracy is tested with the test set. The predictive accuracy of the proposed scheme is evaluated through the majority voting scheme. a coding measure is a vector of measurements on a DNA subsequence. Comparison: All the attributes in a dataset are normalized to facilitate FMACA learning. There are 21 coding measures. For Distributed Datasets: Total training set is equally distributed to each local site and then same scheme is employed at each local site to synthesize the FMACA. which makes the problem of identifying coding regions somewhat harder. and the real value that class element j takes at attri is attrvalij. entirely non-coding. 4. Every sequence is labeled according to whether it is entirely coding. or mixed. 3. For Centralized Datasets: There is use of a training set exclusively to construct a FMACA tree based structured pattern classifier. Suppose. They have surveyed and compared it.min. For the first dataset. Coding Measure: Typically. non-overlapping human DNA sequences of length 54 have been extracted from all human sequences. 2.Experimental observation 1. with shorter pieces at the ends discarded. attrvalj.max). The benchmark human data include three different datasets. The tree then classifies the Test Set. This means that one-half of the data is guaranteed to be from the non-sense strand of the DNA. the possible value range of an attribute attri is (attrvali. then the normalized value of attrvalij i £ 20 | P a g e .

21 | P a g e .

We see as in below GA employed will significantly decrease the memory overhead 22 | P a g e .

Can be used successfully to locate a cancerous protein coding region 2. Use of GA in FMACA classifiers 3. Use of autoregressive modelling approach for introducing better Protein coding region identification. Can be used to genetically build protein Which can ensure higher immunity 3. Fig 13:. Can be used to distinguish between different kinds of genes. ADVANTAGES 1.FUTURE WORKS 1. 2. Use of NNt Tree classification For MLP(multi layer perceptron) for better 4. Use of SVM {Support Vector Machines}.CANCER Treatment by Knowing The protein Code 23 | P a g e .

Cattaneo. Lapedes. F. IEICE Volume 4. 10. 691-702. 2. 1982. pp. 4. 13. Geurts. Farber. Mol.. Sirotkin.. Hawaii. 6. "Determination of eucaryotic protein coding regions using neural networks and information theory.CONCLUSION:So hereby conclude that by using efficient PCRI we can successfully make corrections to faulty genes.Pal Chaudhuri. P." Proc. Sci. "Fuzzy Cellular Automata For Modeling Pattern Classifier. Chaudhuri. A study of fuzzy and many-valued logics in cellular automata. P. 20. 6441-6450. Acad. Biplab K. vol. J. Champaign. 2000. G. 16. ´On The Relationship Between Boolean and Fuzzy Cellular Automata´ Pradipta Maji and P. Flocchini. E88-D. Fuzzy cellular automata and their chaotic behavior. 1126111265. Mingarelli(2007). 1992. 11. A.. 14. P Pal Choudhary ³Cellular Automata in protein coding Region Identification ³ 9proceedings of ICISIP-2005 24 | P a g e . 471-479." Physica D. and N. Mingarelli. International Symposium on Nonlinear Theory and its Applications. 5303-5318. REFERENCES:1. vol.Pal Chaudhuri. ³ ³Theory and Application of Cellular Automata for Pattern Classification´ Pradipta Maji and P. 9. Samik Barua. P. A New Kind of Science (2002). S." J. 15. USA. 226. 8. Natl. pp. Uberbacher and R. Sikdar and P. Wolfram. Sumanta Das. E. G. 3. and N.." IEICE Transactions on Information and Systems. "Recognition of protein coding regions in dna sequences. "Convergence and Aperiodicity in Fuzzy Cellular Automata: Revisiting Rule 90. ³Fuzzy Cellular Automata Based Associative Memory for Pattern Recognition´ Pradipta Maji and Chandra Das . Heather Betel and Paola Flocchini. "Assessment of protein coding measures.Pal Chaudhuri. It is not far when our dream of becoming immortal and space travellers will come true. in Proc. J. A." Nucleic Acids Res. Pradipta Maji and P. Santoro (1993). vol. Il. Fickett and C. Flocchini. Wolfram Media. pp. 5. 10. pp. April 2005. 1285-1289. Knowledge Discovery in Distributed Biological Datasets Using Fuzzy Cellular Automata Pradipta Maji. Maji and P. Mural. 12. Mauri. Tung. 1992. "Locating protein-coding regions in human dna sequences by a multiple sensor-neural network approach. 1991 P. and K." Nucleic Acids Res. Angelo B. Niloy Ganguly. S. vol. R. Thereby a lot of new opportunities emerge. ³Fuzzy Cellular Automata for Modeling Pattern Classifier´. Santoro. 88. vol. pp. Biol.Pal Chaudhuri. Fickett. 7. Chandrama Shaw. ³Basins of Attraction of Cellular Automata Based Associative Memory and its Rule Space´ Pradipta Maji.

25 | P a g e .

Sign up to vote on this title
UsefulNot useful