Vous êtes sur la page 1sur 6

(IJCNS) International Journal of Computer and Network Security, 35

Vol. 1, No. 2, November 2009

An Assimilated Approach for Statistical Genome


Streak Assay between Matriclinous Datasets
Hassan Mathkour1, Muneer Ahmad2 and Hassan Mehmood khan3
King Saud University, Department of Computer Science,
College of Computer and Information Sciences,
P.O. Box 51178, Riyadh 11543, Saudi Arabia
1
binmathkour@yahoo.com
2
muneerahmadmalik@yahoo.com
3
hasmkh@gmail.com

Abstract: Genome Streak Assay for matriclinous datasets computational molecular biological experiments by means
by using ORF (Open Reading Frames) artistries is a of DNA streak assay. Finding unique streak on the entire
titillating area of inquest for bioinformatics inquisitors target genome is one of the most important problems in
recently. There is a strong inquest focus on metaphorical molecular biology [3].
assay between matriclinous behaviors and multeity of The overall goal of this paper is to adduce an assimilated
peculiar species. Antagonistic to choate genome streak approach that performs metaphorical assay between same
assay, scientists are now trying to contemplate peculiarly species revealing that peptide translation in both has tenor
ensconced assay to get a better peculiarly of pertinency of aberrations. This task is accomplished by using ORF with
among matriclinous datasets. This marvel will better help to
statistical assay. The method used for this purpose is a
understand species. We are adducing an ORF statistical
composite artistry that consists of series of filter from
assay for matriclinous data-sets of species Chimera
Monstrosa and Poly Odontidae. For completion of this preprocessing level to final assay.
assay, we use a mongrel approach that combines generic The human genome project has built rich databases which
contrivance for statistical assay with specific approach attracted inquest titillates from biologists and computer
designed for out performance. At first exemplification, scientist to explore and mine these precious data-sets. The
matriclinous datasets are rarefying for better usage at next computer aided applications now can reveal the hidden
level. These sets are then passed through ensconces of information in complex helix DNA structure. They also
filters that perform DNA to Protein translation. Statistical made it possible to perform fast and accurate assay. This has
correlation is performed during this translation. This been made effective with the availability of cost effective
ensconced architecture helps in better understanding of and handy assay tools. Scientists have developed novel
tenor of affinity and aberrations in genomic streaks. ideas, implemented and resolved complex situations in
computational biology whose direct feasible solution was not
Keywords: Open Reading Frame, codon count, amino acid, possible yielding optimal solutions in some cases for streak
preprocessing filter, Nucleotide assay, an NP hard problem [5, 9, 14, 17].
This paper is organized as follows. Section 2 highlights
1. Introduction
some related work. Section 3 describes the proposed artistry
Due to existing and continuously growing bulk of biological (elaborated in subsections). Section 4 contains fundamental
data coming from genome projects and experiments now a concluding remarks for this metaphorical assay. Section 5
days. Protein structure prediction and its systematic re-adduces an acknowledgement and section 6 contains
translation needs an efficient and effective way to streak, References.
analyze and compare coded biological DNA streak
information. The genome streak assay is directly related to 2. Literature review
the streak correlation and alignment. Streak affinity is a way Rajita Kumar [17] gives an approach for a distributed
to predict the functional affinity among genes and have been bioinformatics computing system. It was designed for
used as a tool for functional prediction. Assay and disease detection, criminal forensic and protein assay. It is a
Correlation of DNA streaks and genes is useful for finding combination of peculiar distributed algorithms that are used
the fact that how these genes are organized and what are the to search and identify a triplet repeat pattern in a DNA
similarities and aberrations [1]. These fundamental streak. It consists of search algorithm that computes the
problems are NP hard [14, 17] and need optimal solution number of occurrences of a given pattern in a matriclinous
that can be achieved by improving algorithms and streak. The distributed sub-streak identification algorithm
computing architecture. [2]. A little work has been done in was to detect repeating patterns with sequential and
mongrel statistical assay of genomic data against distributed implementation of algorithms relevant to
exponentially increasing problem size. Usage of Computer peculiar triplet repeat search patterns and matriclinous
aided artistries are not the solution. There is need to work in streaks. The result of this system shows that as complexity
36 (IJCNS) International Journal of Computer and Network Security,
Vol. 1, No. 2, November 2009

of the algorithm increases, the response time also increases. signal to noise ratio and processed signals can be made for
There is space to make this work better for more DNA requirement of single base pair resolution in DNA
streaks of various lengths. sequencing and vector of targeting signal can be
Ken-ichi Kurata [9] adduces a artistry to find unique decomposed into orthogonal matrix of wavelength functions.
genome streaks from distributed environment databases. This is an iterative method with levels n and can be
Ken-ichi used implementation of the method upon the conventionally reconstructed by inverse DWT.
European Data Grid and showed its results. The author Binwei Weng et al., [14] apply wavelength transform to
worked on the unique streaks of E. Cole 0157 (12 genome). extract features from the original measurements. They
The genome is divided into smaller pieces being processed partition the data in subsequent partitions by a hierarchal
individually. In an example quoted by author, the total file clustering method, the terahertz spectroscopy of peculiar
size is 256 MB when it is hashed to 7. It is possible to divide DNA samples show the wavelength domain assay aids the
the genomic files into at most 47 = 16384 pieces of 15 KB clustering process, authors have clustered six DNA samples
each. This method results in memory consumption and into two groups, the data has been cleansed before
increases file size. This data grid method is not useful for processing, wavelet function utilized the Haar wavelet
parallelizing biological important data. methods. The signal trend is separated from the original
Ao Li [16] proposes a genome streak learning method by records. The size of clusters may be calculated by the
simplifying Bayesian network. The nodes in Bayesian maximum distance between two points within cluster.
networks are selected as features. A feature selection Another preprocessing step is balancing the data which can
algorithm is used for structure learning. This algorithm is achieve normalization of data.
based on matriclinous algorithm. The researcher used Bilu et al., [15] propose an alignment algorithm for NP hard
dataset of 570 vertebrate streaks, including 2079 true donor alignment problem of streaks, author outperform an
sites. This approach is limited to the donor site prediction alignment procedure by sufficing optimal alignment of
and also confirms that the nucleotides closer to donor site predefined streak segments, they contemplate on choate
are the key elements in gene expression. There is need to streak rather than letters and estimate running time by
improve the structure learning method, valuable features restricting the search space of dynamic programming
and assay etc. algorithm. Authors take the aid from observation that
DNA chips [7] have main role in disease diagnosis, drug encoding streaks used in NP hard problems are not
discovery and gene identification. Elaine Garbarine [7] used necessarily depiction of protein and DNA streaks. Time
an approach to detect unique gene regions of particular expedition is calculated by taking advantage of biological
species. This artistry named information theoretic method nature of streaks antagonistic to traditional approaches that
exploits genome vocabularies to distinguish between offer good computation leading to optimal alignment; more
pathogens. This approach is useful only for finding the gene stress is given to the structure of input streaks.
streaks and most distinguished similarities between two Tuqan and Rushdi [6] propose an approach for finding the
organisms. Oligo probes were used to distinguish between complete periodicity in DNA streaks, the approach is spliced
two genes. Experiments were conducted to data from Sanger in three channels, firstly they explain the underlying
Institute. Currently 32 out of 92 bacterial pathogen contrivance for period 3 components, secondly directly
sequencing projects are completed. The author selected a relate the identification of these components for finding
pair of genomes to test algorithm. Results were shown for a nucleotide bias in codon spectrum, thirdly completely
12-mer and 25-mer Oligo pathogen probe set and confirmed characterize the DNA spectrum by a set of numerical
the Elaine Garbarine method less likely to cross-mongrelize. streaks. Authors relate the signal processing problem with
José Lousadop [12] developed a software application for genomic one through their proposed multirate DSP model,
large-scale assay of codon-triplet associations to shed new the model identifies the essential components involved in
light into this problem. This algorithm describes codon- the codon biased marinating the dual nature of problem.
triplet context biases, codon-triplet assay and identification This marvel can further help in understanding the biological
of alterations to standard matriclinous code. The method significance codon bias. The period 3 component detection
adduces an evolutionary understanding of codons within works for a kind of genes and may not be suitable for all
open reading frames (ORF). matriclinous datasets.
Gene-Split [8] is an application that shows codon triplet Ma Chan et al., [4] has shown the functionality of popular
patterns in genomes and complete sets of ORFs. Generally clustering algorithms for assay of microarray data and
this application gives opportunity to study the characteristics concluded that performance of these algorithms can be
of codon and amino acids triplets in any genome for further increased. Authors are also proposing an
extraction of hidden patterns. evolutionary algorithm for microarray data assay in which
Hua Zheng et al., [13] adduce a artistry that assimilates the there is no need for calculation of no. of clusters in advance.
low pass filter and wavelength de-noising method. The algorithm was tested with simulation and peculiar
Conventional artistries use the low pass filter with cheap datasets. The noise and missing values are a big issue in this
hardware resulting in degraded de-noising quality. By regard. The marvel is depicted by encoding the entire cluster
properly choosing the cut-off frequency and wavelength de- grouping in a chromosome so that each gene encodes one
noising frequency, some enhancement can be made for cluster and each cluster contains the labels of data used in it.
(IJCNS) International Journal of Computer and Network Security, 37
Vol. 1, No. 2, November 2009

Cross over and mutations are performed suitably. The the start and stop codon. By using the streak indices for start
proposed algorithm has been observed to be slow as and stop, we can extract the sub-streaks and can determine
compared to other prevailing algorithms. the codon distribution effectively.
The most informative and titillating marvel that Choate
process is broken into steps and each step fully performs the
3. THE ADDUCED TECHNIQUE metaphorical assay relevant to DNA to protein translation.
The titillate mainly lies in finding genome regions that are
responsible for protein translation. A. SIZE OF DATASETS
1. Chimaera Monstrosa contains 18580 nucleotides of
Adenine, Guanine, Thymine and Cytosine.
Cumulative size of data becomes 37160 bytes
arranged in the form of a uni-vector.
2. Poly Odontidae contains 16512 nucleotides of
Adenine, Guanine, Thymine and Cytosine.
Cumulative size of data becomes 33024 bytes
arranged in the form of a uni-vector.

B. ORF IN NUCLEOTIDE STREAKS


Figure 1. Ensconced Architecture
It is worth noting that metaphorical assay between both
species is being done at translation level, so this level is vital
We have developed a ensconced architecture shown in the
in assay. We split this ensconce into three more ensconces to
Figure 1 above for this assay that starts from preprocessing
get a better benefit of this ensconced assay.
of raw data to final translation assay. For the sake we have
In each phase, our titillate lies in determining the
used matriclinous datasets of Chimaera Monstrosa (rabbit
accurate start and stop position of codons that perform the
fish, NC_003136) and Poly Odontidae (paddle fish,
relative assay.
NC_004419) [18].
At preprocessing stage raw data sets are passed through a C. ORF PRIMARY FRAMES
filter that outputs a more rarefy form of data which can be At ORF primary frame level,
further used for actual metaphorical assay between species.

Figure 2. Dataset before filter application


Figure 4. ORF of Chimaera Monstrosa in Frame 1

It is evident from Figure 2 that dataset contains characters


Figure 4 shows that start position for the first frame is at
other than pure nucleotide bases. These illegal characters
7156 and second at 8761. These start positions re-adduce
are removed by application of cleansing filter.
the major translation regions in entire frames. These regions
At first exemplification it is worth noting that assay should
are pure depiction of tri-nucleotide molecules. This process
be made with original data values, any garbage collection
leads towards the extraction of sub-chains that lately will be
may lead to detritions of results.
shifted to peptide regions.

Figure 5. ORF of Poly Odontidae in Frame 1


Figure 3. Pre-processed Dataset
Figure 3 depicts that preprocessed data contains only pure Like wise we get the ORF in the second data set of Poly
nucleotide base pairs without any anomalies. This rarefy Odontidae shown in Figure 5. The by entering the start
data is later fed into next ensconce for actual assay. positions we can get stop codons. The start positions of the
second dataset Frame 1 are 10798 to 11395, 14641 to
First we display the ORF in a nucleotide streak and find
38 (IJCNS) International Journal of Computer and Network Security,
Vol. 1, No. 2, November 2009

15559. It is clear that there is an evident aberration in codon variation in translated regions in species.
regions for both frames of these species. The corresponding
translated regions are so entirely peculiar that we can not
guess even the idea of sub-channels affinity.

D. ORF SECONDARY FRAMES


At second level, we intend to find the codon positions for
Frame 2 of both species,

Figure 9. Frame 3 (Poly Odontidae)


In Figure 9, third frame for Poly Odontidae goes from 2796
to 3242, 6315 to 6722 and 12753 to 13217. Fig. 8 shows
that first 2 codon positions are relative similar while third
position again describe a jump gap. Performing
metaphorical assay this level, reveals the facts that both
matriclinous data finds a kind of extremity in behavior
Figure 6. Frame 2 (Chimaera Monstrosa) which make them relevant at certain codon composition and
peculiar at others.
Figure 6 describes that major ORF start from 2753, 5426
and 10325, this re-adduces that there are series of other F. CODON COUNT
regions occupied between first and second frame that don't The codon count describes the tri-nucleotide behavior of
contribute the peptide translation regions. streaks. We need to find the tenor of pertinency in terms of
strengths of nucleotide bases. For exemplification, we have
selected frame 1 from codon composition of both species
and compare the strength.

Figure 7. Frame 2(Poly Odontidae)

Similarly the frame 2 of Poly Odontidae shown in Figure 7


describes its codon position from 11120 to 11465 and 12464
to 12887. This shows a massive aberration in datasets at this
level as we move with increasing nucleotide sub-streaks, we
may get larger aberrations but this case does not seem to be Figure 10. Codon count (Chimaera Monstrosa in Frame 1)
true for all matriclinous datasets. This is the reason that Figure 10 re-adduces the codon count for Chimera
marvel has been given importance in selection these Monstrosa. Our aim focuses on metaphorical assay of codon
particular sets. strength at this stage. For the purpose, we need to calculate
the codon count for Poly Odontidae. Figure 11 shows the
E. ORF TERTIARY FRAMES
codon count of the first ORF of the Poly Odontidae.
Discussing the last frame set in this streak, we first find the
codon composition for these frames, for exemplification
consider

Figure 11. Codon count (Poly Odontidae in Frame 1)

G. STRENGTH OF AMINO ACID IN THE PROTEIN STREAK

At last phase of this metaphorical assay, we need to find the


relevant strength of peptide pairs in protein streaks (resulted
Figure 8. Frame 3 (Chimaera Monstrosa) as a translation from DNA to protein)
frame 3 of Chimaera Monstrosa. Figure 8 shows that major
ORF starts from 4019, 11948 and 14328. This massive
aberration in codon compositions also provide an evidence
that first translated region lies some four thousand while
second and third regions have jump gaps. This is the Figure 14. Strength of amino acid (Chimaera Monstrosa)
(IJCNS) International Journal of Computer and Network Security, 39
Vol. 1, No. 2, November 2009

Figure 14 shows the strength of amino acid in Chimera References


Monstrosa, now we determine the atomic decomposition and
[1] Ravi Gupta, Ankush Mittal, Kuldip Singh, Prateek
molecular weight of the protein Bajpai, Suraj and Prakash, “A Time Series Approach
C: 1220 H: 1886 N: 298 O: 341 S: 12 for Identification of Exons and Introns”, 10th
Molecular weight is 2.6569e+004 International Conference on Information Technology
2007, Page(s):91 - 93
The strength of amino acid in protein streak of the Poly [2] Patrick Ma and C.C. Keith Chan, “Discovering Clusters
Odontidae is depicted in Fig. 15 below, in Gene Expression Data using Evolutionary
Approach”, 15th IEEE International Conference
on Tools with Artificial Intelligence 2003, page(s): 459-
466
[3] Tejaswi Gowda, Samuel Leshner, Sarma Vrudhula and
Seungchan Kim, “Threshold logic gene regulatory
Networks”, International Workshop on Genomic Signal
Figure 15. Strength of amino acid (Poly Odontidae) Processing and Statistics 2007, page(s): 1-4, ISBN:
Similarly the atomic decomposition and molecular weight of 978-1-4244-0998-3
the protein are
[4] P.C.H. Ma, K.C.C. Chan, Xin Yao and D.K.Y. Chiu,
C: 940 H: 1488 N: 276 O: 266 S: 14
"An evolutionary clustering algorithm for gene
Molecular weight is 2.1360e+004
expression microarray data analysis", IEEE
Comparing amino acid streaks of both species obtained from Transactions on Evolutionary Computation 2006,
the primary codon translation, we see in table 1 Volume 10, Issue 3, , page(s):296 - 314
[5] Daniel Miranker,”Evolving Models of Biological
Table 1:(Amino Acid streak correlation) Sequence Similarity”, First International Workshop on
Amino acid Chim. Monstrosa Poly Odontidae 2008,page(s):3-9
C 1220 940 [6] J. Tuqan and A. Rushdi, "A DSP approach for finding
H 1886 1488 the codon Bias in DNA Sequences", IEEE Journal of
N 298 276 Selected Topics in Signal Processing 2008, Volume 2,
O 341 266 Issue 3, page(s):343 - 356
S 12 14 [7] Elaine Garbarine and Gail Rosen “An information
theoretic method of microarray probe design for
and corresponding molecular weight in table 2
genome classification”, 30th Annual International
Conference of the Engineering in Medicine and
Table 2:(Molecular weight correlation) Biology Society, 2008, page(s): 3779-3782
Chimaera Monstrosa Poly Odontidae [8] P.H.-M Chang, Von-Wun Soo, Tai-Yu Chen, Wei-Shen
2.6569e+004 2.1360e+004 Lai, Shiun-Cheng Su and Yu-Ling Huang,
“Automating the determination of open reading frames
These results clearly describe the marvel that despite both in genomic sequences using the Web service techniques
species from same class differ greatly in patterns of ORF. - a case study using SARS coronavirus”, Fourth IEEE
Symposium on Bioinformatics and Bioengineering
4. Conclusion 2004, page(s):451 - 458
An Open Reading Frame (ORF) contains a start codon [9] Ken-ichi Kurata, Vincent Breton and Hiroshi
region. This subsequent region contains pairs of nucleotides Nakamura, “A Method to Find Unique Sequences on
in length multiple of 3 and end with a stop codon. This Distributed Genomic Databases”, IEEE/ACM
paper describes the phase wise metaphorical assay of two International Symposium on Cluster Computing and the
matriclinous data of species Chimaera Monstrosa and Poly Grid 2003, 3rd Volume, page(s): 62 - 69
Odontidae. It re-adduces an assimilated approach composed [10] Nasreddine Hireche, J.M. Pierre Langlois and Gabriela
of step by step processes to elaborate the results effectively. Nicolescu, “Survey of biological high performance
The process gives more stress on peptide translation using computing: Algorithms, Implementations and Outlook
Open Reading Frame concept and data refining Research”, Canadian Conference on Electrical and
methodology. At the end we look for all outcomes that make Computer Engineering 2006, page(s):1926 – 1929
this effort optimal by performing a sensitive assay at DNA
to protein conversion. Variations at each step were observed [11] Bartkowiak, “Nonlinear Dimensionality Reduction by
even the data classes remained same. Isomap and MLEdim as Applied to Amino-Acid
Distribution in Yeast ORFs”, Computer Information
Acknowledgements Systems and Industrial Management Application, 2008
, page(s):183 – 188
This work was partially supported by Research Center,
[12] José Lousado and R. Gabriela Moura, “Exploiting
College of Computer and Information Sciences, King Saud
codon-triplets association for genome primary structure
University Riyadh Saudi Arabia.
analysis”, International Conference on Bio-
40 (IJCNS) International Journal of Computer and Network Security,
Vol. 1, No. 2, November 2009

computation, Bioinformatics, and Biomedical


Technologies 2008. , page(s):155 - 158
[13] Hua Zheng, Yan Shi, Jie Wang, Liqiang Wang and
Zukang Lu, "The Analysis on the Signals Denoising
and Single Base Pair Resolution of DNA Sequencing",
International Symposium on Biophotonics,
Nanophotonics and Metamaterials, 2006, page(s):118 -
121
[14] Binwei Weng, Guangchi Xuan, J. Kolodzey and K.E.
Barner;, "Discriminating DNA Sequences from
Terahertz Spectroscopy - A Wavelet Domain Analysis",
Proceedings of the IEEE 32nd Annual Northeast
Bioengineering Conference 2006, page(s):211 - 212

[15] Y. Bilu, P.K Agarwal and R. Kolodny, "Faster


Algorithms for Optimal Multiple Sequence Alignment
Based on Pairwise Comparisons", IEEE/ACM
Transactions on Computational Biology and
Bioinformatics 2006, Volume 3, Issue 4, page(s):408 -
422
[16] Ao Li, Tao Wang, Yun Zhou, Ming-hui Wang and
Huan-qing Feng, “An efficient structure learning
method in gene predection”, Proceedings of the
International Conference on Neural Networks and
Signal Processing, 2003, Volume 1, page(s): 567- 570
[17] Rajita Kumar, Arooshi Kumar and Sanjuli Agarwa “A
Distributed Bioinformatics Computing System for
Analysis of DNA Sequences”, IEEE proceedings of
Southeast Conference 2007, page(s):358 – 363.
[18] http://www.ncbi.nlm.nih.gov.

Author Profile
Hassan Mathkour is a professor in the department of
Computer Science. He is serving in the College of Computer
and Information Sciences King Saud University, Riyadh,
Saudi Arabia as the Vice Dean for Quality, Assurance and
Development. He completed his PhD from the University of
Iowa, USA in 1986. His research interests include
Databases, Artificial Intelligence, Bio-informatics, NLP and
Computational sciences.

Vous aimerez peut-être aussi