Académique Documents
Professionnel Documents
Culture Documents
html
BIO-INFORMATICS
Abstract:
and information technology merge to form a single discipline. The ultimate goal of the
field is to enable the discovery of new biological insights as well as to create a global
perspective from which unifying principles in biology can be discerned. At the beginning
of the "genomic revolution," a bioinformatics concern was the creation and maintenance
sequences. Development of this type of database involved not only design issues, but the
development of complex interfaces whereby researchers could both access existing data
comprehensive picture of normal cellular activities so that researchers may study how
these activities are altered in different disease states. Therefore, the field of
bioinformatics has evolved such that the most pressing task now involves the analysis
and interpretation of various types of data, including nucleotide and amino acid
sequences, protein domains, and protein structures. The actual process of analyzing and
1 Email: chinna_chetan05@yahoo.com
Visit: www.geocities.com/chinna_chetan05/forfriends.html
The development and implementation of tools that enable efficient access to, and
BIO-INFORMATICS
The collection, storage, and analysis of biochemical and biological
genomics.
In last the few decades, advances in molecular biology and the equipment
available for research in this field have allowed the increasingly rapid sequencing of large
portions of the genomes of several species. In fact, to date, several bacterial genomes, as
well as those of some simple eucalypts (e.g., Saccharomyces cerevisiae, or baker's yeast)
have been sequenced in full. The Human Genome Project, designed to sequence all 24 of
GenBank and EMBL, have been growing at exponential rates. This deluge of information
has necessitated the careful storage, organization and indexing of sequence information.
Information science has been applied to biology to produce the field called
Bioinformatics.
The simplest tasks used in bioinformatics concern the creation and maintenance of
databases of biological information. Nucleic acid sequences (and the protein sequences
derived from them) comprise the majority of such databases. While the storage and or
2 Email: chinna_chetan05@yahoo.com
Visit: www.geocities.com/chinna_chetan05/forfriends.html
developing an interface whereby researchers can both access existing information and
The most pressing tasks in bioinformatics involve the analysis of sequence information.
Computational Biology is the name given to this process, and it involves the following:
evolutionary relationships.
The process of evolution has produced DNA sequences that encode proteins with very
using algorithms that have been derived from our knowledge of physics, chemistry and
most importantly, from the analysis of other proteins with similar amino acid sequences.
The diagram below summarizes the process by which DNA sequences are used to model
protein structure. The processes involved in this transformation are detailed in the pages
that follow.
3 Email: chinna_chetan05@yahoo.com
Visit: www.geocities.com/chinna_chetan05/forfriends.html
sequence
similarity search against expressed sequence tags (ESTs) to assess the suitability
strategy was constructed to examine the potential of this approach, and was
applied to test sets containing all human genomic sequences longer than 5 kb in
a gene. These ESTs provide immediate access to the corresponding cDNA clones
The apparent false-positive rate rose to 55% of ESTs among all sequences and
20% among benchmark sequences at the lowest stringency, indicating that many
alignments span multiple exons, and thus aid in the construction of gene
multiple cDNA libraries frequently cluster over genes, providing a starting point
4 Email: chinna_chetan05@yahoo.com
Visit: www.geocities.com/chinna_chetan05/forfriends.html
for crude expression profiles. Clone IDs may be used to form EST pairs, and
genomic sequences.
The Expressed Sequence Tags (ESTs) are short, single pass cDNA
415 000 human ESTs represent a valuable, low priced, and easily accessible
biological reagent.
As many ESTs are derived from yet uncharacterized genes, dbEST is a prime
starting point for the identification of novel mRNAs. Conversely, other genes are
the WashU-Merck EST project. These ESTs were collected by querying dbEST
with the genomic sequences of 15 human genes. When we aligned the matching
ESTs to the genomic sequences, we found that in one gene, 73% of the ESTs
which derive from spliced or partially spliced transcripts either contain intron
sequences or are spliced at previously unreported sites; other genes have lower
percentages of such ESTs, and some have none. This finding suggests that ESTs
certain genes. In a related analysis of pairs of ESTs which are reported to derive
5 Email: chinna_chetan05@yahoo.com
Visit: www.geocities.com/chinna_chetan05/forfriends.html
from a single gene, we found that as many as 26% of the pairs do not BOTH align
with the sequence of the same gene. We suspect that some of these unusual ESTs
result from artifacts in EST generation, and caution researchers that they may find
greatly diverged but homologous family members. To assist in these efforts, the pattern-
hit initiated BLAST (PHI-BLAST) program described here takes as input both a protein
database for other instances of the input pattern, and uses those found as seeds for the
construction of local alignments to the query sequence. The random distribution of PHI-
BLAST alignment scores is studied analytically and empirically. In many instances, the
that are not recognizably related using traditional single-pass database search methods.
6 Email: chinna_chetan05@yahoo.com
Visit: www.geocities.com/chinna_chetan05/forfriends.html
The function of many RNAs depends crucially on their structure. Therefore, the design of
RNA molecules with specific structural properties has many potential applications, e.g. in
the context of investigating the function of biological RNAs, of creating new ribozymes,
solving the following RNA secondary structure design problem: given a secondary
structure, find an RNA sequence (if any) that is predicted to fold to that structure. Unlike
hard computationally. Our new algorithm, "RNA Secondary Structure Designer (RNA-
SSD)", is based on stochastic local search, a prominent general approach for solving hard
empirically modelled structures from the biological literature shows that RNA-SSD
substantially outperforms the best known algorithm for this problem, RNAinverse from
the Vienna RNA Package. In particular, the new algorithm is able to solve structures,
challenging task in itself, provides the scientist with a wealth of information, albeit of
limited use. The power of a database comes not from the collection of information, but in
its analysis. A sequence of DNA does not necessarily constitute a gene. It may constitute
7 Email: chinna_chetan05@yahoo.com
Visit: www.geocities.com/chinna_chetan05/forfriends.html
Luckily, in agreement with evolutionary principles, scientific research to date has shown
that all genes share common elements. For many genetic elements, it has been possible to
construct consensus sequences, those sequences best representing the norm for a given
promoters, enhancers, polyadenylation signal sequences and protein binding sites. These
Genetic elements share common sequences, and it is this fact that allows mathematical
There are a myriad of steps following the location of a gene locus to the realization of a
A proper analysis to locate a genetic locus will usually have already pinpointed at least
the approximate sites of the transcriptional start and stop. Such an analysis is usually
sufficient in determining protein structure. It is the start and end codons for translation
The first codon in a messenger RNA sequence is almost always AUG. While this reduces
the number of candidate codons, the reading frame of the sequence must also be taken
into consideration.
8 Email: chinna_chetan05@yahoo.com
Visit: www.geocities.com/chinna_chetan05/forfriends.html
There are six reading frames possible for a given DNA sequence, three on each strand,
that must be considered, unless further information is available. Since genes are usually
transcribed away from their promoters, the definitive location of this element can reduce
the number of possible frames to three. There is not a strong concensus between different
species surrounding translation start codons. Therefore, location of the appropriate start
codon will include a frame in which they are not apparent abrupt stop codons.
Knowledge of a proteinÕs predicted molecular mass can assist this analysis. Incorrect
reading frames usually predict relatively short peptide sequences. Therefore, it might
seem deceptively simple to ascertain the correct frame. In bacteria, such is frequently the
In eukaryotes, the reading frame is discontinuous at the level of the DNA because of the
presence of introns. Unless one is working with a cDNA sequence in analysis, these
introns must be spliced out and the exons joined to give the sequence that actually codes
Intron/exon splice sites can be predicted on the basis of their common features. Most
introns begin with the nucleotides GT and end with the nucleotides AG. There is a branch
sequence near the downstream end of each intron involved in the splicing event. There is
With the completed primary amino acid sequence in hand, the challenge of modelling the
three-dimensional structure of the protein awaits. This process uses a wide range of data
and CPU-intensive computer analysis. Most often, one is only able to obtain a rough
9 Email: chinna_chetan05@yahoo.com
Visit: www.geocities.com/chinna_chetan05/forfriends.html
model of the protein, and several conformations of the protein may exist that are equally
probable.
Importance of bio-informatics
characterized organisms.
10 Email: chinna_chetan05@yahoo.com