Vous êtes sur la page 1sur 10

Visit: www.geocities.com/chinna_chetan05/forfriends.

html

BIO-INFORMATICS

Abstract:

Bio-informatics is the field of science in which biology, computer science,

and information technology merge to form a single discipline. The ultimate goal of the

field is to enable the discovery of new biological insights as well as to create a global

perspective from which unifying principles in biology can be discerned. At the beginning

of the "genomic revolution," a bioinformatics concern was the creation and maintenance

of a database to store biological information, such as nucleotide and amino acid

sequences. Development of this type of database involved not only design issues, but the

development of complex interfaces whereby researchers could both access existing data

as well as submit new or revised data.

Ultimately, however, all of this information must be combined to form a

comprehensive picture of normal cellular activities so that researchers may study how

these activities are altered in different disease states. Therefore, the field of

bioinformatics has evolved such that the most pressing task now involves the analysis

and interpretation of various types of data, including nucleotide and amino acid

sequences, protein domains, and protein structures. The actual process of analyzing and

interpreting data is referred to as computational biology. Important sub-disciplines within

bioinformatics and computational biology include:

1 Email: chinna_chetan05@yahoo.com
Visit: www.geocities.com/chinna_chetan05/forfriends.html

 The development and implementation of tools that enable efficient access to, and

use and management of, various types of information; and

The development of new algorithms (mathematical formulas) and


statistics with which to assess relationships among members of large
data sets, such as methods to locate a gene within a sequence, predict
protein structure and/or function, and

BIO-INFORMATICS
The collection, storage, and analysis of biochemical and biological

information using computers especially as applied in molecular genetics and

genomics.

In last the few decades, advances in molecular biology and the equipment

available for research in this field have allowed the increasingly rapid sequencing of large

portions of the genomes of several species. In fact, to date, several bacterial genomes, as

well as those of some simple eucalypts (e.g., Saccharomyces cerevisiae, or baker's yeast)

have been sequenced in full. The Human Genome Project, designed to sequence all 24 of

the human chromosomes, is also progressing. Popular sequence databases, such as

GenBank and EMBL, have been growing at exponential rates. This deluge of information

has necessitated the careful storage, organization and indexing of sequence information.

Information science has been applied to biology to produce the field called

Bioinformatics.

The simplest tasks used in bioinformatics concern the creation and maintenance of

databases of biological information. Nucleic acid sequences (and the protein sequences

derived from them) comprise the majority of such databases. While the storage and or

2 Email: chinna_chetan05@yahoo.com
Visit: www.geocities.com/chinna_chetan05/forfriends.html

ganization of millions of nucleotides is far from trivial, designing a database and

developing an interface whereby researchers can both access existing information and

submit new entries is only the beginning.

The most pressing tasks in bioinformatics involve the analysis of sequence information.

Computational Biology is the name given to this process, and it involves the following:

 Finding the genes in the DNA sequences of various organisms

 Developing methods to predict the structure and/or function of newly

discovered proteins and structural RNA sequences.

 Clustering protein sequences into families of related sequences and the

development of protein models.

 Aligning similar proteins and generating phylogenetic trees to examine

evolutionary relationships.

The process of evolution has produced DNA sequences that encode proteins with very

specific functions. It is possible to predict the three-dimensional structure of a protein

using algorithms that have been derived from our knowledge of physics, chemistry and

most importantly, from the analysis of other proteins with similar amino acid sequences.

The diagram below summarizes the process by which DNA sequences are used to model

protein structure. The processes involved in this transformation are detailed in the pages

that follow.

3 Email: chinna_chetan05@yahoo.com
Visit: www.geocities.com/chinna_chetan05/forfriends.html

Analysis of EST-driven gene annotation in human genomic

sequence

A systematic analysis of gene identification in genomic sequence is performed by

similarity search against expressed sequence tags (ESTs) to assess the suitability

of this method for automated annotation of the human genome. A BLAST-based

strategy was constructed to examine the potential of this approach, and was

applied to test sets containing all human genomic sequences longer than 5 kb in

public databases, plus 300 kb of exhaustively characterized benchmark sequence.

At high stringency, 70%-90% of all annotated genes are detected by near-identity

to EST sequence; >95% of ESTs aligning with well-annotated sequences overlap

a gene. These ESTs provide immediate access to the corresponding cDNA clones

for follow-up laboratory verification and subsequent biologic analysis. At lower

stringency, up to 97% of annotated genes were identified by similarity to ESTs.

The apparent false-positive rate rose to 55% of ESTs among all sequences and

20% among benchmark sequences at the lowest stringency, indicating that many

genes in public database entries are unannotated. Approximately half of the

alignments span multiple exons, and thus aid in the construction of gene

predictions and elucidation of alternative splicing. In addition, ESTs from

multiple cDNA libraries frequently cluster over genes, providing a starting point

4 Email: chinna_chetan05@yahoo.com
Visit: www.geocities.com/chinna_chetan05/forfriends.html

for crude expression profiles. Clone IDs may be used to form EST pairs, and

particularly to extend models by associating alignments of lower stringency with

high-quality alignments. These results demonstrate that EST similarity search is a

practical general-purpose annotation technique.

A comparison of expressed sequence tags (ESTs) to human

genomic sequences.

The Expressed Sequence Tags (ESTs) are short, single pass cDNA

sequences generated from randomly selected library clones. The approximately

415 000 human ESTs represent a valuable, low priced, and easily accessible

biological reagent.

As many ESTs are derived from yet uncharacterized genes, dbEST is a prime

starting point for the identification of novel mRNAs. Conversely, other genes are

represented by hundreds of ESTs, a redundancy which may provide data about

rare mRNA isoforms. Here we present an analysis of >1000 ESTs generated by

the WashU-Merck EST project. These ESTs were collected by querying dbEST

with the genomic sequences of 15 human genes. When we aligned the matching

ESTs to the genomic sequences, we found that in one gene, 73% of the ESTs

which derive from spliced or partially spliced transcripts either contain intron

sequences or are spliced at previously unreported sites; other genes have lower

percentages of such ESTs, and some have none. This finding suggests that ESTs

could provide researchers with novel information about alternative splicing in

certain genes. In a related analysis of pairs of ESTs which are reported to derive

5 Email: chinna_chetan05@yahoo.com
Visit: www.geocities.com/chinna_chetan05/forfriends.html

from a single gene, we found that as many as 26% of the pairs do not BOTH align

with the sequence of the same gene. We suspect that some of these unusual ESTs

result from artifacts in EST generation, and caution researchers that they may find

such clones while analyzing sequences in dbEST

Protein sequence similarity searches using patterns as seeds.

Protein families often are characterized by conserved sequence patterns or

motifs. A researcher frequently wishes to evaluate the significance of a specific pattern

within a protein, or to exploit knowledge of known motifs to aid the recognition of

greatly diverged but homologous family members. To assist in these efforts, the pattern-

hit initiated BLAST (PHI-BLAST) program described here takes as input both a protein

sequence and a pattern of interest that it contains. PHI-BLAST searches a protein

database for other instances of the input pattern, and uses those found as seeds for the

construction of local alignments to the query sequence. The random distribution of PHI-

BLAST alignment scores is studied analytically and empirically. In many instances, the

program is able to detect statistically significant similarity between homologous proteins

that are not recognizably related using traditional single-pass database search methods.

PHI-BLAST is applied to the analysis of CED4-like cell death regulators, HS90-type

ATPase domains, archaeal tRNA nucleotidyltransferases and archaeal homologs of

DnaG-type DNA primases.

A New Algorithm for RNA Secondary Structure Design

6 Email: chinna_chetan05@yahoo.com
Visit: www.geocities.com/chinna_chetan05/forfriends.html

The function of many RNAs depends crucially on their structure. Therefore, the design of

RNA molecules with specific structural properties has many potential applications, e.g. in

the context of investigating the function of biological RNAs, of creating new ribozymes,

or of designing artificial RNA nanostructures. Here, we present a new algorithm for

solving the following RNA secondary structure design problem: given a secondary

structure, find an RNA sequence (if any) that is predicted to fold to that structure. Unlike

the (pseudoknot-free) secondary structure prediction problem, this problem appears to be

hard computationally. Our new algorithm, "RNA Secondary Structure Designer (RNA-

SSD)", is based on stochastic local search, a prominent general approach for solving hard

combinatorial problems. A thorough empirical evaluation on computationally predicted

structures of biological sequences and artificially generated RNA structures as well as on

empirically modelled structures from the biological literature shows that RNA-SSD

substantially outperforms the best known algorithm for this problem, RNAinverse from

the Vienna RNA Package. In particular, the new algorithm is able to solve structures,

consistently, for which RNAinverse is unable to find solutions.

Searching for Genes

The collecting, organizing and indexing of sequence information into a database, a

challenging task in itself, provides the scientist with a wealth of information, albeit of

limited use. The power of a database comes not from the collection of information, but in

its analysis. A sequence of DNA does not necessarily constitute a gene. It may constitute

only a fragment of a gene or alternatively, it may contain several genes.

7 Email: chinna_chetan05@yahoo.com
Visit: www.geocities.com/chinna_chetan05/forfriends.html

Luckily, in agreement with evolutionary principles, scientific research to date has shown

that all genes share common elements. For many genetic elements, it has been possible to

construct consensus sequences, those sequences best representing the norm for a given

class of organisms (e.g, bacteria, eukaroytes). Common genetic elements include

promoters, enhancers, polyadenylation signal sequences and protein binding sites. These

elements have also been further characterized into further subelements.

Genetic elements share common sequences, and it is this fact that allows mathematical

algorithms to be applied to the analysis of sequence data. A computer program for

finding genes will contain at least the following elements.

The Challenge of Protein Modelling

There are a myriad of steps following the location of a gene locus to the realization of a

three-dimensional model of the protein that it encodes.

1.,Location of Transcription Start/Stop

A proper analysis to locate a genetic locus will usually have already pinpointed at least

the approximate sites of the transcriptional start and stop. Such an analysis is usually

sufficient in determining protein structure. It is the start and end codons for translation

that must be determined with accuracy.

2.Location of Translation Start/Stop

The first codon in a messenger RNA sequence is almost always AUG. While this reduces

the number of candidate codons, the reading frame of the sequence must also be taken

into consideration.

8 Email: chinna_chetan05@yahoo.com
Visit: www.geocities.com/chinna_chetan05/forfriends.html

There are six reading frames possible for a given DNA sequence, three on each strand,

that must be considered, unless further information is available. Since genes are usually

transcribed away from their promoters, the definitive location of this element can reduce

the number of possible frames to three. There is not a strong concensus between different

species surrounding translation start codons. Therefore, location of the appropriate start

codon will include a frame in which they are not apparent abrupt stop codons.

Knowledge of a proteinÕs predicted molecular mass can assist this analysis. Incorrect

reading frames usually predict relatively short peptide sequences. Therefore, it might

seem deceptively simple to ascertain the correct frame. In bacteria, such is frequently the

case. However, eukaryotes add a new obstacle to this process: INTRONS!

3.Detection of Intron/Exon Splice Sites

In eukaryotes, the reading frame is discontinuous at the level of the DNA because of the

presence of introns. Unless one is working with a cDNA sequence in analysis, these

introns must be spliced out and the exons joined to give the sequence that actually codes

for the protein.

Intron/exon splice sites can be predicted on the basis of their common features. Most

introns begin with the nucleotides GT and end with the nucleotides AG. There is a branch

sequence near the downstream end of each intron involved in the splicing event. There is

a moderate concensus around this branch site.

4.3-D Structure modelling

With the completed primary amino acid sequence in hand, the challenge of modelling the

three-dimensional structure of the protein awaits. This process uses a wide range of data

and CPU-intensive computer analysis. Most often, one is only able to obtain a rough

9 Email: chinna_chetan05@yahoo.com
Visit: www.geocities.com/chinna_chetan05/forfriends.html

model of the protein, and several conformations of the protein may exist that are equally

probable.

Importance of bio-informatics

The rationale for applying computational approaches to facilitate the understanding of

various biological processes includes:

 a more global perspective in experimental design; and

 the ability to capitalize on the emerging technology of database-mining--the

process by which testable hypotheses are generated regarding the function or

structure of a gene or protein of interest by identifying similar sequences in better

characterized organisms.

10 Email: chinna_chetan05@yahoo.com

Vous aimerez peut-être aussi