Vous êtes sur la page 1sur 133

BIOINFORMATICS

Biology easily has 500 years of exciting problems to work


on
--Donald E. Knuth
Ever since the structure of DNA was unraveled in 1953, the molecular

biology has witnessed tremendous advances.

The need to process the ever growing biological data has created

entirely new problems that are interdisciplinary in nature.

Scientists from biological sciences are the creators & ultimate users of

this data.

Due to huge size & high complexity of the biological data, the help of

many other disciplinesin particular from mathematics& computer


science is required

This need has created a new field Computational Molecular Biology &

bioinformatics

The Commercial Market


Current

bioinformatics market is worth 300 million / year


(Half software)

Prediction: $2
~50

billion / year in 5-6 years

Bioinformatics companies:

Genomatrix Software, Genaissance Pharmaceuticals, Lynx, Lexicon Genetics, DeCode


Genetics, CuraGen, AlphaGene, Bionavigation, Pangene, InforMax, TimeLogic,
GeneCodes, LabOnWeb.com, Darwin, Celera, Incyte, BioResearch Online, BioTools,
Oxford Molecular, Genomica, NetGenics, Rosetta, Lion BioScience, DoubleTwist,
eBioinformatics, Prospect Genomics, Neomorphic, Molecular Mining, GeneLogic,
GeneFormatics, Molecular Simulations, Bioinformatics Solutions.BIOCON(INDIA)

Comptational molecular biology & Bioinformatics


(cmb)

CMB consists of the development and use of computer


science & mathematical techniques to solve
problems in molecular biology.

Bioinformatics
Unit 1. Basic Concepts
Unit 2. Suffix Trees and Applications
Unit 3. sequence alignment: pair wise

alignment. Multiple Alignments


Unit 4. Sequencing
Unit 5. Motif Prediction

Bioinformatics
Unit 1. Basic Concepts of Molecular Biology:

Cellular Architecture, Nucleic Acids (RNA & DNA ),


DNA replication, Repair and recombination.
Transcription, Genetic code, Gene expression,
Protein structure and function, Molecular biology
tools. Statistical Methods: Estimation, Hypothesis
testing, Random Walks, Markov Models(HMM).

Unit 2. Suffix Trees


Definition and examples, Ukkonens linear-

time suffix tree algorithm, Applications (exact


string matching, longest common sub strings
of two strings, Recognizing DNA
contamination). Pair wise Sequence
Alignment (Edit distance , Dynamic
Programming Calculator of edit distance,
string similarity, gaps).

Unit 3. sequence alignment


Pair wise sequence alignment (local), HMM for

pair wise alignment. Multiple String Alignments :


Need of MSA, Family & Super Family representation,
multiple sequence comparison
for structural inferences, Multiple alignments with
sum-of- pairs, consensus objective functions. Profile
HMM for multiple sequence alignment. Database
searching for similar sequence (FASTA, BLAST),
PAM, BLOSOM substitution matrices.

Unit 4. Sequencing
Fragment Assembly (Shortest common super

string algorithms based on multi-graph),


Sequencing by Hybridization, Protein
sequencin

Unit 5. Motif Prediction


Motif Prediction, Gene prediction, Introduction

to Protein structure Prediction.

BOOKS RECOMMENDED
Dan Gusfied, Algorithm on strings, Trees and

Sequences : Computer science & Computational


Biology, Cambridge University Press, 1997.
(Chapters: 5,6,7,10,11,14,15)
J.Setubal & Meidanis, Introduction to computational
Molecular Biology PWS Publishing Company,
1997(Chapters : 1,8).
W.J. Ewens & G.R. Grant Statistical Methods in
Bioinformatics Springer-1989.

Contd

R. Durbin, S.R. Eddy, A. Krogh and G.J.


Mitchison, Biological Sequence Analysis :
Probabilistic Models of Proteins and Nuclics
Acids, Cambridge University Press
1998(Chapters: 3,5 & 6)
R.C. Denier, S. Tavare, M.S. Waterman,
Computational Genome Analysis, Springer,
2005.

Contd

N.C. Jones and P.A. Pevzner An


Introduction to Bioinformatics Algorithms
MIT Press- 2004.
D.E. Krane, M.L. Raymer Fundamental
Concepts of Bioinformatics Pearson
Education 2003.
J. Tisdall, Beginning Perl for Bioinformatics,
OReilly, 2001.

Contd

M.S. Waterman, Introduction to Computational


Biology CRC Press, 2000.
A. Baxevainis & B. Ouellete, Bio Informatics: A
Practical Guide to the Analysis of Genes and
Proteins, Willy- Interescience, 2001.
M. Ridley, Genome: The autobiography of a
species, Fourth Estate, 1999.
Lodish, Berk, Zipursky, Blalimore & Darnell,
Molecular Cell Biology, W.H.Freeman, 2000.

Class 2

UNIT 1
LIFE AT ITS SIMPLEST
DNA
RNA
PROTEIN
GENETIC-CODE

QUICK-PRIMER ON GENETICS.
EVERY CELL IN THE HUMAN BODY CONTAINS A COPY OF

THE GENOME

THINK OF THE GENOME AS A BOOK

THE BLUEPRINT THAT

CONTAINS DETAILS OF WHAT EACH INDIVIDUAL OUGHT TO BE LIKE.


NOW, EACH HUMAN GENOME CONTAINS 23 CHROMOSOMES.

IF THE GENOME WERE A BOOK, THEN THINK OF

CHROMOSOMES AS THE CHAPTERS IN IT

EACH OF THESE CHAPTERS TELL ABOUT SEVERAL

THOUSAND STORIES CALLED GENES


e.g the colour of individual skin,eyes & hair, left or right handed,
his IQ, and everything that matters

What Venter did?


He could figure out what language the book

was written in & how to read it.

He could figure out the grammar of the book

& therefore, how to write in the language

This is the kind of language that will allow him

to create life.

Ventermagicnew-life?
Venter used his knowledge to create

Mycoplasma Laboratorium, a chromosome


that is 381 genes long, transplanted into living
cell, it is expected to take control of the cell
and become a new life-form.

The new life-form can mop up excessive

carbon dioxide and contribute to resolving


problems like global-warming.

Contd
Until now, scientists have managed to take

the genome out of one cell put it another cell


and create an altogether new organism.
But, nobody knew how to create the genome
itself. Venter did just that.

Biology easily has 500 years of exciting problems to work


on
--Donald E. Knuth
Ever since the structure of DNA was unraveled in 1953, the molecular

biology has witnessed tremendous advances.

The need to process the ever growing biological data has created

entirely new problems that are interdisciplinary in nature.

Scientists from biological sciences are the creators & ultimate users of

this data.

Due to huge size & high complexity of the biological data, the help of

many other disciplinesin particular from mathematics& computer


science is required

This need has created a new field Computational Molecular Biology &

bioinformatics

Comptational molecular biology & Bioinformatics


(cmb)

CMB consists of the development and use of computer


science & mathematical techniques to solve
problems in molecular biology.

Living vs. Nonliving


Both kinds of matter are composed by the same
atoms and confirms to the same physical and
chemical rules.
What is the difference then????

Living vs. Nonliving


Living things can move,

reproduce, grow, eat


They have an active
participation in their
environment
Living beings act the way they
do due to a complex array of
chemical reactions that occur
inside them. These reactions
never cease.
Living organism is constantly
exchanging matter & energy
with its surroundings.

Anything that is in

equilibrium with its


surrounding can
generally be considered
dead
(exceptions are
vegetative forms, like
seeds, and viruses
which may be
completely inactive for
long periods of time and
are not dead.)

Life starts
Life started some 3.5 billions of years ago, shortly

after the Earth itself was formed.


The first life forms were very simple, but over billions
of years a continuously acting process called
evolution made them evolve and diversify
Both complex and simple organisms have a similar
molecular chemistry or biochemistry.
Main actors in the chemistry of life are molecules
called proteins and nucleic acids

ACTORS IN CHEMISTRY OF LIFE :Proteins

& nucleic acids:


Proteins are responsible for what a living being is and

does in a physical sense.


we are our proteins-Russell Doolittle
Nucleic acids, on the other hand, encode information
necessary to produce proteins and responsible for
passing along this recipe to subsequent generations
Recent research is devoted to the understanding of
the structure and function of proteins and nucleic
acids

Proteins
Different Roles of Proteins

Enzymes
Carry signals
Transport small molecules such as oxygen
Form cellular structures (tissues)
Regulate cell processes (such as defense
mechanisms)
What are proteins made of?
Amino acids chain of amino acids = protein

Amino acids

Backbone of polypeptide chain


Convention
Begin at N-terminal
End at C-terminal
Torsion or rotation

angles around:

C-N bond ()
C-C bond ()

PROTEIN STRUCTURE
Protein is not just a linear sequence of residues

primary structure
Proteins actually fold in 3D, presenting secondary,
tertiary and quaternary structures
3D shape of a protein is related to its function
Protein can be made out of 20 different kinds of
amino acids make the resulting 3D structure very
complex and without symmetry
No simple and accurate method for determining the
3Dstructure is known.

Genomic Code

DNA
deoxyribosenucleic acid
Basic unit = nucleotide
Sugar,Phosphate,Base

(A, G, T, C)
adenine, thymine
cytosine, guanine.

Contd
DNA is a chain of simpler molecules
Actually it is a double chain (strands)
Each simple chain has a backbone consisting of

repetitions of the same basic unit


This unit is formed by a sugar molecule called 2deoxyribose attached to a phosphate residue
The sugar molecules contains five carbon atoms and
they are labeled 1 through 5
DNA molecules also have a orientation (starts at the
5 end finishes at the 3 end)

Contd
Attached to each 1 carbon in the backbone

are other molecules called bases


There are 4 kinds of bases:
A(ADENINE)
G(GUANINE)
C(CYTOSINE)
T(THYMINE)

CONTD
Bases A & G belong to a larger group of substances called

purines where as C & T belong to pyrimidines


When we see the basic unit of a DNA molecule as consisting of
sugar, phosphate, & its base we call it nucleotide
Bases & nucleotides are not the same thing.
DNA molecule having a few nucleotides is referred to as an
oligonucleotide
DNA molecule in nature is very long, much longer than proteins
In humancell, each DNA molecules have hundreds of millions of
nucleotides

Contd
DNA molecules are double strands
The two strands are tied together in a helical structure (watson

& crick 1953)


Each base in one strand is paired with a base in the other
strand
A pairs with T (COMPLEMENTARY BASES/watson crick base
pairs)
C pairs with G
base pair ( bp) provides the unit of length

RNA
RNA is a nucleic acid made from long chain

of nucleotides
Each nucleotide consists of a nitrogen base,
a ribose sugar, and a phosphate
RNA is very similar to DNA , but differs with
the following basic compositional and
structural differences

DNA vs. RNA


DNA is double stranded
Very long chain of

nucleotides
DNA contains deoxyribose
DNA is more stable

RNA is single stranded


Comparatively shorter chain

of nucleotides
RNA contains ribose

Less stable, more prone to


Complementary nucleotide

to Adenine is Thymine
DNA performs essentially
one function

hydrolysis
Complementary nucleotide
to adenine is uracil
There are different kinds of
RNA performing different
functions

Class 3
26/06/08

Central Dogma of Molecular


Biology

How the information in DNA results in


proteins?
A promoter is a region before each gene in the DNA that serves

as an indication to the cellular mechanism that a gene is ahead.

Having recognized the beginning of a gene a copy of the gene

is made on an RNA molecule.


The resulting RNA is mRNA (substitute U for T). This process is
called transcription.
the mRNA will be used to manufacture protein.
After transcription, the introns are spliced from the

mRNA=>introns are that part of gene that are not used in


protein synthesis
After introns are spliced out the shortened mRNA containing
copies of only exons plus regulatory regions in the beginning &
end leaves the nucleus

Contd
Because of the intron/exon phenomenon, we use

different names to the entire gene & to the spliced


sequence consisting of exons only.
The former is called genomic DNA & the latter
complementary DNA or cDNA
t RNA are the molecules that actually implement the
genetic code in a process called translation. They
make the connection between a codon and the
specific amino acid this codon codes for.
When a stop codon appears no tRNA associates with
it and the synthesis ends.

Central Dogma: DNA -> RNA ->


Protein
DNA

CCTGAGCCAACTATTGATGAA

transcription

RNA

CCUGAGCCAACUAUUGAUGAA

translation

Protein

PEPTIDE

JUNK DNA
Genes are certain contiguous regions of the chromosome, but

they do not cover the entire molecule

There are intergenic regions which does not have any known

functions. They are called junk DNA because they appear to


be there for no particular use.

Recent research has shown that junk DNA has more information

content than previously believed

The amount of junk DNA varies from species to species


humans>90% junk DNA

OPEN READING FRAME


( ORF )
An ORF in a DNA sequence is a contiguous stretch of this sequence

beginning at the start codon, having an integral number of codon, such


that none of its codon is a stop codon

Consider the sequence TAATCGAATGGGC

one reading frame: TAA TCG AAT GGG


second reading frame: AAT CGA ATG GGC
third reading frame: ATC GAA TGG
fourth reading frame: TCG AAT GGG
4TH frame is a subset of one of the frame starting at position 1
Sometimes we talk about 6 not 3 different frames in a sequence.(look

at the opposite strand and count another 3)

Genome
Complete set of chromosome inside a cell is

called a genome.
The number of chromosomes in a genome is
characteristic of species
Every cell in a human being has 46
chromosomes, whereas in mice this number
is 40

Is Genome like a computer program?


Genome = computer program

Genome of an organism is seen as a computer


program that completely specifies the organism
Cell machinery = interpreter of this program
Biological functions performed by proteins =

execution of this program

Class 4
01/07/08

The Eye of the Fly


Fruit flies (Drosophila melanoglaster) have a gene

called eyeless which , if it is knocked out (i.e.


eliminated the genome using molecular biology
methods) results in fruit flies without eyes
It is obvious that eyeless gene plays a role in eye
development
Researchers have identified a human gene
responsible for a condition called aniridia
In humans who are missing this gene( or in whom the
gene has mutated just enough for its protein product
onto stop functioning properly), the eyes develop
without irises

Cont
If the gene for aniridia is inserted into an

eyeless drosophila knock out", it causes the


production of normal drosophila eyes. It is an
interesting observation.
Could there be some similarity in how eyeless
and aniridia function??? Even though flies &
humans are vastly different organisms?
To gain insight into how eyeless & aniridia
work together, we can compare their
sequences

Contd
20 years ago similarity between eyeless &

aniridia DNA sequences would have been like


looking for a needle in a haystack
Most scientists compared the respective gene
sequences by hand aligning them one under
the other in a word processor & looking for
matches character by character.
This was time consuming & hard on the eyes.

Contd
In the late 1980s, fast computer programs for

comparing sequences changed molecular


biology for ever
Many tools that are widely available to the
biology community- including everything from
multiple alignment, phylogenetic analysis,
motif identification, &
homology modeling software, to web-based
database search services-rely on pair wise
sequence comparison algorithms as a core
element of their function

How the genome is studied?


Say, human genome
Sequencing: The basic information we want to extract

from any piece of DNA is its base pair sequence. The


process of obtaining this information is called
sequencing
A human chromosome has around 108 base pairs.
but, the largest pieces of DNA that can be sequenced
in the laboratory are 700bp long.
=>there is a gap of some 105 between the scales of what
we can actually sequence and a chromosome size.
This gap is at the heart of many problems in
computational biology

Cutting & Breaking DNA


Because a DNA molecule is so long, some tool to cut

it at specific points (like a pair of scissors) or to break


it apart in some way is needed
The pair of scissors is represented by restriction
enzymes
They cut DNA molecules in all places where a certain
sequence appears( usually a palindrome sequence)
Some common types of restriction enzymes are 4
cutters, 6 cutters, & 8 cutters
It is rare to see an odd cutter because sequences of
odd length cannot be palindromes

COPYING DNA
Also known as DNA amplification
Very important in DNA cloning
Given a piece of DNA, one way of obtaining further

copies is to use nature itself


We insert this piece into the genome of an organism
(host) and then let the organism multiply itself.
Upon host multiplication, the inserted piece gets
multiplied along with original DNA.
Then we kill the host & dispose the rest keeping only
the inserts in the desired quantity.
DNA produced in this way is called recombinant.

Reading & measuring DNA


Reading is done with a technique

known as gel electrophoresis which is


based on separation of molecules by
their size
This process involves a gel medium & a

strong electric field

HUMAN GENOME PROJECT

The Human Genome Project

What is the Human


Genome Project?
U.S. govt. project coordinated by the Department of

Energy and the National Institutes of Health


goals (1998-2003)

identify the approximate 100,000 genes in human DNA


determine the sequences of the 3 billion bases that make up
human DNA
store this information in databases
develop tools for data analysis
address the ethical, legal, and social issues that arise from
genome research

Why is the Department of


Energy involved?
-after atomic bombs were dropped during War
War II, Congress told DOE to conduct studies to
understand the biological and health effects of
radiation and chemical by-products of all energy
production
-best way to study these effects is at the DNA
level

Whose genome is being


sequenced?
the first reference genome is a composite genome

from several different people


generated from 10-20 primary samples taken from

numerous anonymous donors across racial and


ethnic groups

Benefits of HGP Research


improvements in medicine
microbial genome research for fuel and

environmental cleanup
DNA forensics
improved agriculture and livestock
better understanding of evolution and human
migration
more accurate risk assessment

Ethical, Legal, and Social Implications


of HGP Research
fairness in the use of genetic information
privacy and confidentiality
psychological impact and stigmatization
genetic testing
reproductive issues
education, standards, and quality control
commercialization
conceptual and philosophical implications

For More Information...

Human Genome Project Information Website


http://www.ornl.gov/hgmis

Contd
A large effort like this cannot be entertained

by a single lab!!!!!!
On computer science side, databases with
updated & consistent information have to be
maintained,
Fast access to the data has to be provided
After the sequencing there is a still difficult
task of analyzing the data obtained

Contd..
Treatment of genetic diseases based on data
produced by the Human Genome Project is
still going on, although encouraging
pioneering efforts have already yielded
results.

Class 5
08/07/08

What is a database?
A collection of information, usually stored in

an electronic format that can be searched by


a computer.

A brief history of biological databases


1965 M. O. Dayhoff et al. publish Atlas of
Protein Sequences and Structures
1982 EMBL initiates DNA sequence database,
followed within a year by GenBank (then
at LANL) and in 1984 by DNA Database
of Japan
1988 EMBL/GenBank/DDBJ agree on
common format for data elements

Biological databases: why?


There are two main functions of biological

databases:

Make biological data available to scientists.

As much as possible of a particular type of information should


be available in one single place (book, site, database).
Published data may be difficult to find or access, and collecting
it from the literature is very time-consuming. And not all data is
actually published explicitly in an article (genome sequences!).

To make biological data available in computer-readable

form.
Since analysis of biological data almost always involves
computers, having the data in computer-readable form (rather
than printed on paper) is a necessary first step.

The different types of databases


One may characterize the available biological

databases by several different properties. Here is a


list to help you think about the various properties a
particular database may have

Type of data
nucleotide sequences
protein sequences
proteins sequence patterns or motifs
macromolecular 3D structure
gene expression data
metabolic pathways

Contd

Data entry and quality control


Scientists (teams) deposit data directly
Appointed curators add and update data
Are erroneous data removed or marked?
Type and degree of error checking
Consistency, redundancy, conflicts, updates

Primary or derived data


Primary databases: experimental results directly into database
Secondary databases: results of analysis of primary databases
Aggregate of many databases
Links to other data items
Combination of data
Consolidation of data

Contd
Technical design
Flat-files
Relational database (SQL)
Object-oriented database (e.g. CORBA, XML)

Maintainer status
Large, public institution (e.g. EMBL, NCBI)
Quasi-academic institute (e.g. Swiss Institute of Bioinformatics, TIGR)
Academic group or scientist
Commercial company

Availability
Publicly available, no restrictions
Available, but with copyright
Accessible, but not downloadable
Academic, but not freely available
Proprietary, commercial; possibly free for academics

Accession codes vs identifiers


Many databases in bioinformatics (SWISS-

PROT, EMBL, GenBank, Pfam) use a system


where an entry can be identified in two
different ways.
Basically, it has two names:

Identifier
Accession code (or number)

Contd
Identifier
An identifier ("locus" in GenBank, "entry name" in SWISS-

PROT) is a string of letters and digits that generally is


interpretable in some meaningful way by a human, for instance
as a recognizable abbreviation of the full protein or gene name.
SWISS-PROT uses a system where the entry name consists of
two parts: the first denotes the protein and the second part
denotes the species it is found in. For example, KRAF_HUMAN
is the entry name for the Raf-1 oncogene from Homo sapiens.
An identifier can usually change. For example, the database
curators may decide that the identifier for an entry no longer is
appropriate. However, this does not happen very often. In fact, it
happens so rarely that it's not really a big problem.

Contd
Accession code (number)
An accession code (or number) is a number (possibly with a few

characters in front) that uniquely identifies an entry in its database. For


example, the accession code for KRAF_HUMAN in SWISS-PROT is
P04049.
The main conceptual difference from the identifier is that it is supposed
to be stable: any given accession code will, as soon as it has been
issued, always refer to that entry, or its ancestors. It is often called the
primary key for the entry. The accession code, once issued, must
always point to its entry, even after large changes have been made to
the entry. This means that in discussions about specific database
entries (e.g. an article about a specific protein), one should always give
the accession code for the entry in the relevant database.
In the case where two entries are merged into one single, then the
new entry will have both accession codes, where one will be the
primary and the other the secondary accession code. When an entry
is split into two, both new entries will get new accession codes, but will
also have the old accession code as secondary codes.

Nucleotide sequence databases


Primary nucleotide sequence databases
The databases EMBL, GenBank, and DDBJ are the three primary
nucleotide sequence databases:
They include sequences submitted directly by scientists and genome
sequencing group, and sequences taken from literature and patents.
There is comparatively little error checking and there is a fair amount of
redundancy.
The entries in the EMBL, GenBank and DDBJ databases are

synchronized on a daily basis, and the accession numbers are


managed in a consistent manner between these three centers.
The nucleotide databases have reached such large sizes that they are
available in subdivisions that allow searches or downloads that are
more limited, and hence less time-consuming. For example, GenBank
has currently 17 divisions.
There are no legal restrictions on the use of the data in these
databases. However, there are some patented sequences in the
databases.

Contd
EMBL www.ebi.ac.uk/embl/

The EMBL (European Molecular Biology Laboratory) nucleotide


sequence database is maintained by the European Bioinformatics
Institute (EBI) in Hinxton, Cambridge, UK.
GenBank www.ncbi.nlm.nih.gov/Genbank/

The GenBank nucleotide database is maintained by the National Center


for Biotechnology Information (NCBI), which is part of the National
Institute of Health (NIH), a federal agency of the US government.
It can be accessed and searched through the Entrez system at NCBI, or
one can download the entire database as flat files.
DDBJ www.ddbj.nig.ac.jp

The DNA Data Bank of Japan began as a collaboration with EMBL and
GenBank. It is run by the National Institute of Genetics. One can search
for entries by accession number.

Other nucleotide sequence databases


secondary databases

The following databases contain subsets of the EMBL/GenBank databases. Some also
contain more information or links than the primary ones, or have a different organization of
the data to better some specific purpose. However, the nucleotide sequences themselves
should always be available in the EMBL/GenBank databases. In this sense, the databases
below are secondary databases.
UniGene http://www.ncbi.nlm.nih.gov/sites/entrez?db =unigene
The UniGene system attempts to process the GenBank sequence data into a non-redundant
set of gene-oriented clusters. Each UniGene cluster contains sequences that represent a
unique gene, as well as related information such as the tissue types in which the gene has
been expressed and map location.
SGD http://www.yeastgenome.org /
The Saccharomyces Genome Database (SGD) is a scientific database of the molecular
biology and genetics of the yeast Saccharomyces cerevisiae.
EBI Genomes www.ebi.ac.uk /genomes/
This web site provides access and statistics for the completed genomes, and information
about ongoing projects.
Genome Biology www.ncbi.nlm.nih.gov /Genomes/
The Genome Biology site at NCBI contains information about the available complete
genomes.
Ensembl www.ensembl.org
Ensembl is a joint project between EMBL-EBI and the Sanger Centre to develop a software
system which produces and maintains automatic annotation on eukaryotic genomes.

Protein sequence databases


The two protein sequence databases SWISS-

PROT and PIR are different from the


nucleotide databases in that they are both
curated.
This means that groups of designated
curators (scientists) prepare the entries from
literature and/or contacts with external
experts.

SWISS-PROT, TrEMBL
www.expasy.ch/sprot

SWISS-PROT is a protein sequence database which strives to provide a high


level of annotations (such as the description of the function of a protein, its
domains structure, post-translational modifications, variants, etc.), a minimal
level of redundancy and high level of integration with other databases.
It was started in 1986 by Amos Bairoch in the Department of Medical
Biochemistry at the University of Geneva. This database is generally considered
one of the best protein sequence databases in terms of the quality of the
annotation. Its size is given in the table below.
TrEMBL is a computer-annotated supplement of SWISS-PROT that contains
all the translations of EMBL nucleotide sequence entries not yet integrated in
SWISS-PROT. The procedure that is used to produce it was developed by Rolf
Apweiler. The annotation of an entry in TrEMBL has not (yet) reached the
standards required for inclusion into SWISS-PROT proper.
The SWISS-PROT database has some legal restrictions: the entries
themselves are copyrighted, but freely accessible and usable by academic
researchers. Commercial companies must buy a license fee from SIB.

PIR pir.georgetown.edu

The Protein Information Resource (PIR) is a division of the National Biomedical


Research Foundation (NBRF) in the US. It is involved in a collaboration with the
Munich Information Center for Protein Sequences (MIPS) and the Japanese
International Protein Sequence Database (JIPID). The PIR-PSD (Protein
Sequence Database) release 70.01 (22 Oct 2000) contains 254,293 entries.

PIR grew out of Margaret Dayhoff's work in the middle of the 1960s. It strives to
be comprehensive, well-organized, accurate, and consistently annotated.
However, it is generally believed that it does not reach the level of completeness
in the entry annotation as does SWISS-PROT. Although SWISS-PROT and PIR
overlap extensively, there are still many sequences which can be found in only
one of them.
One can search for entries or do sequence similarity searches at the PIR site.
PIR also produces the NRL-3D, which is a database of sequences extracted
from the three-dimensional structures in the Protein Databank (PDB).
It appears that the PIR web site, and possibly also the underlying database, has
improved considerably since one year ago. This means that if one is interested
in protein sequences, there is now even more reason to check out PIR;

Other relevant databases


GeneCards www.genecards.org

GeneCards is a database of human genes, their products and


their involvement in diseases. It offers concise information about
the functions of all human genes that have an approved symbol,
as well as selected others. It is a typical example of a
secondary database, which contains many links to other
databases, and attempts to consolidate the information that is
available for a specific class of entity, in this case human genes.
GeneLynx www.genelynx.org

GeneLynx is a database of Web links for human genes. It


contains pointers to a large number of other databases. This is
also a typical secondary database. It is maintained by Boris
Lenhard and Wyeth Wasserman at CGB, KI, Sweden.

Contd
KEGG www.genome.ad.jp/kegg/

The Kyoto Encyclopedia of Genes and Genomes


(KEGG) is an effort to computerize current
knowledge of molecular and cellular biology in terms
of the information pathways that consist of interacting
molecules or genes and to provide links from the
gene catalogs produced by genome sequencing
projects.
Amos' WWW links page
www.expasy.org/links.html
A page of many links to biological databases and/or
web sites

POPULAR BIOINFORMATICS
DATABASES

http://everest.bic.nus.edu.sg/~bhuvana/lsm21

04/popualr-bioinformatics-databases.htm

Class 6
10/07/08

Growth of GenBank database


Base Pairs
Sequences

www.ncbi.nlm.nih.gov
Created in 1988 as part of the National
Library of Medicine at NIH
Establish public databases
Research in computational biology
Develop software tools for sequence
analysis
Disseminate biomedical information

Types of databases at NCBI


Primary databases

Original submissions by experimentalists


Content controlled by the submitter

Examples: GenBank, SNP, GEO

Derivative databases
Built from primary data
Content controlled by third party (NCBI)

Examples: Refseq, TPA, RefSNP, UniGene, NCBI


Protein, Structure, Conserved Domain, Gene

Entrez

Literature & Text


PubMed
15 million citations in MEDLINE
Links to participating online journals

Books
Linked from PubMed and other records

Searchable from within Entrez

Nucleotide databases
Primary
GenBank / EMBL / DDBJ
54,694,591
Genbank
Derivative
RefSeq
RefSeq
1,132,972
Third
Party Annotation
4,763
PDB
PDB
5,887
Total

55,838,213

EMBL/ GenBank /DDBJ (European


Molecular Biology Laboratory)
Archive containing all sequences from:

genome projects
sequencing centers
individual scientists
patent offices

Database is doubling every 15 months


Sequences from >200,000 different species
>1000 new species added every month

Protein Databases
Genpept

CDS from GenBank entries

TrEMBL (1996)

Automatic CDS translations from EMBL

Highly redundant
Not all experimentally determined
Many inaccuracies

Secondary protein database


SWISS-PROT (1986)

Best annotated, least redundant

PIR (Protein Information Resource)

More automated annotation


Collaborations with MIPS and JIPID

Secondary protein databases


SWISS-PROT (1986)

Best annotated, least redundant

PIR (Protein Information Resource)

More automated annotation


Collaborations with MIPS and JIPID

Uniprot (2003)

UniProt (Universal Protein Resource) is a central


repository of protein sequence and function
created by joining the information contained in
Swiss-Prot, TrEMBL, and PIR.

Uniprot
UniProt Knowledgebase (UniProt)
Central access point for extensive curated protein
information, including function, classification, and
cross-reference.
UniProt Non-redundant Reference (UniRef)

Set of databases that combine closely related


sequences into a single record to speed searches.

UniProt Archive (UniParc)


Comprehensive repository, reflecting the history of all
protein sequences.
No annotation, used internally

NCBI Derivative Sequence


Curators
Data

C TC
ATCATCT
TA
TA
G
CC
GC G
T
G
AC

GAG
GAG
A A

RefSeq

T
T
G
A
C
A
C
G
TGA

TATAGCCG
AGCTCCGATA
CCGATGACAA

AT
T
GA
C
TA

CG
G
CC
G
A
TAT

Genome
Assembly

TA
TA
GC

A
TG
CG C
TG
G

AC

CGTGA

A
G
T
ATTG
C GA
CT
A
ACG
TGC

Labs
CA
A
G
TT
TTGACA
TAT AT
TA
C
AG TG
CG GA
CC CTAAC
CA
C
A
A
G
T
T
A TAG T
TATAGCCG
TATAGCCG
TATAGCCG
ATTG TATAGCCG
TG
A
T
A
T
T
AT
C

GenBank

UniGene

A
T
TG
A
C
TA
GA

AT
C TC
ATCATT

GAG
GAG
A A

TC T
T
C
ATTATC
A

Algorithms

GAGA
GAG
A

High-throughput DNA sequencing


Top

image: confocal detection


by the MegaBACE sequencer
of fluorescently labeled DNA

Bottom

image: computer
image of sequence read by
automated sequencer

The trend of data growth

8
century is a century of biotechnology & bioinformatics:
7

Genomics: New sequence information is being


produced at increasing rates. (The
contents of GenBank double every one and

Nucleotides(billion)

21st

half year)

6
5
4
3
2
1
0
1980

1985

1990

Years

Microarray: Global expression analysis: RNA levels of every


gene in the genome analyzed in parallel.

Proteomics:Global protein analysis generates by large mass


spectra libraries.

Metabolomics:Global metabolite analysis: 25,000 secondary


metabolites characterized

Glycomics:Global sugar metabolism analysis

1995

2000

How to handle the large amount of information?

Drew Sheneman, New Jersey--The Newark Star Ledger

Answer: bioinformatics and Internet

Bioinformatics NEED FOR


ALGORITHM?
In1960s: the birth of bioinformatics

IBM 7090 computer


Margaret

Oakley Dayhoff created:

The first protein database


The first program for sequence assembly

There

is a need for computers and algorithms that allow:

Access, processing, storing, sharing, retrieving, visualizing, annotating

Why do we need the Internet?


omics

projects and the information associated with involve a huge amount


of data that is stored on computers all over the world.

Because

it is impossible to maintain up-to-date copies of all relevant


databases within the lab. Access to the data is via the internet.

Database
storage

Re

You are
here

lts
u
s

ur
o
Y

st
e
u
eq

The Commercial Market


Current

bioinformatics market is worth 300 million / year


(Half software)

Prediction: $2
~50

billion / year in 5-6 years

Bioinformatics companies:

Genomatrix Software, Genaissance Pharmaceuticals, Lynx, Lexicon Genetics, DeCode


Genetics, CuraGen, AlphaGene, Bionavigation, Pangene, InforMax, TimeLogic,
GeneCodes, LabOnWeb.com, Darwin, Celera, Incyte, BioResearch Online, BioTools,
Oxford Molecular, Genomica, NetGenics, Rosetta, Lion BioScience, DoubleTwist,
eBioinformatics, Prospect Genomics, Neomorphic, Molecular Mining, GeneLogic,
GeneFormatics, Molecular Simulations, Bioinformatics Solutions.BIOCON(INDIA)

Scope

Make

you familiar with bioinformatics resources


available on the
web

LOGO
They

are big databases and searching either one should produce


similar results because they exchange information routinely.
-GenBank (NCBI): http://www.ncbi.nlm.nih.gov
-DDBJ (DNA DataBase of Japan): http://www.ddbj.nig.ac.jp
-TIGR: http://tigr.org/tdb/tgi
-Yeast: http://yeastgenome.org
-E. coli: http://colibase.bham.ac.uk/blast/

Specialized

databases:Tissues, species
-ESTs (Expressed Sequence Tags)
~at NCBI http://www.ncbi.nlm.nih.gov/dbEST
~at TIGR http://tigr.org/tdb/tgi
- ...many more!

Protein (amino acid) databases


They

are big databases too:


-Swiss-Prot (very high level of annotation)
http://au.expasy.org/
-PIR (protein identification resource) the world's most
comprehensive catalog of information on proteins

http://www.pir.uniprot.org/
Translated

databases:
-TREMBL (translated EMBL): includes entries that have
not been annotated yet into Swiss-Prot.
http://www.ebi.ac.uk/trembl/access.html

-GenPept (translation of coding regions in GenBank)


-pdb (sequences derived from the 3D structure
Brookhaven PDB) http://www.rcsb.org/pdb/

Database homology searching


Use

algorithms to efficiently provide mathematical basis of searches


that can be translated to statistical significance.

Assumes

that sequence, structure, and function are inter-related.

All

similarity searching methods rely on the concepts of alignment


and distance between sequences.

A similarity

score is calculated from a distance: the number of DNA


bases or amino acids that are different between two sequences.

Calculating alignment scores


Scoring

system: Uses scoring matrices that allow biologists to quantify the


quality of sequence alignments.

The

raw score S is calculated by summing the scores for each aligned


position and the scores for gaps. Gap creation/extension scores are
inherent to the scoring system in use (BLAST, FASTA)

The

score for an identity or a mismatch is given by the specified substitution


matrix (e.g., BLOSUM62).

Devising a scoring system


Some

popular scoring matrices are:

How

PAM (Percent Accepted Mutation): for evolutionary studies.


For example in PAM1, 1 accepted point mutation per 100 amino
acids is erquired.
BLOSUM (BLOcks amino acid SUbstitution Matrix): for finding
common motifs. For example in BLOSUM62, the alignment is
created using sequences sharing no more than 62% identity.

the matrices were created:

Very similar sequences were aligned.

From these alignments, the frequency of substitution between


each pair of amino acids was calculated and then PAM1 was built.

After normalizing to log-odds format, the full series of PAM matrices


can be calculated by multiplying the PAM1 matrix by itself.

Devising a scoring system


Importance:

Scoring matrices appear in all analysis


involving sequence comparison.
The choice of matrix can strongly influence
the outcome of the analysis.
Understanding theories underlying a given
scoring matrix can aid in making proper
choice:
-Some matrices reflect similarity: good for
database searching
-Some reflect distance: good for phylogenies

Log-odds matrices, a normalisation method for matrix values:

S is the probability that two residues, i and j, are aligned by evolutionary descent
and by chance.
qij are the frequencies that i and j are observed to align in sequences known to

be related. pi and pj are their frequencies of occurrence in the set of sequences.

Database search methods: Sequence Alignment


Two

broad classes of sequence alignments exist:


QKESGPSSSYC

Global alignment:

VQQESGLVRTTC

not sensitive

ESG

The

Local alignment:

ESG

faster

most widely used local similarity algorithms are:


Smith-Waterman (http://www.ebi.ac.uk/MPsrch/)
Basic Local Alignment Search Tool (BLAST, http://www.ncbi.nih.gov)
Fast Alignment (FASTA, http://fasta.genome.jp; http://www.ebi.ac.uk/fasta33/;

http://www.arabidopsis.org/cgi-bin/fasta/nph-TAIRfasta.pl)

Which algorithm to use for database similarity search?


Speed:

BLAST > FASTA > Smith-Waterman (It is VERY SLOW and uses a
LOT OF COMPUTER POWER)

Sensitivity/statistics:

FASTA is more sensitive, misses less homologues


Smith-Waterman is even more sensitive.
BLAST calculates probabilities
FASTA more accurate for DNA-DNA search then BLAST

Genomics: Completed genomes as 2002


Currently the genome of over 600 organisms are sequenced:
Whole-genome
shotgun

Map-based

0.8-6 million

15 million

C. elegans (roundworm)

100 million

Drosophila (fruitfly)

120 million

Arabidopsis (thale cress)

130 million

Rice

435 million

3 billion

Fugu (puffer fish)

365 million

Anopheles (malaria-carrying mosquito)

278 million

Organism
54 Bacteria
Yeast

Human

Base pairs

This generates large amounts of information to be handled by individual

computers.

Tools to search databases


The

dilemma: DNA or protein?


Search by similarity
Using nucleotide seq.

Using amino acid seq.

Is the comparison of two nucleotide sequences accurate?

By translating into amino acid sequence, are we losing information?


The genetic code is degenerate (Two or more codons can represent
the same amino acid)
Very different DNA sequences may code for similar protein sequences
We certainly do not want to miss those cases!

Reasons for translating


Comparing

DNA sequences give more random matches:

A good alignment with end-gaps

A very poor alignment

Almost 50% identity!


Conservation

of protein in evolution (DNA similarity decays faster!)

Conclusion:

It is almost always better to compare coding sequences in their amino acid form,
especially if they are very divergent.
Very highly similar nucleotide sequences may give better results.

BLAST and FASTA variants


FASTA:

Compares a DNA query to DNA database, or a protein query


to protein database
FASTX:
Compares a translated DNA query to a protein database
TFASTA: Compares a protein query to a translated DNA database
BLASTN:

Compares a DNA query to DNA database.

BLASTP:

Compares a protein query to protein database.

BLASTX:
TBLASTN:
TBLASTX:

Compares the 6-frame translations of DNA query to protein


database.
Compares a protein query to the 6-frame translations of a DNA
database.
Compares the 6-frame translations of DNA query to the 6-frame
translations of a DNA database (each sequence is comparable to
BLASTP searches!)

PSI-BLAST: Performs iterative database searches. The results from each round

are incorporated into a 'position specific' score matrix, which is


used for further searching

A practical example of sequence alignment


http://www.ncbi.nlm.nih.gov

BLAST results

Detailed BLAST results

value: is the expectation value or probability to find by chance hits similar to


your sequence. The lower the E, the more significant the score.

Database searching tips


Use

latest database version.

Use

BLAST first, then a finer tool (FASTA,)

Search

both strands when using FASTA.

Translate
Search

sequences where relevant

6-frame translation of DNA database

< 0.05 is statistically significant, usually biologically


interesting.

If

the query has repeated segments, delete them and


repeat search

Most widely used sites for sequence analysis


Sites

for alignment of 2 sequences:


T-COFFEE (http://www.ch.embnet.org/software/TCoffee.html): more accurate
than ClustalW for sequences with less than 30% identity.
ClustalW (http://www.ch.embnet.org/software/ClustalW.html;

http://align.genome.jp)

bl2sequ (http://www.ncbi.nlm.nih.gov/blast/bl2seq/wblast2.cgi)
LALIGN (http://www.ch.embnet.org/software/LALIGN_form.html )
MultiALIGN (http://prodes.toulouse.inra.fr/multalin/multalin.html )

Sites

for DNA to protein translation:

These algorithms can translate DNA sequences in any of the 3 forward or three
reverse sense frames.
Translate (http://au.expasy.org/tools/dna.html)
Translate a DNA sequence: (http://www.vivo.colostate.edu/molkit/translate/index.html )
Transeq (http://www.ebi.ac.uk/emboss/transeq)

Vous aimerez peut-être aussi