Bioinformatics Class

BIOINFORMATICS
Biology easily has 500 years of exciting problems to work

on
--Donald E. Knuth
Ever since the structure of DNA was unraveled in 1953, the molecular
biology has witnessed tremendous advances.
The need to process the ever growing biological data has created
entirely new problems that are interdisciplinary in nature.
Scientists from biological sciences are the creators & ultimate users of
this data.
Due to huge size & high complexity of the biological data, the help of
many other disciplinesin particular from mathematics& computer

science is required
This need has created a new field Computational Molecular Biology &
bioinformatics
The Commercial Market

Current
bioinformatics market is worth 300 million / year

(Half software)
Prediction: $2
~50
billion / year in 5-6 years
Bioinformatics companies:
Genomatrix Software, Genaissance Pharmaceuticals, Lynx, Lexicon Genetics, DeCode

Genetics, CuraGen, AlphaGene, Bionavigation, Pangene, InforMax, TimeLogic,
GeneCodes, LabOnWeb.com, Darwin, Celera, Incyte, BioResearch Online, BioTools,
Oxford Molecular, Genomica, NetGenics, Rosetta, Lion BioScience, DoubleTwist,
eBioinformatics, Prospect Genomics, Neomorphic, Molecular Mining, GeneLogic,
GeneFormatics, Molecular Simulations, Bioinformatics Solutions.BIOCON(INDIA)
Comptational molecular biology & Bioinformatics

(cmb)
CMB consists of the development and use of computer

science & mathematical techniques to solve
problems in molecular biology.
Bioinformatics
Unit 1. Basic Concepts
Unit 2. Suffix Trees and Applications
Unit 3. sequence alignment: pair wise
alignment. Multiple Alignments

Unit 4. Sequencing
Unit 5. Motif Prediction
Bioinformatics
Unit 1. Basic Concepts of Molecular Biology:
Cellular Architecture, Nucleic Acids (RNA & DNA ),

DNA replication, Repair and recombination.
Transcription, Genetic code, Gene expression,
Protein structure and function, Molecular biology
tools. Statistical Methods: Estimation, Hypothesis
testing, Random Walks, Markov Models(HMM).
Unit 2. Suffix Trees

Definition and examples, Ukkonens linear-
time suffix tree algorithm, Applications (exact

string matching, longest common sub strings
of two strings, Recognizing DNA
contamination). Pair wise Sequence
Alignment (Edit distance , Dynamic
Programming Calculator of edit distance,
string similarity, gaps).
Unit 3. sequence alignment

Pair wise sequence alignment (local), HMM for
pair wise alignment. Multiple String Alignments :

Need of MSA, Family & Super Family representation,
multiple sequence comparison
for structural inferences, Multiple alignments with
sum-of- pairs, consensus objective functions. Profile
HMM for multiple sequence alignment. Database
searching for similar sequence (FASTA, BLAST),
PAM, BLOSOM substitution matrices.
Unit 4. Sequencing
Fragment Assembly (Shortest common super
string algorithms based on multi-graph),

Sequencing by Hybridization, Protein
sequencin
Unit 5. Motif Prediction

Motif Prediction, Gene prediction, Introduction
to Protein structure Prediction.
BOOKS RECOMMENDED
Dan Gusfied, Algorithm on strings, Trees and
Sequences : Computer science & Computational

Biology, Cambridge University Press, 1997.
(Chapters: 5,6,7,10,11,14,15)
J.Setubal & Meidanis, Introduction to computational
Molecular Biology PWS Publishing Company,
1997(Chapters : 1,8).
W.J. Ewens & G.R. Grant Statistical Methods in
Bioinformatics Springer-1989.
Contd
R. Durbin, S.R. Eddy, A. Krogh and G.J.

Mitchison, Biological Sequence Analysis :
Probabilistic Models of Proteins and Nuclics
Acids, Cambridge University Press
1998(Chapters: 3,5 & 6)
R.C. Denier, S. Tavare, M.S. Waterman,
Computational Genome Analysis, Springer,
2005.
Contd
N.C. Jones and P.A. Pevzner An

Introduction to Bioinformatics Algorithms
MIT Press- 2004.
D.E. Krane, M.L. Raymer Fundamental
Concepts of Bioinformatics Pearson
Education 2003.
J. Tisdall, Beginning Perl for Bioinformatics,
OReilly, 2001.
Contd
M.S. Waterman, Introduction to Computational

Biology CRC Press, 2000.
A. Baxevainis & B. Ouellete, Bio Informatics: A
Practical Guide to the Analysis of Genes and
Proteins, Willy- Interescience, 2001.
M. Ridley, Genome: The autobiography of a
species, Fourth Estate, 1999.
Lodish, Berk, Zipursky, Blalimore & Darnell,
Molecular Cell Biology, W.H.Freeman, 2000.
Class 2
UNIT 1
LIFE AT ITS SIMPLEST
DNA
RNA
PROTEIN
GENETIC-CODE
QUICK-PRIMER ON GENETICS.
EVERY CELL IN THE HUMAN BODY CONTAINS A COPY OF
THE GENOME
THINK OF THE GENOME AS A BOOK
THE BLUEPRINT THAT
CONTAINS DETAILS OF WHAT EACH INDIVIDUAL OUGHT TO BE LIKE.

NOW, EACH HUMAN GENOME CONTAINS 23 CHROMOSOMES.
IF THE GENOME WERE A BOOK, THEN THINK OF
CHROMOSOMES AS THE CHAPTERS IN IT
EACH OF THESE CHAPTERS TELL ABOUT SEVERAL
THOUSAND STORIES CALLED GENES

e.g the colour of individual skin,eyes & hair, left or right handed,
his IQ, and everything that matters
What Venter did?

He could figure out what language the book
was written in & how to read it.
He could figure out the grammar of the book
& therefore, how to write in the language
This is the kind of language that will allow him
to create life.
Ventermagicnew-life?
Venter used his knowledge to create
Mycoplasma Laboratorium, a chromosome

that is 381 genes long, transplanted into living
cell, it is expected to take control of the cell
and become a new life-form.
The new life-form can mop up excessive
carbon dioxide and contribute to resolving

problems like global-warming.
Contd
Until now, scientists have managed to take
the genome out of one cell put it another cell

and create an altogether new organism.
But, nobody knew how to create the genome
itself. Venter did just that.
Biology easily has 500 years of exciting problems to work

on
--Donald E. Knuth
Ever since the structure of DNA was unraveled in 1953, the molecular
biology has witnessed tremendous advances.
The need to process the ever growing biological data has created
entirely new problems that are interdisciplinary in nature.
Scientists from biological sciences are the creators & ultimate users of
this data.
Due to huge size & high complexity of the biological data, the help of
many other disciplinesin particular from mathematics& computer

science is required
This need has created a new field Computational Molecular Biology &
bioinformatics
Comptational molecular biology & Bioinformatics

(cmb)
CMB consists of the development and use of computer

science & mathematical techniques to solve
problems in molecular biology.
Living vs. Nonliving

Both kinds of matter are composed by the same
atoms and confirms to the same physical and
chemical rules.
What is the difference then????
Living vs. Nonliving

Living things can move,
reproduce, grow, eat

They have an active
participation in their
environment
Living beings act the way they
do due to a complex array of
chemical reactions that occur
inside them. These reactions
never cease.
Living organism is constantly
exchanging matter & energy
with its surroundings.
Anything that is in
equilibrium with its

surrounding can
generally be considered
dead
(exceptions are
vegetative forms, like
seeds, and viruses
which may be
completely inactive for
long periods of time and
are not dead.)
Life starts
Life started some 3.5 billions of years ago, shortly
after the Earth itself was formed.

The first life forms were very simple, but over billions
of years a continuously acting process called
evolution made them evolve and diversify
Both complex and simple organisms have a similar
molecular chemistry or biochemistry.
Main actors in the chemistry of life are molecules
called proteins and nucleic acids
ACTORS IN CHEMISTRY OF LIFE :Proteins
& nucleic acids:

Proteins are responsible for what a living being is and
does in a physical sense.

we are our proteins-Russell Doolittle
Nucleic acids, on the other hand, encode information
necessary to produce proteins and responsible for
passing along this recipe to subsequent generations
Recent research is devoted to the understanding of
the structure and function of proteins and nucleic
acids
Proteins
Different Roles of Proteins
Enzymes
Carry signals
Transport small molecules such as oxygen
Form cellular structures (tissues)
Regulate cell processes (such as defense
mechanisms)
What are proteins made of?
Amino acids chain of amino acids = protein
Amino acids
Backbone of polypeptide chain

Convention
Begin at N-terminal
End at C-terminal
Torsion or rotation
angles around:
C-N bond ()
C-C bond ()
PROTEIN STRUCTURE
Protein is not just a linear sequence of residues
primary structure
Proteins actually fold in 3D, presenting secondary,
tertiary and quaternary structures
3D shape of a protein is related to its function
Protein can be made out of 20 different kinds of
amino acids make the resulting 3D structure very
complex and without symmetry
No simple and accurate method for determining the
3Dstructure is known.
Genomic Code
DNA
deoxyribosenucleic acid
Basic unit = nucleotide
Sugar,Phosphate,Base
(A, G, T, C)
adenine, thymine
cytosine, guanine.
Contd
DNA is a chain of simpler molecules
Actually it is a double chain (strands)
Each simple chain has a backbone consisting of
repetitions of the same basic unit

This unit is formed by a sugar molecule called 2deoxyribose attached to a phosphate residue
The sugar molecules contains five carbon atoms and
they are labeled 1 through 5
DNA molecules also have a orientation (starts at the
5 end finishes at the 3 end)
Contd
Attached to each 1 carbon in the backbone
are other molecules called bases

There are 4 kinds of bases:
A(ADENINE)
G(GUANINE)
C(CYTOSINE)
T(THYMINE)
CONTD
Bases A & G belong to a larger group of substances called
purines where as C & T belong to pyrimidines

When we see the basic unit of a DNA molecule as consisting of
sugar, phosphate, & its base we call it nucleotide
Bases & nucleotides are not the same thing.
DNA molecule having a few nucleotides is referred to as an
oligonucleotide
DNA molecule in nature is very long, much longer than proteins
In humancell, each DNA molecules have hundreds of millions of
nucleotides
Contd
DNA molecules are double strands
The two strands are tied together in a helical structure (watson
& crick 1953)

Each base in one strand is paired with a base in the other
strand
A pairs with T (COMPLEMENTARY BASES/watson crick base
pairs)
C pairs with G
base pair ( bp) provides the unit of length
RNA
RNA is a nucleic acid made from long chain
of nucleotides
Each nucleotide consists of a nitrogen base,
a ribose sugar, and a phosphate
RNA is very similar to DNA , but differs with
the following basic compositional and
structural differences
DNA vs. RNA

DNA is double stranded
Very long chain of
nucleotides
DNA contains deoxyribose
DNA is more stable
RNA is single stranded

Comparatively shorter chain
of nucleotides
RNA contains ribose
Less stable, more prone to

Complementary nucleotide
to Adenine is Thymine
DNA performs essentially
one function
hydrolysis
Complementary nucleotide
to adenine is uracil
There are different kinds of
RNA performing different
functions
Class 3
26/06/08
Central Dogma of Molecular

Biology
How the information in DNA results in

proteins?
A promoter is a region before each gene in the DNA that serves
as an indication to the cellular mechanism that a gene is ahead.
Having recognized the beginning of a gene a copy of the gene
is made on an RNA molecule.

The resulting RNA is mRNA (substitute U for T). This process is
called transcription.
the mRNA will be used to manufacture protein.
After transcription, the introns are spliced from the
mRNA=>introns are that part of gene that are not used in

protein synthesis
After introns are spliced out the shortened mRNA containing
copies of only exons plus regulatory regions in the beginning &
end leaves the nucleus
Contd
Because of the intron/exon phenomenon, we use
different names to the entire gene & to the spliced

sequence consisting of exons only.
The former is called genomic DNA & the latter
complementary DNA or cDNA
t RNA are the molecules that actually implement the
genetic code in a process called translation. They
make the connection between a codon and the
specific amino acid this codon codes for.
When a stop codon appears no tRNA associates with
it and the synthesis ends.
Central Dogma: DNA -> RNA ->

Protein
DNA
CCTGAGCCAACTATTGATGAA
transcription
RNA
CCUGAGCCAACUAUUGAUGAA
translation
Protein
PEPTIDE
JUNK DNA
Genes are certain contiguous regions of the chromosome, but
they do not cover the entire molecule
There are intergenic regions which does not have any known
functions. They are called junk DNA because they appear to

be there for no particular use.
Recent research has shown that junk DNA has more information
content than previously believed
The amount of junk DNA varies from species to species

humans>90% junk DNA
OPEN READING FRAME

( ORF )
An ORF in a DNA sequence is a contiguous stretch of this sequence
beginning at the start codon, having an integral number of codon, such

that none of its codon is a stop codon
Consider the sequence TAATCGAATGGGC
one reading frame: TAA TCG AAT GGG

second reading frame: AAT CGA ATG GGC
third reading frame: ATC GAA TGG
fourth reading frame: TCG AAT GGG
4TH frame is a subset of one of the frame starting at position 1
Sometimes we talk about 6 not 3 different frames in a sequence.(look
at the opposite strand and count another 3)
Genome
Complete set of chromosome inside a cell is
called a genome.
The number of chromosomes in a genome is
characteristic of species
Every cell in a human being has 46
chromosomes, whereas in mice this number
is 40
Is Genome like a computer program?

Genome = computer program
Genome of an organism is seen as a computer

program that completely specifies the organism
Cell machinery = interpreter of this program
Biological functions performed by proteins =
execution of this program
Class 4
01/07/08
The Eye of the Fly

Fruit flies (Drosophila melanoglaster) have a gene
called eyeless which , if it is knocked out (i.e.

eliminated the genome using molecular biology
methods) results in fruit flies without eyes
It is obvious that eyeless gene plays a role in eye
development
Researchers have identified a human gene
responsible for a condition called aniridia
In humans who are missing this gene( or in whom the
gene has mutated just enough for its protein product
onto stop functioning properly), the eyes develop
without irises
Cont
If the gene for aniridia is inserted into an
eyeless drosophila knock out", it causes the

production of normal drosophila eyes. It is an
interesting observation.
Could there be some similarity in how eyeless
and aniridia function??? Even though flies &
humans are vastly different organisms?
To gain insight into how eyeless & aniridia
work together, we can compare their
sequences
Contd
20 years ago similarity between eyeless &
aniridia DNA sequences would have been like

looking for a needle in a haystack
Most scientists compared the respective gene
sequences by hand aligning them one under
the other in a word processor & looking for
matches character by character.
This was time consuming & hard on the eyes.
Contd
In the late 1980s, fast computer programs for
comparing sequences changed molecular

biology for ever
Many tools that are widely available to the
biology community- including everything from
multiple alignment, phylogenetic analysis,
motif identification, &
homology modeling software, to web-based
database search services-rely on pair wise
sequence comparison algorithms as a core
element of their function
How the genome is studied?

Say, human genome
Sequencing: The basic information we want to extract
from any piece of DNA is its base pair sequence. The

process of obtaining this information is called
sequencing
A human chromosome has around 108 base pairs.
but, the largest pieces of DNA that can be sequenced
in the laboratory are 700bp long.
=>there is a gap of some 105 between the scales of what
we can actually sequence and a chromosome size.
This gap is at the heart of many problems in
computational biology
Cutting & Breaking DNA

Because a DNA molecule is so long, some tool to cut
it at specific points (like a pair of scissors) or to break

it apart in some way is needed
The pair of scissors is represented by restriction
enzymes
They cut DNA molecules in all places where a certain
sequence appears( usually a palindrome sequence)
Some common types of restriction enzymes are 4
cutters, 6 cutters, & 8 cutters
It is rare to see an odd cutter because sequences of
odd length cannot be palindromes
COPYING DNA
Also known as DNA amplification
Very important in DNA cloning
Given a piece of DNA, one way of obtaining further
copies is to use nature itself

We insert this piece into the genome of an organism
(host) and then let the organism multiply itself.
Upon host multiplication, the inserted piece gets
multiplied along with original DNA.
Then we kill the host & dispose the rest keeping only
the inserts in the desired quantity.
DNA produced in this way is called recombinant.
Reading & measuring DNA

Reading is done with a technique
known as gel electrophoresis which is

based on separation of molecules by
their size
This process involves a gel medium & a
strong electric field
HUMAN GENOME PROJECT
The Human Genome Project
What is the Human

Genome Project?
U.S. govt. project coordinated by the Department of
Energy and the National Institutes of Health

goals (1998-2003)
identify the approximate 100,000 genes in human DNA

determine the sequences of the 3 billion bases that make up
human DNA
store this information in databases
develop tools for data analysis
address the ethical, legal, and social issues that arise from
genome research
Why is the Department of

Energy involved?
-after atomic bombs were dropped during War
War II, Congress told DOE to conduct studies to
understand the biological and health effects of
radiation and chemical by-products of all energy
production
-best way to study these effects is at the DNA
level
Whose genome is being

sequenced?
the first reference genome is a composite genome
from several different people

generated from 10-20 primary samples taken from
numerous anonymous donors across racial and

ethnic groups
Benefits of HGP Research

improvements in medicine
microbial genome research for fuel and
environmental cleanup
DNA forensics
improved agriculture and livestock
better understanding of evolution and human
migration
more accurate risk assessment
Ethical, Legal, and Social Implications

of HGP Research
fairness in the use of genetic information
privacy and confidentiality
psychological impact and stigmatization
genetic testing
reproductive issues
education, standards, and quality control
commercialization
conceptual and philosophical implications
For More Information...
Human Genome Project Information Website

http://www.ornl.gov/hgmis
Contd
A large effort like this cannot be entertained
by a single lab!!!!!!
On computer science side, databases with
updated & consistent information have to be
maintained,
Fast access to the data has to be provided
After the sequencing there is a still difficult
task of analyzing the data obtained
Contd..
Treatment of genetic diseases based on data
produced by the Human Genome Project is
still going on, although encouraging
pioneering efforts have already yielded
results.
Class 5
08/07/08
What is a database?
A collection of information, usually stored in
an electronic format that can be searched by

a computer.
A brief history of biological databases

1965 M. O. Dayhoff et al. publish Atlas of
Protein Sequences and Structures
1982 EMBL initiates DNA sequence database,
followed within a year by GenBank (then
at LANL) and in 1984 by DNA Database
of Japan
1988 EMBL/GenBank/DDBJ agree on
common format for data elements
Biological databases: why?

There are two main functions of biological
databases:
Make biological data available to scientists.
As much as possible of a particular type of information should

be available in one single place (book, site, database).
Published data may be difficult to find or access, and collecting
it from the literature is very time-consuming. And not all data is
actually published explicitly in an article (genome sequences!).
To make biological data available in computer-readable
form.
Since analysis of biological data almost always involves
computers, having the data in computer-readable form (rather
than printed on paper) is a necessary first step.
The different types of databases

One may characterize the available biological
databases by several different properties. Here is a

list to help you think about the various properties a
particular database may have
Type of data
nucleotide sequences
protein sequences
proteins sequence patterns or motifs
macromolecular 3D structure
gene expression data
metabolic pathways
Contd
Data entry and quality control

Scientists (teams) deposit data directly
Appointed curators add and update data
Are erroneous data removed or marked?
Type and degree of error checking
Consistency, redundancy, conflicts, updates
Primary or derived data

Primary databases: experimental results directly into database
Secondary databases: results of analysis of primary databases
Aggregate of many databases
Links to other data items
Combination of data
Consolidation of data
Contd
Technical design
Flat-files
Relational database (SQL)
Object-oriented database (e.g. CORBA, XML)
Maintainer status
Large, public institution (e.g. EMBL, NCBI)
Quasi-academic institute (e.g. Swiss Institute of Bioinformatics, TIGR)
Academic group or scientist
Commercial company
Availability
Publicly available, no restrictions
Available, but with copyright
Accessible, but not downloadable
Academic, but not freely available
Proprietary, commercial; possibly free for academics
Accession codes vs identifiers

Many databases in bioinformatics (SWISS-
PROT, EMBL, GenBank, Pfam) use a system

where an entry can be identified in two
different ways.
Basically, it has two names:
Identifier
Accession code (or number)
Contd
Identifier
An identifier ("locus" in GenBank, "entry name" in SWISS-
PROT) is a string of letters and digits that generally is

interpretable in some meaningful way by a human, for instance
as a recognizable abbreviation of the full protein or gene name.
SWISS-PROT uses a system where the entry name consists of
two parts: the first denotes the protein and the second part
denotes the species it is found in. For example, KRAF_HUMAN
is the entry name for the Raf-1 oncogene from Homo sapiens.
An identifier can usually change. For example, the database
curators may decide that the identifier for an entry no longer is
appropriate. However, this does not happen very often. In fact, it
happens so rarely that it's not really a big problem.
Contd
Accession code (number)
An accession code (or number) is a number (possibly with a few
characters in front) that uniquely identifies an entry in its database. For

example, the accession code for KRAF_HUMAN in SWISS-PROT is
P04049.
The main conceptual difference from the identifier is that it is supposed
to be stable: any given accession code will, as soon as it has been
issued, always refer to that entry, or its ancestors. It is often called the
primary key for the entry. The accession code, once issued, must
always point to its entry, even after large changes have been made to
the entry. This means that in discussions about specific database
entries (e.g. an article about a specific protein), one should always give
the accession code for the entry in the relevant database.
In the case where two entries are merged into one single, then the
new entry will have both accession codes, where one will be the
primary and the other the secondary accession code. When an entry
is split into two, both new entries will get new accession codes, but will
also have the old accession code as secondary codes.
Nucleotide sequence databases

Primary nucleotide sequence databases
The databases EMBL, GenBank, and DDBJ are the three primary
nucleotide sequence databases:
They include sequences submitted directly by scientists and genome
sequencing group, and sequences taken from literature and patents.
There is comparatively little error checking and there is a fair amount of
redundancy.
The entries in the EMBL, GenBank and DDBJ databases are
synchronized on a daily basis, and the accession numbers are

managed in a consistent manner between these three centers.
The nucleotide databases have reached such large sizes that they are
available in subdivisions that allow searches or downloads that are
more limited, and hence less time-consuming. For example, GenBank
has currently 17 divisions.
There are no legal restrictions on the use of the data in these
databases. However, there are some patented sequences in the
databases.
Contd
EMBL www.ebi.ac.uk/embl/
The EMBL (European Molecular Biology Laboratory) nucleotide

sequence database is maintained by the European Bioinformatics
Institute (EBI) in Hinxton, Cambridge, UK.
GenBank www.ncbi.nlm.nih.gov/Genbank/
The GenBank nucleotide database is maintained by the National Center

for Biotechnology Information (NCBI), which is part of the National
Institute of Health (NIH), a federal agency of the US government.
It can be accessed and searched through the Entrez system at NCBI, or
one can download the entire database as flat files.
DDBJ www.ddbj.nig.ac.jp
The DNA Data Bank of Japan began as a collaboration with EMBL and
GenBank. It is run by the National Institute of Genetics. One can search
for entries by accession number.
Other nucleotide sequence databases

secondary databases
The following databases contain subsets of the EMBL/GenBank databases. Some also
contain more information or links than the primary ones, or have a different organization of
the data to better some specific purpose. However, the nucleotide sequences themselves
should always be available in the EMBL/GenBank databases. In this sense, the databases
below are secondary databases.
UniGene http://www.ncbi.nlm.nih.gov/sites/entrez?db =unigene
The UniGene system attempts to process the GenBank sequence data into a non-redundant
set of gene-oriented clusters. Each UniGene cluster contains sequences that represent a
unique gene, as well as related information such as the tissue types in which the gene has
been expressed and map location.
SGD http://www.yeastgenome.org /
The Saccharomyces Genome Database (SGD) is a scientific database of the molecular
biology and genetics of the yeast Saccharomyces cerevisiae.
EBI Genomes www.ebi.ac.uk /genomes/
This web site provides access and statistics for the completed genomes, and information
about ongoing projects.
Genome Biology www.ncbi.nlm.nih.gov /Genomes/
The Genome Biology site at NCBI contains information about the available complete
genomes.
Ensembl www.ensembl.org
Ensembl is a joint project between EMBL-EBI and the Sanger Centre to develop a software
system which produces and maintains automatic annotation on eukaryotic genomes.
Protein sequence databases

The two protein sequence databases SWISS-
PROT and PIR are different from the

nucleotide databases in that they are both
curated.
This means that groups of designated
curators (scientists) prepare the entries from
literature and/or contacts with external
experts.
SWISS-PROT, TrEMBL
www.expasy.ch/sprot
SWISS-PROT is a protein sequence database which strives to provide a high

level of annotations (such as the description of the function of a protein, its
domains structure, post-translational modifications, variants, etc.), a minimal
level of redundancy and high level of integration with other databases.
It was started in 1986 by Amos Bairoch in the Department of Medical
Biochemistry at the University of Geneva. This database is generally considered
one of the best protein sequence databases in terms of the quality of the
annotation. Its size is given in the table below.
TrEMBL is a computer-annotated supplement of SWISS-PROT that contains
all the translations of EMBL nucleotide sequence entries not yet integrated in
SWISS-PROT. The procedure that is used to produce it was developed by Rolf
Apweiler. The annotation of an entry in TrEMBL has not (yet) reached the
standards required for inclusion into SWISS-PROT proper.
The SWISS-PROT database has some legal restrictions: the entries
themselves are copyrighted, but freely accessible and usable by academic
researchers. Commercial companies must buy a license fee from SIB.
PIR pir.georgetown.edu
The Protein Information Resource (PIR) is a division of the National Biomedical

Research Foundation (NBRF) in the US. It is involved in a collaboration with the
Munich Information Center for Protein Sequences (MIPS) and the Japanese
International Protein Sequence Database (JIPID). The PIR-PSD (Protein
Sequence Database) release 70.01 (22 Oct 2000) contains 254,293 entries.
PIR grew out of Margaret Dayhoff's work in the middle of the 1960s. It strives to
be comprehensive, well-organized, accurate, and consistently annotated.
However, it is generally believed that it does not reach the level of completeness
in the entry annotation as does SWISS-PROT. Although SWISS-PROT and PIR
overlap extensively, there are still many sequences which can be found in only
one of them.
One can search for entries or do sequence similarity searches at the PIR site.
PIR also produces the NRL-3D, which is a database of sequences extracted
from the three-dimensional structures in the Protein Databank (PDB).
It appears that the PIR web site, and possibly also the underlying database, has
improved considerably since one year ago. This means that if one is interested
in protein sequences, there is now even more reason to check out PIR;
Other relevant databases

GeneCards www.genecards.org
GeneCards is a database of human genes, their products and

their involvement in diseases. It offers concise information about
the functions of all human genes that have an approved symbol,
as well as selected others. It is a typical example of a
secondary database, which contains many links to other
databases, and attempts to consolidate the information that is
available for a specific class of entity, in this case human genes.
GeneLynx www.genelynx.org
GeneLynx is a database of Web links for human genes. It

contains pointers to a large number of other databases. This is
also a typical secondary database. It is maintained by Boris
Lenhard and Wyeth Wasserman at CGB, KI, Sweden.
Contd
KEGG www.genome.ad.jp/kegg/
The Kyoto Encyclopedia of Genes and Genomes

(KEGG) is an effort to computerize current
knowledge of molecular and cellular biology in terms
of the information pathways that consist of interacting
molecules or genes and to provide links from the
gene catalogs produced by genome sequencing
projects.
Amos' WWW links page
www.expasy.org/links.html
A page of many links to biological databases and/or
web sites
POPULAR BIOINFORMATICS
DATABASES
http://everest.bic.nus.edu.sg/~bhuvana/lsm21
04/popualr-bioinformatics-databases.htm
Class 6
10/07/08
Growth of GenBank database

Base Pairs
Sequences
www.ncbi.nlm.nih.gov
Created in 1988 as part of the National
Library of Medicine at NIH
Establish public databases
Research in computational biology
Develop software tools for sequence
analysis
Disseminate biomedical information
Types of databases at NCBI

Primary databases
Original submissions by experimentalists

Content controlled by the submitter
Examples: GenBank, SNP, GEO
Derivative databases
Built from primary data
Content controlled by third party (NCBI)
Examples: Refseq, TPA, RefSNP, UniGene, NCBI

Protein, Structure, Conserved Domain, Gene
Entrez
Literature & Text

PubMed
15 million citations in MEDLINE
Links to participating online journals
Books
Linked from PubMed and other records
Searchable from within Entrez
Nucleotide databases
Primary
GenBank / EMBL / DDBJ
54,694,591
Genbank
Derivative
RefSeq
RefSeq
1,132,972
Third
Party Annotation
4,763
PDB
PDB
5,887
Total
55,838,213
EMBL/ GenBank /DDBJ (European

Molecular Biology Laboratory)
Archive containing all sequences from:
genome projects
sequencing centers
individual scientists
patent offices
Database is doubling every 15 months

Sequences from >200,000 different species
>1000 new species added every month
Protein Databases
Genpept
CDS from GenBank entries
TrEMBL (1996)
Automatic CDS translations from EMBL
Highly redundant
Not all experimentally determined
Many inaccuracies
Secondary protein database

SWISS-PROT (1986)
Best annotated, least redundant
PIR (Protein Information Resource)
More automated annotation

Collaborations with MIPS and JIPID
Secondary protein databases

SWISS-PROT (1986)
Best annotated, least redundant
PIR (Protein Information Resource)
More automated annotation

Collaborations with MIPS and JIPID
Uniprot (2003)
UniProt (Universal Protein Resource) is a central

repository of protein sequence and function
created by joining the information contained in
Swiss-Prot, TrEMBL, and PIR.
Uniprot
UniProt Knowledgebase (UniProt)
Central access point for extensive curated protein
information, including function, classification, and
cross-reference.
UniProt Non-redundant Reference (UniRef)
Set of databases that combine closely related

sequences into a single record to speed searches.
UniProt Archive (UniParc)

Comprehensive repository, reflecting the history of all
protein sequences.
No annotation, used internally
NCBI Derivative Sequence

Curators
Data
C TC
ATCATCT
TA
TA
G
CC
GC G
T
G
AC
GAG
GAG
A A
RefSeq
T
T
G
A
C
A
C
G
TGA
TATAGCCG
AGCTCCGATA
CCGATGACAA
AT
T
GA
C
TA
CG
G
CC
G
A
TAT
Genome
Assembly
TA
TA
GC
A
TG
CG C
TG
G
AC
CGTGA
A
G
T
ATTG
C GA
CT
A
ACG
TGC
Labs
CA
A
G
TT
TTGACA
TAT AT
TA
C
AG TG
CG GA
CC CTAAC
CA
C
A
A
G
T
T
A TAG T
TATAGCCG
TATAGCCG
TATAGCCG
ATTG TATAGCCG
TG
A
T
A
T
T
AT
C
GenBank
UniGene
A
T
TG
A
C
TA
GA
AT
C TC
ATCATT
GAG
GAG
A A
TC T
T
C
ATTATC
A
Algorithms
GAGA
GAG
A
High-throughput DNA sequencing

Top
image: confocal detection

by the MegaBACE sequencer
of fluorescently labeled DNA
Bottom
image: computer
image of sequence read by
automated sequencer
The trend of data growth
8
century is a century of biotechnology & bioinformatics:
7
Genomics: New sequence information is being

produced at increasing rates. (The
contents of GenBank double every one and
Nucleotides(billion)
21st
half year)
6
5
4
3
2
1
0
1980
1985
1990
Years
Microarray: Global expression analysis: RNA levels of every

gene in the genome analyzed in parallel.
Proteomics:Global protein analysis generates by large mass

spectra libraries.
Metabolomics:Global metabolite analysis: 25,000 secondary

metabolites characterized
Glycomics:Global sugar metabolism analysis
1995
2000
How to handle the large amount of information?
Drew Sheneman, New Jersey--The Newark Star Ledger
Answer: bioinformatics and Internet
Bioinformatics NEED FOR

ALGORITHM?
In1960s: the birth of bioinformatics
IBM 7090 computer

Margaret
Oakley Dayhoff created:
The first protein database

The first program for sequence assembly
There
is a need for computers and algorithms that allow:
Access, processing, storing, sharing, retrieving, visualizing, annotating
Why do we need the Internet?

omics
projects and the information associated with involve a huge amount

of data that is stored on computers all over the world.
Because
it is impossible to maintain up-to-date copies of all relevant

databases within the lab. Access to the data is via the internet.
Database
storage
Re
You are
here
lts
u
s
ur
o
Y
st
e
u
eq
The Commercial Market

Current
bioinformatics market is worth 300 million / year

(Half software)
Prediction: $2
~50
billion / year in 5-6 years
Bioinformatics companies:
Genomatrix Software, Genaissance Pharmaceuticals, Lynx, Lexicon Genetics, DeCode

Genetics, CuraGen, AlphaGene, Bionavigation, Pangene, InforMax, TimeLogic,
GeneCodes, LabOnWeb.com, Darwin, Celera, Incyte, BioResearch Online, BioTools,
Oxford Molecular, Genomica, NetGenics, Rosetta, Lion BioScience, DoubleTwist,
eBioinformatics, Prospect Genomics, Neomorphic, Molecular Mining, GeneLogic,
GeneFormatics, Molecular Simulations, Bioinformatics Solutions.BIOCON(INDIA)
Scope
Make
you familiar with bioinformatics resources

available on the
web
LOGO
They
are big databases and searching either one should produce

similar results because they exchange information routinely.
-GenBank (NCBI): http://www.ncbi.nlm.nih.gov
-DDBJ (DNA DataBase of Japan): http://www.ddbj.nig.ac.jp
-TIGR: http://tigr.org/tdb/tgi
-Yeast: http://yeastgenome.org
-E. coli: http://colibase.bham.ac.uk/blast/
Specialized
databases:Tissues, species
-ESTs (Expressed Sequence Tags)
~at NCBI http://www.ncbi.nlm.nih.gov/dbEST
~at TIGR http://tigr.org/tdb/tgi
- ...many more!
Protein (amino acid) databases

They
are big databases too:

-Swiss-Prot (very high level of annotation)
http://au.expasy.org/
-PIR (protein identification resource) the world's most
comprehensive catalog of information on proteins
http://www.pir.uniprot.org/
Translated
databases:
-TREMBL (translated EMBL): includes entries that have
not been annotated yet into Swiss-Prot.
http://www.ebi.ac.uk/trembl/access.html
-GenPept (translation of coding regions in GenBank)

-pdb (sequences derived from the 3D structure
Brookhaven PDB) http://www.rcsb.org/pdb/
Database homology searching

Use
algorithms to efficiently provide mathematical basis of searches

that can be translated to statistical significance.
Assumes
that sequence, structure, and function are inter-related.
All
similarity searching methods rely on the concepts of alignment

and distance between sequences.
A similarity
score is calculated from a distance: the number of DNA

bases or amino acids that are different between two sequences.
Calculating alignment scores

Scoring
system: Uses scoring matrices that allow biologists to quantify the

quality of sequence alignments.
The
raw score S is calculated by summing the scores for each aligned

position and the scores for gaps. Gap creation/extension scores are
inherent to the scoring system in use (BLAST, FASTA)
The
score for an identity or a mismatch is given by the specified substitution

matrix (e.g., BLOSUM62).
Devising a scoring system

Some
popular scoring matrices are:
How
PAM (Percent Accepted Mutation): for evolutionary studies.

For example in PAM1, 1 accepted point mutation per 100 amino
acids is erquired.
BLOSUM (BLOcks amino acid SUbstitution Matrix): for finding
common motifs. For example in BLOSUM62, the alignment is
created using sequences sharing no more than 62% identity.
the matrices were created:
Very similar sequences were aligned.
From these alignments, the frequency of substitution between

each pair of amino acids was calculated and then PAM1 was built.
After normalizing to log-odds format, the full series of PAM matrices

can be calculated by multiplying the PAM1 matrix by itself.
Devising a scoring system

Importance:
Scoring matrices appear in all analysis

involving sequence comparison.
The choice of matrix can strongly influence
the outcome of the analysis.
Understanding theories underlying a given
scoring matrix can aid in making proper
choice:
-Some matrices reflect similarity: good for
database searching
-Some reflect distance: good for phylogenies
Log-odds matrices, a normalisation method for matrix values:
S is the probability that two residues, i and j, are aligned by evolutionary descent
and by chance.
qij are the frequencies that i and j are observed to align in sequences known to
be related. pi and pj are their frequencies of occurrence in the set of sequences.
Database search methods: Sequence Alignment

Two
broad classes of sequence alignments exist:

QKESGPSSSYC
Global alignment:
VQQESGLVRTTC
not sensitive
ESG
The
Local alignment:
ESG
faster
most widely used local similarity algorithms are:

Smith-Waterman (http://www.ebi.ac.uk/MPsrch/)
Basic Local Alignment Search Tool (BLAST, http://www.ncbi.nih.gov)
Fast Alignment (FASTA, http://fasta.genome.jp; http://www.ebi.ac.uk/fasta33/;
http://www.arabidopsis.org/cgi-bin/fasta/nph-TAIRfasta.pl)
Which algorithm to use for database similarity search?

Speed:
BLAST > FASTA > Smith-Waterman (It is VERY SLOW and uses a
LOT OF COMPUTER POWER)
Sensitivity/statistics:
FASTA is more sensitive, misses less homologues

Smith-Waterman is even more sensitive.
BLAST calculates probabilities
FASTA more accurate for DNA-DNA search then BLAST
Genomics: Completed genomes as 2002

Currently the genome of over 600 organisms are sequenced:
Whole-genome
shotgun
Map-based
0.8-6 million
15 million
C. elegans (roundworm)
100 million
Drosophila (fruitfly)
120 million
Arabidopsis (thale cress)
130 million
Rice
435 million
3 billion
Fugu (puffer fish)
365 million
Anopheles (malaria-carrying mosquito)
278 million
Organism
54 Bacteria
Yeast
Human
Base pairs
This generates large amounts of information to be handled by individual
computers.
Tools to search databases

The
dilemma: DNA or protein?

Search by similarity
Using nucleotide seq.
Using amino acid seq.
Is the comparison of two nucleotide sequences accurate?
By translating into amino acid sequence, are we losing information?

The genetic code is degenerate (Two or more codons can represent
the same amino acid)
Very different DNA sequences may code for similar protein sequences
We certainly do not want to miss those cases!
Reasons for translating

Comparing
DNA sequences give more random matches:
A good alignment with end-gaps
A very poor alignment
Almost 50% identity!

Conservation
of protein in evolution (DNA similarity decays faster!)
Conclusion:
It is almost always better to compare coding sequences in their amino acid form,
especially if they are very divergent.
Very highly similar nucleotide sequences may give better results.
BLAST and FASTA variants

FASTA:
Compares a DNA query to DNA database, or a protein query

to protein database
FASTX:
Compares a translated DNA query to a protein database
TFASTA: Compares a protein query to a translated DNA database
BLASTN:
Compares a DNA query to DNA database.
BLASTP:
Compares a protein query to protein database.
BLASTX:
TBLASTN:
TBLASTX:
Compares the 6-frame translations of DNA query to protein

database.
Compares a protein query to the 6-frame translations of a DNA
database.
Compares the 6-frame translations of DNA query to the 6-frame
translations of a DNA database (each sequence is comparable to
BLASTP searches!)
PSI-BLAST: Performs iterative database searches. The results from each round
are incorporated into a 'position specific' score matrix, which is

used for further searching
A practical example of sequence alignment

http://www.ncbi.nlm.nih.gov
BLAST results
Detailed BLAST results
value: is the expectation value or probability to find by chance hits similar to

your sequence. The lower the E, the more significant the score.
Database searching tips

Use
latest database version.
Use
BLAST first, then a finer tool (FASTA,)
Search
both strands when using FASTA.
Translate
Search
sequences where relevant
6-frame translation of DNA database
< 0.05 is statistically significant, usually biologically

interesting.
If
the query has repeated segments, delete them and

repeat search
Most widely used sites for sequence analysis

Sites
for alignment of 2 sequences:

T-COFFEE (http://www.ch.embnet.org/software/TCoffee.html): more accurate
than ClustalW for sequences with less than 30% identity.
ClustalW (http://www.ch.embnet.org/software/ClustalW.html;
http://align.genome.jp)
bl2sequ (http://www.ncbi.nlm.nih.gov/blast/bl2seq/wblast2.cgi)
LALIGN (http://www.ch.embnet.org/software/LALIGN_form.html )
MultiALIGN (http://prodes.toulouse.inra.fr/multalin/multalin.html )
Sites
for DNA to protein translation:
These algorithms can translate DNA sequences in any of the 3 forward or three
reverse sense frames.
Translate (http://au.expasy.org/tools/dna.html)
Translate a DNA sequence: (http://www.vivo.colostate.edu/molkit/translate/index.html )
Transeq (http://www.ebi.ac.uk/emboss/transeq)

Bioinformatics Class

Transféré par

Informations du document

Description originale:

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Bioinformatics Class

Transféré par

Droits d'auteur :

Formats disponibles

BIOINFORMATICS

Biology easily has 500 years of exciting problems to work

biology has witnessed tremendous advances.

entirely new problems that are interdisciplinary in nature.

many other disciplinesin particular from mathematics& computer

The Commercial Market

bioinformatics market is worth 300 million / year

billion / year in 5-6 years

Genomatrix Software, Genaissance Pharmaceuticals, Lynx, Lexicon Genetics, DeCode

Comptational molecular biology & Bioinformatics

CMB consists of the development and use of computer

alignment. Multiple Alignments

Cellular Architecture, Nucleic Acids (RNA & DNA ),

Unit 2. Suffix Trees

time suffix tree algorithm, Applications (exact

Unit 3. sequence alignment

pair wise alignment. Multiple String Alignments :

string algorithms based on multi-graph),

Unit 5. Motif Prediction

to Protein structure Prediction.

Sequences : Computer science & Computational

R. Durbin, S.R. Eddy, A. Krogh and G.J.

N.C. Jones and P.A. Pevzner An

M.S. Waterman, Introduction to Computational

THINK OF THE GENOME AS A BOOK

THE BLUEPRINT THAT

CONTAINS DETAILS OF WHAT EACH INDIVIDUAL OUGHT TO BE LIKE.

IF THE GENOME WERE A BOOK, THEN THINK OF

CHROMOSOMES AS THE CHAPTERS IN IT

EACH OF THESE CHAPTERS TELL ABOUT SEVERAL

THOUSAND STORIES CALLED GENES

What Venter did?

was written in & how to read it.

He could figure out the grammar of the book

& therefore, how to write in the language

This is the kind of language that will allow him

Mycoplasma Laboratorium, a chromosome

The new life-form can mop up excessive

carbon dioxide and contribute to resolving

the genome out of one cell put it another cell

Biology easily has 500 years of exciting problems to work

biology has witnessed tremendous advances.

entirely new problems that are interdisciplinary in nature.

many other disciplinesin particular from mathematics& computer

Comptational molecular biology & Bioinformatics

CMB consists of the development and use of computer

Living vs. Nonliving

Living vs. Nonliving

reproduce, grow, eat

equilibrium with its

after the Earth itself was formed.

ACTORS IN CHEMISTRY OF LIFE :Proteins

& nucleic acids:

does in a physical sense.

Backbone of polypeptide chain

repetitions of the same basic unit

are other molecules called bases

purines where as C & T belong to pyrimidines

& crick 1953)

DNA vs. RNA

RNA is single stranded

Less stable, more prone to

Central Dogma of Molecular