Académique Documents
Professionnel Documents
Culture Documents
The need to process the ever growing biological data has created
Scientists from biological sciences are the creators & ultimate users of
this data.
Due to huge size & high complexity of the biological data, the help of
This need has created a new field Computational Molecular Biology &
bioinformatics
Prediction: $2
~50
Bioinformatics companies:
Bioinformatics
Unit 1. Basic Concepts
Unit 2. Suffix Trees and Applications
Unit 3. sequence alignment: pair wise
Bioinformatics
Unit 1. Basic Concepts of Molecular Biology:
Unit 4. Sequencing
Fragment Assembly (Shortest common super
BOOKS RECOMMENDED
Dan Gusfied, Algorithm on strings, Trees and
Contd
Contd
Contd
Class 2
UNIT 1
LIFE AT ITS SIMPLEST
DNA
RNA
PROTEIN
GENETIC-CODE
QUICK-PRIMER ON GENETICS.
EVERY CELL IN THE HUMAN BODY CONTAINS A COPY OF
THE GENOME
to create life.
Ventermagicnew-life?
Venter used his knowledge to create
Contd
Until now, scientists have managed to take
The need to process the ever growing biological data has created
Scientists from biological sciences are the creators & ultimate users of
this data.
Due to huge size & high complexity of the biological data, the help of
This need has created a new field Computational Molecular Biology &
bioinformatics
Anything that is in
Life starts
Life started some 3.5 billions of years ago, shortly
Proteins
Different Roles of Proteins
Enzymes
Carry signals
Transport small molecules such as oxygen
Form cellular structures (tissues)
Regulate cell processes (such as defense
mechanisms)
What are proteins made of?
Amino acids chain of amino acids = protein
Amino acids
angles around:
C-N bond ()
C-C bond ()
PROTEIN STRUCTURE
Protein is not just a linear sequence of residues
primary structure
Proteins actually fold in 3D, presenting secondary,
tertiary and quaternary structures
3D shape of a protein is related to its function
Protein can be made out of 20 different kinds of
amino acids make the resulting 3D structure very
complex and without symmetry
No simple and accurate method for determining the
3Dstructure is known.
Genomic Code
DNA
deoxyribosenucleic acid
Basic unit = nucleotide
Sugar,Phosphate,Base
(A, G, T, C)
adenine, thymine
cytosine, guanine.
Contd
DNA is a chain of simpler molecules
Actually it is a double chain (strands)
Each simple chain has a backbone consisting of
Contd
Attached to each 1 carbon in the backbone
CONTD
Bases A & G belong to a larger group of substances called
Contd
DNA molecules are double strands
The two strands are tied together in a helical structure (watson
RNA
RNA is a nucleic acid made from long chain
of nucleotides
Each nucleotide consists of a nitrogen base,
a ribose sugar, and a phosphate
RNA is very similar to DNA , but differs with
the following basic compositional and
structural differences
nucleotides
DNA contains deoxyribose
DNA is more stable
of nucleotides
RNA contains ribose
to Adenine is Thymine
DNA performs essentially
one function
hydrolysis
Complementary nucleotide
to adenine is uracil
There are different kinds of
RNA performing different
functions
Class 3
26/06/08
Contd
Because of the intron/exon phenomenon, we use
CCTGAGCCAACTATTGATGAA
transcription
RNA
CCUGAGCCAACUAUUGAUGAA
translation
Protein
PEPTIDE
JUNK DNA
Genes are certain contiguous regions of the chromosome, but
There are intergenic regions which does not have any known
Recent research has shown that junk DNA has more information
Genome
Complete set of chromosome inside a cell is
called a genome.
The number of chromosomes in a genome is
characteristic of species
Every cell in a human being has 46
chromosomes, whereas in mice this number
is 40
Class 4
01/07/08
Cont
If the gene for aniridia is inserted into an
Contd
20 years ago similarity between eyeless &
Contd
In the late 1980s, fast computer programs for
COPYING DNA
Also known as DNA amplification
Very important in DNA cloning
Given a piece of DNA, one way of obtaining further
environmental cleanup
DNA forensics
improved agriculture and livestock
better understanding of evolution and human
migration
more accurate risk assessment
Contd
A large effort like this cannot be entertained
by a single lab!!!!!!
On computer science side, databases with
updated & consistent information have to be
maintained,
Fast access to the data has to be provided
After the sequencing there is a still difficult
task of analyzing the data obtained
Contd..
Treatment of genetic diseases based on data
produced by the Human Genome Project is
still going on, although encouraging
pioneering efforts have already yielded
results.
Class 5
08/07/08
What is a database?
A collection of information, usually stored in
databases:
form.
Since analysis of biological data almost always involves
computers, having the data in computer-readable form (rather
than printed on paper) is a necessary first step.
Type of data
nucleotide sequences
protein sequences
proteins sequence patterns or motifs
macromolecular 3D structure
gene expression data
metabolic pathways
Contd
Contd
Technical design
Flat-files
Relational database (SQL)
Object-oriented database (e.g. CORBA, XML)
Maintainer status
Large, public institution (e.g. EMBL, NCBI)
Quasi-academic institute (e.g. Swiss Institute of Bioinformatics, TIGR)
Academic group or scientist
Commercial company
Availability
Publicly available, no restrictions
Available, but with copyright
Accessible, but not downloadable
Academic, but not freely available
Proprietary, commercial; possibly free for academics
Identifier
Accession code (or number)
Contd
Identifier
An identifier ("locus" in GenBank, "entry name" in SWISS-
Contd
Accession code (number)
An accession code (or number) is a number (possibly with a few
Contd
EMBL www.ebi.ac.uk/embl/
The DNA Data Bank of Japan began as a collaboration with EMBL and
GenBank. It is run by the National Institute of Genetics. One can search
for entries by accession number.
The following databases contain subsets of the EMBL/GenBank databases. Some also
contain more information or links than the primary ones, or have a different organization of
the data to better some specific purpose. However, the nucleotide sequences themselves
should always be available in the EMBL/GenBank databases. In this sense, the databases
below are secondary databases.
UniGene http://www.ncbi.nlm.nih.gov/sites/entrez?db =unigene
The UniGene system attempts to process the GenBank sequence data into a non-redundant
set of gene-oriented clusters. Each UniGene cluster contains sequences that represent a
unique gene, as well as related information such as the tissue types in which the gene has
been expressed and map location.
SGD http://www.yeastgenome.org /
The Saccharomyces Genome Database (SGD) is a scientific database of the molecular
biology and genetics of the yeast Saccharomyces cerevisiae.
EBI Genomes www.ebi.ac.uk /genomes/
This web site provides access and statistics for the completed genomes, and information
about ongoing projects.
Genome Biology www.ncbi.nlm.nih.gov /Genomes/
The Genome Biology site at NCBI contains information about the available complete
genomes.
Ensembl www.ensembl.org
Ensembl is a joint project between EMBL-EBI and the Sanger Centre to develop a software
system which produces and maintains automatic annotation on eukaryotic genomes.
SWISS-PROT, TrEMBL
www.expasy.ch/sprot
PIR pir.georgetown.edu
PIR grew out of Margaret Dayhoff's work in the middle of the 1960s. It strives to
be comprehensive, well-organized, accurate, and consistently annotated.
However, it is generally believed that it does not reach the level of completeness
in the entry annotation as does SWISS-PROT. Although SWISS-PROT and PIR
overlap extensively, there are still many sequences which can be found in only
one of them.
One can search for entries or do sequence similarity searches at the PIR site.
PIR also produces the NRL-3D, which is a database of sequences extracted
from the three-dimensional structures in the Protein Databank (PDB).
It appears that the PIR web site, and possibly also the underlying database, has
improved considerably since one year ago. This means that if one is interested
in protein sequences, there is now even more reason to check out PIR;
Contd
KEGG www.genome.ad.jp/kegg/
POPULAR BIOINFORMATICS
DATABASES
http://everest.bic.nus.edu.sg/~bhuvana/lsm21
04/popualr-bioinformatics-databases.htm
Class 6
10/07/08
www.ncbi.nlm.nih.gov
Created in 1988 as part of the National
Library of Medicine at NIH
Establish public databases
Research in computational biology
Develop software tools for sequence
analysis
Disseminate biomedical information
Derivative databases
Built from primary data
Content controlled by third party (NCBI)
Entrez
Books
Linked from PubMed and other records
Nucleotide databases
Primary
GenBank / EMBL / DDBJ
54,694,591
Genbank
Derivative
RefSeq
RefSeq
1,132,972
Third
Party Annotation
4,763
PDB
PDB
5,887
Total
55,838,213
genome projects
sequencing centers
individual scientists
patent offices
Protein Databases
Genpept
TrEMBL (1996)
Highly redundant
Not all experimentally determined
Many inaccuracies
Uniprot (2003)
Uniprot
UniProt Knowledgebase (UniProt)
Central access point for extensive curated protein
information, including function, classification, and
cross-reference.
UniProt Non-redundant Reference (UniRef)
C TC
ATCATCT
TA
TA
G
CC
GC G
T
G
AC
GAG
GAG
A A
RefSeq
T
T
G
A
C
A
C
G
TGA
TATAGCCG
AGCTCCGATA
CCGATGACAA
AT
T
GA
C
TA
CG
G
CC
G
A
TAT
Genome
Assembly
TA
TA
GC
A
TG
CG C
TG
G
AC
CGTGA
A
G
T
ATTG
C GA
CT
A
ACG
TGC
Labs
CA
A
G
TT
TTGACA
TAT AT
TA
C
AG TG
CG GA
CC CTAAC
CA
C
A
A
G
T
T
A TAG T
TATAGCCG
TATAGCCG
TATAGCCG
ATTG TATAGCCG
TG
A
T
A
T
T
AT
C
GenBank
UniGene
A
T
TG
A
C
TA
GA
AT
C TC
ATCATT
GAG
GAG
A A
TC T
T
C
ATTATC
A
Algorithms
GAGA
GAG
A
Bottom
image: computer
image of sequence read by
automated sequencer
8
century is a century of biotechnology & bioinformatics:
7
Nucleotides(billion)
21st
half year)
6
5
4
3
2
1
0
1980
1985
1990
Years
1995
2000
There
Because
Database
storage
Re
You are
here
lts
u
s
ur
o
Y
st
e
u
eq
Prediction: $2
~50
Bioinformatics companies:
Scope
Make
LOGO
They
Specialized
databases:Tissues, species
-ESTs (Expressed Sequence Tags)
~at NCBI http://www.ncbi.nlm.nih.gov/dbEST
~at TIGR http://tigr.org/tdb/tgi
- ...many more!
http://www.pir.uniprot.org/
Translated
databases:
-TREMBL (translated EMBL): includes entries that have
not been annotated yet into Swiss-Prot.
http://www.ebi.ac.uk/trembl/access.html
Assumes
All
A similarity
The
The
How
S is the probability that two residues, i and j, are aligned by evolutionary descent
and by chance.
qij are the frequencies that i and j are observed to align in sequences known to
Global alignment:
VQQESGLVRTTC
not sensitive
ESG
The
Local alignment:
ESG
faster
http://www.arabidopsis.org/cgi-bin/fasta/nph-TAIRfasta.pl)
BLAST > FASTA > Smith-Waterman (It is VERY SLOW and uses a
LOT OF COMPUTER POWER)
Sensitivity/statistics:
Map-based
0.8-6 million
15 million
C. elegans (roundworm)
100 million
Drosophila (fruitfly)
120 million
130 million
Rice
435 million
3 billion
365 million
278 million
Organism
54 Bacteria
Yeast
Human
Base pairs
computers.
Conclusion:
It is almost always better to compare coding sequences in their amino acid form,
especially if they are very divergent.
Very highly similar nucleotide sequences may give better results.
BLASTP:
BLASTX:
TBLASTN:
TBLASTX:
PSI-BLAST: Performs iterative database searches. The results from each round
BLAST results
Use
Search
Translate
Search
If
http://align.genome.jp)
bl2sequ (http://www.ncbi.nlm.nih.gov/blast/bl2seq/wblast2.cgi)
LALIGN (http://www.ch.embnet.org/software/LALIGN_form.html )
MultiALIGN (http://prodes.toulouse.inra.fr/multalin/multalin.html )
Sites
These algorithms can translate DNA sequences in any of the 3 forward or three
reverse sense frames.
Translate (http://au.expasy.org/tools/dna.html)
Translate a DNA sequence: (http://www.vivo.colostate.edu/molkit/translate/index.html )
Transeq (http://www.ebi.ac.uk/emboss/transeq)