Académique Documents
Professionnel Documents
Culture Documents
Techniques
Chapter 8
8.4. Mining sequence patterns in
biological data
Jiawei Han and Micheline Kamber
Department of Computer Science
University of Illinois at Urbana-Champaign
www.cs.uiuc.edu/~hanj
2006 Jiawei Han and Micheline Kamber. All rights reserved.
08/18/15
08/18/15
Summary
08/18/15
08/18/15
Nucleotides (bases)
Adenine (A)
Cytosine (C)
Guanine (G)
Thymine (T)
Source:
www.mtsinai.on.ca/pdmg/Genetics/basic.htm
08/18/15
08/18/15
In RNA, there is one special codon called start codon and a few
others called stop codons
There are basically 20 amino acids (A, L, V, S, ...) but in certain rare
situations, others can be added to that list.
08/18/15
Biological Information:
From Genes to Proteins
Gene
DNA
Transcription
genomics
molecular
biology
RNA
Translation
Protein
Protein folding
structural
biology
biophysics
08/18/15
08/18/15
10
sequence
3D structure
protein functions
08/18/15
Spliceosome
Ribosome
11
Source:
08/18/15
Functional genomics:
studies the function of all
the proteins of a genome
12
A cell is made up of
molecular components that
can be viewed as 3Dstructures of various shapes
In a living cell, the molecules
interact with each other (w.
shape and location). An
important type of interaction
involve catalysis (enzyme)
that facilitate interaction.
A metabolic pathway is a
chain of molecular
interactions involving
enzymes
Signaling pathways are
molecular interactions that
enable communication
through the cells membrane
Source: www.mtsinai.on.ca/pdmg/images/pairscolour.jpg
08/18/15
13
It can produce 300k base pairs per day at relatively low cost
08/18/15
14
08/18/15
15
08/18/15
16
08/18/15
17
08/18/15
18
Bioinformatics
Computational management
and analysis of biological
information
Interdisciplinary Field
(Molecular Biology, Statistics,
Computer Science, Genomics,
Genetics, Databases,
B io in fo r m a tic s
F u n c tio n a l
G e n o m ic s
Chemistry, Radiology )
Bioinformatics vs.
computational biology (more
on algorithm correctness,
G e n o m ic s
P r o te o m ic s
08/18/15
S tr u c tu r a l
B io in fo r m a tic s
19
Research
(I) Genomics to Biology
08/18/15
20
08/18/15
21
08/18/15
22
08/18/15
23
08/18/15
24
08/18/15
25
Algorithms Used in
Bioinformatics
Using script languages: script on the Web to analyze data and applications
08/18/15
26
Summary
08/18/15
27
Comparing Sequences
Nucleotides: identical
08/18/15
28
Goal:
Method:
08/18/15
29
HEAGAWGHEE
PAWHEAE
Example
HEAGAWGHE-E
HEAGAWGHE-E
P-A--W-HEAE
--P-AW-HEAE
08/18/15
30
-1
-2
-3
-1
-3
-3
-2
-2
10
-3
-1
-1
-2
-2
-4
-3
-3
-3
-3
15
Gap penalty: -8
Gap extension: -8
HEAGAWGHE-E
--P-AW-HEAE
(-8) + (-8) + (-1) + 5 + 15 + (-8)
+ 10 + 6 + (-8) + 6 = 9
HEAGAWGHE-E
P-A--W-HEAE
31
Formal Description
Problem: PairSeqAlign
Input: Two sequences
x, y
Scoring matrix
s
Gap penalty
d
Gap extension penalty e
Output: The optimal sequence alignment
Difficulty:
If x, y are of size n then
2n (2n)!
22n
2
n
n (n!)
global alignments is
08/18/15
32
Uses weights for the outmost edges that encourage the best
overall (global) alignment
-25
--P-AW-HEAE
HEAG-
HEAGA
HEAGA
--P-A
--P
--P-A
-33
08/18/15
HEAGAWGHE-E
-33
Gap with top
-20
Top and bottom
33
Global Alignment
yj aligned to gap
F(i-1,j-1) F(i,j-1)
s(xi,yj)
d
F(i-1,j)
O(n2) algorithm
F(i,j)
xi aligned to
gap
genomes.
08/18/15
34
Pair-Wise Sequence
Alignment
Given s ( xi , y j ), d
Given s ( xi , y j ), d
F (0, 0) 0
F (0, 0) 0
0
F (i 1, j 1) s ( x , y )
i
j
F (i 1, j 1) s ( xi , y j )
F (i, j ) max F (i 1, j ) d
F (i, j 1) d
F (i, j ) max
F (i 1, j ) d
F (i, j 1) d
Alignment: 0 F(i,j)
35
08/18/15
36
Heuristic Alignment
Algorithms
08/18/15
37
Motivation
08/18/15
38
08/18/15
39
Motivation
08/18/15
40
08/18/15
41
08/18/15
ATGGTGCACCTGACTGATGCTGAGAAGGCTGCTGT
ATGGTGCACCTGACTGATGCTGAGAAGGCTGCTGT
ATGGTGCATCTGTCCAGT---GAGGAGAAGTCTGC
ATGGTGCACCTGACTCCT---GAGGAGAAGTCTGC
ATGGTGCACTTGACTTTT---GAGGAGAAGAACTG
ATGGTGCACTGGACTGCT---GAGGAGAAGCAGCT
---ATGGGTTTGACAGCACATGATCGT---CAGCT
42
08/18/15
43
X1=x11,,x1m1
X1=x11,,x1m1
XN=xN1,,xNmN
08/18/15
X2=x21,,x2m2
S(a*)= 21
XN=xN1,,xNmN
44
Intuition:
A perfectly aligned
column has one single
symbol (least
uncertainty)
A poorly aligned
column has many
distinct symbols (high
uncertainty)
08/18/15
Count of symbol a in
column i
a'
45
Multidimensional Dynamic
Programming
Assumptions: (1) columns are independent (2) linear gap cost
S (m) G s (mi )
i
G ( g ) dg
i1,i 2,...,iN
N
xi11 , xi22 ,..., xiN
0,0,...,0 0
N
i11,i 2 1,...,iN 1 S ( xi11 , xi22 ,..., xiN
)
2
N
i1,i 2 1,...,iN 1 S ( , xi 2 ,..., xiN )
i1,i 2,...,iN
1
N
i11,i 2,...,iN 1 S ( xi1 , ,..., xiN )
max ...
N
S
(
,...,
x
)
i
1,
i
2,...,
iN
1
iN
...
Alignment: 0,0,0,0---|x1| , , |xN|
1
i11,i 2 ,...,iN S ( xi1 , ,..., )
08/18/15
46
Complexity of Dynamic
Programming
S (a ) S (a kl )
k l
S (a kl ) kl
08/18/15
47
08/18/15
48
Iterative Refinement
08/18/15
49
Many Heuristics
08/18/15
50
Summary
08/18/15
51
08/18/15
52
Transition probabilities
Pr(x =a|x =g)=0.16
i
i-1
Pr(x =c|x =g)=0.34
i
i-1
Pr(x =g|x =g)=0.38
i
i-1
Pr(x =t|x =g)=0.12
i
i-1
Pr( x | x
i
08/18/15
i 1
g) 1
53
a set of states
08/18/15
54
L 1
L 1
L 2
Pr( x1 ) Pr( xi | xi 1 )
i2
08/18/15
55
Pr(cggt)=Pr(c)Pr(g|c)Pr(g|g)Pr(t|g)
08/18/15
56
Example Application
CpG islands
08/18/15
57
Pr( x | CpGModel )
score( x) log
Pr( x | nullModel )
08/18/15
58
Why use
Pr( x | CpGModel )
score ( x ) log
Pr( x | nullModel )
59
08/18/15
60
parameters
08/18/15
61
62
Pr(gctaca)=Pr(gctac)Pr(a|gctac)
08/18/15
63
64
Hidden State
08/18/15
65
Learning
Classification
Segmentation
08/18/15
66
Learning
Classification
Segmentation
08/18/15
67
Transition Probabilities
akl Pr( i l | i 1 k )
Probability of transition from state k to state l
Emission Probabilities
ek (b) Pr( xi b | i k )
08/18/15
68
An HMM Example
08/18/15
69
08/18/15
70
Pr( AAC , )
a01 e1 ( A) a11 e1 ( A)
a13 e3 (C ) a35
.5 .4 .2 .4 .8 .3 .6
08/18/15
71
08/18/15
72
08/18/15
73
Initialization
Recursion
f l (i )
i ) L)
f k (i 1) akl
For emitting state
(i e=l (1,
k
f k (i ) akl
08/18/15
74
75
Initialization
f (0)=1, f (0)=0f (0)=0
0
1
5
=0.3*(1*0.5+0*0.2)=0.15
f2(1)=0.4*(1*0.5+0*0.8)
f1(2)=e1(A)*(f0(1)a01+f1(1)a11)
=0.4*(0*0.5+0.15*0.2)
76
08/18/15
77
78
08/18/15
79
80
81
82
Algorithm sketch:
08/18/15
83
O( S 2 L)
O( MS L)
08/18/15
84
85
Summary
08/18/15
86
08/18/15
87
References
08/18/15
88
08/18/15
89