Académique Documents
Professionnel Documents
Culture Documents
Patrick Schmid
CSE 497
Spring 2004
What is a DNA Microarray?
Also known as DNA Chip
Allows simultaneous measurement of the
level of transcription for every gene in a
genome (gene expression)
Transcription?
Process of copying of DNA into messenger RNA
(mRNA)
Environment dependant!
Microarray detects mRNA, or rather the more
stable cDNA
Patrick Schmid 2
What is a DNA Microarray? (cont.)
Patrick Schmid 3
How do we manufacture a
microarray?
Start with individual genes, e.g. the ~6,200
genes of the yeast genome
Amplify all of them using polymerase chain
reaction (PCR)
“Spot” them on a medium, e.g. an ordinary
glass microscope slide
Each spot is about 100 µm in diameter
Spotting is done by a robot
Complex and potentially expensive task
Patrick Schmid 4
How do we manufacture a
microarray?
Patrick Schmid 5
Example
Patrick Schmid 6
Example (cont.)
Patrick Schmid 7
Overview of Example
Patrick Schmid 8
Reading an array
Laser scans array and produces images
One laser for each color, e.g. one for green, one for
red
Image analysis, main tasks:
Noise suppression
Spot localization and detection, including the extraction of
the background intensity, the spot position, and the spot
boundary and size
Data quantification and quality assessment
Image Analysis is a book on its own:
Kamberova, G. & Shah, S. “DNA Array Image Analysis
Nuts & Bolts“. DNA Press LLC, 2002
Patrick Schmid 9
Reading an array (cont.)
Block Column Row Gene Name Red Green Red:Green
Ratio
1 1 1 tub1 2,345 2,467 0.95
1 1 2 tub2 3,589 2,158 1.66
1 1 3 sec1 4,109 1,469 2.80
1 1 4 sec2 1,500 3,589 0.42
1 1 5 sec3 1,246 1,258 0.99
1 1 6 act1 1,937 2,104 0.92
1 1 7 act2 2,561 1,562 1.64
1 1 8 fus1 2,962 3,012 0.98
1 1 9 idp2 3,585 1,209 2.97
1 1 10 idp1 2,796 1,005 2.78
1 1 11 idh1 2,170 4,245 0.51
1 1 12 idh2 1,896 2,996 0.63
1 1 13 erd1 1,023 3,354 0.31
1 1 14 erd2 1,698 2,896 0.59
Campbell & Heyer, 2003
Patrick Schmid 10
Real DNA Microarray
Patrick Schmid 11
Y-fold
Biologists rather deal with folds than with ratios
A fold is nothing else than saying “times”
We express it either as a Y-fold repression, or a
Y-fold induction
It is calculated by taking the inverse of the ratio
Ratio of 0.33 = 3-fold repression
Ratio of 10 = 10-fold induction
Fractional ratios can cause problems with
techniques of analyzing and comparing gene
expression patterns
Patrick Schmid 12
Color Coding
Tables are difficult to read Campbell & Heyer, 2003
Patrick Schmid 13
Logarithmic transformation
log2 is commonly used
Sometimes log10 is used
Example:
log2(0.0625) = log2(1/16) =
log2(1) – log2(16) = -log2(16) = -4
log2 transformations ease identification of doublings
or halvings in ratios
log10 transformations ease identification of order of
magnitude changes
Key attribute: equally sized induction and repression
receive equal treatment visually and mathematically
Patrick Schmid 14
Complication: Time Series
Biologists care more about the process of
adaptation than about the end result
For example, measure every 2 hours for 10 hours
(depletion of oxygen)
31,000 gene expression ratios
Or 6,200 different graphs with five data points each
Question: Are there any genes that responded in
similar ways to the depletion of oxygen?
Patrick Schmid 15
Example data: fold change (ratios)
Name 0 hours 2 hours 4 hours 6 hours 8 hours 10 hours
Gene C 1 8 12 16 12 8
Gene D 1 3 4 4 3 2
Gene E 1 4 8 8 8 8
Gene F 1 1 1 0.25 0.25 0.1
Gene G 1 2 3 4 3 2
Gene H 1 0.5 0.33 0.25 0.33 0.5
Gene I 1 4 8 4 1 0.5
Gene J 1 2 1 2 1 2
Gene K 1 1 1 1 3 3
Gene L 1 2 3 4 3 2
Gene M 1 0.33 0.25 0.25 0.33 0.5
Gene N 1 0.125 0.0833 0.0625 0.0833 0.125
Campbell & Heyer, 2003
Patrick Schmid 16
Example data: log2 transformation
Name 0 hours 2 hours 4 hours 6 hours 8 hours 10 hours
Gene C 0 3 3.58 4 3.58 3
Gene D 0 1.58 2 2 1.58 1
Gene E 0 2 3 3 3 3
Gene F 0 0 0 -2 -2 -3.32
Gene G 0 1 1.58 2 1.58 1
Gene H 0 -1 -1.60 -2 -1.60 -1
Gene I 0 2 3 2 0 -1
Gene J 0 1 0 1 0 1
Gene K 0 0 0 0 1.58 1.58
Gene L 0 1 1.58 2 1.58 1
Gene M 0 -1.60 -2 -2 -1.60 -1
Gene N 0 -3 -3.59 -4 -3.59 -3
Campbell & Heyer, 2003
Patrick Schmid 17
Pearson Correlation Coefficient r
Patrick Schmid 18
Pearson Correlation Coefficient r (cont.)
X = Gene C = (0, 3.00, 3.58, 4, 3.58, 3)
Y = Gene D = (0, 1.58, 2.00, 2, 1.58, 1)
∑XY = (0)(0)+(3)(1.58)+(3.58)(2)+(4)(2)+(3.58)(1.58)+(3)(1) = 28.5564
∑X = 3+3.58+4+3.58+3 = 17.16
∑X2 = 32+3.582+42+3.582+32 = 59.6328
∑Y = 1.58+2+2+1.58+1 = 8.16
∑Y2 = 1.582+22+22+1.582+12 = 13.9928
N=6
∑XY – ∑X∑Y/N = 28.5564 – (17.16)(8.16)/6 = 5.2188
∑X2 – (∑X)2/N = 59.6328 – (17.16)2/6 = 10.5552
∑Y2 – (∑Y)2/N = 13.9928 – (8.16)2/6 = 2.8952
r = 5.2188 / sqrt((10.5552)(2.8952)) = 0.944
Patrick Schmid 19
Example data: Pearson correlation
coefficient
Gene C Gene D Gene E Gene F Gene G Gene H Gene I Gene J Gene K Gene L Gene M Gene N
Gene C 1 0.94 0.96 -0.40 0.95 -0.95 0.41 0.36 0.23 0.95 -0.94 -1
Gene D 0.94 1 0.84 -0.10 0.94 -0.94 0.68 0.24 -0.07 0.94 -1 -0.94
Gene E 0.96 0.84 1 -0.57 0.89 -0.89 0.21 0.30 0.43 0.89 -0.84 -0.96
Gene F -0.40 -0.10 -0.57 1 -0.35 0.35 0.60 -0.43 -0.79 -0.35 0.10 0.40
Gene G 0.95 0.94 0.89 -0.35 1 -1 0.48 0.22 0.11 1 -0.94 -0.95
Gene H -0.95 -0.94 -0.89 0.35 -1 1 -0.48 -0.21 -0.11 -1 0.94 0.95
Gene I 0.41 0.68 0.21 0.60 0.48 -0.48 1 0 -0.75 0.48 -0.68 -0.41
Gene J 0.36 0.24 0.30 -0.43 0.22 -0.21 0 1 0 0.22 -0.24 -0.36
Gene K 0.23 -0.07 0.43 -0.79 0.11 -0.11 -0.75 0 1 0.11 0.07 -0.23
Gene L 0.95 0.94 0.89 -0.35 1 -1 0.48 0.22 0.11 1 -0.94 -0.95
Gene M -0.94 -1 -0.84 0.10 -0.94 0.94 -0.68 -0.24 0.07 -0.94 1 0.94
Gene N -1 -0.94 -0.96 0.40 -0.95 0.95 -0.41 -0.36 -0.23 -0.95 0.94 1
Patrick Schmid 20
Example: Reorganization of data
Name 0 hours 2 hours 4 hours 6 hours 8 hours 10 hours
Gene M 1 0.33 0.25 0.25 0.33 0.5
Gene N 1 0.125 0.0833 0.0625 0.0833 0.125
Gene H 1 0.5 0.33 0.25 0.33 0.5
Gene K 1 1 1 1 3 3
Gene J 1 2 1 2 1 2
Gene E 1 4 8 8 8 8
Gene C 1 8 12 16 12 8
Gene L 1 2 3 4 3 2
Gene G 1 2 3 4 3 2
Gene D 1 3 4 4 3 2
Gene I 1 4 8 4 1 0.5
Gene F 1 1 1 0.25 0.25 0.1
Campbell & Heyer, 2003
Patrick Schmid 21
Clustering of example
Patrick Schmid 22
Clustering of entire yeast genome
Patrick Schmid 23
Hierarchical Clustering
Algorithm:
First, find the two most similar genes in the entire
set of genes. Join these together into a cluster.
Now join the next two most similar objects (an
object can be a gene or a cluster), forming a new
cluster. Add the new cluster to the list of available
objects, and remove the two objects used to form
the new cluster. Continue this process, joining
objects in the order of their similarity to one
another, until there is only one object on the list –
a single cluster containing all genes.
(Campbell & Heyer, 2003)
Patrick Schmid 24
Hierarchical Clustering (cont.)
Gene C Gene D Gene E Gene F Gene G Gene H Gene I Gene J Gene K Gene L Gene M Gene N
Gene C 0.94 0.96 -0.40 0.95 -0.95 0.41 0.36 0.23 0.95 -0.94 -1
Gene D 0.84 -0.10 0.94 -0.94 0.68 0.24 -0.07 0.94 -1 -0.94
Gene E -0.57 0.89 -0.89 0.21 0.30 0.43 0.89 -0.84 -0.96
Gene M 0.94
Gene N
Patrick Schmid 25
Hierarchical Clustering (cont.)
Gene
1 C Gene D Gene E
F Gene
Gene G
F Gene G
Gene
1 C 0.94
0.89 -0.485
0.96 -0.40
0.92 0.95
Gene D -0.10
0.84 0.94
-0.10 0.94
F
Gene E -0.35
-0.57 0.89
Gene G
F -0.35
Gene G
E Average observations
F •Gene D: (0.94+0.84)/2 = 0.89
1
G •Gene F: (-0.40+(-0.57))/2 = -0.485
•Gene G: (0.95+0.89)/2 = 0.92
C E
Patrick Schmid 26
Hierarchical Clustering (cont.)
1 Gene D Gene F Gene G
Gene F -0.35
Gene G
F
1 2
G
C E G D
Patrick Schmid 27
Hierarchical Clustering (cont.)
1 2 Gene F
1 0.905 -0.485
2 -0.225
Gene F
F
1 2
C E G D
Patrick Schmid 28
Hierarchical Clustering (cont.)
3 Gene F
3 -0.355
Gene F
F
1 2
F C E G D
Patrick Schmid 29
Hierarchical Clustering (cont.)
Did this
4
algorithm not
3
1 2
look familiar?
F C E G D
Patrick Schmid 30
Hierarchical Clustering (cont.)
Patrick Schmid 31
Hierarchical Clustering (cont.)
We differentiate hierarchical clustering algorithms
by how they agglomerate distances:
Single Linkage
Shortest link between two clusters
Complete Linkage
Longest link between two clusters
Average Linkage
Average of distances between all
pairs of objects
Average Group Linkage
Groups once formed are
represented by their mean values,
and then those are averaged
Which one did we use in the previous
example ?
http://www.resample.com/xlminer/help/HClst/HClst_intro.htm
Patrick Schmid 32
Clustering Overview
Patrick Schmid 33
Clustering Overview (cont.)
Different Clustering Methods
Unsupervised
Hierarchical Clustering
k-means Clustering (k nearest neighbors)
Thursday
Self-organizing map
Thursday
Supervised
Support vector machine
Ensemble classifier
Data Mining
Patrick Schmid 34
Support Vector Machines
Linear regression:
x = w0 + w1a1 + w2a2 + … + wkak
x is the class, ai are the attribute values and wj are
the weights
Given a distance vector Y with distances ai in
which class x does Y belong?
What do we mean by a class x?
Primitive method: Y is in one class if x<0.5, in
another class for x≥0.5.
Patrick Schmid 35
Support Vector Machines (cont.)
Patrick Schmid 36
Support Vector Machines (cont.)
This means…
Two pairs of classes
Weight vector for class 1:
w0(1) + w1(1)a1 + w2(1)a2 + … + wk(1)ak
Weight vector for class 2:
w0(2) + w1(2)a1 + w2(2)a2 + … + wk(2)ak
An instance will be assigned to class 1 rather than
class 2 if
w0(1) + w1(1)a1 + w2(1)a2 + … + wk(1)ak > w0(2) + w1(2)a1 + w2(2)a2 + … + wk(2)ak
We can rewrite this as
(w0(1) - w0(2)) + (w1(1) - w1(2)) a1 + … + (wk(1) - wk(2)) ak > 0
Hyperplane
Patrick Schmid 37
Support Vector Machines (cont.)
We can only represent linear boundaries between classes so far
Trick: Transform the input using a nonlinear mapping, then construct
a linear model in the new space
Example: Use all products of n factors (2 attributes, n=3):
x = w1a13 + w2a12a2 + w3a1a22 + w4a23
Then use multi-response linear regression
However, for 10 attributes and including all products with 5 factors,
we would need to determine more than 2000 coefficients
Linear regression is O(n3) in time
Problem: Training is infeasible
Another problem: Overfit. The resulting model will be “too
nonlinear”, because there are just too many parameters in the
model.
Patrick Schmid 38
Support Vector Machines (cont.)
Convex hull of points
is the tightest
enclosing polygon
Maximum margin
hyperplane
Instances closest to
hyperplane are called
support vectors
Support vectors define
maximum margin
hyperplane uniquely
Patrick Schmid 39
Support Vector Machines (cont.)
We only need set of support vectors, everything else is irrelevant
A hyperplane separating two classes can then be written as
x = w + w a + w a
0 1 1 2 2
Or
x = b + ∑ α γ (a(i) ∙ a)
i i
i is support vector
γi is the class value of a(i)
b and αi are numeric values to be determined
Vector a represents a test instance
a(i) are the support vectors
Determining b and αi is a constrained quadratic optimization
problem that can be solved with off-the-shelf software packages
Support Vector Machines do not overfit, because there are
usually only a few support vectors
Patrick Schmid 40
Support Vector Machines (cont.)
Did I not introduce Support Vector Machines by
talking about non-linear class boundaries?
x = b + ∑ αiγi (a(i) ∙ a)n
n is the number of factors
(x ∙ y)n is called a polynomial kernel
A good way of choosing n is by starting with n=1
and incrementing it until estimated error ceases to
improve
If you want to know more:
SVMs in general: Witten & Frank, 2000 (lecture material
based on this)
Application to cancer classification: Cho & Won, 2003
Patrick Schmid 41
Demo – Shneiderman
References
Brown, P., Botstein, D. “Exploring the new world of the genome with
DNA microarrays” Nature genetics supplement, vol. 21, January 1999
Campbell A. & Heyer L. “discovering Genomics, Proteomics, &
Bioninformatics” Benjamin Cummings, 2003.
Cheung, V., Morley, M., Aguilar, F., Massimi, A., Kucherlapati, R. &
Childs, G. “Making and reading microarrays” Nature genetics
supplement, vol. 21, January 1999
Cho, S. & Won, H. “Machine Learning in DNA Microarray Analysis for
Cancer Classification” Proceedings of the First Asia-Pacific
bioinformatics conference on Bioinformatics 2003 - Volume 19,
Australian Computer Society Inc.
Eisen, M., Spellman, P., Brown, P. & Botstein, D. “Cluster analysis and
display of genome-wide expression patterns” Proc. Natl. Acad. Sci.
USA. Vol 95, pp. 14 863-14868, December 1998. Genetics
Seo, J. & Sheiderman, B. “Interactively Exploring Hierarchical Clustering
Results” IEEE Computer, July 2002
Witten, I. & Frank, E. “Data Mining” Morgan Kaufmann Publishers, 2000
Patrick Schmid 43