DNA Microarrays: Patrick Schmid CSE 497 Spring 2004

DNA Microarrays
Patrick Schmid
CSE 497
Spring 2004
What is a DNA Microarray?
 Also known as DNA Chip
 Allows simultaneous measurement of the
level of transcription for every gene in a
genome (gene expression)
 Transcription?
 Process of copying of DNA into messenger RNA
(mRNA)
 Environment dependant!
 Microarray detects mRNA, or rather the more
stable cDNA
Patrick Schmid 2
What is a DNA Microarray? (cont.)
Cheung et al. 1999
Patrick Schmid 3
How do we manufacture a
microarray?
 Start with individual genes, e.g. the ~6,200
genes of the yeast genome
 Amplify all of them using polymerase chain
reaction (PCR)
 “Spot” them on a medium, e.g. an ordinary
glass microscope slide
 Each spot is about 100 µm in diameter
 Spotting is done by a robot
 Complex and potentially expensive task
Patrick Schmid 4
How do we manufacture a
microarray?
Cheung et al. 1999
Patrick Schmid 5
Example
 Remember the flash animation?

 Yeast
 Grow in aerobic and anaerobic environment
 Different genes will be activated in order to
adapt to each environment
 Extract mRNA
 Convert mRNA into colored cDNA
(fluorescently labeled)
Patrick Schmid 6
Example (cont.)
 Mix cDNA together

 Hybridize cDNA with array
 Each cDNA sequence hybridizes specifically
with the corresponding gene sequence in the
array
 Wash unhybridized cDNA off
 Read array with laser
 Analyze images
Patrick Schmid 7
Overview of Example
Brown & Botstein, 1999
Patrick Schmid 8
Reading an array
 Laser scans array and produces images
 One laser for each color, e.g. one for green, one for
red
 Image analysis, main tasks:
 Noise suppression
 Spot localization and detection, including the extraction of
the background intensity, the spot position, and the spot
boundary and size
 Data quantification and quality assessment
 Image Analysis is a book on its own:
 Kamberova, G. & Shah, S. “DNA Array Image Analysis
Nuts & Bolts“. DNA Press LLC, 2002
Patrick Schmid 9
Reading an array (cont.)
Block Column Row Gene Name Red Green Red:Green
Ratio
1 1 1 tub1 2,345 2,467 0.95
1 1 2 tub2 3,589 2,158 1.66
1 1 3 sec1 4,109 1,469 2.80
1 1 4 sec2 1,500 3,589 0.42
1 1 5 sec3 1,246 1,258 0.99
1 1 6 act1 1,937 2,104 0.92
1 1 7 act2 2,561 1,562 1.64
1 1 8 fus1 2,962 3,012 0.98
1 1 9 idp2 3,585 1,209 2.97
1 1 10 idp1 2,796 1,005 2.78
1 1 11 idh1 2,170 4,245 0.51
1 1 12 idh2 1,896 2,996 0.63
1 1 13 erd1 1,023 3,354 0.31
1 1 14 erd2 1,698 2,896 0.59
Campbell & Heyer, 2003
Patrick Schmid 10
Real DNA Microarray
Patrick Schmid 11
Y-fold
 Biologists rather deal with folds than with ratios
 A fold is nothing else than saying “times”
 We express it either as a Y-fold repression, or a
Y-fold induction
 It is calculated by taking the inverse of the ratio
 Ratio of 0.33 = 3-fold repression
 Ratio of 10 = 10-fold induction
 Fractional ratios can cause problems with
techniques of analyzing and comparing gene
expression patterns
Patrick Schmid 12
Color Coding
 Tables are difficult to read Campbell & Heyer, 2003
 Data is presented with a color scale

 Coding scheme:
 Green = repressed (less mRNA) gene in experiment
 Red = induced (more mRNA) gene in experiment
 Black = no change (1:1 ratio)
 Or
 Green = control condition (e.g. aerobic)
 Red = experimental condition (e.g. anaerobic)
 We only use ratio
Patrick Schmid 13
Logarithmic transformation
 log2 is commonly used
 Sometimes log10 is used
 Example:
 log2(0.0625) = log2(1/16) =
log2(1) – log2(16) = -log2(16) = -4
 log2 transformations ease identification of doublings
or halvings in ratios
 log10 transformations ease identification of order of
magnitude changes
 Key attribute: equally sized induction and repression
receive equal treatment visually and mathematically
Patrick Schmid 14
Complication: Time Series
 Biologists care more about the process of
adaptation than about the end result
 For example, measure every 2 hours for 10 hours
(depletion of oxygen)
 31,000 gene expression ratios
 Or 6,200 different graphs with five data points each
 Question: Are there any genes that responded in
similar ways to the depletion of oxygen?
Patrick Schmid 15
Example data: fold change (ratios)
Name 0 hours 2 hours 4 hours 6 hours 8 hours 10 hours
Gene C 1 8 12 16 12 8
Gene D 1 3 4 4 3 2
Gene E 1 4 8 8 8 8
Gene F 1 1 1 0.25 0.25 0.1
Gene G 1 2 3 4 3 2
Gene H 1 0.5 0.33 0.25 0.33 0.5
Gene I 1 4 8 4 1 0.5
Gene J 1 2 1 2 1 2
Gene K 1 1 1 1 3 3
Gene L 1 2 3 4 3 2
Gene M 1 0.33 0.25 0.25 0.33 0.5
Gene N 1 0.125 0.0833 0.0625 0.0833 0.125
Patrick Schmid 16
Example data: log2 transformation
Gene C 0 3 3.58 4 3.58 3
Gene D 0 1.58 2 2 1.58 1
Gene E 0 2 3 3 3 3
Gene F 0 0 0 -2 -2 -3.32
Gene G 0 1 1.58 2 1.58 1
Gene H 0 -1 -1.60 -2 -1.60 -1
Gene I 0 2 3 2 0 -1
Gene J 0 1 0 1 0 1
Gene K 0 0 0 0 1.58 1.58
Gene L 0 1 1.58 2 1.58 1
Gene M 0 -1.60 -2 -2 -1.60 -1
Gene N 0 -3 -3.59 -4 -3.59 -3
Patrick Schmid 17
Pearson Correlation Coefficient r
 Gene expression over time is a vector, e.g.

for gene C: (0, 3, 3.58, 4, 3.58, 3)
 Given two vectors X and Y that contain N
elements, we calculate r as follows:
Cho & Won, 2003
Patrick Schmid 18
Pearson Correlation Coefficient r (cont.)
 X = Gene C = (0, 3.00, 3.58, 4, 3.58, 3)
Y = Gene D = (0, 1.58, 2.00, 2, 1.58, 1)
 ∑XY = (0)(0)+(3)(1.58)+(3.58)(2)+(4)(2)+(3.58)(1.58)+(3)(1) = 28.5564
 ∑X = 3+3.58+4+3.58+3 = 17.16
 ∑X2 = 32+3.582+42+3.582+32 = 59.6328
 ∑Y = 1.58+2+2+1.58+1 = 8.16
 ∑Y2 = 1.582+22+22+1.582+12 = 13.9928
 N=6
 ∑XY – ∑X∑Y/N = 28.5564 – (17.16)(8.16)/6 = 5.2188
 ∑X2 – (∑X)2/N = 59.6328 – (17.16)2/6 = 10.5552
 ∑Y2 – (∑Y)2/N = 13.9928 – (8.16)2/6 = 2.8952
 r = 5.2188 / sqrt((10.5552)(2.8952)) = 0.944
Patrick Schmid 19
Example data: Pearson correlation
coefficient
Gene C Gene D Gene E Gene F Gene G Gene H Gene I Gene J Gene K Gene L Gene M Gene N
Gene C 1 0.94 0.96 -0.40 0.95 -0.95 0.41 0.36 0.23 0.95 -0.94 -1
Gene D 0.94 1 0.84 -0.10 0.94 -0.94 0.68 0.24 -0.07 0.94 -1 -0.94
Gene E 0.96 0.84 1 -0.57 0.89 -0.89 0.21 0.30 0.43 0.89 -0.84 -0.96
Gene F -0.40 -0.10 -0.57 1 -0.35 0.35 0.60 -0.43 -0.79 -0.35 0.10 0.40
Gene G 0.95 0.94 0.89 -0.35 1 -1 0.48 0.22 0.11 1 -0.94 -0.95
Gene H -0.95 -0.94 -0.89 0.35 -1 1 -0.48 -0.21 -0.11 -1 0.94 0.95
Gene I 0.41 0.68 0.21 0.60 0.48 -0.48 1 0 -0.75 0.48 -0.68 -0.41
Gene J 0.36 0.24 0.30 -0.43 0.22 -0.21 0 1 0 0.22 -0.24 -0.36
Gene K 0.23 -0.07 0.43 -0.79 0.11 -0.11 -0.75 0 1 0.11 0.07 -0.23
Gene L 0.95 0.94 0.89 -0.35 1 -1 0.48 0.22 0.11 1 -0.94 -0.95
Gene M -0.94 -1 -0.84 0.10 -0.94 0.94 -0.68 -0.24 0.07 -0.94 1 0.94
Gene N -1 -0.94 -0.96 0.40 -0.95 0.95 -0.41 -0.36 -0.23 -0.95 0.94 1
Patrick Schmid 20
Example: Reorganization of data
Gene M 1 0.33 0.25 0.25 0.33 0.5
Gene N 1 0.125 0.0833 0.0625 0.0833 0.125
Gene H 1 0.5 0.33 0.25 0.33 0.5
Gene K 1 1 1 1 3 3
Gene J 1 2 1 2 1 2
Gene E 1 4 8 8 8 8
Gene C 1 8 12 16 12 8
Gene L 1 2 3 4 3 2
Gene G 1 2 3 4 3 2
Gene D 1 3 4 4 3 2
Gene I 1 4 8 4 1 0.5
Gene F 1 1 1 0.25 0.25 0.1
Patrick Schmid 21
Clustering of example
Patrick Schmid 22
Clustering of entire yeast genome
Patrick Schmid 23
Hierarchical Clustering
 Algorithm:
 First, find the two most similar genes in the entire
set of genes. Join these together into a cluster.
Now join the next two most similar objects (an
object can be a gene or a cluster), forming a new
cluster. Add the new cluster to the list of available
objects, and remove the two objects used to form
the new cluster. Continue this process, joining
objects in the order of their similarity to one
another, until there is only one object on the list –
a single cluster containing all genes.
(Campbell & Heyer, 2003)
Patrick Schmid 24
Hierarchical Clustering (cont.)
Gene C Gene D Gene E Gene F Gene G Gene H Gene I Gene J Gene K Gene L Gene M Gene N
Gene C 0.94 0.96 -0.40 0.95 -0.95 0.41 0.36 0.23 0.95 -0.94 -1
Gene D 0.84 -0.10 0.94 -0.94 0.68 0.24 -0.07 0.94 -1 -0.94
Gene E -0.57 0.89 -0.89 0.21 0.30 0.43 0.89 -0.84 -0.96
Gene F -0.35 0.35 0.60 -0.43 -0.79 -0.35 0.10 0.40
Gene G -1 0.48 0.22 0.11 1 -0.94 -0.95
Gene H -0.48 -0.21 -0.11 -1 0.94 0.95
Gene I 0 -0.75 0.48 -0.68 -0.41
Gene J 0 0.22 -0.24 -0.36
Gene K 0.11 0.07 -0.23
Gene L -0.94 -0.95
Gene M 0.94
Gene N
Patrick Schmid 25
Gene
1 C Gene D Gene E
F Gene
Gene G
F Gene G
Gene
1 C 0.94
0.89 -0.485
0.96 -0.40
0.92 0.95
Gene D -0.10
0.84 0.94
-0.10 0.94
F
Gene E -0.35
-0.57 0.89
Gene G
F -0.35
Gene G
E Average observations
F •Gene D: (0.94+0.84)/2 = 0.89
1
G •Gene F: (-0.40+(-0.57))/2 = -0.485
•Gene G: (0.95+0.89)/2 = 0.92
C E
Patrick Schmid 26
1 Gene D Gene F Gene G
1 0.89 -0.485 0.92
Gene D -0.10 0.94
Gene F -0.35
Gene G
F
1 2
G
C E G D
Patrick Schmid 27
1 2 Gene F
1 0.905 -0.485
2 -0.225
Gene F
F
1 2
C E G D
Patrick Schmid 28
3 Gene F
3 -0.355
Gene F
F
1 2
F C E G D
Patrick Schmid 29
Did this
4
algorithm not
3
1 2
look familiar?
F C E G D
Patrick Schmid 30
Eisen et al., 1998
Patrick Schmid 31
 We differentiate hierarchical clustering algorithms
by how they agglomerate distances:
 Single Linkage
 Shortest link between two clusters
 Complete Linkage
 Longest link between two clusters
 Average Linkage
 Average of distances between all
pairs of objects
 Average Group Linkage
 Groups once formed are
represented by their mean values,
and then those are averaged
 Which one did we use in the previous
example ?
http://www.resample.com/xlminer/help/HClst/HClst_intro.htm
Patrick Schmid 32
Clustering Overview
 Different similarity measures

 Pearson Correlation Coefficient
 Cosine Coefficient
 Euclidean Distance
 Information Gain
 Mutual Information
 Signal to noise ratio
 Simple Matching for Nominals
Patrick Schmid 33
Clustering Overview (cont.)
 Different Clustering Methods
 Unsupervised
 Hierarchical Clustering
 k-means Clustering (k nearest neighbors)
 Thursday
 Self-organizing map
 Thursday
 Supervised
 Support vector machine
 Ensemble classifier
 Data Mining
Patrick Schmid 34
Support Vector Machines
 Linear regression:
 x = w0 + w1a1 + w2a2 + … + wkak
 x is the class, ai are the attribute values and wj are
the weights
 Given a distance vector Y with distances ai in
which class x does Y belong?
 What do we mean by a class x?
 Primitive method: Y is in one class if x<0.5, in
another class for x≥0.5.
Patrick Schmid 35
Support Vector Machines (cont.)
 Multi-response linear regression:

 Set output to 1 for training instances that belong
to a class
 Set output to 0 for training instances that do not
belong to that class
 Result is a linear expression for each class
 Classification of unknown example:
 Compute all linear expressions
 Choose the one that gives the largest output value
Patrick Schmid 36
 This means…
 Two pairs of classes
 Weight vector for class 1:
 w0(1) + w1(1)a1 + w2(1)a2 + … + wk(1)ak
 Weight vector for class 2:
 w0(2) + w1(2)a1 + w2(2)a2 + … + wk(2)ak
 An instance will be assigned to class 1 rather than
class 2 if
 w0(1) + w1(1)a1 + w2(1)a2 + … + wk(1)ak > w0(2) + w1(2)a1 + w2(2)a2 + … + wk(2)ak
 We can rewrite this as
 (w0(1) - w0(2)) + (w1(1) - w1(2)) a1 + … + (wk(1) - wk(2)) ak > 0
 Hyperplane
Patrick Schmid 37
 We can only represent linear boundaries between classes so far
 Trick: Transform the input using a nonlinear mapping, then construct
a linear model in the new space
 Example: Use all products of n factors (2 attributes, n=3):
 x = w1a13 + w2a12a2 + w3a1a22 + w4a23
 Then use multi-response linear regression
 However, for 10 attributes and including all products with 5 factors,
we would need to determine more than 2000 coefficients
 Linear regression is O(n3) in time
 Problem: Training is infeasible
 Another problem: Overfit. The resulting model will be “too
nonlinear”, because there are just too many parameters in the
model.
Patrick Schmid 38
 Convex hull of points
is the tightest
enclosing polygon
 Maximum margin
hyperplane
 Instances closest to
hyperplane are called
support vectors
 Support vectors define
maximum margin
hyperplane uniquely
support vectors Witten & Frank, 2000
Patrick Schmid 39
 We only need set of support vectors, everything else is irrelevant
 A hyperplane separating two classes can then be written as
 x = w + w a + w a
0 1 1 2 2
 Or
 x = b + ∑ α γ (a(i) ∙ a)
i i
 i is support vector
 γi is the class value of a(i)
 b and αi are numeric values to be determined
 Vector a represents a test instance
 a(i) are the support vectors
 Determining b and αi is a constrained quadratic optimization
problem that can be solved with off-the-shelf software packages
 Support Vector Machines do not overfit, because there are
usually only a few support vectors
Patrick Schmid 40
 Did I not introduce Support Vector Machines by
talking about non-linear class boundaries?
 x = b + ∑ αiγi (a(i) ∙ a)n
 n is the number of factors
 (x ∙ y)n is called a polynomial kernel
 A good way of choosing n is by starting with n=1
and incrementing it until estimated error ceases to
improve
 If you want to know more:
 SVMs in general: Witten & Frank, 2000 (lecture material
based on this)
 Application to cancer classification: Cho & Won, 2003
Patrick Schmid 41
Demo – Shneiderman
References
 Brown, P., Botstein, D. “Exploring the new world of the genome with
DNA microarrays” Nature genetics supplement, vol. 21, January 1999
 Campbell A. & Heyer L. “discovering Genomics, Proteomics, &
Bioninformatics” Benjamin Cummings, 2003.
 Cheung, V., Morley, M., Aguilar, F., Massimi, A., Kucherlapati, R. &
Childs, G. “Making and reading microarrays” Nature genetics
supplement, vol. 21, January 1999
 Cho, S. & Won, H. “Machine Learning in DNA Microarray Analysis for
Cancer Classification” Proceedings of the First Asia-Pacific
bioinformatics conference on Bioinformatics 2003 - Volume 19,
Australian Computer Society Inc.
 Eisen, M., Spellman, P., Brown, P. & Botstein, D. “Cluster analysis and
display of genome-wide expression patterns” Proc. Natl. Acad. Sci.
USA. Vol 95, pp. 14 863-14868, December 1998. Genetics
 Seo, J. & Sheiderman, B. “Interactively Exploring Hierarchical Clustering
Results” IEEE Computer, July 2002
 Witten, I. & Frank, E. “Data Mining” Morgan Kaufmann Publishers, 2000
Patrick Schmid 43

DNA Microarrays: Patrick Schmid CSE 497 Spring 2004

Transféré par

Informations du document

Description originale:

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

DNA Microarrays: Patrick Schmid CSE 497 Spring 2004

Transféré par

Droits d'auteur :

Formats disponibles

DNA Microarrays

Cheung et al. 1999

Cheung et al. 1999

 Remember the flash animation?

 Mix cDNA together

Brown & Botstein, 1999

Campbell & Heyer, 2003

 Data is presented with a color scale

 Gene expression over time is a vector, e.g.

Cho & Won, 2003

Campbell & Heyer, 2003

Campbell & Heyer, 2003

Campbell & Heyer, 2003

Gene F -0.35 0.35 0.60 -0.43 -0.79 -0.35 0.10 0.40

Gene G -1 0.48 0.22 0.11 1 -0.94 -0.95

Gene H -0.48 -0.21 -0.11 -1 0.94 0.95

Gene I 0 -0.75 0.48 -0.68 -0.41

Gene J 0 0.22 -0.24 -0.36

Gene K 0.11 0.07 -0.23

Gene L -0.94 -0.95

Campbell & Heyer, 2003

1 0.89 -0.485 0.92

Gene D -0.10 0.94

Eisen et al., 1998

 Different similarity measures

 Multi-response linear regression:

support vectors Witten & Frank, 2000

Vous aimerez peut-être aussi