Académique Documents
Professionnel Documents
Culture Documents
2
Principal Component Analysis
See online tutorials such as
http://www.cs.otago.ac.nz/cosc453/student_
tutorials/principal_components.pdf
X2
Y1
Y2 x
x
x
xx
Note: Y1 is x
x x
x
the first x
x
x
eigen vector, x
x x
x
Y2 is the x x
x x X1
second. Y2 x
x
x Key observation:
x
ignorable.
x variance = largest!
3
Principal Component Analysis: one
attribute first Temperature
42
40
(Xi X ) 2 30
35
s 2 i 1
(n 1) 30
40
30
4
Now consider two dimensions
X=Temperature Y=Humidity
Covariance: measures the 40 90
correlation between X and Y 40 90
cov(X,Y)=0: independent 40 90
Cov(X,Y)>0: move same dir 30 90
Cov(X,Y)<0: move oppo dir 15 70
15 70
15 70
30 90
n
15 70
( X i X )(Yi Y )
i 1 30 70
cov( X , Y )
(n 1) 30 70
30 90
5
40 70
More than two attributes: covariance
matrix
Contains covariance values between all
possible dimensions (=attributes):
C nxn
(cij | cij cov( Dimi , Dim j ))
2 3 3 12 3
x 4 x
2 1 2 8 2
7
Eigenvalues & eigenvectors
Ax=x (A-I)x=0
How to calculate x and :
Calculate det(A-I), yields a polynomial
(degree n)
Determine roots to det(A-I)=0, roots are
eigenvalues
Solve (A- I) x=0 for each to obtain
eigenvectors x
8
Principal components
1. principal component (PC1)
The eigenvalue with the largest absolute value will
indicate that the data have the largest variance
along its eigenvector, the direction along which
there is greatest variation
2. principal component (PC2)
the direction with maximum variation left in data,
orthogonal to the 1. PC
In general, only few directions manage to
capture most of the variability in the data.
9
Steps of PCA
Let X be the mean
For matrix C, vectors e
vector (taking the mean (=column vector) having
of all rows) same direction as Ce :
Adjust the original data eigenvectors of C is e such
that Ce=e,
by the mean is called an eigenvalue of
X = X X C.
Compute the Ce=e (C-I)e=0
covariance matrix C of
Most data mining
adjusted X packages do this for you.
Find the eigenvectors
and eigenvalues of C.
10
Eigenvalues
Calculate eigenvalues and eigenvectors x for
covariance matrix:
Eigenvalues j are used for calculation of [% of total
variance] (Vj) for each component j:
j n
V j 100 n x n
x
x 1
x 1
11
Principal components - Variance
25
20
Variance (%)
15
10
0
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10
12
Transformed Data
Eigenvalues j corresponds to variance on each
component j
Thus, sort by j
Take the first p eigenvectors ei; where p is the number of
top eigenvalues
These are the directions with the largest variances
yi1 e1 xi1 x1
yi 2 e2 xi 2 x2
... ...
...
y e x x
ip p in n 13
An Example Mean1=24.1
Mean2=53.8
X1 X2 X1' X2' 100
90
80
70
60
19 63 -5.1 9.25 50 Series1
40
30
20
39 74 14.9 20.25 10
0
0 10 20 30 40 50
30 87 5.9 33.25
40
30
30 23 5.9 -30.75 20
10
-10
-20
-40
15 32 -9.1 -21.75 14
Covariance Matrix
75 106
C=
106 482
15
If we only keep one dimension: e2
0.5
yi
0.4
-10.14
0.3
-16.72
We keep the dimension 0.2
0.1 -31.35
of e2=(0.21,-0.98) -40 -20
0
-0.1 0 20 40
31.374
xi1
yi 0.21 0.98 0.21* xi1 0.98 * xi 2
xi 2
16
17
18
19
PCA > Original Data
Retrieving old data (e.g. in data compression)
RetrievedRowData=(RowFeatureVectorT x
FinalData)+OriginalMean
Yields original data using the chosen components
20
Principal components
General about principal components
summary variables
linear combinations of the original variables
uncorrelated with each other
capture as much of the original variance as
possible
21
Applications Gene expression analysis
Reference: Raychaudhuri et al. (2000)
Purpose: Determine core set of conditions for useful
gene comparison
Dimensions: conditions, observations: genes
Yeast sporulation dataset (7 conditions, 6118 genes)
Result: Two components capture most of variability
(90%)
Issues: uneven data intervals, data dependencies
PCA is common prior to clustering
Crisp clustering questioned : genes may correlate with
multiple clusters
Alternative: determination of genes closest neighbours
22
Two Way (Angle) Data Analysis
Genes 103104 Conditions 101102
Genes 103-104
Samples 101-102
23
PCA - example
24
PCA on all Genes
Leukemia data, precursor B and T
Plot of 34 patients, dimension of 8973 genes
reduced to 2
25
PCA on 100 top significant genes
Leukemia data, precursor B and T
Plot of 34 patients, dimension of 100 genes
reduced to 2
26
PCA of genes (Leukemia data)
Plot of 8973 genes, dimension of 34 patients reduced to 2
27