Vous êtes sur la page 1sur 22

Principal Component Analysis

CMPUT 466/551

Nilanjan Ray

Overview
Principal component analysis (PCA) is a way to
reduce data dimensionality
PCA projects high dimensional data to a lower
dimension
PCA projects the data in the least square
sense it captures big (principal) variability in the
data and ignores small variability
PCA: An Intuitive Approach

=
= =
N
i
i
N
1
0
1
x m x
Let us say we have x
i
, i=1N data points in p dimensions (p is large)

If we want to represent the data set by a single point x
0
, then
Can we justify this choice mathematically?
Source: Chapter 3 of [DHS]

=
=
N
i
i
J
1
2
0 0 0
) ( x x x
It turns out that if you minimize J
0
, you get the above solution, viz., sample mean
Sample mean
PCA: An Intuitive Approach
e m x a + =
Representing the data set x
i
, i=1N by its mean is quite uninformative

So lets try to represent the data by a straight line of the form:
This is equation of a straight line that says that it passes through m

e is a unit vector along the straight line

And the signed distance of a point x from m is a

The training points projected on this straight line would be
N i a
i i
... 1 , = + = e m x
PCA: An Intuitive Approach



= = =
= = =
=
+ =
+ =
+ =
N
i
i i
N
i
T
i
N
i
i
N
i
i i
N
i
T
i
N
i
i
N
i
i i N
a a
a a
a a a a J
1
2
1 1
2
1
2
1 1
2 2
1
2
2 1 1
|| || ) ( 2
|| || ) ( 2 || ||
) , , , , (
m x m x e
m x m x e e
x e m e
) ( m x e =
i
T
i
a

= = =
+ = + =
N
i
i
T
N
i
i
N
i
T
i i
T
S J
1
2
1
2
1
1
|| || || || ) )( ( ) ( m x e e m x e m x m x e e
Lets now determine a
i
s
Partially differentiating with respect to a
i
we get:
Plugging in this expression for a
i
in J
1
we get:
where

=
=
N
i
T
i i
S
1
) )( ( m x m x is called the scatter matrix
So minimizing J
1
is equivalent to maximizing:
PCA: An Intuitive Approach
e e S
T
1 = e e
T
) 1 ( e e e e
T T
S
Subject to the constraint that e is a unit vector:
Use Lagrange multiplier method to form the objective function:
Differentiate to obtain the equation:
e Se 0 e e = = or S 2 2
Solution is that e is the eigenvector of S corresponding to the largest eigenvalue
PCA: An Intuitive Approach
d d
a a e e m x + + + =
1 1

= =
+ =
N
i
i
d
k
k ik d
a J
1
2
1
|| ) ( || x e m
The preceding analysis can be extended in the following way.

Instead of projecting the data points on to a straight line, we may

now want to project them on a d-dimensional plane of the form:
d is much smaller than the original dimension p
In this case one can form the objective function:
It can also be shown that the vectors e
1
, e
2
, , e
d
are d eigenvectors

corresponding to d largest eigen values of the scatter matrix

=
=
N
i
T
i i
S
1
) )( ( m x m x
PCA: Visually
Data points are represented in a rotated orthogonal coordinate system: the origin
is the mean of the data points and the axes are provided by the eigenvectors.
Computation of PCA
In practice we compute PCA via SVD (singular value
decomposition)
Form the centered data matrix:


Compute its SVD:


U and V are orthogonal matrices, D is a diagonal
matrix
| | ) ( ) (
1 ,
m x m x =
N N p
X
T
p N p p p p
V D U X ) (
, , ,
=
Computation of PCA
Note that the scatter matrix can be written as:



So the eigenvectors of S are the columns of U and the
eigenvalues are the diagonal elements of D
2

Take only a few significant eigenvalue-eigenvector
pairs d << p; The new reduced dimension
representation becomes:
T T
U UD XX S
2
= =
) ( ) (
~
, ,
m x m x + =
i
T
d p d p i
U U
Computation of PCA
Sometimes we are given only a few high dimensional data
points, i.e., p >> N
In such cases compute the SVD of X
T
:


So that we get:


Then, proceed as before, choose only d < N significant
eigenvalues for data representation:
T
N p N N N N
T
U D V X ) (
, , ,
=
T
N N N N N p
V D U X ) (
, , ,
=
) ( ) (
~
, ,
m x m x + =
i
T
d p d p i
U U
PCA: A Gaussian Viewpoint
, )
2
)) ( (
exp(
2
1
)) ( ) (
2
1
exp(
| | ) 2 (
1
~
1
2
2
1
[
=


= E
E
p
i
i
T
i
i
T
p
o o t
t
x u
x x x
where the covariance matrix E is estimated from the scatter matrix as (1/N)S
us and os are respectively eigenvectors and eigenvalues of S.
If p is large, then we need a even larger number of data points to estimate the
covariance matrix. So, when a limited number of training data points is available
the estimation of the covariance matrix goes quite wrong. This is known as curse
of dimensionality in this context.
To combat curse of dimensionality, we discard smaller eigenvalues and
be content with:
) , min( where , )
2
)) ( (
exp(
2
1
~
1
2
2
N p d
d
i
i
T
i
i
<

[
=
o o t
x u
x
PCA Examples
Image compression example

Novelty detection example
Kernel PCA
Assumption behind PCA is that the data points x are
multivariate Gaussian

Often this assumption does not hold

However, it may still be possible that a transformation |(x) is
still Gaussian, then we can perform PCA in the space of |(x)

Kernel PCA performs this PCA; however, because of kernel
trick, it never computes the mapping |(x) explicitly!
KPCA: Basic Idea
Kernel PCA Formulation
We need the following fact:

Let v be a eigenvector of the scatter matrix:

Then v belongs to the linear space spanned by the data
points x
i
i=1, 2, N.

Proof:

=
=
N
i
T
i i
S
1
x x

= =
= = =
N
i
i i
N
i
T
i i
S
1 1
) (
1
x v x x v v v o

Kernel PCA Formulation


Let C be the scatter matrix of the centered mapping |(x):


Let w be an eigenvector of C, then w can be written as a
linear combination:


Also, we have:

Combining, we get:

=
=
N
i
T
i i
C
1
) ( ) ( x x | |

=
=
N
k
k k
1
) (x w | o
w w = C

= = =
=
N
k
k k
N
k
k k
N
i
T
i i
1 1 1
) ( ) ) ( )( ) ( ) ( ( x x x x | o | o | |
Kernel PCA Formulation
). ( ) ( where ,
, , 2 , 1 , ) ( ) ( ) ( ) ( ) ( ) (
) ( ) ( ) ( ) (
) ( ) ) ( )( ) ( ) ( (
2
1 1 1
1 1 1
1 1 1
j
T
i ij
N
k
k
T
l k
N
i
N
k
k k
T
i i
T
l
N
k
k k
N
i
N
k
k k
T
i i
N
k
k k
N
k
k k
N
i
T
i i
K K
K K
N l
x x

x x x x x x
x x x x
x x x x
| |

| | o o | | | |
| o o | | |
| o | o | |
= =
=
= =
=
=



= = =
= = =
= = =

Kernel or Gram matrix


Kernel PCA Formulation
= K From the eigen equation
And the fact that the eigenvector w is normalized to 1, we obtain:

| o | o
1
1 ) ) ( ( ) ) ( ( || ||
1 1
2
=
= = =

= =

x x w
T
T
N
i
i i
T
N
i
i i
K
KPCA Algorithm
Step 1: Compute the Gram matrix: N j i k K
j i ij
, , 1 , ), , ( = = x x
Step 2: Compute (eigenvalue, eigenvector) pairs of K:
M l
l
l
, , 1 ), , ( =
Step 3: Normalize the eigenvectors:
l
l
l


Thus, an eigenvector w
l
of C is now represented as:

=
=
N
k
k
l
k
l
1
) (x w | o
To project a test feature |(x) onto w
l
we need to compute:

= =
= =
N
k
k
l
k
N
k
k
l
k
T l T
k
1 1
) , ( ) ) ( ( ) ( ) ( x x x x w x o | o | |
So, we never need | explicitly
Feature Map Centering
So far we assumed that the feature map |(x) is centered for thedata points x
1,
x
N

Actually, this centering can be done on the Gram matrix without ever
explicitly computing the feature map |(x).
) / 11 ( ) / 11 (
~
N I K N I K
T T
=
Scholkopf, Smola, Muller, Nonlinear component analysis as a kernel eigenvalue problem, Technical report #44,
Max Plank Institute, 1996.
is the kernel matrix for centered features, i.e., 0 ) (
1
=

=
N
i
i
x |
A similar expression exist for projecting test features on the feature eigenspace
KPCA: USPS Digit Recognition
Scholkopf, Smola, Muller, Nonlinear component analysis as a kernel eigenvalue problem, Technical report #44,
Max Plank Institute, 1996.
d T
y x k ) ( ) , ( y x = Kernel function:
(d)
Classier: Linear SVM with features as kernel principal components
N = 3000, p = 16-by-16 image
Linear PCA