Dimensionality Reduction Lecture Slide

Lecture Slides for
ETHEM ALPAYDIN
The MIT Press, 2010
alpaydin@boun.edu.tr
http://www.cmpe.boun.edu.tr/~ethem/i2ml2e
Why Reduce Dimensionality?
Reduces time complexity: Less computation
Reduces space complexity: Less parameters
Saves the cost of observing the feature
Simpler models are more robust on small datasets
More interpretable; simpler explanation
Data visualization (structure, groups, outliers, etc) if
plotted in 2 or 3 dimensions
Lecture Notes for E Alpaydn 2010 Introduction to Machine Learning 2e The MIT Press (V1.0) 3
Feature Selection vs Extraction
Feature selection: Choosing k<d important features,
ignoring the remaining d k
Subset selection algorithms
Feature extraction: Project the
original xi , i =1,...,d dimensions to
new k<d dimensions, zj , j =1,...,k
Principal components analysis (PCA), linear

discriminant analysis (LDA), factor analysis (FA)
Subset Selection
There are 2d subsets of d features
Forward search: Add the best feature at each step
Set of features F initially .
At each iteration, find the best new feature
j = argmini E ( F xi )
Add xj to F if E ( F xj ) < E ( F )
Hill-climbing O(d2) algorithm

Backward search: Start with all features and remove
one at a time, if possible.
Floating search (Add k, remove l)
Principal Components Analysis (PCA)
Find a low-dimensional space such that when x is
projected there, information loss is minimized.
The projection of x on the direction of w is: z = wTx
Find w such that Var(z) is maximized
Var(z) = Var(wTx) = E[(wTx wT)2]
= E[(wTx wT)(wTx wT)]
= E[wT(x )(x )Tw]
= wT E[(x )(x )T]w = wT w
where Var(x)= E[(x )(x )T] =
Maximize Var(z) subject to ||w||=1
maxw1T w1 w1T w1 1
w1
w1 = w1 that is, w1 is an eigenvector of

Choose the one with the largest eigenvalue for Var(z) to be max
Second principal component: Max Var(z2), s.t., ||w2||=1 and
orthogonal to w1
maxwT2 w 2 wT2 w 2 1 wT2 w1 0
w2
w2 = w2 that is, w2 is another eigenvector of

and so on.
What PCA does
z = WT(x m)
where the columns of W are the eigenvectors of , and m
is sample mean
Centers the data at the origin and rotates the axes
How to choose k ?
Proportion of Variance (PoV) explained
1 2 k
1 2 k d
when i are sorted in descending order

Typically, stop at PoV>0.9
Scree graph plots of PoV vs k, stop at elbow
Factor Analysis
Find a small number of factors z, which when combined
generate x :
xi i = vi1z1 + vi2z2 + ... + vikzk + i
where zj, j =1,...,k are the latent factors with

E[ zj ]=0, Var(zj)=1, Cov(zi ,, zj)=0, i j ,
i are the noise sources
E* i += i, Cov(i , j) =0, i j, Cov(i , zj) =0 ,
and vij are the factor loadings
PCA vs FA
PCA From x to z z = WT(x )
FA From z to x x = Vz +
x z
z x
Factor Analysis
In FA, factors zj are stretched, rotated and translated to
generate x
Multidimensional Scaling
Given pairwise distances between N points,
dij, i,j =1,...,N
place on a low-dim map s.t. distances are preserved.
z = g (x | ) Find that min Sammon stress
E | X
z r
z x x
s r s
2
s 2
r ,s x xr

gx | gx | x
r s r
x s

2
s 2
r ,s x x
r
Map of Europe by MDS
Map from CIA The World Factbook: http://www.cia.gov/
Linear Discriminant Analysis
Find a low-dimensional
space such that when x is
projected, classes are
well-separated.
Find w that maximizes
J w
m1 m2 2
2
s1 s2
2
m1
t x r
w T t t
s t w x m1 r
2 T t 2 t
r t 1
t
Between-class scatter:
m1 m2 w m1 w m 2
2 T T 2
w m1 m 2 m1 m 2 w
T T
w T SB w where SB m1 m 2 m1 m 2 T
Within-class scatter:
s t w x m1 r
2 T t 2 t
1
t w x m1 x m1 wr t w T S1w
T t t T
where S1 t x m1 x m1 r t t T t
s12 s12 w T SW w where SW S1 S 2
Fishers Linear Discriminant
Find w that max
w SB w w m1 m 2
T 2
T
Jw T
w SW w w T SW w
LDA soln: w c SW1 m1 m2
Parametric soln:
w 1 2
1
when px|C i ~ N i ,
K>2 Classes
Within-class scatter:
Si t ri x m i x m i
K
SW Si t t t T
i 1
Between-class scatter:
K
1 K
SB Ni m i m m i m T m mi
i 1 K i 1
Find W that max
W SB W
T The largest eigenvectors of SW-1SB
J W Maximum rank of K-1
WT SW W
Isomap
Geodesic distance is the distance along the manifold that
the data lies in, as opposed to the Euclidean distance in
the input space
Isomap
Instances r and s are connected in the graph if
||xr-xs||<e or if xs is one of the k neighbors of xr
The edge length is ||xr-xs||
For two nodes r and s not connected, the distance is equal to
the shortest path between them
Once the NxN distance matrix is thus formed, use MDS to find
a lower-dimensional mapping
Optdigits after Isomap (with neighborhood graph).
150
100 2
22222
2
2
50 3 22 2
7 7777 1 11 313
333
77 7 7 7 4 111 1
1 338
3
1 8 83
7 44999 5 5 5 98 38
0 99944 9 59 88
49
4 0
88 0 0
0 00
-50 000
6
4 6 66 0
6 66
4
44
-100
4
-150
-150 -100 -50 0 50 100 150
Matlab source from http://web.mit.edu/cocosci/isomap/isomap.html
Locally Linear Embedding
1. Given xr find its neighbors xs(r)
2. Find Wrs that minimize
2
E (W| X ) x r Wrs x(sr )

r s
3. Find the new coordinates zr that minimize

2
E (z | W) z r Wrs z(sr )
r s
LLE on Optdigits
1
0 000
7
7777
7
6 666 7
7 9 9
1 66 399 47
84 4
8383
9334
957
9
44
389 93
41
9
8 34
3
484 1
1
4 4 1 82 282
1 1 22 222
9
8 25
1
1
1 55
5
1
1
-3.5 -3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5
Matlab source from http://www.cs.toronto.edu/~roweis/lle/code.html

Dimensionality Reduction Lecture Slide

Transféré par

Informations du document

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Dimensionality Reduction Lecture Slide

Transféré par

Droits d'auteur :

Formats disponibles

Lecture Slides for

Principal components analysis (PCA), linear

Hill-climbing O(d2) algorithm

w1 = w1 that is, w1 is an eigenvector of

w2 = w2 that is, w2 is another eigenvector of

when i are sorted in descending order

where zj, j =1,...,k are the latent factors with

Map from CIA The World Factbook: http://www.cia.gov/

s12 s12 w T SW w where SW S1 S 2

E (W| X ) x r Wrs x(sr )

3. Find the new coordinates zr that minimize

Matlab source from http://www.cs.toronto.edu/~roweis/lle/code.html

Vous aimerez peut-être aussi