maths for machine learning ch10

© All Rights Reserved

2 vues

maths for machine learning ch10

© All Rights Reserved

- Statistical Arbitrage in U.S Equityies Market
- Analytical Chemistry i to 12
- Roe Riemann
- A Me 509
- Hl 2413221328
- ME_solutions
- Impact of Employee Satisfaction on Success of Organization
- WH Solucion
- The Use of Discrete Data in PCAR
- 582 Problems
- LECTUR1
- Linear Algebra
- T2D CN M1 V2
- Popov Ppm Gaz
- A Comparative Study of Remotely Sensed Data Classification
- Gab Or
- 1032.pdf
- A Basic Scheme in the Development of the Ce-se Courant Number Insensitive Schemes
- Exp7 H7 PCB Fabrication Testing
- Factor Analysis

Vous êtes sur la page 1sur 30

10

Component Analysis

5217 Data in real life is often high dimensional. For example, if we want to esti-

5218 mate the price of our house in a year’s time, we can use data that helps us

5219 to do this: the type of house, the size, the number of bedrooms and bath-

5220 rooms, the value of houses in the neighborhood when they were bought,

5221 the distance to the next train station and park, the number of crimes com-

5222 mitted in the neighborhood, the economic climate etc. – there are many

5223 things that influence the house price, and we collect this information in a

5224 data set that we can use to estimate the house price. Another example is a

5225 640×480 pixels color image, which is a data point in a million-dimensional

5226 space, where every pixel responds to three dimensions - one for each color

5227 channel (red, green, blue).

5228 Working directly with high-dimensional data comes with some difficul-

5229 ties: It is hard to analyze, interpretation is difficult, visualization is nearly

5230 impossible, and (from a practical point of view) storage can be expensive.

5231 However, high-dimensional data also has some nice properties: For exam-

5232 ple, high-dimensional data is often overcomplete, i.e., many dimensions

5233 are redundant and can be explained by a combination of other dimen-

5234 sions. Dimensionality reduction exploits structure and correlation and al-

5235 lows us to work with a more compact representation of the data, ideally

5236 without losing information. We can think of dimensionality reduction as

5237 a compression technique, similar to jpg or mp3, which are compression

5238 algorithms for images and music.

principal component

5239 In this chapter, we will discuss principal component analysis (PCA), an

analysis 5240 algorithm for linear dimensionality reduction. PCA, proposed by Pearson

dimensionality 5241 (1901) and Hotelling (1933), has been around for more than 100 years

reduction

5242 and is still one of the most commonly used techniques for data compres-

5243 sion, data visualization and the identification of simple patterns, latent

5244 factors and structures of high-dimensional data. In the signal processing

Karhunen-Loève 5245 community, PCA is also known as the Karhunen-Loève transform. In this

transform 5246 chapter, we will explore the concept of linear dimensionality reduction

5247 with PCA in more detail, drawing on our understanding of basis and ba-

5248 sis change (see Sections 2.6.1 and 2.7.2), projections (see Section 3.6),

5249 eigenvalues (see Section 4.2), Gaussian distributions (see Section 6.6) and

5250 constrained optimization (see Section 7.2).

5251 Dimensionality reduction generally exploits the property of high-dimen-

286

c

Draft chapter (July 10, 2018) from “Mathematics for Machine Learning”
2018 by Marc Peter

Deisenroth, A Aldo Faisal, and Cheng Soon Ong. To be published by Cambridge University Press.

Report errata and feedback to http://mml-book.com. Please do not post or distribute this file,

please link to https://mml-book.com.

10.1 Problem Setting 287

ure 10.1

stration:

4 4

mensionality

uction. (a) The

ginal dataset not 2 2

y much along the

direction. (b)

x2

x2

0 0

e data from (a)

n be represented

−2 −2

ng the

coordinate alone

h nearly no loss. −4 −4

−5.0 −2.5 0.0 2.5 5.0 −5.0 −2.5 0.0 2.5 5.0

x1 x1

(a) Dataset with x1 and x2 coordinates. (b) Compressed dataset where only the x1 coor-

dinate is relevant.

D D Graphical

IR IR

illustration of PCA.

IRM In PCA, we find a

compressed version

x z x̃ x̃ of original data x

that has an intrinsic

lower-dimensional

representation z.

5252 sional data (e.g., images) that it often lies on a low-dimensional subspace,

5253 and that many dimensions are highly correlated, redundant or contain

5254 little information. Figure 10.1 gives an illustrative example in two dimen-

5255 sions. Although the data in Figure 10.1(a) does not quite lie on a line, the

5256 data does not vary much in the x2 -direction, so that we can express it as

5257 if it was on a line – with nearly no loss, see Figure 10.1(b). The data in

5258 Figure 10.1(b) requires only the x1 -coordinated to describe and lies in a

5259 one-dimensional subspace of R2 .

5261 In PCA, we are interested in finding projections x̃n of data points xn that

5262 are as similar to the original data points as possible, but which have a sig-

5263 nificantly lower intrinsic dimensionality. Figure 10.1 gives an illustration

5264 what this could look like.

5265 Figure 10.2 illustrates the setting we consider in PCA, where z repre-

5266 sents the intrinsic lower dimension of the compressed data x̃ and plays

5267 the role of a bottleneck, which controls how much information can flow

5268 between x and x̃.

5269 More concretely, we consider i.i.d. data points x1 , . . . , xN ∈ RD , and

c

2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.

288 Dimensionality Reduction with Principal Component Analysis

Figure 10.3

Examples of

handwritten digits

from the MNIST

dataset.

5271 If our observed data lives in RD , we look for an M -dimensional subspace

5272 U ⊆ RD , dim(U ) = M < D onto which we project data. We denote

5273 the projected data as x̃n ∈ U , and their coordinates (with respect to an

5274 appropriate basis in U ) with z n . Our aim is to find x̃n so that they are as

5275 similar to the original data xn as possible.

Consider R2 with the canonical basis e1 = [1, 0]> , e2 = [0, 1]> . From

Chapter 2 we know that x ∈ R2 can be represented as a linear combina-

tion of these basis vectors, e.g.,

5

= 5e1 + 3e2 . (10.1)

3

However, when we consider the set of vectors

0

x̃ = ∈ R2 , z ∈ R , (10.2)

z

they can always be written as 0e1 + ze2 . To represent these vectors it is

sufficient to remember/store the coordinate/code z of the e2 vector.

More precisely, the set of x̃ vectors (with the standard vector addition

and scalar multiplication) forms a vector subspace U (see Section 2.4)

The dimension of a with dim(U ) = 1 because U = span[e2 ].

vector space

corresponds to the

number of its basis

vectors (see 5276 In PCA, we consider the relationship between the original data x and

Section 2.6.1).

5277 its low-dimensional code z to be linear so that z = B > x for a suitable

5278 matrix B .

5279 Throughout this chapter, we will use the MNIST digits dataset as a re-

http: 5280 occurring example, which contains 60, 000 examples of handwritten digits

//yann.lecun. 5281 0–9. Each digit is an image of size 28 × 28, i.e., it contains 784 pixels so

com/exdb/mnist/

5282 that we can interpret every image in this dataset as a vector x ∈ R784 .

5283 Examples of these digits are shown in Figure 10.3.

5284 In the following, we will derive PCA from two different perspectives.

5285 First, we derive PCA by maintaining as much variance as possible in the

5286 projected space. Second, we will derive PCA by minimizing the average

5287 squared reconstruction error, which directly links to many concepts in

5288 Chapters 3 and 4.

Draft (2018-07-10) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.

10.2 Maximum Variance Perspective 289

5290 Figure 10.1 gave an example of how a two-dimensional dataset can be

5291 represented using a single coordinate. In Figure 10.1(b), we chose to ig-

5292 nore the x2 -coordinate of the data because it did not add too much in-

5293 formation so that the compressed data is similar to the original data in

5294 Figure 10.1(a). We could have chosen to ignore the x1 -coordinate, but

5295 then the compressed data had been very dissimilar from the original data,

5296 and much information in the data would have been lost.

5297 If we interpret information content in the data as how “space filling”

5298 the data set is, then we can describe the information contained in the

5299 data by looking at the spread of the data. From Chapter 6 we know that

5300 the variance is an indicator of the spread of the data, and it is possible

5301 to formulate PCA as a dimensionality reduction algorithm that maximizes

5302 the variance in the low-dimensional representation of the data to retain as

5303 much information as possible. Now, let us formulate this objective more

5304 concretely.

Consider a dataset x1 , . . . , xN , xn ∈ RD , with mean 0 that possesses

the data covariance matrix (empirical covariance) data covariance

matrix

N

1 X

S= xn x>

n . (10.3)

N n=1

5306 RM of xn , where B ∈ RD×M .

5307 Our aim is to find a matrix B that retains as much information as possi-

5308 ble when compressing data. We assume that B is an orthogonal matrix so

5309 that b>

i bj = 0 if and only if i 6= j . Retaining most information is formu- The columns

5310 lated as capturing the largest amount of variance in the low-dimensional b1 , . . . , bM of B

form a basis of the

5311 code (Hotelling, 1933).

M -dimensional

Remark. (Centered Data) Let us assume that µ = Ex [x] is the (empirical) subspace in which

the projected data

mean of the data. Using the properties of the variance, which we discussed x̃ = BB > x ∈ RD

in Section 6.4.4 we obtain live.

5312 i.e., the variance of the low-dimensional code does not depend on the

5313 mean of the data. Therefore, we assume without loss of generality that the

5314 data has mean 0 for the remainder of this section. With this assumption

5315 the mean of the low-dimensional code is also 0 since Ez [z] = Ex [B > x] =

5316 B > Ex [x] = 0. ♦

We maximize the variance of the low-dimensional code using a sequen-

tial approach. We start by seeking a single vector b1 ∈ RD that maximizes

the variance of the projected data, i.e., we aim to maximize the first coor-

c

2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.

290 Dimensionality Reduction with Principal Component Analysis

dinate z1 of z ∈ RM so that

N

1 X 2

V1 := V[z1 ] = z (10.5)

N n=1 1n

defined z1n as the first coordinate of the low-dimensional representation

z n ∈ RM of xn ∈ RD . Note that first component of z n is given by

z1n = b>

1 xn . (10.6)

N N

1 X > 1 X >

V1 = 2

(b xn ) = b xn x>n b1 (10.7a)

N n=1 1 N n=1 1

N

!

> 1 X

= b1 xn x>n b1 = b>

1 Sb1 , (10.7b)

N n=1

5318 It is clear that arbitrarily increasing the magnitude of the vector b1 in-

5319 creases V1 . Therefore, we restrict all solutions to kb1 k = 1, which results in

5320 a constrained optimization problem in which we seek the direction along

5321 which the data varies most.

With the restriction of the solution space to unit vectors we end up with

the constrained optimization problem

max b>

1 Sb1 (10.8)

b1

2

subject to kb1 k = 1 . (10.9)

1 b1 ) = b1 Sb1 + λ1 (1 − b1 b1 ) (10.10)

L with respect to b1 and λ1 are

∂L

= 2b> >

1 S − 2λ1 b1 (10.11)

∂b1

∂L

= 1 − b>

1 b1 , (10.12)

∂λ1

respectively. Setting these partial derivatives to 0 gives us the relations

b>

1 b1 = 1 , (10.13)

Sb1 = λ1 b1 , (10.14)

Draft (2018-07-10) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.

10.2 Maximum Variance Perspective 291

This eigenvector property allows us to rewrite our variance objective as

V 1 = b> >

1 Sb1 = λ1 b1 b1 = λ1 , (10.15)

i.e., the variance of the data projected onto a one-dimensional subspace

equals the eigenvalue that is associated with the basis vector b1 that spans

this subspace. Therefore, to maximize the variance of the low-dimensional

code we choose the basis vector belonging to the largest eigenvalue of the

data covariance matrix. This eigenvector is called the first principal compo- principal component

nent. We can determine the effect/contribution of the principal component

b1 in the original data space by mapping the coordinate z1n back into data

space, which gives us the projected data point

x̃n = b1 z1n = b1 b>

1 xn ∈ R

D

(10.16)

5322 in the original data space.

5323 Remark. Although x̃n is a D-dimensional vector it only requires a single

5324 coordinate z1n to represent it with respect to the basis vector b1 ∈ RD . ♦ √

The quantity λ1 is

Generally, the mth principal component can be found by subtracting the also called the

effect of the first m−1 principal components from the data, thereby trying loading of the unit

vector b1 and

to find principal components that compress the remaining information.

represents the

We achieve this by first subtracting the contribution of the m − 1 principal standard deviation

components from the data, similar to (10.16), so that we arrive at the new of the data

data matrix accounted for by the

principal subspace

m−1

X span[b1 ].

X̂ := X − bi b>

i X, (10.17)

i=1

5326 vectors. The matrix X̂ in (10.17) contains the data that only contains the

5327 information that has not yet been compressed.

5328 Remark (Notation). Throughout this chapter, we do not follow the con-

5329 vention of collecting data x1 , . . . , xN as rows of the data matrix, but we

5330 define them to be the columns of X . This means that our data matrix X is

5331 a D × N matrix instead of the conventional N × D matrix. The reason for

5332 our choice is that the algebra operations work out smoothly without the

5333 need to either transpose the matrix or to redefine vectors as row vectors

5334 that are left-multiplied onto matrices. ♦

To find the mth principal component, we maximize the variance

N N

1 X 2 1 X >

Vm = V[zm ] = z = b x n = b>

m Ŝbm , (10.18)

N n=1 mn N n=1 m

2

5335 subject to kbm k = 1, where we followed the same steps as in (10.7b)

5336 and defined Ŝ as the data covariance matrix of X̂ . As previously, when

5337 we looked at the first principal component alone, we solve a constrained

c

2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.

292 Dimensionality Reduction with Principal Component Analysis

5338 optimization problem and discover that the optimal solution bm is the

5339 eigenvector of Ŝ that belongs to the largest eigenvalue of Ŝ .

However, it also turns out that bm is an eigenvector of S . Since

N N m−1 m−1 >

1 X 1 X

(10.17)

x̂n x̂>

X X

Ŝ = n = xn − bi b>

i x n x n − b b>

i i x n

N n=1 N n=1 i=1 i=1

(10.19a)

N m−1 m−1 m−1

!

1 X X X X

= xn x> >

n − 2xn xn bi b>

i + xn

>

bi b>

i bi b>

i

N n=1 i=1 i=1 i=1

(10.19b)

N N

1 X > 1 X

Ŝbm = x̂n x̂n bm = xn x>

n bm = Sbm = λm bm . (10.20)

N n=1 N n=1

1, . . . , m − 1 (all terms involving sums up to m − 1 vanish). In the end,

we exploited the fact that bm is an eigenvector of Ŝ . Therefore, bm is also

an eigenvector of the original data covariance matrix S , and the corre-

sponding eigenvalue is λm is the mth largest eigenvalue of S . Moreover,

the variance of the data projected onto the mth principal component

(10.20)

V m = b>

m Sbm = λm b>

m bm = λ m (10.21)

5340 since b> m bm = 1. This means that the variance of the data, when projected

5341 onto an M -dimensional subspace, equals the sum of the eigenvalues that

To maximize the 5342 belong to the corresponding eigenvectors of the data covariance matrix.

variance of the In practice, we do not have to compute principal components sequen-

projected data, we

tially, but we can compute all of them at the same time. If we are looking

choose the columns

of B to be the for a projection onto an M -dimensional subspace so that as much variance

eigenvectors that as possible is retained in the projection, then PCA tells us to choose the

belong to the M columns of B to be the eigenvectors that belong to the M largest eigen-

largest eigenvalues

values of the data covariance matrix. The maximum amount of variance

of the data

covariance matrix. PCA can capture with the first M principal components is

M

X

V = λm , (10.22)

m=1

where the λm are the M largest eigenvalues of the data covariance matrix

S . Consequently, the variance lost by data compression via PCA is

D

X

J= λj . (10.23)

j=M +1

Draft (2018-07-10) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.

10.3 Projection Perspective 293

Figure 10.4

6 Illustration of the

projection approach

4 to PCA. We aim to

find a

2 one-dimensional

subspace (line) of

x2

0

R2 so that the

distance vector

−2

between projected

−4 (orange) and

original (blue) data

−6 is as small as

possible.

−5 0 5

x1

5344 orthogonal projection maximizes the variance of the data we need to com-

5345 pute the M eigenvectors that belong to the M largest eigenvalues of the

5346 data covariance matrix. In Section 10.4, we will return to this point and

5347 discuss how to efficiently compute these M eigenvectors.

5349 In the following, we will derive PCA as an algorithm for linear dimension-

5350 ality reduction that minimizes the average projection error. We will draw

5351 heavily from Chapters 2 and 3. In the previous section, we derived PCA

5352 by maximizing the variance in the projected space to retain as much infor-

5353 mation as possible. In the following, we will look at the difference vectors

5354 between the original data xn and their reconstruction x̃n and minimize

5355 this distance so that xn and x̃n are as close as possible. Figure 10.4 illus-

5356 trates this setting.

Assume an (ordered) orthonormal basis (ONB) B = (b1 , . . . , bD ) of RD ,

i.e., b>

i bj = 1 if and only if i = j and 0 otherwise. From Section 2.5 we

know that every x ∈ RD can be written as a linear combination of the

basis vectors of RD , i.e.,

D

X

x= zd bd (10.24)

d=1

in lower-dimensional subspace U of RD , so that x̃ is as similar to x as

possible. As x̃ ∈ U ⊆ RD , we can also express x̃ as a linear combination

c

2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.

294 Dimensionality Reduction with Principal Component Analysis

Figure 10.5

Simplified 2 2

projection setting.

(a) A vector x ∈ R2

(red cross) shall be 1 1

x2

x2

U U

projected onto a

one-dimensional b b

subspace U ⊆ R2 0 0

spanned by b. (b)

shows the difference

vectors between x −1 0 1 2 −1 0 1 2

x1 x1

and some

candidates x̃. (a) Setting. (b) Differences x − x̃ for 50 candidates x̃ are

shown by the red lines.

D

X

x̃ = zd bd . (10.25)

d=1

For example, vectors Let us assume dim(U ) = M where M < D = dim(RD ). Then, we

x̃ ∈ U could be can find basis vectors b1 , . . . , bD of RD so that at least D − M of the

vectors on a plane

coefficients zd are equal to 0, and we can rearrange the way we index the

in R3 . The

dimensionality of basis vectors bd such that the coefficients that are zero appear at the end.

the plane is 2, but This allows us to express x̃ as

the vectors still have

three coordinates in M

X D

X M

X

R3 . x̃ = zm bm + 0bj = zm bm = Bz ∈ RD , (10.26)

m=1 j=M +1 m=1

where we defined

> M

z := [z1 , . . . , zM ] ∈ R . (10.28)

5359 optimal coordinates z and basis vectors b1 , . . . , bM such that x̃ is as sim-

5360 ilar to the original data point x, i.e., we aim to minimize the (Euclidean)

5361 distance kx − x̃k. Figure 10.5 illustrates this setting.

5362 Without loss of generality, we assume that the dataset X = {x1 , . . . , xN },

5363 xn ∈ RD , is centered at 0, i.e., E[X] = 0.

5364 Remark. Without the zero-mean assumption, we would arrive at exactly

5365 the same solution but the notation would be substantially more cluttered.

5366 ♦

We are interested in finding the best linear projection of X onto a

lower-dimensional subspace U of RD with dim(U ) = M and orthonor-

principal subspace mal basis vectors b1 , . . . , bM . We will call this subspace U the principal

subspace, and (b1 , . . . , bM ) is an orthonormal basis of the principal sub-

Draft (2018-07-10) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.

10.3 Projection Perspective 295

M

X

x̃n := zmn bm = Bz n ∈ RD , (10.29)

m=1

z n := [z1n , . . . , zM n ]> ∈ RM , n = 1, . . . N , (10.30)

5367 is the coordinate vector of x̃n with respect to the basis (b1 , . . . , bM ). More

5368 specifically, we are interested in having the x̃n as similar to xn as possible.

5369 There are many ways to measure similarity.

The similarity measure we use in the following is the squared Euclidean

2

norm kx − x̃k between x and x̃. We therefore define our objective as

the minimizing the average squared Euclidean distance (reconstruction er- reconstruction error

ror) (Pearson, 1901)

N

1 X

J := kxn − x̃n k2 . (10.31)

N n=1

5370 In order to find this optimal linear projection, we need to find the or-

5371 thonormal basis of the principal subspace and the coordinates z n of the

5372 projections with respect to these basis vectors. All these parameters enter

5373 our objective (10.31) through x̃n .

5374 In order to find the coordinates z n and the ONB of the principal sub-

5375 space we optimize J by computing the partial derivatives of J with respect

5376 to all parameters of interest (i.e., the coordinates and the basis vectors),

5377 setting them to 0, and solving for the parameters. We detail these steps

5378 next. We will first determine the optimal coordinates zin and then the ba-

5379 sis vectors b1 , . . . , bM of the principal subspace, i.e., the subspace in which

5380 x̃ lives.

Since the parameters we are interested in, i.e., the basis vectors bi and the

coordinates zin of the projection with respect to the basis of the principal

subspace, only enter the objective J through x̃n , we obtain

∂J ∂J ∂ x̃n

= , (10.32)

∂zin ∂ x̃n ∂zin

∂J ∂J ∂ x̃n

= (10.33)

∂bi ∂ x̃n ∂bi

for i = 1, . . . , M and n = 1, . . . , N , where

∂J 2

= − (xn − x̃n )> ∈ R1×D . (10.34)

∂ x̃n N

5382 In the following, we determine the optimal coordinates zin first before

5383 finding the ONB of the principal subspace.

c

2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.

296 Dimensionality Reduction with Principal Component Analysis

5384 Coordinates

Let us start by finding the coordinates z1n , . . . , zM n of the projections x̃n

for n = 1, . . . , N . We assume that (b1 , . . . , bD ) is an ordered ONB of RD .

From (10.32) we require the partial derivative

M

!

∂ x̃n (10.29) ∂ X

= zmn bm = bi (10.35)

∂zin ∂zin m=1

for i = 1, . . . , M , such that we obtain

(10.34) M

!>

∂J 2

(10.35) 2

(10.29)

X

= − (xn − x̃n )> bi = − xn − zmn bm bi

∂zin N N m=1

(10.36)

ONB 2 > 2

= − (x bi − zin b> bi ) = − (x> bi − zin ) . (10.37)

N n | i{z } N n

=1

nates

>

zin = x>

n bi = bi xn (10.38)

5385 for i = 1, . . . , M and n = 1, . . . , N . This means, the optimal coordinates

5386 zin of the projection x̃n are the coordinates of the orthogonal projection

5387 (see Section 3.6) of the original data point xn onto the one-dimensional

The coordinates of5388 subspace that is spanned by bi . Consequently:

the optimal

projection of xn 5389 • The optimal projection x̃n of xn is an orthogonal projection.

with respect to the5390 • The coordinates of x̃n with respect to the basis b1 , . . . , bM are the coor-

basis vectors 5391 dinates of the orthogonal projection of xn onto the principal subspace.

b1 , . . . , bM are the

coordinates of the

5392 • An orthogonal projection is the best linear mapping we can find given

orthogonal 5393 the objective (10.31).

projection of xn

Remark (Orthogonal Projections with Orthonormal Basis Vectors). Let us

onto the principal

subspace. briefly recap orthogonal projections from Section 3.6. If (b1 , . . . , bD ) is an

orthonormal basis of RD then

x̃ = bj (b>j bj )

−1 >

bj x = bj b>

j x ∈ R

D

(10.39)

| {z }

=1

x> bj is the 5394 is the orthogonal projection of x onto the subspace spanned by the j th

coordinate of the 5395 basis vector, and zj = b> j x is the coordinate of this projection with respect

orthogonal

5396 to the basis vector bj that spans that subspace since zj bj = x̃. Figure 10.6

projection of x onto

5397

the one-dimensional illustrates this setting.

subspace spanned More generally, if we aim to project onto an M -dimensional subspace

by bj . of RD , we obtain the orthogonal projection of x onto the M -dimensional

subspace with orthonormal basis vectors b1 , . . . , bM as

> −1 > >

x̃ = B(B

| {zB}) B x = BB x , (10.40)

=I

Draft (2018-07-10) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.

10.3 Projection Perspective 297

Figure 10.6

2 Optimal projection

3.0 of a vector x ∈ R2

onto a

2.5 1 one-dimensional

kx − x̃k

x2

U

x̃ subspace

2.0 b (continuation from

0 Figure 10.5). (a)

1.5 Distances kx − x̃k

for some x̃ ∈ U . (b)

−1 0 1 2 −1 0 1 2 Orthogonal

x1 x1

projection and

(a) Distances kx − x̃k for some x̃ ∈ U , see (b) The vector x̃ that minimizes the distance optimal coordinates.

panel (b) for the setting. in panel (a) is its orthogonal projection onto

U . The coordinate of the projection x̃ with

respect to the basis vector b that spans U

is the factor we need to scale b in order to

“reach” x̃.

5399 projection with respect to the ordered basis (b1 , . . . , bM ) are z := B > x

5400 as discussed in Section 3.6.

5401 We can think of the coordinates as a representation of the projected

5402 vector in a new coordinate system defined by (b1 , . . . , bM ). Note that al-

5403 though x̃ ∈ RD we only need M coordinates z1 , . . . , zM to represent this

5404 vector; the other D − M coordinates with respect to the basis vectors

5405 (bM +1 , . . . , bD ) are always 0. ♦

5407 We already determined the optimal coordinates of the projected data for

5408 a given ONB (b1 , . . . , bD ) of RD , only M of which were non-zero. What

5409 remains is to determine the basis vectors that span the principal subspace.

5410 Before we get started, let us briefly introduce the concept of an orthog-

5411 onal complement.

Remark. (Orthogonal Complement) Consider a D-dimensional vector space

V and an M -dimensional subspace U ⊆ V . Then its orthogonal comple- orthogonal

ment U ⊥ is a (D − M )-dimensional subspace of V and contains all vectors complement

in V that are orthogonal to every vector in U . Furthermore, every vector

x ∈ V can be (uniquely) decomposed into

M

X D−M

X

x= λm bm + ψj b⊥

j , λi , ψj ∈ R , (10.41)

m=1 j=1

1 , . . . , bD−M ) is a basis of U .

5413 ♦

To determine the basis vectors b1 , . . . , bM of the principal subspace,

we rephrase the loss function (10.31) using the results we have so far.

c

2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.

298 Dimensionality Reduction with Principal Component Analysis

This will make it easier to find the basis vectors. To reformulate the loss

function, we exploit our results from before and obtain

M M

(10.38)

X X

x̃n = zmn bm = (x>

n bm )bm . (10.42)

m=1 m=1

M

!

X >

x̃n = bm bm xn . (10.43)

m=1

Since we can generally write the original data point xn as a linear combi-

nation of all basis vectors, we can also write

D D D

!

(10.38)

X X X >

>

xn = zdn bd = (xn bd )bd = bd bd xn (10.44a)

d=1 d=1 d=1

M

! D

!

X X

= bm b>

m xn + bj b>

j xn , (10.44b)

m=1 j=M +1

where we split the sum with D terms into a sum over M and a sum

over D − M terms. With this result, we find that the displacement vector

xn − x̃n , i.e., the difference vector between the original data point and its

projection, is

D

!

X >

xn − x̃n = bj bj xn (10.45)

j=M +1

D

X

= (x>

n bj )bj . (10.46)

j=M +1

5414 This means the difference is exactly the projection of the data point onto

5415 the orthogonal complement of the principal subspace: We identify the ma-

trix j=M +1 bj b>

PD

5416

j in (10.45) as the projection matrix that performs this

5417 projection. This also means the displacement vector xn − x̃n lies in the

5418 subspace that is orthogonal to the principal subspace as illustrated in Fig-

5419 ure 10.7.

PCA finds the best

rank-M Remark (Low-Rank Approximation). In (10.45), we saw that the projec-

approximation of

tion matrix, which projects x onto x̃ is given by

the identity matrix.

M

X

bm b> >

m = BB . (10.47)

m=1

m we see that BB

>

Draft (2018-07-10) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.

10.3 Projection Perspective 299

Figure 10.7

6 U⊥ Orthogonal

projection and

4 displacement

vectors. When

2 projecting data

points xn (blue)

x2

0

U onto subspace U1

we obtain x̃n

−2

(orange). The

−4 displacement vector

x̃n − xn lies

−6 completely in the

orthogonal

−5 0 5

x1 complement U2 of

U1 .

N N X N 2

2

X X > >

kxn − x̃n k = xn − BB xn = (I − BB )xn .

n=1 n=1 n=1

(10.48)

5421 tween the original data xn and their projections x̃n , n = 1, . . . , N , is

5422 minimized is equivalent to finding the best rank-M approximation BB >

5423 of the identity matrix I , see Section 4.6. ♦

Now, we have all the tools to reformulate the loss function (10.31).

N

D

N X

2

1 X (10.46) 1 X

J= kxn − x̃n k2 = (b> x )b . (10.49)

j n j

N n=1 N n=1

j=M +1

We now explicitly compute the squared norm and exploit the fact that the

bj form an ONB, which yields

N D N D

1 X X > 1 X X >

J= 2

(bj xn ) = bj xn b>

j xn (10.50a)

N n=1 j=M +1 N n=1 j=M +1

N D

1 X X >

= b xn x>

n bj , (10.50b)

N n=1 j=M +1 j

where we exploited the symmetry of the dot product in the last step to

write b> >

j xn = xn bj . We can now swap the sums and obtain

D N

! D

X 1 X X

J= b>

j x n x>

n bj = b>

j Sbj (10.51a)

j=M +1

N n=1 j=M +1

| {z }

=:S

c

2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.

300 Dimensionality Reduction with Principal Component Analysis

D

X D

X D

X

= tr(b>

j Sbj ) tr(Sbj b>

j ) = tr bj b>

j S ,

j=M +1 j=M +1 j=M +1

| {z }

projection matrix

(10.51b)

5424 where we exploited the property that the trace operator tr(·), see (4.16),

5425 is linear and invariant to cyclic permutations of its arguments. Since we

5426 assumed that our dataset is centered, i.e., E[X] = 0, we identify S as the

5427 data covariance matrix. We see that the projection matrix in (10.51b) is

5428 constructed as a sum of rank-one matrices bj b>j so that it itself is of rank

Minimizing the 5429 D − M.

average squared

5430 Equation (10.51a) implies that we can formulate the average squared

reconstruction error

is equivalent to 5431 reconstruction error equivalently as the covariance matrix of the data,

minimizing the 5432 projected onto the orthogonal complement of the principal subspace. Min-

projection of the 5433 imizing the average squared reconstruction error is therefore equivalent to

data covariance

5434 minimizing the variance of the data when projected onto the subspace we

matrix onto the

orthogonal 5435 ignore, i.e., the orthogonal complement of the principal subspace. Equiva-

complement of the5436 lently, we maximize the variance of the projection that we retain in the

principal subspace.5437 principal subspace, which links the projection loss immediately to the

5438 maximum-variance formulation of PCA discussed in Section 10.2. But this

5439 then also means that we will obtain the same solution that we obtained for

5440 the maximum-variance perspective. Therefore, we skip the slightly lengthy

5441 derivation here and summarize the results from earlier in the light of the

Minimizing the 5442 projection perspective.

average squared

The average squared reconstruction error, when projecting onto the M -

reconstruction error

is equivalent to dimensional principal subspace, is

maximizing the

variance of the

projected data.

D

X

J= λj , (10.52)

j=M +1

5443 where λj are the eigenvalues of the data covariance matrix. Therefore,

5444 to minimize (10.52) we need to select the smallest D − M eigenvalues,

5445 which then implies that their corresponding eigenvectors are the basis

5446 of the orthogonal complement of the principal subspace. Consequently,

5447 this means that the basis of the principal subspace are the eigenvectors

5448 b1 , . . . , bM that belong to the largest M eigenvalues of the data covariance

5449 matrix.

Draft (2018-07-10) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.

10.4 Eigenvector Computation 301

Figure 10.8

Embedding of

MNIST digits 0

(blue) and 1

(orange) in a

two-dimensional

principal subspace

using PCA. Four

examples

embeddings of the

digits ‘0’ and ‘1’ in

the principal

subspace are

highlighted in red

with their

corresponding

original digit.

Figure 10.8 visualizes the training data of the MMIST digits ‘0’ and

‘1’ embedded in the vector subspace spanned by the first two principal

components. We can see a relatively clear separation between ‘0’s (blue

dots) and ‘1’s (orange dots), and we can see the variation within each

individual cluster.

In the previous sections, we obtained the basis of the principal subspace

as the eigenvectors that belong to the largest eigenvalues of the data co-

variance matrix

N

1 X 1

S= xn x>

n = XX > , (10.53)

N n=1 N

X = [x1 , . . . , xN ] ∈ RD×N . (10.54)

5451 To get the eigenvalues (and the corresponding eigenvectors) of S , we can

5452 follow two approaches: Eigendecomposition

or SVD to compute

5453 • We perform an eigendecomposition (see Section 4.2) and compute the eigenvectors.

5454 eigenvalues and eigenvectors of S directly.

• We use a singular value decomposition (see Section 4.5). Since S is

symmetric and factorizes into XX > (ignoring the factor N1 ), the eigen-

values of S are the squared singular values of X . More specifically, if

c

2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.

302 Dimensionality Reduction with Principal Component Analysis

X = U ΣV > , (10.55)

where U ∈ RD×D and and V > ∈ RD×N are orthogonal matrices and

Σ ∈ RD×N is a matrix whose only non-zero entries are the singular

values σii > 0. Then it follows that

1 1 1

S= XX > = U ΣV > V Σ> U > = U ΣΣ> U > . (10.56)

N N N

With the results from Section 4.5 we get that the columns of U are the

eigenvectors of XX > (and therefore S ). Furthermore, the eigenvalues

of S are related to the singular values of X via

σi2

λi = . (10.57)

N

5456 tant in other fundamental machine learning methods that require matrix

5457 decompositions. In theory, as we discussed in Section 4.2, we can solve for

5458 the eigenvalues as roots of the characteristic polynomial. However, for ma-

5459 trices larger than 4 × 4 this is not possible because we would need to find

5460 the roots of a polynomial of degree 5 or higher. However, the Abel-Ruffini

5461 theorem (Ruffini, 1799; Abel, 1826) states that there exists no algebraic

5462 solution to this problem for polynomials of degree 5 or more. Therefore, in

np.linalg.eigh

or 5463 practice, we solve for eigenvalues or singular values using iterative meth-

np.linalg.svd 5464 ods, which are implemented in all modern packages for linear algebra.

In many applications (such as PCA presented in this chapter), we only

require a few eigenvectors. It would be wasteful to compute the full de-

composition, and then discard all eigenvectors with eigenvalues that are

beyond the first few. It turns out that if we are interested in only the

first few eigenvectors (with the largest eigenvalues) iterative processes,

which directly optimize these eigenvectors, are computationally more ef-

ficient than a full eigendecomposition (or SVD). In the extreme case of

power iteration only needing the first eigenvector, a simple method called the power it-

eration is very efficient. Power iteration chooses a random vector x0 and

follows the iteration

Sxk

xk+1 = , k = 0, 1, . . . . (10.58)

kSxk k

5465 This means the vector xk is multiplied by S in every iteration and then

5466 normalized, i.e., we always have kxk k = 1. This sequence of vectors con-

5467 verges to the eigenvector associated with the largest eigenvalue of S . The

5468 original Google PageRank algorithm (Page et al., 1999) uses such an al-

5469 gorithm for ranking web pages based on their hyperlinks.

Draft (2018-07-10) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.

10.5 PCA Algorithm 303

5471 In the following, we will go through the individual steps of PCA using a

5472 running example, which is summarized in Figure 10.9. We are given a

5473 two-dimensional data set (Figure 10.9(a)), and we want to use PCA to

5474 project it onto a one-dimensional subspace.

5476 mean µ of the dataset and subtracting it from every single data point.

5477 This ensures that the data set has mean 0 (Figure 10.9(b)). Mean sub-

5478 traction is not strictly necessary but reduces the risk of numerical prob-

5479 lems.

5480 2. Standardization Divide the data points by the standard deviation σd

5481 of the dataset for every dimension d = 1, . . . , D. Now the data is unit

5482 free, and it has variance 1 along each axis, which is indicated by the

5483 two arrows in Figure 10.9(c). This step completes the standardization standardization

5484 of the data.

5485 3. Eigendecomposition of the covariance matrix Compute the data

5486 covariance matrix and its eigenvalues and corresponding eigenvectors.

5487 In Figure 10.9(d), the eigenvectors are scaled by the magnitude of the

5488 corresponding eigenvalue. The longer vector spans the principal sub-

5489 space, which we denote by U . The data covariance matrix is repre-

5490 sented by the ellipse.

4. Projection We can project any data point x∗ ∈ RD onto the principal

subspace: To get this right, we need to standardize x∗ using the mean

and standard deviation of the data set that we used to compute the

data covariance matrix, i.e.,

(d)

x∗ − µ(d)

x(d)

∗ ← , d = 1, . . . , D , (10.59)

σd

where x(d) is the dth component of x. Then, we obtain the projected

data point as

x̃∗ = BB > x∗ (10.60)

5491 with coordinates z ∗ = B > x∗ with respect to the basis of the prin-

5492 cipal subspace. Here, B is the matrix that contains the eigenvectors

5493 that belong to the largest eigenvalues of the data covariance matrix as

5494 columns.

5. Moving back to data space To see our projection in the original data

format (i.e., before standardization), we need to undo the standardiza-

tion (10.59) and multiply by the standard deviation before adding the

mean so that we obtain

x̃(d) (d)

∗ ← x̃∗ σd + µ

(d)

, d = 1, . . . , D , (10.61)

5495 where µ(d) and σd are the mean and standard deviation of the training

c

2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.

304 Dimensionality Reduction with Principal Component Analysis

Figure 10.9 Steps

6 6 6

of PCA.

4 4 4

2 2 2

x2

x2

x2

0 0 0

−2 −2 −2

−4 −4 −4

−2.5 0.0 2.5 5.0 −2.5 0.0 2.5 5.0 −2.5 0.0 2.5 5.0

x1 x1 x1

(a) Original dataset. (b) Step 1: Centering by sub- (c) Step 2: Dividing by the

tracting the mean from each standard deviation to make

data point. the data unit free. Data has

variance 1 along each axis.

6 6 6

4 4 4

2 2 2

x2

x2

x2

0 0 0

−2 −2 −2

−4 −4 −4

−2.5 0.0 2.5 5.0 −2.5 0.0 2.5 5.0 −2.5 0.0 2.5 5.0

x1 x1 x1

(d) Step 3: Compute eigenval- (e) Step 4: Project data onto (f) Step 5: Undo the standard-

ues and eigenvectors (arrows) the subspace spanned by the ization and move projected

of the data covariance matrix eigenvectors belonging to the data back into the original

(ellipse). largest eigenvalues (principal data space from (a).

subspace).

5496 data in the dth dimension, respectively. Figure 10.9(f) illustrates the

5497 projection in the original data format.

http: In the following, we will apply PCA to the MNIST digits dataset, which

//yann.lecun.

contains 60, 000 examples of handwritten digits 0–9. Each digit is an im-

com/exdb/mnist/

age of size 28×28, i.e., it contains 784 pixels so that we can interpret every

image in this dataset as a vector x ∈ R784 . Examples of these digits are

shown in Figure 10.3. For illustration purposes, we apply PCA to a subset

of the MNIST digits, and we focus on the digit ‘8’. We used 5,389 training

images of the digit ‘8’ and determined the principal subspace as detailed

in this chapter. We then used the learned projection matrix to reconstruct

a set of test images, which is illustrated in Figure 10.10. The first row

of Figure 10.10 shows a set of four original digits from the test set. The

following rows show reconstructions of exactly these digits when using

a principal subspace of dimensions 1, 10, 100, 500, respectively. We can

see that even with a single-dimensional principal subspace we get a half-

Draft (2018-07-10) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.

10.5 PCA Algorithm 305

and generic. With an increasing number of principal components (PCs)

the reconstructions become sharper and more details can be accounted

for. With 500 principal components, we effectively obtain a near-perfect

reconstruction. If we were to choose 784 PCs we would recover the exact

digit without any compression loss.

of increasing

Original number of principal

components on

reconstruction.

PCs: 1

PCs: 10

PCs: 100

PCs: 500

Figure 10.11

Average reconstruction error

6 Average

reconstruction error

as a function of the

4 number of principal

components.

2

0

0 200 400 600

Number of PCs

N D p

1 X X

kxn − x̃n k = λd , (10.62)

N n=1 d=1

the importance of the principal components drops off rapidly, and only

marginal gains can be achieved by adding more PCs. With about 550 PCs,

c

2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.

306 Dimensionality Reduction with Principal Component Analysis

we can essentially fully reconstruct the training data that contains the

digit ‘8’.

5499 In order to do PCA, we need to compute the data covariance matrix. In D

5500 dimensions, the data covariance matrix is a D × D matrix. Computing the

5501 eigenvalues and eigenvectors of this matrix is computationally expensive

5502 as it scales cubically in D. Therefore, PCA, as we discussed earlier, will be

5503 infeasible in very high dimensions. For example, if our xn are images with

5504 10, 000 pixels (e.g., 100 × 100 pixel images), we would need to compute

5505 the eigendecomposition of a 10, 000 × 10, 000 covariance matrix. In the

5506 following, we provide a solution to this problem for the case that we have

5507 substantially fewer data points than dimensions, i.e., N D.

Assume we have a data set x1 , . . . , xN , xn ∈ RD . Assuming the data is

centered, the data covariance matrix is given as

1

S= XX > ∈ RD×D , (10.63)

N

5508 where X = [x1 , . . . , xN ] is a D × N matrix whose columns are the data

5509 points.

5510 We now assume that N D, i.e., the number of data points is smaller

5511 than the dimensionality of the data. Then the rank of the covariance ma-

5512 trix S is N , and it has D − N + 1 many eigenvalues that are 0. Intuitively,

5513 this means that there are some redundancies.

5514 In the following, we will exploit this and turn the D × D covariance

5515 matrix into an N × N covariance matrix whose eigenvalues are all greater

5516 than 0.

In PCA, we ended up with the eigenvector equation

Sbm = λm bm , m = 1, . . . , M , (10.64)

where bm is a basis vector of the principal subspace. Let us re-write this

equation a bit: With S defined in (10.63), we obtain

1

Sbm = XX > bm = λm bm . (10.65)

N

We now multiply X > ∈ RN ×D from the left-hand side, which yields

1 > 1 >

X{z X} X > bm = λm X > bm ⇐⇒ X Xcm = λm cm , (10.66)

N | | {z } N

N ×N =:cm

5518 value, but the eigenvector is now cm := X > bm of the matrix N1 X > X ∈

5519 RN ×N . Assuming we have no duplicate data points, this matrix has rank N

Draft (2018-07-10) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.

10.7 Latent Variable Perspective 307

5520 and is invertible. This also implies that N1 X > X has the same (non-zero)

5521 eigenvalues as the data covariance matrix S . But this is now an N × N

5522 matrix, so that we can compute the eigenvalues and eigenvectors much

5523 more efficiently than for the original D × D data covariance matrix.

Now, that we have the eigenvectors of N1 X > X , we are going to re-

cover the original eigenvectors, which we still need for PCA. Currently,

we know the eigenvectors of N1 X > X . If we left-multiply our eigenvalue/

eigenvector equation with X , we get

1

XX > Xcm = λm Xcm (10.67)

N

| {z }

S

5524 and we recover the data covariance matrix again. This now also means

5525 that we recover Xcm as an eigenvector of S .

5526 Remark. If we want to apply the PCA algorithm that we discussed in Sec-

5527 tion 10.5 we need to normalize the eigenvectors Xcm of S so that they

5528 have norm 1. ♦

5530 In the previous sections, we derived PCA without any notion of a prob-

5531 abilistic model using the maximum-variance and the projection perspec-

5532 tives. On the one hand this approach may be appealing as it allows us to

5533 sidestep all the mathematical difficulties that come with probability the-

5534 ory, on the other hand a probabilistic model would offer us more flexibility

5535 and useful insights. More specifically, a probabilistic model would

5536 • come with a likelihood function, and we can explicitly deal with noisy

5537 observations (which we did not even discuss earlier),

5538 • allow us to do Bayesian model comparison via the marginal likelihood

5539 as discussed in Section 8.4,

5540 • view PCA as a generative model, which allows us to simulate new data,

5541 • allow us to make straightforward connections to related algorithms

5542 • deal with data dimensions that are missing at random by applying

5543 Bayes’ theorem,

5544 • give us a notion of the novelty of a new data point,

5545 • allow us to extend the model fairly straightforwardly, e.g., to a mixture

5546 of PCA models,

5547 • have the PCA we derived in earlier sections as a special case,

5548 • allow for a fully Bayesian treatment by marginalizing out the model

5549 parameters.

5550 By introducing a continuous-valued latent variable z ∈ RM it is possible

5551 to phrase PCA as a probabilistic latent-variable model. Tipping and Bishop

5552 (1999) proposed this latent-variable model as Probabilistic PCA (PPCA). Probabilistic PCA

5553 PPCA addresses most of the issues above, and the PCA solution that we

c

2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.

308 Dimensionality Reduction with Principal Component Analysis

Figure 10.12

Graphical model for zn

probabilistic PCA.

The observations xn

explicitly depend on B µ

corresponding

latent variables

z n ∼ N 0, I . The

xn σ

model parameters

n = 1, . . . , N

B, µ and the

likelihood

parameter σ are

shared across the 5554 obtained by maximizing the variance in the projected space or by minimiz-

dataset. 5555 ing the reconstruction error is obtained as the special case of maximum

5556 likelihood estimation in a noise-free setting.

In PPCA, we explicitly write down the probabilistic model for linear di-

mensionality reduction. For this we assume a continuous

latent variable

z ∈ RM with a standard-Normal prior p(z) = N 0, I and a linear rela-

tionship between the latent variables and the observed x data where

x = Bz + µ + ∈ RD , (10.68)

where ∼ N 0, σ 2 I is Gaussian observation noise, B ∈ RD×M and µ ∈

RD describe the linear/affine mapping from latent to observed variables.

Therefore, PPCA links latent and observed variables via

p(x|z, B, µ, σ 2 ) = N x | Bz + µ, σ 2 I .

(10.69)

Overall, PPCA induces the following generative process:

z n ∼ N z | 0, I (10.70)

xn | z n ∼ N x | Bz n + µ, σ 2 I

(10.71)

5558 To generate a data point that is typical given the model parameters, we

ancestral sampling5559 follow an ancestral sampling scheme: We first sample a latent variable z n

5560 from p(z). Then, we use z n in (10.69) to sample a data point conditioned

5561 on the sampled z n , i.e., xn ∼ p(x | z n , B, µ, σ 2 ).

This generative process allows us to write down the probabilistic model

(i.e., the joint distribution of all random variables) as

p(x, z|B, µ, σ 2 ) = p(x|z, B, µ, σ 2 )p(z) , (10.72)

5562 which immediately gives rise to the graphical model in Figure 10.12 using

5563 the results from Section 8.5.

5564 Remark. Note the direction of the arrow that connects the latent variables

5565 z and the observed data x: The arrow points from z to x, which means

5566 that the PPCA model assumes a lower-dimensional latent cause z for high-

5567 dimensional observations x. In the end, we are obviously interested in

Draft (2018-07-10) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.

10.7 Latent Variable Perspective 309

5568 finding something out about z given some observations. To get there we

5569 will apply Bayesian inference to “invert” the arrow implicitly and go from

5570 observations to latent variables. ♦

Figure 10.13

Generating new

MNIST digits. The

latent variables z

can be used to

generate new data

x̃ = Bz. The closer

we stay to the

training data the

more realistic the

generated data.

Figure 10.13 shows the latent coordinates of the MNIST digits ‘8’ found

by PCA when using a two-dimensional principal subspace (blue dots). We

can query any vector z ∗ in this latent space an generate an image x̃∗ =

Bz ∗ that resembles the digit ‘8’. We show eight of such generated images

with their corresponding latent space representation. Depending on where

we query the latent space, the generated images look different (shape,

rotation, size, ...). If we query away from the training data, we see more an

more artefacts, e.g., the top-left and top-right digits. Note that the intrinsic

dimensionality of these generated images is only two.

Using the results from Chapter 6, we obtain the marginal distribution of

the data x by integrating out the latent variable z so that

Z

p(x | B, µ, σ ) = p(x | z, µ, σ 2 )p(z)dz

2

Z (10.73)

2

= N x | Bz + µ, σ I N z | 0, I dz .

c

2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.

310 Dimensionality Reduction with Principal Component Analysis

From Section 6.6, we know that the solution to this integral is a Gaussian

distribution with mean

E[x] = Ez [Bz + µ] + E [] = µ (10.74)

and with covariance matrix

V[x] = Vz [Bz + µ] + V [] = Vz [Bz]

(10.75)

= B Vz [z]B > + σ 2 I = BB > + σ 2 I .

PPCA likelihood 5572 The marginal distribution in (10.73) is the PPCA likelihood, which we can

5573 use for maximum likelihood or MAP estimation of the model parameters.

5574 Remark. Although the conditional distribution in (10.69) is also a like-

5575 lihood, we cannot use it for maximum likelihood estimation as it still

5576 depends on the latent variables. The likelihood function we require for

5577 maximum likelihood (or MAP) estimation should only be a function of

5578 the data x and the model parameters, but not on the latent variables. ♦

From Section 6.6 we also know that the joint distribution of a Gaus-

sian random variable z and a linear/affine transformation x = Bz of it

are jointly Gaussian distributed. We already know the marginals p(z) =

N z | 0, I and p(x) = N x | µ, BB > +σ 2 I . The missing cross-covariance

is given as

Cov[x, z] = Covz [Bz + µ] = B Covz [z, z] = B . (10.76)

Therefore, the probabilistic model of PPCA, i.e., the joint distribution of

latent and observed random variables is explicitly given by

BB > + σ 2 I B

x µ

p(x, z | B, µ, σ 2 ) = N , , (10.77)

z 0 B> I

5579 with a mean vector of length D + M and a covariance matrix of size

5580 (D + M ) × (D + M ).

The joint Gaussian distribution p(x, z | B, µ, σ 2 ) in (10.77) allows us to

determine the posterior distribution p(z | x) immediately by applying the

rules of Gaussian conditioning from Section 6.6.1. The posterior distribu-

tion of the latent variable given an observation x is then

p(z | x) = N z | m, C , (10.78a)

m = B > (BB > + σ 2 I)−1 (x − µ) , (10.78b)

> > 2 −1

C = I − B (BB + σ I) B . (10.78c)

5582 The posterior distribution in (10.78) reveals a few things. First, the poste-

5583 rior mean m is effectively an orthogonal projection of the mean-centered

5584 data x − µ onto the vector subspace spanned by the columns of B . If we

Draft (2018-07-10) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.

10.8 Further Reading 311

5586 projection equation (3.55) from Section 3.6. Second, the posterior covari-

5587 ance matrix C does not directly depend on the observed data x.

5588 If we now have a new observation x∗ in data space, we can use (10.78)

5589 to determine the posterior distribution of the corresponding latent vari-

5590 able z ∗ . The covariance matrix C allows us to assess how confident the

5591 embedding is. A covariance matrix C with a small determinant (which

5592 measures volumes) tells us that the latent embedding z ∗ is fairly certain.

5593 However, if we obtain a posterior distribution p(z ∗ | x∗ ) that is uncertain,

5594 we may be faced with an outlier. However, we can explore this posterior

5595 distribution to understand what other data points x are plausible under

5596 this posterior. To do this, we can exploit PPCA’s generative process. The

5597 generative process underlying PPCA allows us to explore the posterior dis-

5598 tribution on the latent variables by generating new data that are plausible

5599 under this posterior. This can be achieved as follows:

5601 over the latent variables (10.78)

5602 2. Sample a reconstructed vector x̃∗ ∼ p(x | z ∗ , B, µ, σ 2 ) from (10.69)

5603 If we repeat this process many times, we can explore the posterior dis-

5604 tribution (10.78) on the latent variables z ∗ and its implications on the

5605 observed data. The sampling process effectively hypothesizes data, which

5606 is plausible under the posterior distribution.

5608 We derived PCA from two perspectives: a) maximizing the variance in the

5609 projected space; b) minimizing the average reconstruction error. However,

5610 PCA can also be interpreted from different perspectives. Let us re-cap what

5611 we have done: We took high-dimensional data x ∈ RD and used a matrix

5612 B > to find a lower-dimensional representation z ∈ RM . The columns of

5613 B are the eigenvectors of the data covariance matrix S that are associated

5614 with the largest eigenvalues. Once we have a low-dimensional represen-

5615 tation z , we can get a high-dimensional version of it (in the original data

5616 space) as x ≈ x̃ = Bz = BB > x ∈ RD , where BB > is a projection

5617 matrix.

We can also think of PCA as a linear auto-encoder as illustrated in Fig- auto-encoder

ure 10.14. An auto-encoder encodes the data xn ∈ RD to a code z n ∈ RM code

and tries to decode it to a x̃n similar to xn . The mapping from the data to

the code is called the encoder, the mapping from the code back to the orig- encoder

inal data space is called the decoder. If we consider linear mappings where decoder

the code is given by z n = B > xn ∈ RM and we are interested in minimiz-

ing the average squared error between the data xn and its reconstruction

c

2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.

312 Dimensionality Reduction with Principal Component Analysis

can be viewed as a

D

linear auto-encoder. IR code

IRD

It encodes the

high-dimensional M

data x into a B > IR B

lower-dimensional x z x̃

representation

(code) z ∈ RM and

decode z using a

decoder. The

decoded vector x̃ is

the orthogonal Encoder Decoder

projection of the

original data x onto

the M -dimensional x̃n = Bz n , n = 1, . . . , N , we obtain

principal subspace.

N N

1 X 2 1 X >

2

kxn − x̃n k = xn − B Bxn . (10.79)

N n=1 N n=1

5618 This means we end up with the same objective function as in (10.31) that

5619 we discussed in Section 10.3 so that we obtain the PCA solution when we

5620 minimize the squared auto-encoding loss. If we replace the linear map-

5621 ping of PCA with a nonlinear mapping, we get a nonlinear auto-encoder.

5622 A prominent example of this is a deep auto-encoder where the linear func-

5623 tions are replaced with deep neural networks. In this context, the encoder

recognition network

5624 is also know as recognition network or inference network, whereas the de-

inference network5625 coder is also called a generator.

generator 5626 Another interpretation of PCA is related to information theory. We can

The code is a 5627 think of the code as a smaller or compressed version of the original data

compressed version

5628

of the original data.

point. When we reconstruct our original data using the code, we do not

5629 get the exact data point back, but a slightly distorted or noisy version

5630 of it. This means that our compression is “lossy”. Intuitively we want

5631 to maximize the correlation between the original data and the lower-

5632 dimensional code. More formally, this is related to the mutual information.

5633 We would then get the same solution to PCA we discussed in Section 10.3

5634 by maximizing the mutual information, a core concept in information the-

5635 ory (MacKay, 2003).

In our discussion on PPCA, we assumed that the parameters of the

model, i.e., B, µ and the likelihood parameter σ 2 are known. Tipping and

Bishop (1999) describe how to derive maximum likelihood estimates for

these parameter in the PPCA setting (note that we use a different notation

in this chapter). The maximum likelihood parameters, when projecting

D-dimensional data onto an M -dimensional subspace, are given by

N

1 X

µML = xn , (10.80)

N n=1

1

B ML = T (Λ − σ 2 I) 2 , (10.81)

Draft (2018-07-10) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.

10.8 Further Reading 313

D

2 1 X

σML = λj , (10.82)

D − M j=M +1

5637 trix, and Λ = diag(λ1 , . . . , λM ) ∈ RM ×M is a diagonal matrix with the

5638 eigenvalues belonging to the principal axes on its diagonal. The maximum

5639 likelihood solution B ML is unique up to a arbitrary rotations, i.e., we can

5640 right-multiply B ML with any rotation matrix R ∈ RM ×M so that (10.81)

5641 essentially is a singular value decomposition (see Section 4.5). An outline

5642 of the proof is given by Tipping and Bishop (1999).

5643 The maximum likelihood estimate for µ given in (10.80) is the sample

5644 mean of the data. The maximum likelihood estimator for the observation

5645 noise variance σ 2 given in (10.82) is the sum of all variances in the or-

5646 thogonal complement of the principal subspace, i.e., the leftover variance

5647 that we cannot capture with the first M principal components are treated

5648 as observation noise.

In the noise-free limit where σ → 0, PPCA and PCA provide identical

solutions: Since the data covariance matrix S is symmetric, it can be di-

agonalized (see Section 4.4), i.e., there exists a matrix T of eigenvectors

of S so that

S = T ΛT −1 . (10.83)

In the PPCA model, the data covariance matrix is the covariance matrix

of the likelihood p(X | B, µ, σ 2 ), which is BB > + σ 2 I , see (10.75). For

σ → 0, we obtain BB > so that this data covariance must equal the PCA

data covariance (and its factorization given in (10.83)) so that

1

Cov[X] = T ΛT −1 = BB > ⇐⇒ B = T Λ 2 , (10.84)

5649 which is exactly the maximum likelihood estimate in (10.81) for σ = 0.

5650 From (10.81) and (10.83) it becomes clear that (P)PCA performs a de-

5651 composition of the data covariance matrix. We can also think of PCA as a

5652 method for finding the best rank-M approximation of the data covariance

5653 matrix so that we can apply the Eckart-Young Theorem from Section 4.6,

5654 which states that the optimal solution can be determined using a singular

5655 value decomposition.

5656 In a streaming setting, where data arrives sequentially, it is recom-

5657 mended to use the iterative Expectation Maximization (EM) algorithm for

5658 maximum likelihood estimation (Roweis, 1998).

5659 Similar to our discussion on linear regression in Chapter 9, we can place

5660 a prior distribution on the parameters of the model and to integrate them

5661 out, thereby avoiding a) point estimates of the parameters and the issues

5662 that come with these point estimates (see Section 8.4) and b) allowing

5663 for an automatic selection of the appropriate dimensionality M of the la-

5664 tent space. In this Bayesian PCA, which was proposed by Bishop (1999), Bayesian PCA

c

2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.

314 Dimensionality Reduction with Principal Component Analysis

5666 generative process allows us to integrates the model parameters out in-

5667 stead of conditioning on them, which addresses overfitting issues. Since

5668 this integration is analytically intractable, Bishop (1999) proposes to use

5669 approximate inference methods, such as MCMC or variational inference.

5670 We refer to the work by Gilks et al. (1996) and Blei et al. (2017) for more

5671 details on these approximate inference techniques.

5672 In PPCA,

we considered the linear model xn = Bz n + with p(z n ) =

5673 N 0, I and ∼ N 0, σ 2 I , i.e., all observation dimensions are affected

5674 by the same amount of noise. If we allow each observation dimension

factor analysis 5675 d to have a different variance σd2 we obtain factor analysis (FA) (Spear-

5676 man, 1904; Bartholomew et al., 2011). This means, FA gives the likeli-

5677 hood some more flexibility than PPCA, but still forces the data to be ex-

An overly flexible 5678 plained by the model parameters B, µ. However, FA no longer allows for

likelihood would be5679 a closed-form solution to maximum likelihood so that we need to use an

able to explain more

5680 iterative scheme, such as the EM algorithm, to estimate the model param-

than just the noise.

5681 eters. While in PPCA all stationary points are global optima, this no longer

5682 holds for FA. Compared to PPCA, FA does not change if we scale the data,

Independent 5683 but it does return different solutions if we rotate the data.

Component Analysis

5684 Independent Component Analysis (ICA) is also closely related to PCA.

5685 Starting again with the model xn = Bz n + we now change the prior

blind-source 5686 on z n to non-Gaussian distributions. ICA can be used for blind-source sep-

separation 5687 aration. Imagine you are in a busy train station with many people talking.

5688 Your ears play the role of microphones, and they linearly mix different

5689 speech signals in the train station. The goal of blind-source separation is

5690 to identify the constituent parts of the mixed signals. As discussed above

5691 in the context of maximum likelihood estimation for PPCA, the original

5692 PCA solution is invariant to any rotation. Therefore, PCA can identify the

5693 best lower-dimensional subspace in which the signals live, but not the sig-

5694 nals themselves (Murphy, 2012). ICA addresses this issue by modifying

5695 the prior distribution p(z) on the latent sources to require non-Gaussian

Murphy2012 5696 priors p(z). We refer to the book by Murphy2012 for more details on ICA.

5697 PCA, factor analysis and ICA are three examples for dimensionality re-

5698 duction with linear models. Cunningham and Ghahramani (2015) provide

5699 a broader survey of linear dimensionality reduction.

5700 The (P)PCA model we discussed here allows for several important ex-

5701 tensions. In Section 10.6, we explained how to do PCA when the in-

5702 put dimensionality D is significantly greater than the number N of data

5703 points. By exploiting the insight that PCA can be performed by computing

5704 (many) inner products, this idea can be pushed to the extreme by consid-

kernel trick 5705 ering infinite-dimensional features. The kernel trick is the basis of kernel

kernel PCA 5706 PCA and allows us to implicitly compute inner products between infinite-

5707 dimensional features (Schölkopf et al., 1998; Schölkopf and Smola, 2002).

5708 There are nonlinear dimensionality reduction techniques that are de-

5709 rived from PCA. The auto-encoder perspective of PCA that we discussed

Draft (2018-07-10) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.

10.8 Further Reading 315

5710 above can be used to render PCA as a special case of a deep auto-encoder. deep auto-encoder

5711 In the deep auto-encoder, both the encoder and the decoder are repre-

5712 sented by multi-layer feedforward neural networks, which themselves are

5713 nonlinear mappings. If we set the activation functions in these neural net-

5714 works to be the identity, the model becomes equivalent to PCA. A different

5715 approach to nonlinear dimensionality reduction is the Gaussian Process La- Gaussian Process

5716 tent Variable Model (GP-LVM) proposed by Lawrence (2005). The GP-LVM Latent Variable

Model

5717 starts off with the latent-variable perspective that we used to derive PPCA

5718 and replaces the linear relationship between the latent variables z and the

5719 observations x with a Gaussian process (GP). Instead of estimating the pa-

5720 rameters of the mapping (as we do in PPCA), the GP-LVM marginalizes out

5721 the model parameters and makes point estimates of the latent variables

5722 z . Similar to Bayesian PCA, the Bayesian GP-LVM proposed by Titsias and Bayesian GP-LVM

5723 Lawrence (2010) maintains a distribution on the latent variables z and

5724 uses approximate inference to integrate them out as well.

c

2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.

- Statistical Arbitrage in U.S Equityies MarketTransféré parlimengland
- Analytical Chemistry i to 12Transféré parjavidabian
- Roe RiemannTransféré parizyannn8950
- A Me 509Transféré parelmakaya
- Hl 2413221328Transféré parAnonymous 7VPPkWS8O
- ME_solutionsTransféré parMukesh Kumar
- Impact of Employee Satisfaction on Success of OrganizationTransféré parpotionpotion
- WH SolucionTransféré parJosé Augusto Siles Ramírez
- The Use of Discrete Data in PCARTransféré parJoe Ogle
- 582 ProblemsTransféré parBradley Graham
- LECTUR1Transféré parNaj Retxed Dargup
- Linear AlgebraTransféré parEdmond Z
- T2D CN M1 V2Transféré parSeba Pñaloza Dz
- Popov Ppm GazTransféré parilyesingenieur
- A Comparative Study of Remotely Sensed Data ClassificationTransféré parJesmin Farzana Khan
- Gab OrTransféré parSujith Kumar
- 1032.pdfTransféré parcyrcos
- A Basic Scheme in the Development of the Ce-se Courant Number Insensitive SchemesTransféré par2e7ah
- Exp7 H7 PCB Fabrication TestingTransféré parshubhika gupta
- Factor AnalysisTransféré parPrudhvinadh Kopparapu
- FinalD_Fall06Transféré parloser mommy
- Civ6400 Jan07 Course WorkTransféré parapi-3707273
- 10 Monitored Withstand 17 With CopyrightTransféré parAnonymous 1AAjd0
- Calculation of critical points using a reduction method.pdfTransféré parmurdanetap957
- problem3-4solTransféré parAnonymous QSLV4dN
- Colored Gaussian Noise 2Transféré parYasser Naguib
- US Federal Trade Commission: montecarloTransféré parftc
- Mat 1163 Exam 2010Transféré par322399mk7086
- Chapter4aTransféré parAshebir Sh
- Matrix ProblemsTransféré parVikash Gupta

- Chapter 001Transféré parricha
- Chapter 11Transféré parricha
- Chapter 05Transféré parricha
- Chapter 06Transféré parricha
- Chapter 13Transféré parricha
- Chapter 09Transféré parricha
- Chapter 12Transféré parricha
- Chapter 07Transféré parricha
- chapter02.pdfTransféré parAnurag Punia
- Chapter 03Transféré parricha
- Chapter 01Transféré parricha
- Chapter 00Transféré parricha

- hjuyTransféré parpavlosmakridakis2525
- Spontaneous Symmetry Breaking by Ling-Fong LiTransféré parFRIENDs CALCUTTA [ SANDIP40 ]
- AnalogITransféré parAamir Ahmed Ali Salih
- ch04_imperfections.pptTransféré parSary Kilany
- Added MassTransféré parAli Punga
- Ch. 22 Vectors and Relative VelocityTransféré parOmori Fumihiro
- To investigate the trajectory of a small ball as it rolls off a surface which is inclined to the horizontalTransféré parBoonKhaiYeoh
- Chapter 06Transféré parRamachandra Reddy
- 23-oscillators.pptrfgtyuiop[]\Transféré parPrincess Dainne Dahilig
- Notebook B - Physics WavesTransféré parapi-26183506
- Estimating Foundation Settlements in Sand From Plate Bearing TestsTransféré parpciemak
- LA Lecture0Transféré parbilly
- Iron in Marine SedimentTransféré parChristiani Silalahi
- E4329_QuickGuideTransféré parlumas
- Free Shear Flows NotesTransféré parutsavrao
- proofofefficacydocumentfireaway-nihalnazeemrevisedTransféré parapi-233066115
- ABC.pdfTransféré parnhonduc
- Eric W. Davis- Interstellar Travel by Means of Wormhole Induction Propulsion (WHIP)Transféré parRtpom
- Physiological ApparatusTransféré parBiyaya San Pedro
- Atomic Structure Type 1Transféré parSudhakar Chollangi
- arkivoc 1Transféré parIo Guerra
- Ftir on Epoxy ResinsTransféré parRia Nita
- Phys410_Final and SolutionsTransféré parAyorinde T Tunde
- Temperature Gradient Loading for Bridge Objects - Test Problems - [CSI Wiki]Transféré parUsman Khan
- Chap5-TxLineModels_notes2014.pdfTransféré parAlexanderEbbeling
- RCC Detail Design of Abutment and PierTransféré parshashibhushan singh
- PJC JC 2 H2 Maths 2011 Mid Year Exam Question Paper 1Transféré parjimmytanlimlong
- Delivery of Methotrexate and Characterization of Skin Treated by Fabricated PLGA Microneedles and Fractional Ablative LaserTransféré parHiep X Nguyen
- How to Match Theoretical and ExperimentaTransféré parNikhilSaxena
- VTU Engineering Physics e-bookTransféré parSujith Thomas