Vous êtes sur la page 1sur 30

5216

10

Dimensionality Reduction with Principal


Component Analysis

5217 Data in real life is often high dimensional. For example, if we want to esti-
5218 mate the price of our house in a year’s time, we can use data that helps us
5219 to do this: the type of house, the size, the number of bedrooms and bath-
5220 rooms, the value of houses in the neighborhood when they were bought,
5221 the distance to the next train station and park, the number of crimes com-
5222 mitted in the neighborhood, the economic climate etc. – there are many
5223 things that influence the house price, and we collect this information in a
5224 data set that we can use to estimate the house price. Another example is a
5225 640×480 pixels color image, which is a data point in a million-dimensional
5226 space, where every pixel responds to three dimensions - one for each color
5227 channel (red, green, blue).
5228 Working directly with high-dimensional data comes with some difficul-
5229 ties: It is hard to analyze, interpretation is difficult, visualization is nearly
5230 impossible, and (from a practical point of view) storage can be expensive.
5231 However, high-dimensional data also has some nice properties: For exam-
5232 ple, high-dimensional data is often overcomplete, i.e., many dimensions
5233 are redundant and can be explained by a combination of other dimen-
5234 sions. Dimensionality reduction exploits structure and correlation and al-
5235 lows us to work with a more compact representation of the data, ideally
5236 without losing information. We can think of dimensionality reduction as
5237 a compression technique, similar to jpg or mp3, which are compression
5238 algorithms for images and music.
principal component
5239 In this chapter, we will discuss principal component analysis (PCA), an
analysis 5240 algorithm for linear dimensionality reduction. PCA, proposed by Pearson
dimensionality 5241 (1901) and Hotelling (1933), has been around for more than 100 years
reduction
5242 and is still one of the most commonly used techniques for data compres-
5243 sion, data visualization and the identification of simple patterns, latent
5244 factors and structures of high-dimensional data. In the signal processing
Karhunen-Loève 5245 community, PCA is also known as the Karhunen-Loève transform. In this
transform 5246 chapter, we will explore the concept of linear dimensionality reduction
5247 with PCA in more detail, drawing on our understanding of basis and ba-
5248 sis change (see Sections 2.6.1 and 2.7.2), projections (see Section 3.6),
5249 eigenvalues (see Section 4.2), Gaussian distributions (see Section 6.6) and
5250 constrained optimization (see Section 7.2).
5251 Dimensionality reduction generally exploits the property of high-dimen-

286
c
Draft chapter (July 10, 2018) from “Mathematics for Machine Learning” 2018 by Marc Peter
Deisenroth, A Aldo Faisal, and Cheng Soon Ong. To be published by Cambridge University Press.
Report errata and feedback to http://mml-book.com. Please do not post or distribute this file,
please link to https://mml-book.com.
10.1 Problem Setting 287
ure 10.1
stration:
4 4
mensionality
uction. (a) The
ginal dataset not 2 2
y much along the
direction. (b)
x2

x2
0 0
e data from (a)
n be represented
−2 −2
ng the
coordinate alone
h nearly no loss. −4 −4

−5.0 −2.5 0.0 2.5 5.0 −5.0 −2.5 0.0 2.5 5.0
x1 x1

(a) Dataset with x1 and x2 coordinates. (b) Compressed dataset where only the x1 coor-
dinate is relevant.

original compressed Figure 10.2


D D Graphical
IR IR
illustration of PCA.
IRM In PCA, we find a
compressed version
x z x̃ x̃ of original data x
that has an intrinsic
lower-dimensional
representation z.

5252 sional data (e.g., images) that it often lies on a low-dimensional subspace,
5253 and that many dimensions are highly correlated, redundant or contain
5254 little information. Figure 10.1 gives an illustrative example in two dimen-
5255 sions. Although the data in Figure 10.1(a) does not quite lie on a line, the
5256 data does not vary much in the x2 -direction, so that we can express it as
5257 if it was on a line – with nearly no loss, see Figure 10.1(b). The data in
5258 Figure 10.1(b) requires only the x1 -coordinated to describe and lies in a
5259 one-dimensional subspace of R2 .

5260 10.1 Problem Setting


5261 In PCA, we are interested in finding projections x̃n of data points xn that
5262 are as similar to the original data points as possible, but which have a sig-
5263 nificantly lower intrinsic dimensionality. Figure 10.1 gives an illustration
5264 what this could look like.
5265 Figure 10.2 illustrates the setting we consider in PCA, where z repre-
5266 sents the intrinsic lower dimension of the compressed data x̃ and plays
5267 the role of a bottleneck, which controls how much information can flow
5268 between x and x̃.
5269 More concretely, we consider i.i.d. data points x1 , . . . , xN ∈ RD , and

c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
288 Dimensionality Reduction with Principal Component Analysis

Figure 10.3
Examples of
handwritten digits
from the MNIST
dataset.

5270 we search a low-dimensional, compressed representation (code) z n of xn .


5271 If our observed data lives in RD , we look for an M -dimensional subspace
5272 U ⊆ RD , dim(U ) = M < D onto which we project data. We denote
5273 the projected data as x̃n ∈ U , and their coordinates (with respect to an
5274 appropriate basis in U ) with z n . Our aim is to find x̃n so that they are as
5275 similar to the original data xn as possible.

Example 10.1 (Coordinate Representation/Code)


Consider R2 with the canonical basis e1 = [1, 0]> , e2 = [0, 1]> . From
Chapter 2 we know that x ∈ R2 can be represented as a linear combina-
tion of these basis vectors, e.g.,
 
5
= 5e1 + 3e2 . (10.1)
3
However, when we consider the set of vectors
 
0
x̃ = ∈ R2 , z ∈ R , (10.2)
z
they can always be written as 0e1 + ze2 . To represent these vectors it is
sufficient to remember/store the coordinate/code z of the e2 vector.
More precisely, the set of x̃ vectors (with the standard vector addition
and scalar multiplication) forms a vector subspace U (see Section 2.4)
The dimension of a with dim(U ) = 1 because U = span[e2 ].
vector space
corresponds to the
number of its basis
vectors (see 5276 In PCA, we consider the relationship between the original data x and
Section 2.6.1).
5277 its low-dimensional code z to be linear so that z = B > x for a suitable
5278 matrix B .
5279 Throughout this chapter, we will use the MNIST digits dataset as a re-
http: 5280 occurring example, which contains 60, 000 examples of handwritten digits
//yann.lecun. 5281 0–9. Each digit is an image of size 28 × 28, i.e., it contains 784 pixels so
com/exdb/mnist/
5282 that we can interpret every image in this dataset as a vector x ∈ R784 .
5283 Examples of these digits are shown in Figure 10.3.
5284 In the following, we will derive PCA from two different perspectives.
5285 First, we derive PCA by maintaining as much variance as possible in the
5286 projected space. Second, we will derive PCA by minimizing the average
5287 squared reconstruction error, which directly links to many concepts in
5288 Chapters 3 and 4.

Draft (2018-07-10) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
10.2 Maximum Variance Perspective 289

5289 10.2 Maximum Variance Perspective


5290 Figure 10.1 gave an example of how a two-dimensional dataset can be
5291 represented using a single coordinate. In Figure 10.1(b), we chose to ig-
5292 nore the x2 -coordinate of the data because it did not add too much in-
5293 formation so that the compressed data is similar to the original data in
5294 Figure 10.1(a). We could have chosen to ignore the x1 -coordinate, but
5295 then the compressed data had been very dissimilar from the original data,
5296 and much information in the data would have been lost.
5297 If we interpret information content in the data as how “space filling”
5298 the data set is, then we can describe the information contained in the
5299 data by looking at the spread of the data. From Chapter 6 we know that
5300 the variance is an indicator of the spread of the data, and it is possible
5301 to formulate PCA as a dimensionality reduction algorithm that maximizes
5302 the variance in the low-dimensional representation of the data to retain as
5303 much information as possible. Now, let us formulate this objective more
5304 concretely.
Consider a dataset x1 , . . . , xN , xn ∈ RD , with mean 0 that possesses
the data covariance matrix (empirical covariance) data covariance
matrix
N
1 X
S= xn x>
n . (10.3)
N n=1

5305 Furthermore, we assume a low-dimensional representation z n = B > xn ∈


5306 RM of xn , where B ∈ RD×M .
5307 Our aim is to find a matrix B that retains as much information as possi-
5308 ble when compressing data. We assume that B is an orthogonal matrix so
5309 that b>
i bj = 0 if and only if i 6= j . Retaining most information is formu- The columns
5310 lated as capturing the largest amount of variance in the low-dimensional b1 , . . . , bM of B
form a basis of the
5311 code (Hotelling, 1933).
M -dimensional
Remark. (Centered Data) Let us assume that µ = Ex [x] is the (empirical) subspace in which
the projected data
mean of the data. Using the properties of the variance, which we discussed x̃ = BB > x ∈ RD
in Section 6.4.4 we obtain live.

Vz [z] = Vx [B > (x − µ)] = Vx [B > x − B > µ] = Vx [B > x] , (10.4)

5312 i.e., the variance of the low-dimensional code does not depend on the
5313 mean of the data. Therefore, we assume without loss of generality that the
5314 data has mean 0 for the remainder of this section. With this assumption
5315 the mean of the low-dimensional code is also 0 since Ez [z] = Ex [B > x] =
5316 B > Ex [x] = 0. ♦
We maximize the variance of the low-dimensional code using a sequen-
tial approach. We start by seeking a single vector b1 ∈ RD that maximizes
the variance of the projected data, i.e., we aim to maximize the first coor-

c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
290 Dimensionality Reduction with Principal Component Analysis

dinate z1 of z ∈ RM so that
N
1 X 2
V1 := V[z1 ] = z (10.5)
N n=1 1n

is maximized, where we exploited the i.i.d. assumption of the data and


defined z1n as the first coordinate of the low-dimensional representation
z n ∈ RM of xn ∈ RD . Note that first component of z n is given by
z1n = b>
1 xn . (10.6)

We use this relationship now in (10.5), which yields


N N
1 X > 1 X >
V1 = 2
(b xn ) = b xn x>n b1 (10.7a)
N n=1 1 N n=1 1
N
!
> 1 X
= b1 xn x>n b1 = b>
1 Sb1 , (10.7b)
N n=1

5317 where S is the data covariance matrix defined in (10.3).


5318 It is clear that arbitrarily increasing the magnitude of the vector b1 in-
5319 creases V1 . Therefore, we restrict all solutions to kb1 k = 1, which results in
5320 a constrained optimization problem in which we seek the direction along
5321 which the data varies most.
With the restriction of the solution space to unit vectors we end up with
the constrained optimization problem

max b>
1 Sb1 (10.8)
b1
2
subject to kb1 k = 1 . (10.9)

Following Section 7.2, we obtain the Lagrangian

L = V1 + λ1 (1 − b> > >


1 b1 ) = b1 Sb1 + λ1 (1 − b1 b1 ) (10.10)

to solve this constrained optimization problem. The partial derivatives of


L with respect to b1 and λ1 are
∂L
= 2b> >
1 S − 2λ1 b1 (10.11)
∂b1
∂L
= 1 − b>
1 b1 , (10.12)
∂λ1
respectively. Setting these partial derivatives to 0 gives us the relations

b>
1 b1 = 1 , (10.13)
Sb1 = λ1 b1 , (10.14)

i.e., we see that b1 is an eigenvector of the data covariance matrix S , and

Draft (2018-07-10) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
10.2 Maximum Variance Perspective 291

the Lagrange multiplier λ1 plays the role of the corresponding eigenvalue.


This eigenvector property allows us to rewrite our variance objective as
V 1 = b> >
1 Sb1 = λ1 b1 b1 = λ1 , (10.15)
i.e., the variance of the data projected onto a one-dimensional subspace
equals the eigenvalue that is associated with the basis vector b1 that spans
this subspace. Therefore, to maximize the variance of the low-dimensional
code we choose the basis vector belonging to the largest eigenvalue of the
data covariance matrix. This eigenvector is called the first principal compo- principal component
nent. We can determine the effect/contribution of the principal component
b1 in the original data space by mapping the coordinate z1n back into data
space, which gives us the projected data point
x̃n = b1 z1n = b1 b>
1 xn ∈ R
D
(10.16)
5322 in the original data space.
5323 Remark. Although x̃n is a D-dimensional vector it only requires a single
5324 coordinate z1n to represent it with respect to the basis vector b1 ∈ RD . ♦ √
The quantity λ1 is
Generally, the mth principal component can be found by subtracting the also called the
effect of the first m−1 principal components from the data, thereby trying loading of the unit
vector b1 and
to find principal components that compress the remaining information.
represents the
We achieve this by first subtracting the contribution of the m − 1 principal standard deviation
components from the data, similar to (10.16), so that we arrive at the new of the data
data matrix accounted for by the
principal subspace
m−1
X span[b1 ].
X̂ := X − bi b>
i X, (10.17)
i=1

5325 where X = [x1 , . . . , xN ] ∈ RD×N contains the data points as column


5326 vectors. The matrix X̂ in (10.17) contains the data that only contains the
5327 information that has not yet been compressed.
5328 Remark (Notation). Throughout this chapter, we do not follow the con-
5329 vention of collecting data x1 , . . . , xN as rows of the data matrix, but we
5330 define them to be the columns of X . This means that our data matrix X is
5331 a D × N matrix instead of the conventional N × D matrix. The reason for
5332 our choice is that the algebra operations work out smoothly without the
5333 need to either transpose the matrix or to redefine vectors as row vectors
5334 that are left-multiplied onto matrices. ♦
To find the mth principal component, we maximize the variance
N N
1 X 2 1 X >
Vm = V[zm ] = z = b x n = b>
m Ŝbm , (10.18)
N n=1 mn N n=1 m
2
5335 subject to kbm k = 1, where we followed the same steps as in (10.7b)
5336 and defined Ŝ as the data covariance matrix of X̂ . As previously, when
5337 we looked at the first principal component alone, we solve a constrained

c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
292 Dimensionality Reduction with Principal Component Analysis

5338 optimization problem and discover that the optimal solution bm is the
5339 eigenvector of Ŝ that belongs to the largest eigenvalue of Ŝ .
However, it also turns out that bm is an eigenvector of S . Since
N N m−1 m−1 >
1 X 1 X
(10.17)

x̂n x̂>
X X
Ŝ = n = xn − bi b>
i x n x n − b b>
i i x n
N n=1 N n=1 i=1 i=1
(10.19a)
N m−1 m−1 m−1
!
1 X X X X
= xn x> >
n − 2xn xn bi b>
i + xn
>
bi b>
i bi b>
i
N n=1 i=1 i=1 i=1
(10.19b)

we can multiply bm onto Ŝ and obtain


N N
1 X > 1 X
Ŝbm = x̂n x̂n bm = xn x>
n bm = Sbm = λm bm . (10.20)
N n=1 N n=1

Here we applied the orthogonality conditions b> i bm = 0 for all i =


1, . . . , m − 1 (all terms involving sums up to m − 1 vanish). In the end,
we exploited the fact that bm is an eigenvector of Ŝ . Therefore, bm is also
an eigenvector of the original data covariance matrix S , and the corre-
sponding eigenvalue is λm is the mth largest eigenvalue of S . Moreover,
the variance of the data projected onto the mth principal component
(10.20)
V m = b>
m Sbm = λm b>
m bm = λ m (10.21)

5340 since b> m bm = 1. This means that the variance of the data, when projected
5341 onto an M -dimensional subspace, equals the sum of the eigenvalues that
To maximize the 5342 belong to the corresponding eigenvectors of the data covariance matrix.
variance of the In practice, we do not have to compute principal components sequen-
projected data, we
tially, but we can compute all of them at the same time. If we are looking
choose the columns
of B to be the for a projection onto an M -dimensional subspace so that as much variance
eigenvectors that as possible is retained in the projection, then PCA tells us to choose the
belong to the M columns of B to be the eigenvectors that belong to the M largest eigen-
largest eigenvalues
values of the data covariance matrix. The maximum amount of variance
of the data
covariance matrix. PCA can capture with the first M principal components is
M
X
V = λm , (10.22)
m=1

where the λm are the M largest eigenvalues of the data covariance matrix
S . Consequently, the variance lost by data compression via PCA is
D
X
J= λj . (10.23)
j=M +1

5343 To summarize, to determine the M -dimensional subspace for which an

Draft (2018-07-10) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
10.3 Projection Perspective 293

Figure 10.4
6 Illustration of the
projection approach
4 to PCA. We aim to
find a
2 one-dimensional
subspace (line) of
x2

0
R2 so that the
distance vector
−2
between projected
−4 (orange) and
original (blue) data
−6 is as small as
possible.
−5 0 5
x1

5344 orthogonal projection maximizes the variance of the data we need to com-
5345 pute the M eigenvectors that belong to the M largest eigenvalues of the
5346 data covariance matrix. In Section 10.4, we will return to this point and
5347 discuss how to efficiently compute these M eigenvectors.

5348 10.3 Projection Perspective


5349 In the following, we will derive PCA as an algorithm for linear dimension-
5350 ality reduction that minimizes the average projection error. We will draw
5351 heavily from Chapters 2 and 3. In the previous section, we derived PCA
5352 by maximizing the variance in the projected space to retain as much infor-
5353 mation as possible. In the following, we will look at the difference vectors
5354 between the original data xn and their reconstruction x̃n and minimize
5355 this distance so that xn and x̃n are as close as possible. Figure 10.4 illus-
5356 trates this setting.

5357 10.3.1 Setting and Objective


Assume an (ordered) orthonormal basis (ONB) B = (b1 , . . . , bD ) of RD ,
i.e., b>
i bj = 1 if and only if i = j and 0 otherwise. From Section 2.5 we
know that every x ∈ RD can be written as a linear combination of the
basis vectors of RD , i.e.,
D
X
x= zd bd (10.24)
d=1

for zd ∈ R. We are interested in finding vectors x̃ ∈ RD , which live


in lower-dimensional subspace U of RD , so that x̃ is as similar to x as
possible. As x̃ ∈ U ⊆ RD , we can also express x̃ as a linear combination

c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
294 Dimensionality Reduction with Principal Component Analysis
Figure 10.5
Simplified 2 2
projection setting.
(a) A vector x ∈ R2
(red cross) shall be 1 1
x2

x2
U U
projected onto a
one-dimensional b b
subspace U ⊆ R2 0 0
spanned by b. (b)
shows the difference
vectors between x −1 0 1 2 −1 0 1 2
x1 x1
and some
candidates x̃. (a) Setting. (b) Differences x − x̃ for 50 candidates x̃ are
shown by the red lines.

of the basis vectors of RD so that


D
X
x̃ = zd bd . (10.25)
d=1

For example, vectors Let us assume dim(U ) = M where M < D = dim(RD ). Then, we
x̃ ∈ U could be can find basis vectors b1 , . . . , bD of RD so that at least D − M of the
vectors on a plane
coefficients zd are equal to 0, and we can rearrange the way we index the
in R3 . The
dimensionality of basis vectors bd such that the coefficients that are zero appear at the end.
the plane is 2, but This allows us to express x̃ as
the vectors still have
three coordinates in M
X D
X M
X
R3 . x̃ = zm bm + 0bj = zm bm = Bz ∈ RD , (10.26)
m=1 j=M +1 m=1

where we defined

B := [b1 , . . . , bM ] ∈ RD×M , (10.27)


> M
z := [z1 , . . . , zM ] ∈ R . (10.28)

5358 In the following, we use exactly this kind of representation of x̃ to find


5359 optimal coordinates z and basis vectors b1 , . . . , bM such that x̃ is as sim-
5360 ilar to the original data point x, i.e., we aim to minimize the (Euclidean)
5361 distance kx − x̃k. Figure 10.5 illustrates this setting.
5362 Without loss of generality, we assume that the dataset X = {x1 , . . . , xN },
5363 xn ∈ RD , is centered at 0, i.e., E[X] = 0.
5364 Remark. Without the zero-mean assumption, we would arrive at exactly
5365 the same solution but the notation would be substantially more cluttered.
5366 ♦
We are interested in finding the best linear projection of X onto a
lower-dimensional subspace U of RD with dim(U ) = M and orthonor-
principal subspace mal basis vectors b1 , . . . , bM . We will call this subspace U the principal
subspace, and (b1 , . . . , bM ) is an orthonormal basis of the principal sub-

Draft (2018-07-10) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
10.3 Projection Perspective 295

space. The projections are denoted by


M
X
x̃n := zmn bm = Bz n ∈ RD , (10.29)
m=1

where B ∈ RD×M is given in (10.27) and


z n := [z1n , . . . , zM n ]> ∈ RM , n = 1, . . . N , (10.30)
5367 is the coordinate vector of x̃n with respect to the basis (b1 , . . . , bM ). More
5368 specifically, we are interested in having the x̃n as similar to xn as possible.
5369 There are many ways to measure similarity.
The similarity measure we use in the following is the squared Euclidean
2
norm kx − x̃k between x and x̃. We therefore define our objective as
the minimizing the average squared Euclidean distance (reconstruction er- reconstruction error
ror) (Pearson, 1901)
N
1 X
J := kxn − x̃n k2 . (10.31)
N n=1
5370 In order to find this optimal linear projection, we need to find the or-
5371 thonormal basis of the principal subspace and the coordinates z n of the
5372 projections with respect to these basis vectors. All these parameters enter
5373 our objective (10.31) through x̃n .
5374 In order to find the coordinates z n and the ONB of the principal sub-
5375 space we optimize J by computing the partial derivatives of J with respect
5376 to all parameters of interest (i.e., the coordinates and the basis vectors),
5377 setting them to 0, and solving for the parameters. We detail these steps
5378 next. We will first determine the optimal coordinates zin and then the ba-
5379 sis vectors b1 , . . . , bM of the principal subspace, i.e., the subspace in which
5380 x̃ lives.

5381 10.3.2 Optimization


Since the parameters we are interested in, i.e., the basis vectors bi and the
coordinates zin of the projection with respect to the basis of the principal
subspace, only enter the objective J through x̃n , we obtain
∂J ∂J ∂ x̃n
= , (10.32)
∂zin ∂ x̃n ∂zin
∂J ∂J ∂ x̃n
= (10.33)
∂bi ∂ x̃n ∂bi
for i = 1, . . . , M and n = 1, . . . , N , where
∂J 2
= − (xn − x̃n )> ∈ R1×D . (10.34)
∂ x̃n N
5382 In the following, we determine the optimal coordinates zin first before
5383 finding the ONB of the principal subspace.

c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
296 Dimensionality Reduction with Principal Component Analysis

5384 Coordinates
Let us start by finding the coordinates z1n , . . . , zM n of the projections x̃n
for n = 1, . . . , N . We assume that (b1 , . . . , bD ) is an ordered ONB of RD .
From (10.32) we require the partial derivative
M
!
∂ x̃n (10.29) ∂ X
= zmn bm = bi (10.35)
∂zin ∂zin m=1
for i = 1, . . . , M , such that we obtain
(10.34) M
!>
∂J 2
(10.35) 2
(10.29)
X
= − (xn − x̃n )> bi = − xn − zmn bm bi
∂zin N N m=1

(10.36)
ONB 2 > 2
= − (x bi − zin b> bi ) = − (x> bi − zin ) . (10.37)
N n | i{z } N n
=1

Setting this partial derivative to 0 yields immediately the optimal coordi-


nates
>
zin = x>
n bi = bi xn (10.38)
5385 for i = 1, . . . , M and n = 1, . . . , N . This means, the optimal coordinates
5386 zin of the projection x̃n are the coordinates of the orthogonal projection
5387 (see Section 3.6) of the original data point xn onto the one-dimensional
The coordinates of5388 subspace that is spanned by bi . Consequently:
the optimal
projection of xn 5389 • The optimal projection x̃n of xn is an orthogonal projection.
with respect to the5390 • The coordinates of x̃n with respect to the basis b1 , . . . , bM are the coor-
basis vectors 5391 dinates of the orthogonal projection of xn onto the principal subspace.
b1 , . . . , bM are the
coordinates of the
5392 • An orthogonal projection is the best linear mapping we can find given
orthogonal 5393 the objective (10.31).
projection of xn
Remark (Orthogonal Projections with Orthonormal Basis Vectors). Let us
onto the principal
subspace. briefly recap orthogonal projections from Section 3.6. If (b1 , . . . , bD ) is an
orthonormal basis of RD then
x̃ = bj (b>j bj )
−1 >
bj x = bj b>
j x ∈ R
D
(10.39)
| {z }
=1

x> bj is the 5394 is the orthogonal projection of x onto the subspace spanned by the j th
coordinate of the 5395 basis vector, and zj = b> j x is the coordinate of this projection with respect
orthogonal
5396 to the basis vector bj that spans that subspace since zj bj = x̃. Figure 10.6
projection of x onto
5397
the one-dimensional illustrates this setting.
subspace spanned More generally, if we aim to project onto an M -dimensional subspace
by bj . of RD , we obtain the orthogonal projection of x onto the M -dimensional
subspace with orthonormal basis vectors b1 , . . . , bM as
> −1 > >
x̃ = B(B
| {zB}) B x = BB x , (10.40)
=I

Draft (2018-07-10) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
10.3 Projection Perspective 297
Figure 10.6
2 Optimal projection
3.0 of a vector x ∈ R2
onto a
2.5 1 one-dimensional
kx − x̃k

x2
U
x̃ subspace
2.0 b (continuation from
0 Figure 10.5). (a)
1.5 Distances kx − x̃k
for some x̃ ∈ U . (b)
−1 0 1 2 −1 0 1 2 Orthogonal
x1 x1
projection and
(a) Distances kx − x̃k for some x̃ ∈ U , see (b) The vector x̃ that minimizes the distance optimal coordinates.
panel (b) for the setting. in panel (a) is its orthogonal projection onto
U . The coordinate of the projection x̃ with
respect to the basis vector b that spans U
is the factor we need to scale b in order to
“reach” x̃.

5398 where we defined B := [b1 , . . . , bM ] ∈ RD×M . The coordinates of this


5399 projection with respect to the ordered basis (b1 , . . . , bM ) are z := B > x
5400 as discussed in Section 3.6.
5401 We can think of the coordinates as a representation of the projected
5402 vector in a new coordinate system defined by (b1 , . . . , bM ). Note that al-
5403 though x̃ ∈ RD we only need M coordinates z1 , . . . , zM to represent this
5404 vector; the other D − M coordinates with respect to the basis vectors
5405 (bM +1 , . . . , bD ) are always 0. ♦

5406 Basis of the Principal Subspace


5407 We already determined the optimal coordinates of the projected data for
5408 a given ONB (b1 , . . . , bD ) of RD , only M of which were non-zero. What
5409 remains is to determine the basis vectors that span the principal subspace.
5410 Before we get started, let us briefly introduce the concept of an orthog-
5411 onal complement.
Remark. (Orthogonal Complement) Consider a D-dimensional vector space
V and an M -dimensional subspace U ⊆ V . Then its orthogonal comple- orthogonal
ment U ⊥ is a (D − M )-dimensional subspace of V and contains all vectors complement
in V that are orthogonal to every vector in U . Furthermore, every vector
x ∈ V can be (uniquely) decomposed into
M
X D−M
X
x= λm bm + ψj b⊥
j , λi , ψj ∈ R , (10.41)
m=1 j=1

5412 where (b1 , . . . , bM ) is a basis of U and (b⊥ ⊥ ⊥


1 , . . . , bD−M ) is a basis of U .
5413 ♦
To determine the basis vectors b1 , . . . , bM of the principal subspace,
we rephrase the loss function (10.31) using the results we have so far.

c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
298 Dimensionality Reduction with Principal Component Analysis

This will make it easier to find the basis vectors. To reformulate the loss
function, we exploit our results from before and obtain
M M
(10.38)
X X
x̃n = zmn bm = (x>
n bm )bm . (10.42)
m=1 m=1

We now exploit the symmetry of the dot product, which yields


M
!
X >
x̃n = bm bm xn . (10.43)
m=1

Since we can generally write the original data point xn as a linear combi-
nation of all basis vectors, we can also write
D D D
!
(10.38)
X X X >
>
xn = zdn bd = (xn bd )bd = bd bd xn (10.44a)
d=1 d=1 d=1
M
! D
!
X X
= bm b>
m xn + bj b>
j xn , (10.44b)
m=1 j=M +1

where we split the sum with D terms into a sum over M and a sum
over D − M terms. With this result, we find that the displacement vector
xn − x̃n , i.e., the difference vector between the original data point and its
projection, is
D
!
X >
xn − x̃n = bj bj xn (10.45)
j=M +1
D
X
= (x>
n bj )bj . (10.46)
j=M +1

5414 This means the difference is exactly the projection of the data point onto
5415 the orthogonal complement of the principal subspace: We identify the ma-
trix j=M +1 bj b>
PD
5416
j in (10.45) as the projection matrix that performs this
5417 projection. This also means the displacement vector xn − x̃n lies in the
5418 subspace that is orthogonal to the principal subspace as illustrated in Fig-
5419 ure 10.7.
PCA finds the best
rank-M Remark (Low-Rank Approximation). In (10.45), we saw that the projec-
approximation of
tion matrix, which projects x onto x̃ is given by
the identity matrix.
M
X
bm b> >
m = BB . (10.47)
m=1

By construction as a sum of rank-one matrices bm b>


m we see that BB
>

is symmetric and has rank M . Therefore, the average reconstruction error

Draft (2018-07-10) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
10.3 Projection Perspective 299

Figure 10.7
6 U⊥ Orthogonal
projection and
4 displacement
vectors. When
2 projecting data
points xn (blue)
x2

0
U onto subspace U1
we obtain x̃n
−2
(orange). The
−4 displacement vector
x̃n − xn lies
−6 completely in the
orthogonal
−5 0 5
x1 complement U2 of
U1 .

can also be written as


N N X N 2
2
X X > >
kxn − x̃n k = xn − BB xn = (I − BB )xn .

n=1 n=1 n=1
(10.48)

5420 Finding orthonormal basis vectors b1 , . . . , bM so that the difference be-


5421 tween the original data xn and their projections x̃n , n = 1, . . . , N , is
5422 minimized is equivalent to finding the best rank-M approximation BB >
5423 of the identity matrix I , see Section 4.6. ♦
Now, we have all the tools to reformulate the loss function (10.31).
N
D
N X
2
1 X (10.46) 1 X
J= kxn − x̃n k2 = (b> x )b . (10.49)

j n j
N n=1 N n=1


j=M +1

We now explicitly compute the squared norm and exploit the fact that the
bj form an ONB, which yields
N D N D
1 X X > 1 X X >
J= 2
(bj xn ) = bj xn b>
j xn (10.50a)
N n=1 j=M +1 N n=1 j=M +1
N D
1 X X >
= b xn x>
n bj , (10.50b)
N n=1 j=M +1 j

where we exploited the symmetry of the dot product in the last step to
write b> >
j xn = xn bj . We can now swap the sums and obtain

D N
! D
X 1 X X
J= b>
j x n x>
n bj = b>
j Sbj (10.51a)
j=M +1
N n=1 j=M +1
| {z }
=:S

c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
300 Dimensionality Reduction with Principal Component Analysis
D
X D
X D
 X  
= tr(b>
j Sbj ) tr(Sbj b>
j ) = tr bj b>
j S ,
j=M +1 j=M +1 j=M +1
| {z }
projection matrix
(10.51b)

5424 where we exploited the property that the trace operator tr(·), see (4.16),
5425 is linear and invariant to cyclic permutations of its arguments. Since we
5426 assumed that our dataset is centered, i.e., E[X] = 0, we identify S as the
5427 data covariance matrix. We see that the projection matrix in (10.51b) is
5428 constructed as a sum of rank-one matrices bj b>j so that it itself is of rank
Minimizing the 5429 D − M.
average squared
5430 Equation (10.51a) implies that we can formulate the average squared
reconstruction error
is equivalent to 5431 reconstruction error equivalently as the covariance matrix of the data,
minimizing the 5432 projected onto the orthogonal complement of the principal subspace. Min-
projection of the 5433 imizing the average squared reconstruction error is therefore equivalent to
data covariance
5434 minimizing the variance of the data when projected onto the subspace we
matrix onto the
orthogonal 5435 ignore, i.e., the orthogonal complement of the principal subspace. Equiva-
complement of the5436 lently, we maximize the variance of the projection that we retain in the
principal subspace.5437 principal subspace, which links the projection loss immediately to the
5438 maximum-variance formulation of PCA discussed in Section 10.2. But this
5439 then also means that we will obtain the same solution that we obtained for
5440 the maximum-variance perspective. Therefore, we skip the slightly lengthy
5441 derivation here and summarize the results from earlier in the light of the
Minimizing the 5442 projection perspective.
average squared
The average squared reconstruction error, when projecting onto the M -
reconstruction error
is equivalent to dimensional principal subspace, is
maximizing the
variance of the
projected data.
D
X
J= λj , (10.52)
j=M +1

5443 where λj are the eigenvalues of the data covariance matrix. Therefore,
5444 to minimize (10.52) we need to select the smallest D − M eigenvalues,
5445 which then implies that their corresponding eigenvectors are the basis
5446 of the orthogonal complement of the principal subspace. Consequently,
5447 this means that the basis of the principal subspace are the eigenvectors
5448 b1 , . . . , bM that belong to the largest M eigenvalues of the data covariance
5449 matrix.

Draft (2018-07-10) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
10.4 Eigenvector Computation 301

Example 10.2 (MNIST Digits Embedding)

Figure 10.8
Embedding of
MNIST digits 0
(blue) and 1
(orange) in a
two-dimensional
principal subspace
using PCA. Four
examples
embeddings of the
digits ‘0’ and ‘1’ in
the principal
subspace are
highlighted in red
with their
corresponding
original digit.

Figure 10.8 visualizes the training data of the MMIST digits ‘0’ and
‘1’ embedded in the vector subspace spanned by the first two principal
components. We can see a relatively clear separation between ‘0’s (blue
dots) and ‘1’s (orange dots), and we can see the variation within each
individual cluster.

5450 10.4 Eigenvector Computation


In the previous sections, we obtained the basis of the principal subspace
as the eigenvectors that belong to the largest eigenvalues of the data co-
variance matrix
N
1 X 1
S= xn x>
n = XX > , (10.53)
N n=1 N
X = [x1 , . . . , xN ] ∈ RD×N . (10.54)
5451 To get the eigenvalues (and the corresponding eigenvectors) of S , we can
5452 follow two approaches: Eigendecomposition
or SVD to compute
5453 • We perform an eigendecomposition (see Section 4.2) and compute the eigenvectors.
5454 eigenvalues and eigenvectors of S directly.
• We use a singular value decomposition (see Section 4.5). Since S is
symmetric and factorizes into XX > (ignoring the factor N1 ), the eigen-
values of S are the squared singular values of X . More specifically, if

c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
302 Dimensionality Reduction with Principal Component Analysis

the SVD of X is given by

X = U ΣV > , (10.55)

where U ∈ RD×D and and V > ∈ RD×N are orthogonal matrices and
Σ ∈ RD×N is a matrix whose only non-zero entries are the singular
values σii > 0. Then it follows that
1 1 1
S= XX > = U ΣV > V Σ> U > = U ΣΣ> U > . (10.56)
N N N
With the results from Section 4.5 we get that the columns of U are the
eigenvectors of XX > (and therefore S ). Furthermore, the eigenvalues
of S are related to the singular values of X via

σi2
λi = . (10.57)
N

5455 Practical aspects Finding eigenvalues and eigenvectors is also impor-


5456 tant in other fundamental machine learning methods that require matrix
5457 decompositions. In theory, as we discussed in Section 4.2, we can solve for
5458 the eigenvalues as roots of the characteristic polynomial. However, for ma-
5459 trices larger than 4 × 4 this is not possible because we would need to find
5460 the roots of a polynomial of degree 5 or higher. However, the Abel-Ruffini
5461 theorem (Ruffini, 1799; Abel, 1826) states that there exists no algebraic
5462 solution to this problem for polynomials of degree 5 or more. Therefore, in
np.linalg.eigh
or 5463 practice, we solve for eigenvalues or singular values using iterative meth-
np.linalg.svd 5464 ods, which are implemented in all modern packages for linear algebra.
In many applications (such as PCA presented in this chapter), we only
require a few eigenvectors. It would be wasteful to compute the full de-
composition, and then discard all eigenvectors with eigenvalues that are
beyond the first few. It turns out that if we are interested in only the
first few eigenvectors (with the largest eigenvalues) iterative processes,
which directly optimize these eigenvectors, are computationally more ef-
ficient than a full eigendecomposition (or SVD). In the extreme case of
power iteration only needing the first eigenvector, a simple method called the power it-
eration is very efficient. Power iteration chooses a random vector x0 and
follows the iteration
Sxk
xk+1 = , k = 0, 1, . . . . (10.58)
kSxk k

5465 This means the vector xk is multiplied by S in every iteration and then
5466 normalized, i.e., we always have kxk k = 1. This sequence of vectors con-
5467 verges to the eigenvector associated with the largest eigenvalue of S . The
5468 original Google PageRank algorithm (Page et al., 1999) uses such an al-
5469 gorithm for ranking web pages based on their hyperlinks.

Draft (2018-07-10) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
10.5 PCA Algorithm 303

5470 10.5 PCA Algorithm


5471 In the following, we will go through the individual steps of PCA using a
5472 running example, which is summarized in Figure 10.9. We are given a
5473 two-dimensional data set (Figure 10.9(a)), and we want to use PCA to
5474 project it onto a one-dimensional subspace.

5475 1. Mean subtraction We start by centering the data by computing the


5476 mean µ of the dataset and subtracting it from every single data point.
5477 This ensures that the data set has mean 0 (Figure 10.9(b)). Mean sub-
5478 traction is not strictly necessary but reduces the risk of numerical prob-
5479 lems.
5480 2. Standardization Divide the data points by the standard deviation σd
5481 of the dataset for every dimension d = 1, . . . , D. Now the data is unit
5482 free, and it has variance 1 along each axis, which is indicated by the
5483 two arrows in Figure 10.9(c). This step completes the standardization standardization
5484 of the data.
5485 3. Eigendecomposition of the covariance matrix Compute the data
5486 covariance matrix and its eigenvalues and corresponding eigenvectors.
5487 In Figure 10.9(d), the eigenvectors are scaled by the magnitude of the
5488 corresponding eigenvalue. The longer vector spans the principal sub-
5489 space, which we denote by U . The data covariance matrix is repre-
5490 sented by the ellipse.
4. Projection We can project any data point x∗ ∈ RD onto the principal
subspace: To get this right, we need to standardize x∗ using the mean
and standard deviation of the data set that we used to compute the
data covariance matrix, i.e.,
(d)
x∗ − µ(d)
x(d)
∗ ← , d = 1, . . . , D , (10.59)
σd
where x(d) is the dth component of x. Then, we obtain the projected
data point as
x̃∗ = BB > x∗ (10.60)

5491 with coordinates z ∗ = B > x∗ with respect to the basis of the prin-
5492 cipal subspace. Here, B is the matrix that contains the eigenvectors
5493 that belong to the largest eigenvalues of the data covariance matrix as
5494 columns.
5. Moving back to data space To see our projection in the original data
format (i.e., before standardization), we need to undo the standardiza-
tion (10.59) and multiply by the standard deviation before adding the
mean so that we obtain
x̃(d) (d)
∗ ← x̃∗ σd + µ
(d)
, d = 1, . . . , D , (10.61)

5495 where µ(d) and σd are the mean and standard deviation of the training

c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
304 Dimensionality Reduction with Principal Component Analysis
Figure 10.9 Steps
6 6 6
of PCA.
4 4 4

2 2 2
x2

x2

x2
0 0 0

−2 −2 −2

−4 −4 −4
−2.5 0.0 2.5 5.0 −2.5 0.0 2.5 5.0 −2.5 0.0 2.5 5.0
x1 x1 x1

(a) Original dataset. (b) Step 1: Centering by sub- (c) Step 2: Dividing by the
tracting the mean from each standard deviation to make
data point. the data unit free. Data has
variance 1 along each axis.

6 6 6

4 4 4

2 2 2
x2

x2

x2
0 0 0

−2 −2 −2

−4 −4 −4
−2.5 0.0 2.5 5.0 −2.5 0.0 2.5 5.0 −2.5 0.0 2.5 5.0
x1 x1 x1

(d) Step 3: Compute eigenval- (e) Step 4: Project data onto (f) Step 5: Undo the standard-
ues and eigenvectors (arrows) the subspace spanned by the ization and move projected
of the data covariance matrix eigenvectors belonging to the data back into the original
(ellipse). largest eigenvalues (principal data space from (a).
subspace).

5496 data in the dth dimension, respectively. Figure 10.9(f) illustrates the
5497 projection in the original data format.

Example 10.3 (MNIST Digits: Reconstruction)

http: In the following, we will apply PCA to the MNIST digits dataset, which
//yann.lecun.
contains 60, 000 examples of handwritten digits 0–9. Each digit is an im-
com/exdb/mnist/
age of size 28×28, i.e., it contains 784 pixels so that we can interpret every
image in this dataset as a vector x ∈ R784 . Examples of these digits are
shown in Figure 10.3. For illustration purposes, we apply PCA to a subset
of the MNIST digits, and we focus on the digit ‘8’. We used 5,389 training
images of the digit ‘8’ and determined the principal subspace as detailed
in this chapter. We then used the learned projection matrix to reconstruct
a set of test images, which is illustrated in Figure 10.10. The first row
of Figure 10.10 shows a set of four original digits from the test set. The
following rows show reconstructions of exactly these digits when using
a principal subspace of dimensions 1, 10, 100, 500, respectively. We can
see that even with a single-dimensional principal subspace we get a half-

Draft (2018-07-10) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
10.5 PCA Algorithm 305

way decent reconstruction of the original digits, which, however, is blurry


and generic. With an increasing number of principal components (PCs)
the reconstructions become sharper and more details can be accounted
for. With 500 principal components, we effectively obtain a near-perfect
reconstruction. If we were to choose 784 PCs we would recover the exact
digit without any compression loss.

Figure 10.10 Effect


of increasing
Original number of principal
components on
reconstruction.
PCs: 1

PCs: 10

PCs: 100

PCs: 500

Figure 10.11
Average reconstruction error

6 Average
reconstruction error
as a function of the
4 number of principal
components.
2

0
0 200 400 600
Number of PCs

Figure 10.11 shows the average reconstruction error, which is


N D p
1 X X
kxn − x̃n k = λd , (10.62)
N n=1 d=1

as a function of the number of principal components. We can see that


the importance of the principal components drops off rapidly, and only
marginal gains can be achieved by adding more PCs. With about 550 PCs,

c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
306 Dimensionality Reduction with Principal Component Analysis

we can essentially fully reconstruct the training data that contains the
digit ‘8’.

5498 10.6 PCA in High Dimensions


5499 In order to do PCA, we need to compute the data covariance matrix. In D
5500 dimensions, the data covariance matrix is a D × D matrix. Computing the
5501 eigenvalues and eigenvectors of this matrix is computationally expensive
5502 as it scales cubically in D. Therefore, PCA, as we discussed earlier, will be
5503 infeasible in very high dimensions. For example, if our xn are images with
5504 10, 000 pixels (e.g., 100 × 100 pixel images), we would need to compute
5505 the eigendecomposition of a 10, 000 × 10, 000 covariance matrix. In the
5506 following, we provide a solution to this problem for the case that we have
5507 substantially fewer data points than dimensions, i.e., N  D.
Assume we have a data set x1 , . . . , xN , xn ∈ RD . Assuming the data is
centered, the data covariance matrix is given as
1
S= XX > ∈ RD×D , (10.63)
N
5508 where X = [x1 , . . . , xN ] is a D × N matrix whose columns are the data
5509 points.
5510 We now assume that N  D, i.e., the number of data points is smaller
5511 than the dimensionality of the data. Then the rank of the covariance ma-
5512 trix S is N , and it has D − N + 1 many eigenvalues that are 0. Intuitively,
5513 this means that there are some redundancies.
5514 In the following, we will exploit this and turn the D × D covariance
5515 matrix into an N × N covariance matrix whose eigenvalues are all greater
5516 than 0.
In PCA, we ended up with the eigenvector equation
Sbm = λm bm , m = 1, . . . , M , (10.64)
where bm is a basis vector of the principal subspace. Let us re-write this
equation a bit: With S defined in (10.63), we obtain
1
Sbm = XX > bm = λm bm . (10.65)
N
We now multiply X > ∈ RN ×D from the left-hand side, which yields
1 > 1 >
X{z X} X > bm = λm X > bm ⇐⇒ X Xcm = λm cm , (10.66)
N | | {z } N
N ×N =:cm

5517 and we get a new eigenvector/eigenvalue equation: λm is still the eigen-


5518 value, but the eigenvector is now cm := X > bm of the matrix N1 X > X ∈
5519 RN ×N . Assuming we have no duplicate data points, this matrix has rank N

Draft (2018-07-10) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
10.7 Latent Variable Perspective 307

5520 and is invertible. This also implies that N1 X > X has the same (non-zero)
5521 eigenvalues as the data covariance matrix S . But this is now an N × N
5522 matrix, so that we can compute the eigenvalues and eigenvectors much
5523 more efficiently than for the original D × D data covariance matrix.
Now, that we have the eigenvectors of N1 X > X , we are going to re-
cover the original eigenvectors, which we still need for PCA. Currently,
we know the eigenvectors of N1 X > X . If we left-multiply our eigenvalue/
eigenvector equation with X , we get
1
XX > Xcm = λm Xcm (10.67)
N
| {z }
S

5524 and we recover the data covariance matrix again. This now also means
5525 that we recover Xcm as an eigenvector of S .
5526 Remark. If we want to apply the PCA algorithm that we discussed in Sec-
5527 tion 10.5 we need to normalize the eigenvectors Xcm of S so that they
5528 have norm 1. ♦

5529 10.7 Latent Variable Perspective


5530 In the previous sections, we derived PCA without any notion of a prob-
5531 abilistic model using the maximum-variance and the projection perspec-
5532 tives. On the one hand this approach may be appealing as it allows us to
5533 sidestep all the mathematical difficulties that come with probability the-
5534 ory, on the other hand a probabilistic model would offer us more flexibility
5535 and useful insights. More specifically, a probabilistic model would
5536 • come with a likelihood function, and we can explicitly deal with noisy
5537 observations (which we did not even discuss earlier),
5538 • allow us to do Bayesian model comparison via the marginal likelihood
5539 as discussed in Section 8.4,
5540 • view PCA as a generative model, which allows us to simulate new data,
5541 • allow us to make straightforward connections to related algorithms
5542 • deal with data dimensions that are missing at random by applying
5543 Bayes’ theorem,
5544 • give us a notion of the novelty of a new data point,
5545 • allow us to extend the model fairly straightforwardly, e.g., to a mixture
5546 of PCA models,
5547 • have the PCA we derived in earlier sections as a special case,
5548 • allow for a fully Bayesian treatment by marginalizing out the model
5549 parameters.
5550 By introducing a continuous-valued latent variable z ∈ RM it is possible
5551 to phrase PCA as a probabilistic latent-variable model. Tipping and Bishop
5552 (1999) proposed this latent-variable model as Probabilistic PCA (PPCA). Probabilistic PCA
5553 PPCA addresses most of the issues above, and the PCA solution that we

c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
308 Dimensionality Reduction with Principal Component Analysis

Figure 10.12
Graphical model for zn
probabilistic PCA.
The observations xn
explicitly depend on B µ
corresponding
latent variables

z n ∼ N 0, I . The
xn σ
model parameters
n = 1, . . . , N
B, µ and the
likelihood
parameter σ are
shared across the 5554 obtained by maximizing the variance in the projected space or by minimiz-
dataset. 5555 ing the reconstruction error is obtained as the special case of maximum
5556 likelihood estimation in a noise-free setting.

5557 10.7.1 Generative Process and Probabilistic Model


In PPCA, we explicitly write down the probabilistic model for linear di-
mensionality reduction. For this we assume a continuous
 latent variable
z ∈ RM with a standard-Normal prior p(z) = N 0, I and a linear rela-
tionship between the latent variables and the observed x data where
x = Bz + µ +  ∈ RD , (10.68)

where  ∼ N 0, σ 2 I is Gaussian observation noise, B ∈ RD×M and µ ∈
RD describe the linear/affine mapping from latent to observed variables.
Therefore, PPCA links latent and observed variables via
p(x|z, B, µ, σ 2 ) = N x | Bz + µ, σ 2 I .

(10.69)
Overall, PPCA induces the following generative process:

z n ∼ N z | 0, I (10.70)
xn | z n ∼ N x | Bz n + µ, σ 2 I

(10.71)
5558 To generate a data point that is typical given the model parameters, we
ancestral sampling5559 follow an ancestral sampling scheme: We first sample a latent variable z n
5560 from p(z). Then, we use z n in (10.69) to sample a data point conditioned
5561 on the sampled z n , i.e., xn ∼ p(x | z n , B, µ, σ 2 ).
This generative process allows us to write down the probabilistic model
(i.e., the joint distribution of all random variables) as
p(x, z|B, µ, σ 2 ) = p(x|z, B, µ, σ 2 )p(z) , (10.72)
5562 which immediately gives rise to the graphical model in Figure 10.12 using
5563 the results from Section 8.5.
5564 Remark. Note the direction of the arrow that connects the latent variables
5565 z and the observed data x: The arrow points from z to x, which means
5566 that the PPCA model assumes a lower-dimensional latent cause z for high-
5567 dimensional observations x. In the end, we are obviously interested in

Draft (2018-07-10) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
10.7 Latent Variable Perspective 309

5568 finding something out about z given some observations. To get there we
5569 will apply Bayesian inference to “invert” the arrow implicitly and go from
5570 observations to latent variables. ♦

Example 10.4 (Generating Data from Latent Variables)

Figure 10.13
Generating new
MNIST digits. The
latent variables z
can be used to
generate new data
x̃ = Bz. The closer
we stay to the
training data the
more realistic the
generated data.

Figure 10.13 shows the latent coordinates of the MNIST digits ‘8’ found
by PCA when using a two-dimensional principal subspace (blue dots). We
can query any vector z ∗ in this latent space an generate an image x̃∗ =
Bz ∗ that resembles the digit ‘8’. We show eight of such generated images
with their corresponding latent space representation. Depending on where
we query the latent space, the generated images look different (shape,
rotation, size, ...). If we query away from the training data, we see more an
more artefacts, e.g., the top-left and top-right digits. Note that the intrinsic
dimensionality of these generated images is only two.

5571 10.7.2 Likelihood and Joint Distribution


Using the results from Chapter 6, we obtain the marginal distribution of
the data x by integrating out the latent variable z so that
Z
p(x | B, µ, σ ) = p(x | z, µ, σ 2 )p(z)dz
2

Z (10.73)
2
 
= N x | Bz + µ, σ I N z | 0, I dz .

c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
310 Dimensionality Reduction with Principal Component Analysis

From Section 6.6, we know that the solution to this integral is a Gaussian
distribution with mean
E[x] = Ez [Bz + µ] + E [] = µ (10.74)
and with covariance matrix
V[x] = Vz [Bz + µ] + V [] = Vz [Bz]
(10.75)
= B Vz [z]B > + σ 2 I = BB > + σ 2 I .
PPCA likelihood 5572 The marginal distribution in (10.73) is the PPCA likelihood, which we can
5573 use for maximum likelihood or MAP estimation of the model parameters.
5574 Remark. Although the conditional distribution in (10.69) is also a like-
5575 lihood, we cannot use it for maximum likelihood estimation as it still
5576 depends on the latent variables. The likelihood function we require for
5577 maximum likelihood (or MAP) estimation should only be a function of
5578 the data x and the model parameters, but not on the latent variables. ♦
From Section 6.6 we also know that the joint distribution of a Gaus-
sian random variable z and a linear/affine transformation x = Bz of it
are jointly Gaussian distributed. We already know the marginals p(z) =
N z | 0, I and p(x) = N x | µ, BB > +σ 2 I . The missing cross-covariance
is given as
Cov[x, z] = Covz [Bz + µ] = B Covz [z, z] = B . (10.76)
Therefore, the probabilistic model of PPCA, i.e., the joint distribution of
latent and observed random variables is explicitly given by
BB > + σ 2 I B
     
x µ
p(x, z | B, µ, σ 2 ) = N , , (10.77)
z 0 B> I
5579 with a mean vector of length D + M and a covariance matrix of size
5580 (D + M ) × (D + M ).

5581 10.7.3 Posterior Distribution


The joint Gaussian distribution p(x, z | B, µ, σ 2 ) in (10.77) allows us to
determine the posterior distribution p(z | x) immediately by applying the
rules of Gaussian conditioning from Section 6.6.1. The posterior distribu-
tion of the latent variable given an observation x is then

p(z | x) = N z | m, C , (10.78a)
m = B > (BB > + σ 2 I)−1 (x − µ) , (10.78b)
> > 2 −1
C = I − B (BB + σ I) B . (10.78c)
5582 The posterior distribution in (10.78) reveals a few things. First, the poste-
5583 rior mean m is effectively an orthogonal projection of the mean-centered
5584 data x − µ onto the vector subspace spanned by the columns of B . If we

Draft (2018-07-10) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
10.8 Further Reading 311

5585 ignore the measurement noise contribution we immediately recover the


5586 projection equation (3.55) from Section 3.6. Second, the posterior covari-
5587 ance matrix C does not directly depend on the observed data x.
5588 If we now have a new observation x∗ in data space, we can use (10.78)
5589 to determine the posterior distribution of the corresponding latent vari-
5590 able z ∗ . The covariance matrix C allows us to assess how confident the
5591 embedding is. A covariance matrix C with a small determinant (which
5592 measures volumes) tells us that the latent embedding z ∗ is fairly certain.
5593 However, if we obtain a posterior distribution p(z ∗ | x∗ ) that is uncertain,
5594 we may be faced with an outlier. However, we can explore this posterior
5595 distribution to understand what other data points x are plausible under
5596 this posterior. To do this, we can exploit PPCA’s generative process. The
5597 generative process underlying PPCA allows us to explore the posterior dis-
5598 tribution on the latent variables by generating new data that are plausible
5599 under this posterior. This can be achieved as follows:

5600 1. Sample a latent variable z ∗ ∼ p(z | x∗ ) from the posterior distribution


5601 over the latent variables (10.78)
5602 2. Sample a reconstructed vector x̃∗ ∼ p(x | z ∗ , B, µ, σ 2 ) from (10.69)

5603 If we repeat this process many times, we can explore the posterior dis-
5604 tribution (10.78) on the latent variables z ∗ and its implications on the
5605 observed data. The sampling process effectively hypothesizes data, which
5606 is plausible under the posterior distribution.

5607 10.8 Further Reading


5608 We derived PCA from two perspectives: a) maximizing the variance in the
5609 projected space; b) minimizing the average reconstruction error. However,
5610 PCA can also be interpreted from different perspectives. Let us re-cap what
5611 we have done: We took high-dimensional data x ∈ RD and used a matrix
5612 B > to find a lower-dimensional representation z ∈ RM . The columns of
5613 B are the eigenvectors of the data covariance matrix S that are associated
5614 with the largest eigenvalues. Once we have a low-dimensional represen-
5615 tation z , we can get a high-dimensional version of it (in the original data
5616 space) as x ≈ x̃ = Bz = BB > x ∈ RD , where BB > is a projection
5617 matrix.
We can also think of PCA as a linear auto-encoder as illustrated in Fig- auto-encoder
ure 10.14. An auto-encoder encodes the data xn ∈ RD to a code z n ∈ RM code
and tries to decode it to a x̃n similar to xn . The mapping from the data to
the code is called the encoder, the mapping from the code back to the orig- encoder
inal data space is called the decoder. If we consider linear mappings where decoder
the code is given by z n = B > xn ∈ RM and we are interested in minimiz-
ing the average squared error between the data xn and its reconstruction

c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
312 Dimensionality Reduction with Principal Component Analysis

Figure 10.14 PCA original projection


can be viewed as a
D
linear auto-encoder. IR code
IRD
It encodes the
high-dimensional M
data x into a B > IR B
lower-dimensional x z x̃
representation
(code) z ∈ RM and
decode z using a
decoder. The
decoded vector x̃ is
the orthogonal Encoder Decoder
projection of the
original data x onto
the M -dimensional x̃n = Bz n , n = 1, . . . , N , we obtain
principal subspace.
N N
1 X 2 1 X >
2
kxn − x̃n k = xn − B Bxn . (10.79)

N n=1 N n=1
5618 This means we end up with the same objective function as in (10.31) that
5619 we discussed in Section 10.3 so that we obtain the PCA solution when we
5620 minimize the squared auto-encoding loss. If we replace the linear map-
5621 ping of PCA with a nonlinear mapping, we get a nonlinear auto-encoder.
5622 A prominent example of this is a deep auto-encoder where the linear func-
5623 tions are replaced with deep neural networks. In this context, the encoder
recognition network
5624 is also know as recognition network or inference network, whereas the de-
inference network5625 coder is also called a generator.
generator 5626 Another interpretation of PCA is related to information theory. We can
The code is a 5627 think of the code as a smaller or compressed version of the original data
compressed version
5628
of the original data.
point. When we reconstruct our original data using the code, we do not
5629 get the exact data point back, but a slightly distorted or noisy version
5630 of it. This means that our compression is “lossy”. Intuitively we want
5631 to maximize the correlation between the original data and the lower-
5632 dimensional code. More formally, this is related to the mutual information.
5633 We would then get the same solution to PCA we discussed in Section 10.3
5634 by maximizing the mutual information, a core concept in information the-
5635 ory (MacKay, 2003).
In our discussion on PPCA, we assumed that the parameters of the
model, i.e., B, µ and the likelihood parameter σ 2 are known. Tipping and
Bishop (1999) describe how to derive maximum likelihood estimates for
these parameter in the PPCA setting (note that we use a different notation
in this chapter). The maximum likelihood parameters, when projecting
D-dimensional data onto an M -dimensional subspace, are given by
N
1 X
µML = xn , (10.80)
N n=1
1
B ML = T (Λ − σ 2 I) 2 , (10.81)

Draft (2018-07-10) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
10.8 Further Reading 313
D
2 1 X
σML = λj , (10.82)
D − M j=M +1

5636 where T ∈ RD×M contains M eigenvectors of the data covariance ma-


5637 trix, and Λ = diag(λ1 , . . . , λM ) ∈ RM ×M is a diagonal matrix with the
5638 eigenvalues belonging to the principal axes on its diagonal. The maximum
5639 likelihood solution B ML is unique up to a arbitrary rotations, i.e., we can
5640 right-multiply B ML with any rotation matrix R ∈ RM ×M so that (10.81)
5641 essentially is a singular value decomposition (see Section 4.5). An outline
5642 of the proof is given by Tipping and Bishop (1999).
5643 The maximum likelihood estimate for µ given in (10.80) is the sample
5644 mean of the data. The maximum likelihood estimator for the observation
5645 noise variance σ 2 given in (10.82) is the sum of all variances in the or-
5646 thogonal complement of the principal subspace, i.e., the leftover variance
5647 that we cannot capture with the first M principal components are treated
5648 as observation noise.
In the noise-free limit where σ → 0, PPCA and PCA provide identical
solutions: Since the data covariance matrix S is symmetric, it can be di-
agonalized (see Section 4.4), i.e., there exists a matrix T of eigenvectors
of S so that
S = T ΛT −1 . (10.83)
In the PPCA model, the data covariance matrix is the covariance matrix
of the likelihood p(X | B, µ, σ 2 ), which is BB > + σ 2 I , see (10.75). For
σ → 0, we obtain BB > so that this data covariance must equal the PCA
data covariance (and its factorization given in (10.83)) so that
1
Cov[X] = T ΛT −1 = BB > ⇐⇒ B = T Λ 2 , (10.84)
5649 which is exactly the maximum likelihood estimate in (10.81) for σ = 0.
5650 From (10.81) and (10.83) it becomes clear that (P)PCA performs a de-
5651 composition of the data covariance matrix. We can also think of PCA as a
5652 method for finding the best rank-M approximation of the data covariance
5653 matrix so that we can apply the Eckart-Young Theorem from Section 4.6,
5654 which states that the optimal solution can be determined using a singular
5655 value decomposition.
5656 In a streaming setting, where data arrives sequentially, it is recom-
5657 mended to use the iterative Expectation Maximization (EM) algorithm for
5658 maximum likelihood estimation (Roweis, 1998).
5659 Similar to our discussion on linear regression in Chapter 9, we can place
5660 a prior distribution on the parameters of the model and to integrate them
5661 out, thereby avoiding a) point estimates of the parameters and the issues
5662 that come with these point estimates (see Section 8.4) and b) allowing
5663 for an automatic selection of the appropriate dimensionality M of the la-
5664 tent space. In this Bayesian PCA, which was proposed by Bishop (1999), Bayesian PCA

c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
314 Dimensionality Reduction with Principal Component Analysis

5665 places a (hierarchical) prior p(µ, B, σ 2 ) on the model parameters. The


5666 generative process allows us to integrates the model parameters out in-
5667 stead of conditioning on them, which addresses overfitting issues. Since
5668 this integration is analytically intractable, Bishop (1999) proposes to use
5669 approximate inference methods, such as MCMC or variational inference.
5670 We refer to the work by Gilks et al. (1996) and Blei et al. (2017) for more
5671 details on these approximate inference techniques.
5672 In PPCA,
 we considered the  linear model xn = Bz n +  with p(z n ) =
5673 N 0, I and  ∼ N 0, σ 2 I , i.e., all observation dimensions are affected
5674 by the same amount of noise. If we allow each observation dimension
factor analysis 5675 d to have a different variance σd2 we obtain factor analysis (FA) (Spear-
5676 man, 1904; Bartholomew et al., 2011). This means, FA gives the likeli-
5677 hood some more flexibility than PPCA, but still forces the data to be ex-
An overly flexible 5678 plained by the model parameters B, µ. However, FA no longer allows for
likelihood would be5679 a closed-form solution to maximum likelihood so that we need to use an
able to explain more
5680 iterative scheme, such as the EM algorithm, to estimate the model param-
than just the noise.
5681 eters. While in PPCA all stationary points are global optima, this no longer
5682 holds for FA. Compared to PPCA, FA does not change if we scale the data,
Independent 5683 but it does return different solutions if we rotate the data.
Component Analysis
5684 Independent Component Analysis (ICA) is also closely related to PCA.
5685 Starting again with the model xn = Bz n +  we now change the prior
blind-source 5686 on z n to non-Gaussian distributions. ICA can be used for blind-source sep-
separation 5687 aration. Imagine you are in a busy train station with many people talking.
5688 Your ears play the role of microphones, and they linearly mix different
5689 speech signals in the train station. The goal of blind-source separation is
5690 to identify the constituent parts of the mixed signals. As discussed above
5691 in the context of maximum likelihood estimation for PPCA, the original
5692 PCA solution is invariant to any rotation. Therefore, PCA can identify the
5693 best lower-dimensional subspace in which the signals live, but not the sig-
5694 nals themselves (Murphy, 2012). ICA addresses this issue by modifying
5695 the prior distribution p(z) on the latent sources to require non-Gaussian
Murphy2012 5696 priors p(z). We refer to the book by Murphy2012 for more details on ICA.
5697 PCA, factor analysis and ICA are three examples for dimensionality re-
5698 duction with linear models. Cunningham and Ghahramani (2015) provide
5699 a broader survey of linear dimensionality reduction.
5700 The (P)PCA model we discussed here allows for several important ex-
5701 tensions. In Section 10.6, we explained how to do PCA when the in-
5702 put dimensionality D is significantly greater than the number N of data
5703 points. By exploiting the insight that PCA can be performed by computing
5704 (many) inner products, this idea can be pushed to the extreme by consid-
kernel trick 5705 ering infinite-dimensional features. The kernel trick is the basis of kernel
kernel PCA 5706 PCA and allows us to implicitly compute inner products between infinite-
5707 dimensional features (Schölkopf et al., 1998; Schölkopf and Smola, 2002).
5708 There are nonlinear dimensionality reduction techniques that are de-
5709 rived from PCA. The auto-encoder perspective of PCA that we discussed

Draft (2018-07-10) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
10.8 Further Reading 315

5710 above can be used to render PCA as a special case of a deep auto-encoder. deep auto-encoder
5711 In the deep auto-encoder, both the encoder and the decoder are repre-
5712 sented by multi-layer feedforward neural networks, which themselves are
5713 nonlinear mappings. If we set the activation functions in these neural net-
5714 works to be the identity, the model becomes equivalent to PCA. A different
5715 approach to nonlinear dimensionality reduction is the Gaussian Process La- Gaussian Process
5716 tent Variable Model (GP-LVM) proposed by Lawrence (2005). The GP-LVM Latent Variable
Model
5717 starts off with the latent-variable perspective that we used to derive PPCA
5718 and replaces the linear relationship between the latent variables z and the
5719 observations x with a Gaussian process (GP). Instead of estimating the pa-
5720 rameters of the mapping (as we do in PPCA), the GP-LVM marginalizes out
5721 the model parameters and makes point estimates of the latent variables
5722 z . Similar to Bayesian PCA, the Bayesian GP-LVM proposed by Titsias and Bayesian GP-LVM
5723 Lawrence (2010) maintains a distribution on the latent variables z and
5724 uses approximate inference to integrate them out as well.

c
2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.