PRML笔记-Notes on Pattern Recognition and Machine Learning PDF

PRML
Notes on Pattern Recognition and Machine Learning (Bishop)

Version 1.0
Jian Xiao
Checklist .....................................................................................................2
Chapter 1 Introduction ................................................................................4
Chapter 2 Probability Distribution............................................................10
Chapter 3 Linear Models for Regression ..................................................14
Chapter 4 Linear Models for Classification..............................................19
Chapter 5 Neural Networks ......................................................................26
Chapter 6 Kernel methods ........................................................................33
Chapter 7 Sparse Kernel Machine ............................................................39
Chapter 8 Graphical Models .....................................................................47
Chapter 9 Mixture Models and EM ..........................................................53
Chapter 10 Approximate Inference ...........................................................58
Chapter 11 Sampling Method ...................................................................63
Chapter 12 Continuous Latent Variables ..................................................68
Chapter 13 Sequential Data ......................................................................72
Chapter 14 Combining Models .................................................................74
iamxiaojian@gmail.com
Checklist
Frequentist-Bayesian
Frequentist
Bayesian
Linear basis function
Bayesian linear basis function
closed-form
regression
regression
solution
Logistic regression
Bayesian logitstic regression
(IRLS)
Laplace approximation
Neural network (for
Bayesian Neural network (for
gradient decent
regression, classification)
regression, classification)
SVM (for regression,
RVM (for regression,
classification)
classification)
Gaussian mixture model
Bayesian Gaussian mixture
EM Variation
model
inferencce
Bayesian probabilistic PCA
closed-form solution
Probabilistic PCA
EM Laplace
approximation
Hidden markov model
Bayesian Hidden markov
EM
model
Linear dynamic system
Bayesian Linear dynamic
EM
system
Bayesian
Fully Bayesian marginalize with respect to hyper-parameters as well as parameters
analytical intractable
curve fitting ( p ( w | ) = N ( w | 0, 1 I ) p (t | x , w , ) = N (t | y ( x , w ), 1 ) )
p (t | t ) = p (t | w , ) p ( w | t , , ) p ( , | t )dwd d
Empirical Bayes/type 2 maximum likelihood/evidence approximation hyper-parameter
marginal likelihood * *
hyper-parameter * * w marginalize
*
*
*
p (t | t ) p (t | t , * , * ) =
p(t | w, ) p(w | t , , )dw
MAP(poor mans Bayesian) marginalization point

estimate
PRML Bayesian Empirical Bayesian
Optimization/approximation
Linear/Quadratic/Convex optimization//
Lagrange multiplier
Gradient decent
Newton iteration
Expectation Maximation (EM)/latent variable model
Variational inference
Expectation Propagation (EP)
MCMC/Gibbs sampling
Latent variable model

Latent variable
Latent variable
Latent variable
GMM
Probabilistic PCA/ICA/Factor Analysis
Latent variable Markov chain
HMM
LDS
Objective function/ Error function/Estimator

LikelihoodMLE estimator
Marginal likelihoodemprical Bayes/evidence approximation hyper-parameter
Sum-of-square errorregression
PosteriorMAP
Negative log likelihood/cross-entropyLogistic regression
Exponential errorAdaboost
Hinge errorSVM
Chapter 1 Introduction
1. Bayesian interpretation of probability
degree of belief
uncertaintydegree of belief
Cox showed that if numerical values are used to represent degrees of belief, then a simple set of
axioms encoding common sense properties of such beliefs leads uniquely to a set of rules for
manipulating degrees of belief that are equivalent to the sum and product rules of probability.
use the machinery of probability theory to describe the uncertainty in
model parameters.
2. parameterBayesian
Frequentist model parameter w fixed estimator
estimator likelihood
Bayesian w prior probability p(w)
fixed wFrequentist data sets D

Bayesian there is only a single data set D, namely the one that is actually observed.
observation D w beliefprior probability
P(w|D) belief
Bayesian convert a prior probability into a posterior
probability by incorporating the evidence provided by the observed data P(D|w)
how probable the observed data set is for different settings of parameter vector w
=
p( w | D)
p ( D | w) p ( w)
=
p( D)
p ( D | w) p ( w)
p( D | w) p(w)dw
p(D) p(w|D) p(D)
C1,,Ck P(C1)
P(Ck)
x P(C1|x)P(Ck|x)
P(C1)=P(C1|x)P(Ck)=P(Ck|x)
P(C1)
P(Ck)
3. BayesianFrequentist
Bayesian prior distribution is often selected on the basis of mathematical
convenience rather than as a reflection of any prior beliefs conjugate prior
Frequentist Over-fitting problem can be understood as a general property of
maximum likelihood
4. over-fitting
Frequentist over-fitting
1) regularization penalty term
L2 regularizer ridge regression
L1 regularizer Lasso regression
penalty shrinkage method reduce the value of the coefficients.
2) cross-validation validation
Cross-validation model selection validation data

model
Bayesian over-fitting Prior probability
5. Bayesianmarginalization
Marginalization lies at the heart of Bayesian methods.
Bayesian methods marginalization full Bayesian procedure
make prediction compare different models
marginalize (sum or integrate)
over the whole of parameter space.
marginalization
sampling Markov chain Monte CarloMonte Carlo method flexible
model computationally intensive small-scale problems.
deterministic approximation variational Bayes expectation propagation
large-scale applications.
6. Curve fitting
1) MLE likelihood function w point estimation
2) MAP (poor mans bayes) prior probability posterior probability
wMAP MLE likelihood function L2 penalty
point estimation
3) fully Bayesian approach sum rule product ruledegree of belief
machinery rule degree of belief predictive
distribution marginalize (sum or integrate) over the whole of parameter space w
p (t | x , X , t ) = p (t | x , w ) p ( w | X , t )d w
x X t label
w probability
w marginalization
7. PRMLProbability theory, decision theory and information theory
8. inferencedecision
solution inference decision inference stage
decision stage posterior probability to make
optimal class assignments
9. decision problem
1) discriminant function: map inputs x directly into decisions. discriminant function
inference decision
2) discriminant model: inference problemdetermining the posterior class
probabilities, P (Ck | x ) decision problem x
class
3) generative modelexplicitly or implicitly model the distribution of inputs as well as outputs.
class-conditional density P ( x | Ck ) joint distribution P ( x , Ck )
posteriordecision problem
10.
Generative model make classification decision joint distribution
wasteful of computational resources and excessively demanding of data
Discriminant function there are powerful reasons for

wanting to compute the posterior probabilities
(1) loss matrix
financial application
decision problem expected loss discriminant
function
(2) X-ray 99.9%
balanced data set P (Ck | x ) compensate for the

effects of the modification to the training data P (Ck | x ) balanced data set
P(Ck ) P(Ck ) P(Ck | x )

discriminant function
(3) Combining models
X-ray xI xB heterogeneous
information input build one system to interpret the X-ray images
and a different one to interpret the blood data:
P(Ck | xI , xB ) P( xI , xB | Ck ) P(Ck )
P( xI | Ck ) P( xB | Ck ) P(Ck )
P(Ck | xI ) P(Ck | xB )
P(Ck )
xI xB Ck
11. Criteria for making decisions

1) Minimizing the misclassification rate
2) minimizing the expected loss
loss function A={a1 , . . . , ap }

Ax (x)
( x) = arg min R(ai | x)

ai A
R ( ai | x)
c
R(ai | x) = (ai | j ) P( j | x)
j =1
(ai | j ) j ai loss risk
(ai | x j ) P( j | x) x j
(x)
(x) risk
R[ ] = R ( ( x) | x) p ( x)dx
R ( ( x) | x)
12.
Entropy is a lower bound on the number of bits needed to transmit the state of a random
variable.
Xx p(x)
1) log 2 p ( x)
2) - p(x)log 2 p(x)dx
2 e
13.
Conditional entropy joint distribution p(X,Y) H[Y|X]/
H [=
Y | X]
(Y | X
H=
x=
) p ( X x)d x
= p ( x, y ) log p ( y | x)dxdy
H[Y|X] X=x Y
H(Y | X = x) = - p(y | x)logp(y | x)dy

Relative entropy p(x) q(x) p(x)
q(x) ( p(x))
p(x)(KL )KL(p||q)
(- p(x)lnq(x)dx) - (- p(x)lnp(x)dx) = - p(x)ln{
q(x)
}dx
p(x)
q(x) q(x)
p(x) p(x)p(x)
Mutual information X,Y p(x, y)=P(x)P(y)
I[x, y] = KL (p(x, y) || p(x) p(y)) = H[x] H[x|y] = H[y] H[y|x]

x yy x uncertainty
-lnx Jensens inequality KL

KL(p(x)|q(x)) = 0 p(x) = q(x) x -lnx
14. 1.14
15 1.32
x H[x]x y=Ax
H[y]=H[x]+ln|A|
Feature extraction: a special form of dimensionality reduction, the input data be transformed into
a reduced representation set of features (features vector)
real-time face detection in a high-resolution video stream, huge
numbers of pixels per second speed up computation
PCA, kernel PCA, manifold leaning
Feature selection: variable selection, feature reduction, attributes selection

Feature selection algorithms typically fall into two categories, feature ranking and
subset selection.
Feature ranking ranks the features by a metric and eliminates all features that do not achieve an
adequate score.
Subset selection searches the set of possible features for the optimal subset.
metricfeature subset
In statistics, the most popular form of feature selection is stepwise regression. It is a greedy
algorithm that adds the best feature (or deletes the worst feature) at each round. The main
control issue is deciding when to stop the iteration.
In machine learning, this is typically done by cross-validation.
Unsupervised leaning: the training data consists of a set of input vectors without any
corresponding target values.
Clustering, density estimation (determine the distribution of data within the input space),
visualization (project the data from a high-dimensional space down to two or three dimensions)
p(D|w)
Frequentistw p(D|w)
Bayesianw p(D|w)
Chapter 2 Probability Distribution

1.
Parametric method: assume a specific functional form for the distribution.
Nonparametric method: form of distribution typically depends on the size of the data set. Such
models still contain parameters, but control the model complexity rather than the form of the
distribution.
2. Conjugate prior: lead to posterior distribution having the same functional form as the prior
Distribution
Conjugate Prior
Bernoulli
Beta distribution
Multinomial
Dirichlet distribution
Gaussian , Given variance, mean unknown
Gaussian distribution
Gaussian, Given mean, variance unknown
Gamma distribution
Gaussian, both mean and variance unknown
Gaussian-Gamma distribution
3. Conjugate priorBayesian inferencesequential Bayesian inference

sequential Bayesian inference observation posterior
posterior prior posterior prior
observation stream of data real-time
learning
4. Conjugate priormultinomial
D N multinomial K
K
p ( D | ) = k mk
k =1
mk N k
prior hype parameter conjugate prior
p ( | ) k k 1
k =1
Dir ( | ) =
(1 + ... + K ) K k 1
k
(1 ) + ... + ( K ) k =1
Dirichlet Multinomial
prior functional
form
5.
Linear Algebra + Matrix Theory, Multivariate Calculus
R
random vector
A G A = GG
probability density function
probability density function

conditional gauss distribution
completing the square
inverse of partitioned matrix
marginal gauss distribution
(1)
(2)
flexible
multimodal function
multimodal function
(1) introducing discrete latent variables Gaussian Mixtures model
(2) introducing continuous latent variables Linear Dynamic System
6. Linear Gaussian Model

Given Gaussian distribution p(x) and p(y|x) p(y|x) mean x covariance
x p(y)p(x|y)
7.
(1) f : R R
(2) f : R R n gradient
n
f ( xn ) T
f ( x1 )
,...,
)
x1
xn
(3) f : R R n (
n
f ( x)
f1 ( x)
,..., n )T
x
x
(4) f : R R (Jacobian matrix) (

n
fi
)ij
xj
(5) f : R { Amn : R n R m } scalar matrix
Aij
x
)ij
(6) f :{ Amn : R n R m } R matrix scalar (
x
)ij
Aij
Ax x x Ax x
8. The Exponential Family
p ( x | ) = h( x ) g ( ) exp{ T u ( x )}
natural parameter
MLE g ( ) h( x ) exp{ T u ( x )}dx = 1

[u ( x )]
g ( ) E=
ln
=
1
N
u(x )
n =1
vector
9.
The prior intended to have as litter influence on the posterior distribution as possible.
normalize improper
Translation invariant Scale invariant
10. Nonparametric methods

D N p(x)
unsupervised learning
R x R P
N R K Bin(K|N,P)
R p(x) R P = p(x) * VV R
N Bin(K|N,P) N*P K = N * P
R p(x) = K / (N * V)
Kernel density estimator( Parzen window ) V
V K smoothing kernel function

kernel function
1,..., D
1,| u | 1/ 2, i =
k (u ) = i
0, otherwise
k(
x xn
) =1
x
n x h hypercube
h
hypercube N hypercube
N
K = k(
n =1
x xn
)
h
p(x) = K / (N * V) hypercube
kernel function artificial discontinuity
Gaussian
=
p( x)
1
N
|| x xn ||2
1
exp{
}
2 1/2
2h 2
n =1 (2 h )
N
h regions of high data densityh

lead to over-smoothing and washing out of structure that might otherwise be extracted from the
datah lead to noisy estimate h location
kNN K K V
kNN
classificationkNN posterior MAP
Chapter 3 Linear Models for Regression

1. curve fitting
N
minimize the sum of squares of error y(x,w) w
x input variable
2
=
E ( w)
1 N
{ y( xn , w) tn }
2 n =1
MLE t y(x, w)
w
MAP MLE w
minimize the sum of squares of error with regularizer regularizer E(w)
regularizer
1 N
{ y ( xn , w) tn } + || w ||2
=
E ( w)
2 n =1
2
2
MLE minimize the sum of squares of error

MAP minimize the sum of squares of error with regularizer (MAP with Gaussian prior)
2. linear model
input variable
linear basis function model
M 1
y ( x, w) =
w0 + w j j ( x) =
wT ( x)
j =1
basis function y(x, w) w Basis function
Polynomial, Gaussian, Logistic sigmoid function, Fourier basis, wavelets
3. Loss function for regressionExpected Lossregression function

input x y(x) t loss
function L(t, y(x)) Expected loss
E[ L] = L(t , y ( x)) p ( x, t )dxdt
regression squared loss
L(t , y=
( x)) { y ( x) t}2
L E[L] expected loss E[L] y(x)
E[L] y(x)
=
y ( x)
tp (t | x)d t E [t | x]
=
t
t x conditional expectation
squared loss L
L(t , y ( x))= { y ( x) t}2= { y ( x) E[t | x] + E[t | x] t}2

= { y ( x) E[t | x]}2 + 2{ y ( x) E[t | x]}{E[t | x] t} + {E[t | x] t}2
E[L]
2
2
E[ L] =
{ y( x) E[t | x]} p( x)dx + {E[t | x] t} p( x, t )dxdt
h(x)=E[t|x]
E[ L] = { y ( x) h( x)}2 p ( x)dx + {h( x) t}2 p ( x, t )dxdt

h(x) E[L]y(x) t
E[L]
{h( x) t}
p ( x, t )dxdt y(x)
arise from the intrinsic noise on the data the represents the minimum achievable value of expected
loss y(x)
4. Frequentistmodel complexityBias-Variance trade-off

linear basis function model
(1) Frequentist data set D w point estimate
(2) data set D w y(x) D
y(x) y(x; D) D y(x; D)
(3) thought experiment data sets p(t, x)
data set N
(4) data set D y(x; D)
{ y ( x; D) h( x)}2 = { y ( x; D) ED [ y ( x; D)] + ED [ y ( x; D)] h( x)}2

=
{ y ( x; D) ED [ y ( x; D)]}2 + {ED [ y ( x; D)] h( x)}2
+2{ y ( x; D) ED [ y ( x; D)]}{ED [ y ( x; D)] h( x)}
take the expectation of this expression with respect to D
)}2 ] {ED [ y ( x; D)] h( x)}2 + ED [{ y ( x; D) ED [ y ( x; D)]}2 ]

ED [{ y ( x; D) h( x=
2
bias average prediction over all data sets desired regression

function h(x) average model best model
variance
data set y(x; D) E[y(x;
D)]
y(x; D) D
(5) expected loss ED [{ y ( x; D ) h( x)} ]

2
E[ L] = { y ( x) h( x)}2 p ( x)dx + {h( x) t}2 p ( x, t )dxdt

{ y ( x) h( x)}2
2
Expected loss (E[L]) = (bias) + variance + noise
{E
=
(bias ) 2
=
var iance
[ y ( x; D)] h( x)}2 p( x)dx

D
[{ y ( x; D) ED [ y ( x; D)]}2 ] p ( x)dx
{h( x) t}
=
noise
p ( x, t )dxdt
regression y(x) E[L] E[L]

bias, variance noise noise y(x)
regression
bias variance flexible
model low biasaverage model best modelhigh variance single
model data set D rigid model high biaslow variance
bias variance
5. bias-variance
curve fitting L data sets N data points
y (l ) ( x) D (l ) average model
y ( x) =
1 L (l )
y ( x)
L l =1
bias variance
(bias ) 2
=
1
N
{ y ( x ) h( x )}
n =1
1 N 1 L (l )
var iance
=
{ y ( xn ) y ( xn )}
N
Ll 1
=
n 1=
model bias variance
6. Bayesianmodel complexityModel evidence/marginal likelihood

(1) Bayesian over-fitting Marginalizing over the model parameters
instead of making point estimates of their values.
(2) model {M i : i = 1,..., L} data set DBayesian model
comparison
p( M i | D) p( M i ) p( D | M i )
p ( M i ) allows us to express a preference for different model
p ( D | M i ) model evidence marginal
likelihood
(3) Model averaging V.S. model selection
Model averaging predictive distribution
p (t | x, D ) =
p(t | x, M , D) p(M
i =1
| D)
Model selection approximation to

model averaging
model model
evidence
(4) Model evidence
marginal likelihood/
p ( D | M i ) = p ( D |w, M i ) p ( w | M i )d w
w marginalize
sampling M i hyper-parameter w parameter model

model hyper-parameter basis function curve fitting
M hyper-parameter M
model M w marginalize
model evidence M model D
7. Bayesian
Fully Bayesian marginalize with respect to hyper-parameters as well as parameters
analytical intractable
curve fitting ( p ( w | ) = N ( w | 0, 1 I ) p (t | x , w , ) = N (t | y ( x , w ), 1 ) )
p (t | t ) = p (t | w , ) p ( w | t , , ) p ( , | t )dwd d
Empirical Bayes/type 2 maximum likelihood/evidence approximation hyper-parameter
marginal likelihood * *
hyper-parameter * * w marginalize
*
*
*
p (t | t ) p (t | t , * , * ) =
p(t | w, ) p(w | t , , )dw
MAP(poor mans Bayesian) marginalization point

estimate
p ( y | x) =
p( y | x, z ) p( z | x)d
p=
( y | x)
p ( x, y )
=
p( x)
, z ) d z p ( y | x, z ) p ( z | x ) p ( x ) d z
p( x, y=
=
p( x)
p( x)
p ( y | x, z ) p ( z | x ) d
Chapter 4 Linear Models for Classification

1. hyperplane,
D Euclidean space D-1
Linearly separable D
Coding scheme1-of-K binary coding scheme K i
K i 1 0
Feature vector D input x fixed nonlinear transformation
D ( x) feature vector
2. Generalized Linear Model: an activation function acting on a linear function of the feature
variables.
=
y ( x) f ( wT ( x) + w0 )
Generalized Linear Model (GLM)
x ( x) x feature vector
f activation functionf link function
(1) f nonlinear function GLM classification model
(2) f identity function GLM regression model
classification GLM linear regression model

regression
y ( x) wT ( x) + w0 w x
=
polynomial function basis GLM activation function
y w
GLM
(1) Logistic regression model f logistic sigmoid
(2) Probit regression activation function f probit
3.
1 decision problem
(1) Discriminant function
x
(2) Generative model joint distribution p ( x, Ck ) class-conditional
distribution p ( x | Ck ) posterior p (Ck | x)
(3) Discriminant model posterior p (Ck | x) GLM
p (Ck | x) =
P(Ck | x)
f ( wT ( x) + w0 ) training data
infer GLM w p (Ck | x)

decision stage Logistic Regression Probit regression
infer Frequentist Logistic Regression Bayesian
Logistic Regression Probit regression
4. Linear discriminant function

Linear discriminant decision surface hyperplane
x non-probabilistic method
4.1 binary classification multiclass classification

One-versus-the-rest K two-class discriminant K discriminant
One-versus-one K(K-1)/2 binary discriminant K

decision region ambiguous region
A single K-class discriminant K linear function K-class discriminant linear

function yk=
( x) wk T x + wk 0 k j decision boundary
yk ( x) = y j ( x) ( wk w j )T x + ( wk 0 w j 0 ) =
0 x
k j k yk ( x) > y j ( x) K-class discriminant
decision region ambiguous region
Linear discriminant function least squares, Fishers linear

discriminant perceptron algorithm
4.2 least squares

regression sum-of-squares error
4.3 Fishers linear discriminant

x D y = w x scalar
T
D 1 Fisher
1
D 1 Fisher criterion
1
Fisher criterion Fisher

criterion w decision hyperplane
1 y0
4.4 perceptron algorithm Generalized Linear Model

=
y ( x)
f ( wT ( x) + w0 )
activation function f step function sign(x)
perceptron criterion
1 tn +1 2 tn -1 xn
1 wT xn > 0 2 wT xn < 0 xn
wT xn tn > 0 misclassified pattern xn wT xn tn < 0
wT xn tn perceptron criterion misclassified pattern
E p ( w=
)
xntn M misclassified pattern
closed stochastic gradient descent

E p ( w) w Generalized Linear Model
x
K>2
5. Generative model
input class-conditional distribution
Discriminant model make decision
2
p (Ck | x)
(1) p (C1 | x) =
=
a = ln
p ( x | C1 ) p (C1 )
p ( x | C1 ) p (C1 ) + p ( x | C2 ) p (C2 )
1
= (a)
1 + exp(a )
p ( x | C1 ) p (C1 )
p ( x | C2 ) p (C2 )
(2) class-conditional distribution Gaussian

class-conditional distribution covariance matrix
| Ck ) N ( k , )
k p ( x=
(3) class-conditional distribution
=
(ln
p (C
1 | x)
p ( x | C1 ) p (C1 )
=
) ( wT x + w0 )
p ( x | C2 ) p (C2 )
w=
1 ( 1 2 )
p (C1 )
1
1
w0 =
1T 11 + 2T 12 + ln
p (C2 )
2
2
k covariance 2
covariance class-conditional distribution

ln
p ( x | C1 ) p (C1 )
x x wT x + w0
p ( x | C2 ) p (C2 )
(4) MLE
p (C1 ) = p (C2 ) = 1 {xn , tn } xn C1 tn = 1
tn = 0
=
p ( xn , C1 ) N ( xn | 1 , )
(1 ) N ( xn | 2 , )
p ( xn , C2 ) =
p (t =
| , 1 , 2 , )
[ N ( x
n =1
| 1 , )]tn [(1 ) N ( xn | 2 , )]1tn
MLE (3)
class-conditional distribution
MLE class-conditional distribution
GLM Logistic
6. Discriminant model
D generative model D
6.1 logistic function Softmax function
Logistic (a ) =
1
step function
1 + exp(a )
Softmaxs (ak ; a1 ,..., an ) =
exp(ak )
max function
exp(a j )
j
Softmax logistic
6.2 Logistic regression

Generalized Linear Model posterior can be achieved by a Softmax
transformation of linear function of the feature variables
=
(ak ; a1 ,..., an )
p (Ck | ) s=
exp(ak )
ak = wk T
exp(a j )
j
2 logistic function
2 training data {n , tn } n
D feature vector tn {0,1}
=
p (t | w)
y
n =1
tn
n
{1 yn }1tn
=
(C1 | n ) (an ) an = wT n
yn p=
negative logarithm of the likelihood cross-entropy error function
N
E ( w) =
ln p (t | w) =
{tn ln yn + (1 tn ) ln(1 yn )}
n =1
w p (Ck | ) inference make decision

Multiclass
activation function probit function probit regressionprobit
function CDF(Cumulative Distribution Function) logistic
6.3 iterative reweighted least squares (IRLS)
E(w) 0 E ( w) =
(y
n =1
tn )n = 0
0 closed-form solution
E ( w) =
logistic function yn
E(w)concave E(w)
0 w
E ( w) =
IRLS
E ( w) =
0
Hessian
w( new) = w( old ) H 1E ( w)
7. Laplace approximation
q(z) p ( z ) =
1
f ( z ) Z f(z)
Z
q(z) f(z) x0 ln[f(z)]
ln f ( z ) ln f ( z0 )
A =
1
A( z z0 ) 2
2
d2
ln f ( z ) |z = z0 f(z)
dz 2
f ( z ) f ( z0 ) exp{
A
( z z0 ) 2 }
2
q( z ) (
A 1/2
A
) exp{ ( z z0 ) 2 }
2
2
p(z)
8. Bayesian Logistic Regression

2 feature vector
p (C1 | , t )
=
predictive distribution
p (C | , w) p ( w | t )dw ( w
=
x) p ( w | t )dw Bayesian
marginalize over parameter space logistic

p(w|t)
p(w|t) Gaussian approximation q(w) Laplace
approximation p(w|t)stationary point lnp(w|t) Hessian matrix
p ( w) = N ( w | m 0 , S0 ) lnp(w|t) m MAP
Hessian matrix S N q ( w) = N ( w | m MAP , S N )
q(w) p(w|t) marginalization logistic
Gaussian predictive distribution

likelihood function
Wikipedia The likelihood of a set of parameter values given some observed outcomes is
equal to the probability of those observed outcomes given those parameter values.
observed outcomeobserved outcome

likelihood PRML
generative model classification p (C1 ) = p (C2 ) = 1
{xn , tn } xn C1 tn = 1 tn = 0
=
p ( xn , C1 ) N ( xn | 1 , )
(1 ) N ( xn | 2 , )
p ( xn , C2 ) =
p (t =
| , 1 , 2 , )
[ N ( x
n =1
| 1 , )]tn [(1 ) N ( xn | 2 , )]1tn
likelihood joint distribution p ( xn , C1 ) p ( xn , C2 )

observed outcome {xn , tn } input xn xn
tn input variable input

logistic regression training data {n , tn } n D
feature vector tn {0,1}
=
p (t | w)
y
n =1
tn
n
{1 yn }1tn
(C1 | n ) (an ) an = wT n
=
yn p=
likelihood posterior distribution yn = p (C1 | n )
observed outcome input xn xn tn
p( xn tn | xn ) yn = p (C1 | n )
Chapter 5 Neural Networks

1. Generalized Linear Model: an activation function acting on a linear function of the feature
variables
=
y ( x) f ( wT ( x) + w0 )
x ( x) x feature vector f activation
functionf link function
(1) f nonlinear function GLM classification model y(x)
posterior probability Generalized Linear Model Discriminative model
f logistic logistic regression cross-entropy error function
(2) f identity function GLM regression model y(x)
probability interpretation sum-of-squares error function
2. Neural Network
Generalized Linear Model nonlinear function
M
yk (x, w) = f ( wkj(2) h( w(1)

ji xi ))
=j 0=i 0
yk (x, w) k outputClassification k Regression

k M hidden layer unit D x h nonlinear
functionf linear nonlinear function regression classification
bias
h logistic logistic regression
3. Neural Network
regression sum-of-squares error function
=
E (w)
1 N
|| y(x n , w) t n ||2
2 n =1
y(x n , w) K output vector x n D input vector t n target

vector output vector K
w weight vector
classification negative logarithm of likelihood function
E (w) = tnk ln[y k (x n , w)] =

=
n 1=
k 1
ln [yk (x n , w)]tnk
n =1
k =1
xn tn 1-of-K coding scheme input k
w)
tnk = 1 0 k yk ( xn ,=
p=
(tk 1| xn )
E(w) likelihood function
(1) input vector

N
E (w) = En (w)
n =1
(2) sum-of-squares error function likelihood

function
4.
observation data E(w) w Gradient decent
(1) Off-line gradient batch gradient steepest decent

conjugate gradients quasi-Newton methods steepest decent
w( +1)= w( ) E ( w( ) )
> 0 learning rate E ( w ) batch error function gradient
( )
observation data
(2) On-line gradient decent sequential gradient decent stochastic gradient decent
w( +1)= w( ) En ( w( ) )
observation data random selection with
replacement
On-line data redundancy escaping from local

minima
5. En Error Back-propagation
Gradient decent
a j hidden layer j
a j = w ji zi
i
zi input layer i j
z j = h( a j )
j =
En
a j
E
En En a j
=
= j zi n j
w ji
w ji a j w ji
k output layer unit
En
=
a j
j
=
aj
=
En ak
=
k a j
a
k
=
w z w
ji i
ji
ak
a j
h(ai )
( wki h(ai ))
i
=
wkj h ' (a j )
a j
ak
=
a j
=
j
w h (a )
=
'
kj
h ' (a j ) k wkj
j
k
k
output layer k k k j
En
w ji
E
wkj
n
output layer hidden layer
=
En ak
= k z j
ak wkj
En
Error Back-propagation output layer k k hidden layer

j j W
O(W)
bias O(N*M + M*K)
BP (1) w a z
evaluationforward propagation(2) E(w) k (3)
j (4) E(w)
E(w) w
Gradient decent E(w)
BP
w
(1) w
(2) w BP E(w)
(3) Gradient decent w
(4)(2)(3) w
6. Neural Networkevaluationforward propagation

Evaluation w D input vector x
output vector
forward propagation
a j = w ji zi
i
z j = h( a j )
O(W)
7. Optimization theory and method

machine learning modeling + optimization
statistical
optimization techniques
Linear and Quadratic Programming, Convex Optimization
Combinatorial Optimization
Probabilistic Optimization: Genetic Algorithm, Simulated Annealing, Particle Swarm
Optimization, Ant Colony Optimization;
Calculus of variation;
Numerical Optimization: Gradient decent, Conjugate gradient, Newton method
8. error propagationHessian En JacobianJ

Jacobian J
J ki =
yk
xi
output vector input vector f : R R

n
Hessian f : R R
n
error propagation
output layer
Hessian
vT H = vT (E ( w)) O(W)
9. Regularization of neural network
Frequentist method
(1) regularizer regression w penalty
quadratic regularizer linear transformation invariance
(2) early stopping validation set
10. neural networkinvariance

linear transformation invariancetranslation invariance scale invariance
T
linear transformation invariance error function regularizer w w

inconsistency input data xi xi = axi + b
1
a
ji =
w ji w j 0 w j 0 = w j 0
w ji w
b
w ji
a i
w regularizer w w
consistency equivalent
regularizer
1
2
wW1
2
2
wW2
Translation invariance scale invariance

translation(scale)neural network
4

(1) data replication augmented data
label
learn invariance
modify data
(2) regularizationpenalize changes in the model when the input is transformed
tangent propagationmodify error function
(3) pre-processing transformation-invariant feature
feature extraction of transformation-invariant features
(4) invariance build the invariance properties into the
structure of a neural networkconvolutional neural network the model with
transformation-invariant structure
11. Bayesian Neural Networks
output
network w
p ( w | ) = N ( w | 0, 1 I )
xconditional distribution p (t | x, w, ) = N (t | y ( x, w), )
1
y ( x, w) N observation {xn } target value D = {tn }
p ( w | D, , ) p ( w | ) p ( D | w, ) =
p ( w | ) N (tn | y ( xn , w), 1 )
n =1
xpredictive distribution
p (t | x, D) = p (t | x, w) p ( w | D, , )d w
(1) Gaussian Laplace approximation
q( w | D)
predictive
distribution
p (t | x, D) = p (t | x, w)q ( w | D)d w p(t|x,w) network function y(x, w)

analytically intractable
(2) network function y(x, w) Taylor p (t | x, w, )
linear-Gaussian model p(t|x, D)
(3) hyper-parameter
, predictive distribution
marginalize empirical Bayesian marginal likelihood
hyper-parameter point estimate

Bayesian Neural Network regression
Machine learning = Modeling + optimization/approximation

ML
modeling SVMSVM
modeling
ML optimization approximation Frequentist
point estimate estimator
estimator / optimization
Bayesian point estimate marginalize
marginalize analytical solution approximation
analytical approximationTaylor expansion, Laplace approximation, Variational Bayes
sampling approximation
Chapter 6 Kernel methods

1. model
training data prediction
training data
linear basis function modelgeneralized linear modelneural network
predication training data
training data kNNGaussian process training
dataSVM Support Vector
2. Kernel
non-linear feature space mapping input x kernel
function k ( x, x ') = ( x)
( x ')
kernel stationary kernel homogeneous kernel
k ( x, x=
') k ( x x ') k ( x=
, x ') k (|| x x ' ||)
radial basis kernel
Kernel kernel
kernel k
k ( x, x ') = ( x)T ( x ') k ( x, z ) = ( xT z ) 2 x z 2

k kernel k ( x, z ) = ( x12 , 2 x1 x2 , x22 )( z12 , 2 z1 z2 , z22 )T kernel
kernel kernel
kernel kernel kernel
kernel kernel kernel
kernel kernel Gaussian kernel kernel
Kernel vector of real numbers

A P(A) vector space
| A1 A2 |
kernel A1 , A2 P ( A) k ( A1 , A2 ) = 2
inner product vector space
P(A)
3. generative modelkernel
Generative model V.S. discriminant modelgenerative model can deal naturally with missing
data HMM handle sequences of varying length discriminative model
generally gives better performance on discriminative tasks.
generative model discriminant model generative model
kernel kernel discriminant model
generative model kernel

k ( x, x ') = p ( x) p ( x ') kernelp(x) input modeling
generative model
p(x)>=0p(x) D x
1 k 1 kernel kernel
kernel kernelk ( x, x ') =
p( x | i) p( x ' | i) p(i)
i
i latent variable
Fish kernel parametric generative model p ( x | )
parameter vector Fisher score g (, x) = ln p ( x | ) scalar
vector vector
1
Fisher Kernelk ( x, x ') = g (, x ) F g (, x ') F Fisher information matrix

T
F = Ex [ g (, x) g (, x)T ] vector outer product

p ( x | ) x Fisher sample average
Fisher score Fisher information

matrix Fisher score random vector covariance matrix Fisher kernel
Fisher score Mahalanobis Euclidean distance kernel
k ( x, x ') = g (, x)T g (, x ')

Fisher kernel document retrieval Kernel measure the similarity of two input
vectors
4. Gaussian process
Gaussian process is defined as a probability distribution over functions y(x)such that
the set of values of y(x) evaluated at an arbitrary set of points x1 ,..., xN jointly have a Gaussian
distribution
x1 ,..., xN y ( x1 ),..., y ( xN )
Gaussian distribution y(x) mean value prior
0 Gaussian process ( y ( x1 ),..., y ( xN )) covariance
xn , xm
( y ( xn ), y ( xm )) covariance ( y ( x1 ),..., y ( xN )) covariance kernel

( y ( xn ), y ( xm )) covariance
E[( y ( xn ) y ( xm )] = k ( xn , xm )
5. Gaussian process for regression
observation noise y ( xn )
tn p (tn | yn ) = N (tn | yn , 1 ) N observations
Gaussian
p (t | y ) = N (t | y , 1I N )
t = ( t1 ,..., t N ) y = ( y1 ,..., y N ) I N N
Gaussian process y = ( y1 ,..., y N ) Gaussian
p (y ) = N (y | 0, K )
K kernel
t = ( t1 ,..., t N ) p (t ) Gaussian
observation t = ( t1 ,..., t N ) input xN +1

observation t N +1 p (t N +1 | t ) p (t )
p( t1 ,..., t N , t N+1 )
Gaussian conditional probability
p (t N +1 | t ) analytical solution
N O ( N ) prediction
2
O ( N ) linear basis function model M

3
basis O ( M ) O ( M ) Gaussian process

basis function M N
Gaussian basis function
O( N 3 ) - O( N 2 ) basis function linear basis function model
6. Gaussian processkernel
rather than fixing the covariance functionwe may prefer to use a parametric family
of functions and then infer the parameter values from the data
kernel
k ( xn , x=
0 exp{
m)
1
2
|| xn xm ||2 } + 2 + 3 xnT xm
Gaussian kernel linear kernel constant

observation t = ( t1 ,..., t N ) = ( 0 ,..., 3) infer
kernel kernel kernel selection
hyper-parameter empirical Bayes marginal likelihood
= p (t | ) Gaussian
p (t ) p (t )
ln p (t | ) MLE
7. Gaussian process for classification

Gaussian process input point xy(x)
Gaussian GP y(x)(0, 1) logistic
binary classificationtarget variable t {0,1} Gaussian process

a ( x) logistic a y = (a ) y Gaussian process
(a)t (1 (a))1t
(t | a )
p=
Gaussian process
p (a N +1 ) = N (| 0, C N +1 )
C N +1 kernel
K C=
( xn , xm ) k ( xn , xm ) + n
m > 0
nm kernel C
C
p (t N=
+1 | t N )
p (t
=
N +1
p (t N +1 1|=
aN +1 )
=
=
p (aN +1 | t N )
1| aN +1 ) p (aN +1 | t N )d aN +1
(aN +1 )
p (a , a | t )da
=
p(a
N +1
N +1
| a N ) p (a N | t N )da N
Gaussian process p ( aN +1 | a N ) conditional Gaussian
p (a N | t N )
p (a N | t N ) Laplace approximation Gaussian

p ( aN +1 | t N ) 2 Gaussian Gaussian p (t N +1 | t N )
Logistic Gaussian N
N+1
model parameters
(1) Gaussian distribution
marginal Gaussian distribution conditional Gaussian distribution Convolution of two
Gaussiansconvolution of logistic and Gaussian
logistic probit Gaussian CDF convolution of logistic
and Gaussian Convolution of two Gaussians
Convolution of two Gaussians

(i) marginalize
N (y; x, ) N (x; , )dx= N (y - x; 0, ) N (x; , )dx

1
x marginalize
=
f 1 (y - x) N (y - x; 0, 1 ) 0, 1
g=
N (x; 2 , 2 ) 2 , 2
2 ,2 ( x)
(y - x) g2 ,2 (x)d x = ( f 1 g2 ,2 )(y )
Gaussian
(ii)
Gaussian N (1 , 1 ) N ( 2 , 2 )
( N (1 , 1 ) N ( 2 , 2 ))(=
t)
N (t z; , )N (z; , )dz
1
(i)
N (y; x, ) N (x; , )dx= N (y - x; 0, ) N (x; , )dx

1
= ( N (0, 1 ) N ( 2 , 2 ))( y )
marginalization N (0, 1 ) N ( 2 , 2 )
(iii)
Gaussian N (, ) e
1
itT tT t
2
N (1 , 1 ) N ( 2 , 2 )
1
itT ( 1 + 2 ) tT ( 1 + 2 )t
2
N (1 + 2 , 1 + 2 )
marginalize
)dx
N (y; x, ) N (x; , =
1
( N (0, 1 ) N ( 2 , 2 ))(
=
y ) N ( 2 , 1 + 2 )
(2) Taylor expansion
(3) Laplace approximation

Gaussian
Chapter 7 Sparse Kernel Machine

1. Lagrange multiplierKTT condition
PRML Andrew Ng Lecture note
2. SVM
SVM discriminant function input RVM
discriminant model
SVM sparse model training datasupport vectors
Gaussian process training data prediction
SVM classificationregression novelty detection
3. SVMmodelinglinearly separable
N {x n , tn } tn {1,1}
(1) modeling
=
y (x) w (x) + b y (x) = 0 SVM
margin decision boundary margin N

decision boundary decision
boundary y ( x) = 0 w b SVM prediction
x y(x) x
(2) modeling mathematical formulation
x n y ( x) = 0
tn y ( x n )
linearly separable
|| w ||
tn y ( xn ) > 0 margin
1
arg max{
min[tn y (x n )]}
|| w || n
w ,b
s.t tn y ( xn ) > 0, n =
1,..., N
margin
w b w w, b b x n
y (x) = 0
tn y ( x n )
|| w ||
1 tn y (x n ) = 1
1
}
tn y (x n ) 1, n =
1,..., N arg max{
|| w ||
w ,b
1
arg min{ || w ||2 }
2
w ,b
s.t. tn y (x n ) 1, n =
1,..., N
inequality constraint quadratic programming
Lagrange function
N
1
2
, a)
|| w || an {tn (w T (x n ) + b) 1}
L(w, b=
2
n =1
Lagrange multiplier a 0
L( w , b, a) w ,b a w ,b
0 a w =
a t ( x
n =1
n n
) antn = 0
n =1
L( w , b, a) w ,b
(a)
L
=
=
n 1
1 N N
antn amtm k (xn , xm )
2=n 1 =m 1
s.t. an 0, n =
1,..., N
N
a t
n =1
n n
=0
L (a) a
w =
a t ( x
n =1
n n
=
y (x) w T (x) + b
)
y (x) = antn k (x, x n )

n =1
kernel k k ( x, x') = ( x) ( x')

T
(3)
KKT
an 0
tn y ( x n ) 1
an {tn ( y (x n ) 1} =
0
3 n 3 an = 0 tn y (x n ) = 1 an = 0
prediction y ( x) =
a t k (x, x
n =1
n n
) n 0 x n
prediction an 0 support vector

tn y (x n ) = 1 y ( x) = 0
SVM sparse model
(4) SVM SMO
4. SVMThe overlapping case

overfitting
training data x n slack variable n 0
decision boundary 1 1 n tn y ( x n ) 1
tn y ( x n ) 1 n
1
2
arg min{ || w ||2 +C

w ,b
}
n =1
tn y (x n ) 1 n n 0 C penalty margin overlapping

overfitting C
Lagrange function w ,b n
Lagrange multiplier n
Lagrange multiplier dual Lagrangian n multiplierdual
N
an
(a)
Lagrangian L
=
=
n 1
1 N N
antn amtm k (xn , xm )
2=n 1 =m 1
0 an C , n =
1,..., N antn = 0 an C
n =1
box constraints
an = 0 x n prediction an 0 support vector
0 < an < C n = 0 x n margin margin an 0
x n margin an = C n x n
n
5. multiclassSVM
open problem
one-versus-the rest K SVM

ambiguous region inconsistency
the restone
6. single-classSVM
unsupervised learning probability density estimation
density of data find a smooth boundary enclosing a region of high density boundary
0 1 distribution
region
feature space
feature space
Single-Class Classification Outlier DetectionNovelty Detection

Concept Learning
7. SVM for regression
SVM regression flat y training

datatarget value y value x n
| y(x n ) tn |< flat model

complexity
overfitting regularizer
optimization problem I
1
min{ || w ||2 }
2
1,..., N
st. | y(x n ) tn |< , n =
modeling hard margin SVM (1)
(2) complexity overfitting
tube
SVM tube
| y(x) t |<
0,
E (y(x) t ) =
| y(x) t | , otherwise
optimization problem II
N
min{C E (y(x n ) t ) +
n =1
1
|| w ||2 } C > 0
2
SVM slack variables x n

slack variables n 0 n 0 tn > y( x n ) + n > 0 n = 0
tn < y(x n ) n > 0 n = 0 x n tube
tn y(x n ) + + n
tn y(x n ) n
n + n optimization
problem III
1
min{C ( n + n ) + || w ||2 }
2
n =1
st. tn y(x n ) + + n
tn y(x n ) n
n 0
n 0
n = 1,..., N
n n w Lagrange multiplier
KKT condition
optimization problem III optimization problem IIoptimization problem I
n = 0 n = 0 x n tube x n optimization problem I

optimization problem II optimization problem III penalty
0
n > 0 n > 0 KKT =
tn y(x n ) + + n
=
tn y(x n ) n E (y(x n ) t ) n n n = 0
n = 0 optimization problem II optimization problem III
penalty
optimization problem II optimization problem III
8. Relevance Vector MachineRVM

sparse Bayesian model
discriminant function SVMRVM discriminant model
RVM SVM prediction
SVM complexity C cross-validation
by definitionSVM
RVM regression classification SVM
RVM for regression

RVM linear basis function model w
y (x, w ) = w T (x)
p (t | x, w, ) = N (t | y (x, w ), 1 )
p (w | ) = N ( wi | 0, i 1 )
i =1
=
| ) N (w | 0, I ) RVM w
p ( w
hyper-parameter
Bayesian modelRVM marginalize predictive distribution
maximize marginal likelihood hyper-parameter
N observation x n X target values
t = (t1 ,..., t N ) marginal likelihood
p (t | X, , ) = p (t | X, w, ) p (w | )dw
p (t | X, w , ) likelihoodwith respect to
*
x predictive distribution
p (t | x, X, t, * , * ) = p (t | x, w, * ) p (w | X, t, * , * )dw
p ( w | X, t , , ) w
*
RVM marginal likelihood with respect to

N ( wi | 0, i 1 ) 0
w wi 0
RVM for classification

binary classification y ( x, w ) logistic
(w T (x))
y (x, w ) =
=
p (C1 | x, w )
p (w | ) = N ( wi | 0, i 1 )
i =1
marginal likelihood classification

*
Bayesian Logistic Regression Laplace approximation

marginalization predictive distribution
RVM Bayesian Logistic Regression RVM prior Bayesian
Logistic Regression prior p ( w ) = N ( w | m 0 , S0 )
parameter RVM Bayesian Logistic Regression
*
Model
Time for Training
Time for Prediction
Remark
SVM (Frequentist
N2 approx.
Gaussian Process
N3
N2
Frequentist linear
M2(N+M)
Bayesian linear basis
M2(N+M)
M2
function model for
(get the necessities of
(specialize the
regression
the predictive
predictive distribution
distribution)
for given data point)
Sparse Model)
RVM(Bayesian Sparse
Model)
basis function model

for regression
Frequentist Logistic
regression
Bayesian Logistic
regression
Frequentist Neural
network
Bayesian Neural
network
N: number of training data
M: number of basis function
V: number of support vectors
Parametric/Non-parametric
Frequentist/Bayesian
Discriminative/Generative
Chapter 8 Graphical Models

1. Probabilistic Graphical ModelPGM
joint distribution PGM
(1) Bayesian network

xi
pai joint distribution pai xi descendant
(2) Markov network ( xi , x j ) joint

distribution x \{xi , x j } xi x j
graph joint distribution semantics

graph joint distribution
graph joint distribution modeling encoding
2. joint distributiongraph
PGM joint distribution Pgraph G
joint distribution
(1) G
Bayesian network descendant

I (G ) Markov network xi , x j
I p (G )
I(G) G I(P) joint distribution P
I(G) I (G ) I p (G )
Bayesian network D-separation I d sep (G )

D-separation ABC x = {x1 ,..., xK }
A B C
AB path path blocked
(i) head-to-tail tail-to-tail C

(ii) head-to-head
decedents C
A B path blocked A B C d-separated
A B C
path
path blocked
Markov network U-separation I u sep (G )

U-separation G X Y Z
X Y Z X Y Z
D-separation U-separation
statement joint distribution P
I d sep (G ) = I (G ) I u sep (G ) = I (G )
(2) G P
P G I (G ) I ( P ) G
P I-map
(3) G P
(2)(3) I (G ) I ( P )
I-map graph I (G ) I ( P ) graph distribution I-map

D-map graph I (G ) I ( P ) graph distribution D-map
Perfect map: graph I (G ) = I ( P ) graph distribution
D-map
perfect map
Bayesian network perfect map DD Markov network
perfect map UD DD UD, DD UD
3. PGM
P PGM GG P
G P I-map G
I(G) P I(P)
factorization I-map
Bayesian network P factorizes over G G P I-map
Markov network P Gibbs distribution factorizes over G G P I-map
P positive distribution G P I-map P Gibbs
distribution factorizes over G
P factorizes over G
P factorizes over G
Bayesian network G P G
K
p (x) = p ( xk | pak ) pak xk K random

k =1
variables
Markov network G P G
p ( x) =
1
C ( X C ) X C C ( X C ) 0
Z C
potential function p ( x) potential function Z
Conditional Independence
Factorization
4. joint distributiongraph
(1) P graph
P G G P I-map P I-map
G P I-map I (G ) I ( P ) G
P I-map P trivial
I (G ) P I-map
(2) I-map P
P I-map G I(G)
I-map I-map minimal I-map

random variable
Bayesian network Markov network
Markov network xi , x j
x \{xi , x j } joint
distribution CK2 K random variable
I-map
Bayesian network xi p ( xi | x1 ,..., xi 1 )
x j {x1 ,..., xi 1} joint distribution
p ( xi | x1 ,..., xi 1 ) = p ( xi |{x1 ,..., xi 1} \ x j ) x j {x1 ,..., xi 1}

pai {x1 ,..., xi 1} pai
xi joint distribution
I-map xi condition
{x1 ,..., xi 1} graph DAG Bayesian network
I-map
I-map
(3) graph P I-map

graph conditional independences graph
conditional independences
I-equivalent graph
graph I-equivalent
DAG
Skeleton skeleton
Immorality head-to-head XZY immorality
X Y
G1 G2 graph I-equivalent G1 G2
skeleton immorality
5. graph
Local independence global independences
Bayesian network G
Local independence xk pak xk descendants
Local independence I (G )
Global independence D-separation Global
independence I (G )
I (G ) I (G ) I ( P ) P joint distributionI (G ) I (G )
I (G ) I (G )
I (G ) G conditional independence
I (G )
Markov network G
Local independence pairwise independence xi , x j
I p (G ) Markov blanket
independence xk Markov blanket Markov
blanket I (G )
Global independence G X Y
Z X Y Z X Y Z
I (G )
I p (G ) I (G ) I (G ) P positive
distribution
I (G )
6. PGM
PGM
Joint distribution random variable
direct representation random variables joint
distribution joint distribution graph
work on the graph graph inference marginal conditional
PGM
joint distribution I-map

graph joint distribution I-map work on
this graph
Chapter 9 Mixture Models and EM

1. EMFrequentistMaximum Likelihood EM
Probabilistic model observed variables X latent variables Z joint distribution
p ( X , Z | ) Frequentist
observed data p ( X | ) =
p( X , Z | )
Z
p ( X | ) complete-data likelihood p ( X , Z | )
Z
EM
q ( Z ) ln p ( X | ) q ( Z )
ln p (=
X | ) L(q, ) + KL(q || p )
L(q, ) = q ( Z ) ln{
Z
p( X , Z | )
}
q( Z )
KL(q || p ) = q ( Z ) ln{
Z
p( Z | X , )
}
q( Z )
EM
(1)
old
old maximize L(q, ) with respect to q( Z )
q ( Z ) = p ( Z | X , old ) E step
(2) q ( Z ) = p ( Z | X ,

new
old
) maximize L(q, ) with respect to
M step
ln p ( X | ) E step ln p ( X | )
q ( Z ) q ( Z ) ln p ( X | ) M step
ln p ( X | ) L(q, new ) L(q, old ) new

q ( Z ) E step p ( Z | X , old ) p ( Z | X , new ) KL(q||p) 0
ln p ( X |
new
) > ln p ( X | old ) EM
EM
(1) ln p ( X | ) q ( Z ) q ( Z ) ln p ( X | )
q ( Z ) L(q, ) q ( Z ) = p ( Z | X , ) KL(q||p) 0
(2) M step maximize L(q, ) with respect to q ( Z )
X Z
L ( q, )
p( Z | X ,
olds
p( Z | X ,
olds
) ln p ( X , Z | ) p ( Z | X , olds ) ln p ( Z | X , olds )
Z
) ln p ( X , Z | ) + co nst
maximize L(q, ) with respect to
Q( , old ) = p ( Z | X , old ) ln p ( X , Z | )
Z
) complete-data likelihood ln p ( X , Z | ) X
with respect to Q ( ,
old
Z p ( Z | X ,
) M step Maximize complete-data
old
likelihood
EM E step Q ( ,
Q ( ,
old
old
) M step
EM observed variables Frequentist

Bayesian EM
2. Gaussian Mixture ModelGMM

Gaussian mixture distribution p ( x ) =
k =1
N ( x | k , k )
Exponential Family
random vector z 1-of-K p ( z ) =
k =1
zk
0 k 1 k = 1
k =1
p ( x | zk= 1)= N ( x | k , k ) x z
K
p ( x | z ) = N ( x | k , k ) zk
k =1
=
p( x )
x
=
p( z ) p( x | z )
k =1
N ( x | k , k )
GMM N x Gaussian mixture distribution
EM z latent variablex observed variable N

X x latent variable
Z z
observation Gaussian mixture distribution inference
=
{
=
1,..., K }=
k : k 1,..., K }=
k : k 1,..., K }
{=
{=
k :k
EM p ( X | , , ) GMM
ln p ( X | , , ) =
ln{ k N ( xn | k , k )} ln
=
n 1=
k 1
Gaussian 0 closed form solution

complete-data log likelihood GMM
K
p ( X, Z | , , ) = kznk N ( xn | k , k ) znk
n 1=
k 1
=
X, Z | , , )
p (=
=
n 1=
k 1
nk
{ln k + ln N ( xn | k , k )}
complete-data log likelihood

Z latent variable complete-data log likelihood
EM likelihood Z
Z p ( Z | X , , , )
n 1=
k 1
=
N ( xn | k , k )]znk p(z) p(x|z)
Z complete-data log likelihood

K
( z
EZ [=
p ( X, Z | , , )]
=
n 1=
k 1
nk
){ln k + ln N ( xn | k , k )} ( znk ) = E[ znk ]
znk Z
( znk ) E=
[ znk ]
=
k N ( xn | k , k )
K
j =1
N ( xn | j , j )
, , EZ [ p ( X, Z | , , )]
k =
1
Nk
k =
Nk
N
=
k
1
Nk
(z
n =1
nk
(z
n =1
) xn
)( xkn k )( xn k )T
N k =
(z
n =1
nk
GMM
old
, old , old
Z znk ( znk ) E stepZ

( znk ) = E[ znk ] EZ [ p ( X, Z | , , )]
, , new , new , new

GMM clustering GMM clustering K
x 1-of-K z x 1
( znk )
p( z
1)=
p ( xn | znk 1)
=
p=
( znk 1|=
xn ) K nk
=
p ( xn | znj 1)
=
p( znj 1)=
k N ( xn | k , k )
= ( znk )
j N ( xn | j , j )
K
=j 1 =j 1
( znk ) xn xn k GMM
, , ( znk )
3. GMMk-means
k-means GMM
clusteringk-means hard assignment of data points to clusters
GMM soft assignment ( znk )
GMM k
GMM
( znk ) =
k exp{ || xn k ||2 /2 }
K
j =1
exp{ || xn j ||2 /2 }
K || xn j || || xn j* ||
2
exp{ || xn j* ||2 /2 } 0 k j * 1 0 k = j *
1
*
0, k j
lim ( znk ) =
*
0
1, k = j
(lim ( zn1 ),..., lim ( znK )) 1-of-K GMM

0
hard assignment
0 k-means N k k
k k k-means
k k K-means cluster means cluster covariance

expected complete-data log likelihood Z
EZ [ p ( X, Z | , , )]
1 K K
( znk ) || xn k ||2 +const
2 =n 1 =k 1
GMM EZ [ p ( X, Z | , , )] k-means distortion
p(x)
lnp(x) x x random variable Bayesian
random variable
lnp(x) x p(x) Gaussian

lnp(x) x lnx p(x) Gamma
Chapter 10 Approximate Inference

1. Approximation
Probabilistic model central task observation X latent variables
Z P(Z|X) expectation with respect to P(Z|X) P(Z|X)
analytically intractable approximation
Latent variable latent variable Bayesian parameter
random variable Probablistic Graphical Model parameter
latent variable parameter latent
variable
Approximation deterministic stochatic Laplace
approximationvariational inference MCMC sampling
2. Variational inference
probablistic model P(X, Z) observed variables X = { x1 ,..., x N } latent
variable Z = {z1 ,..., z N }
P(Z|X) model evidence P(X) approximation
X ) L(q ) + KL(q || p )
q ( Z ) P(X) ln p (=
L(q ) = q ( Z ) ln{
p( X , Z )
}dZ
q( Z )
KL(q || p ) = q ( Z ) ln{
p( Z | X )
}dZ
q( Z )
q ( Z ) P(Z|X) KL(q||p)
q(Z) KL(q||p)
P(Z|X) intractable KL(q||p) q(Z)
joint distribution P(X, Z) ln p ( X ) q(Z)
KL(q||p) L(q)
q(Z) L(q) q(Z)
tractable flexible/
M
q ( Z ) = qi ( Z i )
i =1
Z i Z
q(Z) variational distribution

q(Z) L(q)
=
L(q )
q {ln p( X , Z ) ln q }dZ
i
ln p ( X , Z j )dZ j q j ln q j d Z j + const
=
KL(q j || ln p ( X , Z j )) + const
=
ln p ( X , Z j ) Ei j [ p ( X , Z )] + const Ei j
q ( Z )
i
i j
L(q) with respect to q j
KL(q j || ln p ( X , Z j )) with respect to q j KL
=
ln q* ( Z j ) Ei j [ p ( X , Z )] + const
M-1
ln q ( Z j )
*
q (Z )
i
i j
variational inference L(q) q(Z)

qi ( Z i )
qi ( Z i ) qi* ( Z i )
L(q) qi ( Z i ) convex
3. Variatioinal inferenceBayesian Gaussian Mixture ModelBayesian GMM

Bayesian GMM
GMM =
p( X | Z , , )
N ( x
n 1=
k 1
=
| k , k 1 ) znk Bayesian
random variable prior conjugate prior
p ( Z | ) = k
znk
=
n 1=
k 1
p ( ) = Dir ( | 0 ) 0 = ( 0 ,..., 0 )
p( , =
) p ( | ) p (=
)
N (
k =1
| m0 , ( 0 k ) 1 )W ( k | W0 , 0 ) component
conjugate prior Gaussian-Wishart

random variable PGM joint distribution
, , )
p ( X , Z ,=
p ( X | Z , , ) p ( Z | ) p ( ) p ( | ) p ( )
p ( Z , , , | X )
variational distribution q ( Z , , , )
q ( Z=
, , , ) q ( Z )q ( , , )
parameter random variable latent variable
variational inference
q(Z) L(q)
=
ln q* ( Z ) E , , [ p( X , Z , , , )] + const
= E [ p ( Z | )] + E , [ p ( X | Z , , )] + const
joint distribution p ( X , Z , , , ) Z
const
=
ln q* ( Z )
=
n 1=
k 1
nk
ln nk + const
Ek [ln k ] +
ln =
n
D
1
1
E[ln | k |] ln(2 ) Ek , k [( xn k )T k ( xn k )]
2
2
2
ln q ( Z )
q ( , , ) L(q)
K
ln q* ( , =
, ) ln p ( ) + ln p ( k , k ) + EZ [ p ( Z | )] + E[ zn ]Nk ( xn | k , k 1 ) + const
=
k 1
=
n 1=
k 1
q ( , , ) Z ln q ( , , )
*
, , k
K
q( ,
, , , ) q ( Z )q=
( , , ) q ( Z )q (=
)q ( , ) q ( Z )q ( )
q ( Z=
k =1
variational inference q ( , , ) L(q) q ( ) K q ( k , k )

L(q)q ( Z ) q ( , , )
*
p ( Z , , , | X )
Bayesian GMM K
mixing coefficient
q ( ) Dirchelet K
*
E[ k ] 0 GMM
K variational inference
E[ k ] 0 K*
Predictive distribution:
Bayesian
x predictive distribution
p ( x | X ) = p ( x | z , , ) p (z | ) p ( , , | X )d d d
z
q ( , , ) p ( , , | X )
*
p ( x | X ) K student t
4. Expectation Propagation
variational inference KL(q ( Z ) || p ( Z | X )) with

respect to q ( Z ) p ( Z | X )
KL q ( Z ) expectation propagation reversed form KL
KL(p||q) with respect to q p KL
q(z) exponential family q ( z ) = h( z ) g ( ) ex p {u( z )} KL
T
KL( p || q ) =
ln g ( ) E p ( z ) [u(z )] + const const natuaral
T
parameter
KL
ln g ( ) = E p ( z ) [u(z )] q(z) exponential family

ln g ( ) = E q ( z ) [u(z )] E p ( z ) [u(z )] = E q ( z ) [u(z )] q(z)
q(z) p(z) moment matching
probabilistic model joint distribution p ( D, ) =
f ( )
i
D observed data latent variables p ( | D )

model evidence p(D) p ( | D ) =
1
fi ( ) p( D) = i fi ( )d
p( D) i
model i.i.d. f i ( ) = p ( xi | )
f 0 ( ) = p ( )
q ( ) =
1
fi ( ) p ( | D) factor fi ( ) model
Z i
f i ( ) fi ( ) exponential family q ( )
exponential family family
KL( p ( | D ) || q ( )) with respect q ( )
q ( ) f j ( ) f j ( ) f j ( )
q ( )
1
\j
Z j = f j ( ) q ( ) d
f j ( )q \ j ( ) q \ j ( ) =
Zj
f j ( )
revised f j ( ) KL(
1
f j ( )q \ j ( ) || q new ( )) with respect to
Zj
q new ( ) q new ( ) exponential family moment matching

q
new
q new ( )
new
( ) fj new ( ) = K \ j
K q ( )
q ( )
K = Z j
f j ( )
q ( )
Chapter 11 Sampling Method

(0, 1)
p(z)
1. sampling method
F(x) y (0, 1)
F 1 ( y ) F(x)
1
F ( y )
2. proposal distributionsampling
p(z) q(z) proposal
distribution
Rejection sampling
single-varaible p ( z ) =
1
p ( z ) Z p p(z) z
Zp
p(z) z
Gam( z | a, b) =
b a z a 1 exp(bz )
ba
z Z p
(a )
(a )
p(z) p(z) z normalization

Z p sampling z
Z p
Rejection sampling p(z) proposal distribution q(z) q(z)
sampling k kq ( z ) p ( z ) z
Rejection sampling q(z) sample z0
[0, kq ( z0 )] sample u0 z0 u0 kq(z)

u0 > p ( z0 ) z0 z0 p(z)
k K z0
rejection sampling
Rejection sampling
rejection sampling u0 > p ( z0 )
Importance sampling
p(z) z p(z)
p(z) f(z)
L p(z) E[ f ]
1 L
f ( z (l ) )
L l =1
p(z) Rejection sampling

q(z) q(z) L
p( z ) f ( z )d z
=
E[ f ]
=
p( z ) f ( z )
1 L p( z (l ) )
q( z )d z
f ( z (l ) )
(l )
q( z )
L l =1 q ( z )
p( z (l ) )
q ( z (l ) ) importance weights
rl =
p(z) p(z) z p ( z ) normalization

constant Z p q(z) Z q
p( z ) f ( z )d z
=
=
E[ f ]
r =
Zq
Zp
Zq 1 L
p ( z ) f ( z )
q( z )d z
rl f ( z (l ) )
q ( z )
Z p L l =1
p ( z (l ) )
q ( z (l ) )
Zq
1
q( z )
=
p ( x )d x =
p ( x )d x
Zp
q ( z )
Zp
x
=
q( z )
q( z ) q( x )
=
q ( z ) q ( x )
1 L
rl
L l =1
E[ f ] =
z
(l )
q( x )
q ( z ) p ( x )d x = q ( x) p ( x )d x z
1 L
wl f ( z (l ) )
L l =1
q(z) L
rl
wl =
=
L
rm
m =1
p ( z (l ) ) / q ( z (l ) )
p ( z ( m) ) / q( z ( m) )
m
wl p(z) z p ( z )
Sampling-importance-resampling (SIR)
Rejection sampling kImportance sampling p(z)
Sampling-importance-resampling SIR proposal
distribution q(z)
q(z) L z
{z
(l )
(l )
Importance sampling wl
: l = 1,..., L} L z (l ) wl
L p(z) L
MCMC
3. Monte Carlo EM algorithm

EM E complete-log likelihood latent variables
Q ( ,
old
) = p ( Z | X , old ) ln p ( X , Z | ) sampling
Z
latent variables p ( Z | X ,
{Z
(l )
: l = 1,..., L} Q( , old )
M EM Q ( ,
old
) L
1
ln p(Z (l ) , X | )
L l
old
) with respect to
4. Markov chain
Markov chain a series of random variables {z
p ( z
( m +1)
(l )
: l = 1,..., M }
| z (1) ,..., z ( m ) ) = p ( z ( m +1) | z ( m ) ) m
Transiton probablisty Tm ( z
(m)
, z ( m +1) ) = p ( z ( m +1) | z ( m ) )
Homogeneous mTransiton probablisty Tm homogeneous

Markov chain
Invariant distribution p ( z ) =
*
T ( z, z ) p ( z) Markov chain invariant distribution

*
T Markov chain homogeneous

Detailed balance p(z) Markov chain invariant distribution
p ( z )T ( z , z ) = p( z )T ( z , z )
5. Metropolis-Hastings
Rejection sampling Importance sampling p(z)
proposal sampling q(z) sampling z
q ( z | z
( )
( )
) z * A( z * , z ( ) )
z (*) , if accpet
z ( +1) = ( )
z , if reject
*
A( z , z
( )
) = min{1,
p ( z * )q( z ( ) | z * )
} p ( z ) p(z) z
p ( z ( ) )q ( z * | z ( ) )
Markov chain {z
( )
: = 1,...}
z ( ) p(z) p(z) detailed balance Markov

chain invariant distribution
continuous state space Gaussian centred on the current state proposal
distribution variance variance state space
variance
6. Gibbs sampling
Metropolis-Hastings p ( z ) = p ( z1 ,..., zM ) Gibbs
sampling
{zi : i = 1...M } = 1 sampling
+ 1 sampling
( +1)
~ p ( z1 | z2( ) , z3( ) ,..., zM( ) )
( +1)
~ p ( z2 | z1( +1) , z3( ) ,..., zM( ) )
( +1)
~ p( z j | z1( +1) ,..., z (j+11) , z (j+)1..., zM( ) )
( +1)
~ p ( zM | z1( +1) , z2( +1) ,..., zM( +1)1 )
- sampling z1
- sampling z2
- sampling z j
- sampling zM
Metropolis-Hastings Gibbs sampling zk

zk( +1) z z q ( z | z ) = p ( zk | z \ k ) z \ k {zi : i = 1...M }
*
zk z z zk z \ k = z \*k Metropolis-Hastings
*
( )
A( z , z )
=
p ( z * )q ( z | z * ) p ( zk* | z\*k ) p ( z\*k ) p ( zk | z\*k )

=
= 1
p ( z )q ( z * | z ) p( zk | z\ k ) p ( z\ k ) p ( zk* | z\ k )
Metropolis-Hastings Gibbs sampling
Chapter 12 Continuous Latent Variables

1. Principal Component Analysis (PCA)
unsupervised learning dimentionality reductionfeature extractiondata
visualization lossy data compression
Maximum Variance Subspace
N D observation data { xn : n = 1...N } M<D
N data variance of the projected data is maximize
M>1 variance u1
u1 projected data u1 u2
variance of the projected data M
M
Formulation u1 N observation data variance of
projected data
1
N
{u
T
1
n =1
xn u1T x}2 =
u1T Su1
x =
1
N
=
S sample covariance matrix S
1
N
(x
x )( xn x )T
u1T Su1 with respect to u1 || u1 ||

u1T u1 = 1
Lagrange multiplier Su1 = 1u1 1 S
u1T Su1 = 1 u1T Su1 1 S
u1 1
u2 S M
Minimum Projection Error

xn projection x n
Formulation Maximum Variance Subspace u1
1
N
|| x
n
(u1T xn )u1 ||2 u1T u1 = 1 Maximum Variance
Subspace
2. Probabilistic PCA (PPCA)
latent variable z principle-component subspaceobserved data x data
point
p ( z ) = N ( z | 0, I )
=
p ( x | z ) N ( x | Wz + , 2 I )
W principle-component subspace
linear-Gaussian
p( x ) = N ( x | , C )
=
p ( z | x ) N ( z | M 1W T ( x ), 2 M )
C = WW T + 2 I D M = W T W + 2 I M
p ( x ) latent space z rotation
Rz R orthogonal matrix p ( x ) W = WR

R WW
= WRRT W T = WW T
observed data set PPCA W , ,

2
MLE closed-form solution EM

W , , Prior Bayesian PPCA closed-form
2
solution EM
3. Factor analysis
PPCA
p ( z ) = N ( z | 0, I )
p( =
x | z ) N ( x | Wz + , )
diagonal matrix Factor analysis PPCA PPCA
isotropic matrix
4. Kernel PCA
PCA principle-component subspace projection Kernel PCA
projection
feature map
( x ) data point
M feature space
feature space PCA projection
feature space data point 0
( x ) = 0 M
n
sample covariance C =
1
N
( x ) ( x )
PCA C
Cvi = i vi i=1,,M kernel M

eigenvector equation kernel trick
Kai = i Nai K ( xn , xm ) = ( xn )T ( xm )
N-by-N N eigenvector equation
feature space
K feature space data point x
projection i vi
N
=
yi ( x ) =
( x )T vi
n =1
in
k ( x , xn )
( x ) = 0
n
data centralize
( xn ) ( xn )
=
1
N
( x ) K
n
nm
= ( xn )T ( xm ) K nm = ( xn )T ( xm )
K nm eigenvector equation
4. Nonlinear latent variable models

PPCA latent variable continuous
GMM HMM latent variable discrete
observed variable linear-Gaussian latent variable
Gaussian
Independent component analysis
Observed variable latent variable latent variable
Gaussian p ( z ) =
p( z )
j =1
Autoassociative neural network

auto-encoder neural network unsupervised learning NN
D inputsD outputsM (< D) hidden units 3
=
E (w)
1
|| y ( xn , w ) xn ||2
2 n
hidden unit linear activation function E ( w )

N D M neural network hidden unit
D D principle component
hidden unit nonlinear activation function

M principle-component space
nonlinear principle component analysis hidden layer
Chapter 13 Sequential Data

1. Hidden Markov Model (HMM)
HMM
observation xn discrete latant variable zn
latent variable Markov chain HMM Bayesian Network
latent Markov chain homogeneoustransition probability p ( zn | zn 1 )
n zn K K 1-of-K
zn transition probability p ( zn | zn 1 ) K-by-K A
A entry
HMM
K
p( zn | zn 1 , A) = Ajkn1, j zk
z
=
k 1 =j 1
p ( z1 ) = kz1k k = 1, 0 k 1
k =1
p ( xn | zn , ) = p ( xn | k ) znk
emission probability
k =1
HMM likelihood function

HMM observation { xn : n = 1..N }
likelihood function the product over all data points of the probability evaluated
at each data pointHMM likelihood Bayesian Network
N
p ( X , Z | ) = p ( z1 | )[ p ( zn | zn 1 , A)][ p ( xm | zm , )]
=
n 2=
m 1
=
X { x=
1..N=
}, Z {=
zn : n 1..N=
}, { , A, } HMM
n :n
= { , A, }
HMM learningMaximum Likelihood EM
=
X {=
xn : n 1..N } HMM = { , A, }
learning observation
MLE latent variable observation likelihood
p ( X | ) =
p( X , Z | ) EM likelihood
Z
EM HMM inference local posterior

marginals for latent variables
( zn ) = p ( zn | X , old )
( zn 1 , zn ) = p ( zn 1 , zn | X , old )
Forward-backward marginal HMM factor graph
PGM sum-product
HMM inferenceViterbi
inference marginal conditional ()HMM
inference Z = arg max p ( Z | X , ) HMM
*
=
X {=
xn : n 1..N } latent statessequence of states

PGM max-sum HMM Viterbi
2. Linear Dynamical System (LDS)
LDS HMM PGMLDS latent variables
transition probabilisty emission probabilisty linear-Gaussion LDS
p( zn | zn 1 ) = N ( zn | Azn 1 , )
p ( xn | zn ) = N ( xn | Czn , )
p ( z1 ) = N ( z1 | 0 ,V0 )
Kernelize
Probabilistize
Chapter 14 Combining Models

1. Model combination
model averaging model selection
Model averaging Boostmixture of linear regressioinmixture of logtistic regression
GMM
Model selection Decision tree
2. Bayesian model averaging V.S. Combined model
data set DBayesian model averaging a single
model Combined model D
model GMM D
K Gaussian model
3. Decision tree
Decision tree CART (Classification And Regression Tree) ID3 C4.5
combined model decision tree
PRML decision tree combined model model input space
region data point data point regression/classification decision
tree root leaf model selection data point
modelleaf
Decision tree
D
=
X {=
xn : n 1..N } labels
=
t {=
tn : n 1..N }
regression classification
leaf T leaf input space region R
N data point
regression R model R optimal prediction
y =
1
N
xn R
sum-of-squares error
Q
=
(T )
{t
xn R
y }2
classification p k region R k
Q (T ) = p k ln p k
cross entropy
Q (T )
=
(1 p k ) Gini index
over-fitting leaf data point training error 0

decision tree model complexity Leaf
regression/classification
=
C (T )
|T |
Q (T ) + | T |
=1
Decision tree C(T)

Decision tree
tree root
tree leaf
D featureD-dimension data point data point D feature
feature
feature threshold
threshold leaf leaf feature threshold
data point data point
leaf
Deision tree
human interpretability
tree struture data set training data
tree struture
4. Mixture Model
Unconditional Mixture Model Gaussian Mixture Model (GMM)
K
p ( x ) = k N ( x | k , k ) , ,
k =1
Conditional Mixture ModelMixtures of linear regression models linear

regression model p (t | ) =
k =1
N (t | wkT , 1 ) = {W , , }
conditional mixture
= {W , , } input
model
Mixture of Expert Mixtures of linear regression models mixing coefficients
input input mixture of expert

p (t | x ) =
k =1
( x ) pk (t | x ) pk (t | x ) expert
Hierarchical Mixture of Expert mixture of expert pk (t | x )

mixture of expert
Unconditional Mixture Model GMM closed-form solution
EM
5. AdaBoost
model combination model averaging M model { ym ( x ) : m = 1..M }
model M Model Y ( x ) = sgn(
m =1
ym ( x )) m
ym ( x ) binary classification sgn
M model ym ( x ) ym ( x )
ym 1 ( x )
data point wn( m ) m=1 1/N
ym ( x )
=
Jm
w
n =1
(m)
n
I ( ym ( xn ) tn )
tn {1, +1} I indicator function

N
ym ( x ) m =
w
n =1
(m)
n
I ( ym ( x n ) t n )
N
w
n =1
1 m
m = ln{
ym ( x )
(m)
n
ym ( x ) data point
wn( m +1) wn( m ) exp{ m I ( ym ( xn ) tn )}

=
AdaBoost Y ( x ) = sgn(
m =1
ym ( x ))
exponential error function

=
E
exp{t
n =1
{ xn , tn } f ( xn ) =
f ( xn )} tn {1, +1}
1 M
m ym ( x ) combined modelAdaBoost
2 m =1
sequential optimization

PRML笔记-Notes on Pattern Recognition and Machine Learning PDF

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

PRML笔记-Notes on Pattern Recognition and Machine Learning PDF

Transféré par

Droits d'auteur :

Formats disponibles

PRML

Notes on Pattern Recognition and Machine Learning (Bishop)

Linear basis function

Bayesian linear basis function

Bayesian logitstic regression

Neural network (for

Bayesian Neural network (for

SVM (for regression,

RVM (for regression,

Gaussian mixture model

Bayesian Gaussian mixture

Bayesian probabilistic PCA

Bayesian Hidden markov

Bayesian Linear dynamic

MAP(poor mans Bayesian) marginalization point

PRML Bayesian Empirical Bayesian

Latent variable model

Probabilistic PCA/ICA/Factor Analysis

Latent variable Markov chain

Objective function/ Error function/Estimator

fixed wFrequentist data sets D

p(D) p(w|D) p(D)

Cross-validation model selection validation data

Bayesian over-fitting Prior probability

7. PRMLProbability theory, decision theory and information theory

Discriminant function there are powerful reasons for

decision problem expected loss discriminant

balanced data set P (Ck | x ) compensate for the

P(Ck ) P(Ck ) P(Ck | x )

11. Criteria for making decisions

loss function A={a1 , . . . , ap }

( x) = arg min R(ai | x)

(ai | j ) j ai loss risk

H(Y | X = x) = - p(y | x)logp(y | x)dy

(- p(x)lnq(x)dx) - (- p(x)lnp(x)dx) = - p(x)ln{

I[x, y] = KL (p(x, y) || p(x) p(y)) = H[x] H[x|y] = H[y] H[y|x]

-lnx Jensens inequality KL

Feature selection: variable selection, feature reduction, attributes selection

Chapter 2 Probability Distribution

Gaussian , Given variance, mean unknown

Gaussian, Given mean, variance unknown

Gaussian, both mean and variance unknown

3. Conjugate priorBayesian inferencesequential Bayesian inference

probability density function

probability density function

6. Linear Gaussian Model

(4) f : R R (Jacobian matrix) (

(5) f : R { Amn : R n R m } scalar matrix

(6) f :{ Amn : R n R m } R matrix scalar (

8. The Exponential Family

MLE g ( ) h( x ) exp{ T u ( x )}dx = 1

10. Nonparametric methods

Kernel density estimator( Parzen window ) V

V K smoothing kernel function

h regions of high data densityh

Chapter 3 Linear Models for Regression

MLE minimize the sum of squares of error

basis function y(x, w) w Basis function

Polynomial, Gaussian, Logistic sigmoid function, Fourier basis, wavelets

3. Loss function for regressionExpected Lossregression function

E[ L] = L(t , y ( x)) p ( x, t )dxdt

regression squared loss

L(t , y ( x))= { y ( x) t}2= { y ( x) E[t | x] + E[t | x] t}2