Vous êtes sur la page 1sur 77

PRML

Notes on Pattern Recognition and Machine Learning (Bishop)


Version 1.0
Jian Xiao

Checklist .....................................................................................................2
Chapter 1 Introduction ................................................................................4
Chapter 2 Probability Distribution............................................................10
Chapter 3 Linear Models for Regression ..................................................14
Chapter 4 Linear Models for Classification..............................................19
Chapter 5 Neural Networks ......................................................................26
Chapter 6 Kernel methods ........................................................................33
Chapter 7 Sparse Kernel Machine ............................................................39
Chapter 8 Graphical Models .....................................................................47
Chapter 9 Mixture Models and EM ..........................................................53
Chapter 10 Approximate Inference ...........................................................58
Chapter 11 Sampling Method ...................................................................63
Chapter 12 Continuous Latent Variables ..................................................68
Chapter 13 Sequential Data ......................................................................72
Chapter 14 Combining Models .................................................................74

iamxiaojian@gmail.com

Checklist
Frequentist-Bayesian
Frequentist

Bayesian

Linear basis function

Bayesian linear basis function

closed-form

regression

regression

solution

Logistic regression

Bayesian logitstic regression

(IRLS)
Laplace approximation

Neural network (for

Bayesian Neural network (for

gradient decent

regression, classification)

regression, classification)

Laplace approximation

SVM (for regression,

RVM (for regression,

classification)

classification)

Laplace approximation

Gaussian mixture model

Bayesian Gaussian mixture

EM Variation

model

inferencce

Bayesian probabilistic PCA

closed-form solution

Probabilistic PCA

EM Laplace
approximation
Hidden markov model

Bayesian Hidden markov

EM

model
Linear dynamic system

Bayesian Linear dynamic

EM

system

Bayesian
Fully Bayesian marginalize with respect to hyper-parameters as well as parameters
analytical intractable
curve fitting ( p ( w | ) = N ( w | 0, 1 I ) p (t | x , w , ) = N (t | y ( x , w ), 1 ) )

p (t | t ) = p (t | w , ) p ( w | t , , ) p ( , | t )dwd d
Empirical Bayes/type 2 maximum likelihood/evidence approximation hyper-parameter
marginal likelihood * *
hyper-parameter * * w marginalize
*
*
*
p (t | t ) p (t | t , * , * ) =
p(t | w, ) p(w | t , , )dw

MAP(poor mans Bayesian) marginalization point


estimate

PRML Bayesian Empirical Bayesian

Optimization/approximation
Linear/Quadratic/Convex optimization//
Lagrange multiplier
Gradient decent
Newton iteration
Laplace approximation
Expectation Maximation (EM)/latent variable model
Variational inference
Expectation Propagation (EP)
MCMC/Gibbs sampling

Latent variable model


Latent variable

Latent variable

Latent variable

GMM

Probabilistic PCA/ICA/Factor Analysis

Latent variable Markov chain

HMM

LDS

Objective function/ Error function/Estimator


LikelihoodMLE estimator
Marginal likelihoodemprical Bayes/evidence approximation hyper-parameter
Sum-of-square errorregression
PosteriorMAP
Negative log likelihood/cross-entropyLogistic regression
Exponential errorAdaboost
Hinge errorSVM

Chapter 1 Introduction
1. Bayesian interpretation of probability

degree of belief
uncertaintydegree of belief
Cox showed that if numerical values are used to represent degrees of belief, then a simple set of
axioms encoding common sense properties of such beliefs leads uniquely to a set of rules for
manipulating degrees of belief that are equivalent to the sum and product rules of probability.
use the machinery of probability theory to describe the uncertainty in
model parameters.
2. parameterBayesian
Frequentist model parameter w fixed estimator
estimator likelihood
Bayesian w prior probability p(w)

fixed wFrequentist data sets D


Bayesian there is only a single data set D, namely the one that is actually observed.
observation D w beliefprior probability
P(w|D) belief
Bayesian convert a prior probability into a posterior
probability by incorporating the evidence provided by the observed data P(D|w)
how probable the observed data set is for different settings of parameter vector w

=
p( w | D)

p ( D | w) p ( w)
=
p( D)

p ( D | w) p ( w)

p( D | w) p(w)dw

p(D) p(w|D) p(D)

C1,,Ck P(C1)
P(Ck)

x P(C1|x)P(Ck|x)
P(C1)=P(C1|x)P(Ck)=P(Ck|x)
P(C1)
P(Ck)

3. BayesianFrequentist
Bayesian prior distribution is often selected on the basis of mathematical
convenience rather than as a reflection of any prior beliefs conjugate prior
Frequentist Over-fitting problem can be understood as a general property of
maximum likelihood

4. over-fitting
Frequentist over-fitting
1) regularization penalty term
L2 regularizer ridge regression
L1 regularizer Lasso regression
penalty shrinkage method reduce the value of the coefficients.
2) cross-validation validation

Cross-validation model selection validation data


model

Bayesian over-fitting Prior probability

5. Bayesianmarginalization
Marginalization lies at the heart of Bayesian methods.
Bayesian methods marginalization full Bayesian procedure
make prediction compare different models
marginalize (sum or integrate)
over the whole of parameter space.
marginalization
sampling Markov chain Monte CarloMonte Carlo method flexible
model computationally intensive small-scale problems.
deterministic approximation variational Bayes expectation propagation
large-scale applications.

6. Curve fitting
1) MLE likelihood function w point estimation
2) MAP (poor mans bayes) prior probability posterior probability
wMAP MLE likelihood function L2 penalty

point estimation
3) fully Bayesian approach sum rule product ruledegree of belief
machinery rule degree of belief predictive
distribution marginalize (sum or integrate) over the whole of parameter space w

p (t | x , X , t ) = p (t | x , w ) p ( w | X , t )d w
x X t label
w probability
w marginalization

7. PRMLProbability theory, decision theory and information theory

8. inferencedecision
solution inference decision inference stage
decision stage posterior probability to make
optimal class assignments

9. decision problem
1) discriminant function: map inputs x directly into decisions. discriminant function
inference decision
2) discriminant model: inference problemdetermining the posterior class
probabilities, P (Ck | x ) decision problem x
class
3) generative modelexplicitly or implicitly model the distribution of inputs as well as outputs.
class-conditional density P ( x | Ck ) joint distribution P ( x , Ck )
posteriordecision problem

10.
Generative model make classification decision joint distribution
wasteful of computational resources and excessively demanding of data

Discriminant function there are powerful reasons for


wanting to compute the posterior probabilities
(1) loss matrix

financial application

decision problem expected loss discriminant

function
(2) X-ray 99.9%

balanced data set P (Ck | x ) compensate for the


effects of the modification to the training data P (Ck | x ) balanced data set

P(Ck ) P(Ck ) P(Ck | x )


discriminant function
(3) Combining models
X-ray xI xB heterogeneous
information input build one system to interpret the X-ray images
and a different one to interpret the blood data:

P(Ck | xI , xB ) P( xI , xB | Ck ) P(Ck )
P( xI | Ck ) P( xB | Ck ) P(Ck )

P(Ck | xI ) P(Ck | xB )
P(Ck )

xI xB Ck

11. Criteria for making decisions


1) Minimizing the misclassification rate
2) minimizing the expected loss

loss function A={a1 , . . . , ap }


Ax (x)

( x) = arg min R(ai | x)


ai A

R ( ai | x)
c

R(ai | x) = (ai | j ) P( j | x)
j =1

(ai | j ) j ai loss risk

(ai | x j ) P( j | x) x j
(x)
(x) risk

R[ ] = R ( ( x) | x) p ( x)dx
R ( ( x) | x)

12.
Entropy is a lower bound on the number of bits needed to transmit the state of a random
variable.
Xx p(x)
1) log 2 p ( x)

2) - p(x)log 2 p(x)dx
2 e

13.
Conditional entropy joint distribution p(X,Y) H[Y|X]/

H [=
Y | X]

(Y | X
H=

x=
) p ( X x)d x

= p ( x, y ) log p ( y | x)dxdy
H[Y|X] X=x Y

H(Y | X = x) = - p(y | x)logp(y | x)dy


Relative entropy p(x) q(x) p(x)
q(x) ( p(x))
p(x)(KL )KL(p||q)

(- p(x)lnq(x)dx) - (- p(x)lnp(x)dx) = - p(x)ln{

q(x)
}dx
p(x)

q(x) q(x)
p(x) p(x)p(x)
Mutual information X,Y p(x, y)=P(x)P(y)

I[x, y] = KL (p(x, y) || p(x) p(y)) = H[x] H[x|y] = H[y] H[y|x]


x yy x uncertainty

-lnx Jensens inequality KL


KL(p(x)|q(x)) = 0 p(x) = q(x) x -lnx

14. 1.14

15 1.32
x H[x]x y=Ax
H[y]=H[x]+ln|A|

Feature extraction: a special form of dimensionality reduction, the input data be transformed into
a reduced representation set of features (features vector)
real-time face detection in a high-resolution video stream, huge
numbers of pixels per second speed up computation
PCA, kernel PCA, manifold leaning

Feature selection: variable selection, feature reduction, attributes selection


Feature selection algorithms typically fall into two categories, feature ranking and
subset selection.
Feature ranking ranks the features by a metric and eliminates all features that do not achieve an
adequate score.
Subset selection searches the set of possible features for the optimal subset.
metricfeature subset

In statistics, the most popular form of feature selection is stepwise regression. It is a greedy
algorithm that adds the best feature (or deletes the worst feature) at each round. The main
control issue is deciding when to stop the iteration.
In machine learning, this is typically done by cross-validation.

Unsupervised leaning: the training data consists of a set of input vectors without any
corresponding target values.
Clustering, density estimation (determine the distribution of data within the input space),
visualization (project the data from a high-dimensional space down to two or three dimensions)

p(D|w)
Frequentistw p(D|w)
Bayesianw p(D|w)

Chapter 2 Probability Distribution


1.
Parametric method: assume a specific functional form for the distribution.
Nonparametric method: form of distribution typically depends on the size of the data set. Such
models still contain parameters, but control the model complexity rather than the form of the
distribution.

2. Conjugate prior: lead to posterior distribution having the same functional form as the prior
Distribution

Conjugate Prior

Bernoulli

Beta distribution

Multinomial

Dirichlet distribution

Gaussian , Given variance, mean unknown

Gaussian distribution

Gaussian, Given mean, variance unknown

Gamma distribution

Gaussian, both mean and variance unknown

Gaussian-Gamma distribution

3. Conjugate priorBayesian inferencesequential Bayesian inference


sequential Bayesian inference observation posterior
posterior prior posterior prior
observation stream of data real-time
learning

4. Conjugate priormultinomial
D N multinomial K
K

p ( D | ) = k mk
k =1

mk N k
prior hype parameter conjugate prior

p ( | ) k k 1
k =1

Dir ( | ) =

(1 + ... + K ) K k 1
k
(1 ) + ... + ( K ) k =1

Dirichlet Multinomial

prior functional
form

5.
Linear Algebra + Matrix Theory, Multivariate Calculus

R
random vector
A G A = GG

probability density function

probability density function


conditional gauss distribution
completing the square
inverse of partitioned matrix
marginal gauss distribution

(1)
(2)
flexible
multimodal function
multimodal function
(1) introducing discrete latent variables Gaussian Mixtures model
(2) introducing continuous latent variables Linear Dynamic System

6. Linear Gaussian Model


Given Gaussian distribution p(x) and p(y|x) p(y|x) mean x covariance
x p(y)p(x|y)

7.
(1) f : R R
(2) f : R R n gradient
n

f ( xn ) T
f ( x1 )
,...,
)
x1
xn
(3) f : R R n (
n

f ( x)
f1 ( x)
,..., n )T
x
x

(4) f : R R (Jacobian matrix) (


n

fi
)ij
xj

(5) f : R { Amn : R n R m } scalar matrix

Aij
x

)ij

(6) f :{ Amn : R n R m } R matrix scalar (

x
)ij
Aij

Ax x x Ax x

8. The Exponential Family

p ( x | ) = h( x ) g ( ) exp{ T u ( x )}

natural parameter

MLE g ( ) h( x ) exp{ T u ( x )}dx = 1


[u ( x )]
g ( ) E=
ln
=

1
N

u(x )
n =1

vector

9.
The prior intended to have as litter influence on the posterior distribution as possible.

normalize improper
Translation invariant Scale invariant

10. Nonparametric methods


D N p(x)
unsupervised learning
R x R P

N R K Bin(K|N,P)
R p(x) R P = p(x) * VV R
N Bin(K|N,P) N*P K = N * P
R p(x) = K / (N * V)

Kernel density estimator( Parzen window ) V

V K smoothing kernel function


kernel function

1,..., D
1,| u | 1/ 2, i =
k (u ) = i
0, otherwise

k(

x xn
) =1
x
n x h hypercube
h

hypercube N hypercube
N

K = k(
n =1

x xn
)
h

p(x) = K / (N * V) hypercube
kernel function artificial discontinuity
Gaussian

=
p( x)

1
N

|| x xn ||2
1
exp{
}

2 1/2
2h 2
n =1 (2 h )
N

h regions of high data densityh


lead to over-smoothing and washing out of structure that might otherwise be extracted from the
datah lead to noisy estimate h location

kNN K K V
kNN
classificationkNN posterior MAP

Chapter 3 Linear Models for Regression


1. curve fitting
N
minimize the sum of squares of error y(x,w) w
x input variable
2

=
E ( w)

1 N
{ y( xn , w) tn }
2 n =1

MLE t y(x, w)
w
MAP MLE w
minimize the sum of squares of error with regularizer regularizer E(w)
regularizer

1 N

{ y ( xn , w) tn } + || w ||2
=
E ( w)

2 n =1
2
2

MLE minimize the sum of squares of error


MAP minimize the sum of squares of error with regularizer (MAP with Gaussian prior)

2. linear model
input variable
linear basis function model
M 1

y ( x, w) =
w0 + w j j ( x) =
wT ( x)
j =1

basis function y(x, w) w Basis function

Polynomial, Gaussian, Logistic sigmoid function, Fourier basis, wavelets

3. Loss function for regressionExpected Lossregression function


input x y(x) t loss
function L(t, y(x)) Expected loss

E[ L] = L(t , y ( x)) p ( x, t )dxdt

regression squared loss

L(t , y=
( x)) { y ( x) t}2
L E[L] expected loss E[L] y(x)
E[L] y(x)

=
y ( x)

tp (t | x)d t E [t | x]
=
t

t x conditional expectation

squared loss L

L(t , y ( x))= { y ( x) t}2= { y ( x) E[t | x] + E[t | x] t}2


= { y ( x) E[t | x]}2 + 2{ y ( x) E[t | x]}{E[t | x] t} + {E[t | x] t}2
E[L]
2
2
E[ L] =
{ y( x) E[t | x]} p( x)dx + {E[t | x] t} p( x, t )dxdt

h(x)=E[t|x]

E[ L] = { y ( x) h( x)}2 p ( x)dx + {h( x) t}2 p ( x, t )dxdt


h(x) E[L]y(x) t

E[L]

{h( x) t}

p ( x, t )dxdt y(x)

arise from the intrinsic noise on the data the represents the minimum achievable value of expected
loss y(x)

4. Frequentistmodel complexityBias-Variance trade-off


linear basis function model
(1) Frequentist data set D w point estimate
(2) data set D w y(x) D
y(x) y(x; D) D y(x; D)
(3) thought experiment data sets p(t, x)
data set N
(4) data set D y(x; D)

{ y ( x; D) h( x)}2 = { y ( x; D) ED [ y ( x; D)] + ED [ y ( x; D)] h( x)}2


=
{ y ( x; D) ED [ y ( x; D)]}2 + {ED [ y ( x; D)] h( x)}2

+2{ y ( x; D) ED [ y ( x; D)]}{ED [ y ( x; D)] h( x)}

take the expectation of this expression with respect to D

)}2 ] {ED [ y ( x; D)] h( x)}2 + ED [{ y ( x; D) ED [ y ( x; D)]}2 ]


ED [{ y ( x; D) h( x=
2

bias average prediction over all data sets desired regression


function h(x) average model best model
variance
data set y(x; D) E[y(x;
D)]
y(x; D) D

(5) expected loss ED [{ y ( x; D ) h( x)} ]


2

E[ L] = { y ( x) h( x)}2 p ( x)dx + {h( x) t}2 p ( x, t )dxdt


{ y ( x) h( x)}2
2

Expected loss (E[L]) = (bias) + variance + noise

{E

=
(bias ) 2
=
var iance

[ y ( x; D)] h( x)}2 p( x)dx


D

[{ y ( x; D) ED [ y ( x; D)]}2 ] p ( x)dx

{h( x) t}

=
noise

p ( x, t )dxdt

regression y(x) E[L] E[L]


bias, variance noise noise y(x)
regression
bias variance flexible
model low biasaverage model best modelhigh variance single
model data set D rigid model high biaslow variance
bias variance

5. bias-variance
curve fitting L data sets N data points

y (l ) ( x) D (l ) average model

y ( x) =

1 L (l )
y ( x)
L l =1

bias variance

(bias ) 2
=

1
N

{ y ( x ) h( x )}

n =1

1 N 1 L (l )
var iance
=

{ y ( xn ) y ( xn )}
N
Ll 1
=
n 1=

model bias variance

6. Bayesianmodel complexityModel evidence/marginal likelihood


(1) Bayesian over-fitting Marginalizing over the model parameters
instead of making point estimates of their values.
(2) model {M i : i = 1,..., L} data set DBayesian model
comparison

p( M i | D) p( M i ) p( D | M i )
p ( M i ) allows us to express a preference for different model
p ( D | M i ) model evidence marginal
likelihood
(3) Model averaging V.S. model selection
Model averaging predictive distribution
p (t | x, D ) =

p(t | x, M , D) p(M
i =1

| D)

Model selection approximation to


model averaging
model model
evidence
(4) Model evidence
marginal likelihood/

p ( D | M i ) = p ( D |w, M i ) p ( w | M i )d w
w marginalize

sampling M i hyper-parameter w parameter model


model hyper-parameter basis function curve fitting
M hyper-parameter M
model M w marginalize
model evidence M model D

7. Bayesian
Fully Bayesian marginalize with respect to hyper-parameters as well as parameters
analytical intractable
curve fitting ( p ( w | ) = N ( w | 0, 1 I ) p (t | x , w , ) = N (t | y ( x , w ), 1 ) )

p (t | t ) = p (t | w , ) p ( w | t , , ) p ( , | t )dwd d
Empirical Bayes/type 2 maximum likelihood/evidence approximation hyper-parameter
marginal likelihood * *
hyper-parameter * * w marginalize
*
*
*
p (t | t ) p (t | t , * , * ) =
p(t | w, ) p(w | t , , )dw

MAP(poor mans Bayesian) marginalization point


estimate

p ( y | x) =

p( y | x, z ) p( z | x)d

p=
( y | x)

p ( x, y )
=
p( x)

, z ) d z p ( y | x, z ) p ( z | x ) p ( x ) d z
p( x, y=
=
p( x)

p( x)

p ( y | x, z ) p ( z | x ) d

Chapter 4 Linear Models for Classification


1. hyperplane,
D Euclidean space D-1

Linearly separable D
Coding scheme1-of-K binary coding scheme K i
K i 1 0
Feature vector D input x fixed nonlinear transformation
D ( x) feature vector

2. Generalized Linear Model: an activation function acting on a linear function of the feature
variables.

=
y ( x) f ( wT ( x) + w0 )
Generalized Linear Model (GLM)

x ( x) x feature vector
f activation functionf link function
(1) f nonlinear function GLM classification model
(2) f identity function GLM regression model

classification GLM linear regression model


regression

y ( x) wT ( x) + w0 w x
=
polynomial function basis GLM activation function
y w
GLM
(1) Logistic regression model f logistic sigmoid
(2) Probit regression activation function f probit

3.
1 decision problem
(1) Discriminant function
x
(2) Generative model joint distribution p ( x, Ck ) class-conditional
distribution p ( x | Ck ) posterior p (Ck | x)
(3) Discriminant model posterior p (Ck | x) GLM

p (Ck | x) =
P(Ck | x)

f ( wT ( x) + w0 ) training data

infer GLM w p (Ck | x)


decision stage Logistic Regression Probit regression
infer Frequentist Logistic Regression Bayesian
Logistic Regression Probit regression

4. Linear discriminant function


Linear discriminant decision surface hyperplane
x non-probabilistic method

4.1 binary classification multiclass classification


One-versus-the-rest K two-class discriminant K discriminant

One-versus-one K(K-1)/2 binary discriminant K


decision region ambiguous region

A single K-class discriminant K linear function K-class discriminant linear


function yk=
( x) wk T x + wk 0 k j decision boundary

yk ( x) = y j ( x) ( wk w j )T x + ( wk 0 w j 0 ) =
0 x
k j k yk ( x) > y j ( x) K-class discriminant
decision region ambiguous region

Linear discriminant function least squares, Fishers linear


discriminant perceptron algorithm

4.2 least squares


regression sum-of-squares error

4.3 Fishers linear discriminant


x D y = w x scalar
T

D 1 Fisher
1
D 1 Fisher criterion
1

Fisher criterion Fisher


criterion w decision hyperplane
1 y0

4.4 perceptron algorithm Generalized Linear Model


=
y ( x)

f ( wT ( x) + w0 )

activation function f step function sign(x)

perceptron criterion
1 tn +1 2 tn -1 xn
1 wT xn > 0 2 wT xn < 0 xn
wT xn tn > 0 misclassified pattern xn wT xn tn < 0
wT xn tn perceptron criterion misclassified pattern

E p ( w=
)

xntn M misclassified pattern

closed stochastic gradient descent


E p ( w) w Generalized Linear Model
x
K>2

5. Generative model
input class-conditional distribution
Discriminant model make decision
2
p (Ck | x)

(1) p (C1 | x) =

=
a = ln

p ( x | C1 ) p (C1 )
p ( x | C1 ) p (C1 ) + p ( x | C2 ) p (C2 )

1
= (a)
1 + exp(a )

p ( x | C1 ) p (C1 )
p ( x | C2 ) p (C2 )

(2) class-conditional distribution Gaussian


class-conditional distribution covariance matrix

| Ck ) N ( k , )
k p ( x=

(3) class-conditional distribution

=
(ln
p (C
1 | x)

p ( x | C1 ) p (C1 )
=
) ( wT x + w0 )
p ( x | C2 ) p (C2 )

w=
1 ( 1 2 )

p (C1 )
1
1
w0 =
1T 11 + 2T 12 + ln
p (C2 )
2
2
k covariance 2

covariance class-conditional distribution


ln

p ( x | C1 ) p (C1 )
x x wT x + w0
p ( x | C2 ) p (C2 )

(4) MLE
p (C1 ) = p (C2 ) = 1 {xn , tn } xn C1 tn = 1
tn = 0

=
p ( xn , C1 ) N ( xn | 1 , )

(1 ) N ( xn | 2 , )
p ( xn , C2 ) =

p (t =
| , 1 , 2 , )

[ N ( x
n =1

| 1 , )]tn [(1 ) N ( xn | 2 , )]1tn

MLE (3)

class-conditional distribution
MLE class-conditional distribution

GLM Logistic

6. Discriminant model
D generative model D

6.1 logistic function Softmax function

Logistic (a ) =

1
step function
1 + exp(a )

Softmaxs (ak ; a1 ,..., an ) =

exp(ak )
max function
exp(a j )
j

Softmax logistic

6.2 Logistic regression


Generalized Linear Model posterior can be achieved by a Softmax
transformation of linear function of the feature variables

=
(ak ; a1 ,..., an )
p (Ck | ) s=

exp(ak )
ak = wk T
exp(a j )
j

2 logistic function
2 training data {n , tn } n
D feature vector tn {0,1}

=
p (t | w)

y
n =1

tn
n

{1 yn }1tn

=
(C1 | n ) (an ) an = wT n
yn p=
negative logarithm of the likelihood cross-entropy error function
N

E ( w) =
ln p (t | w) =
{tn ln yn + (1 tn ) ln(1 yn )}
n =1

w p (Ck | ) inference make decision


Multiclass
activation function probit function probit regressionprobit
function CDF(Cumulative Distribution Function) logistic
6.3 iterative reweighted least squares (IRLS)
E(w) 0 E ( w) =

(y
n =1

tn )n = 0

0 closed-form solution
E ( w) =
logistic function yn
E(w)concave E(w)

0 w
E ( w) =

IRLS

E ( w) =
0
Hessian

w( new) = w( old ) H 1E ( w)
7. Laplace approximation
q(z) p ( z ) =

1
f ( z ) Z f(z)
Z

q(z) f(z) x0 ln[f(z)]

ln f ( z ) ln f ( z0 )
A =

1
A( z z0 ) 2
2

d2
ln f ( z ) |z = z0 f(z)
dz 2

f ( z ) f ( z0 ) exp{

A
( z z0 ) 2 }
2

q( z ) (

A 1/2
A
) exp{ ( z z0 ) 2 }
2
2

p(z)

8. Bayesian Logistic Regression


2 feature vector
p (C1 | , t )
=

predictive distribution

p (C | , w) p ( w | t )dw ( w
=

x) p ( w | t )dw Bayesian

marginalize over parameter space logistic


p(w|t)
p(w|t) Gaussian approximation q(w) Laplace
approximation p(w|t)stationary point lnp(w|t) Hessian matrix
p ( w) = N ( w | m 0 , S0 ) lnp(w|t) m MAP
Hessian matrix S N q ( w) = N ( w | m MAP , S N )
q(w) p(w|t) marginalization logistic
Gaussian predictive distribution


likelihood function
Wikipedia The likelihood of a set of parameter values given some observed outcomes is
equal to the probability of those observed outcomes given those parameter values.

observed outcomeobserved outcome


likelihood PRML
generative model classification p (C1 ) = p (C2 ) = 1

{xn , tn } xn C1 tn = 1 tn = 0

=
p ( xn , C1 ) N ( xn | 1 , )

(1 ) N ( xn | 2 , )
p ( xn , C2 ) =

p (t =
| , 1 , 2 , )

[ N ( x
n =1

| 1 , )]tn [(1 ) N ( xn | 2 , )]1tn

likelihood joint distribution p ( xn , C1 ) p ( xn , C2 )


observed outcome {xn , tn } input xn xn

tn input variable input


logistic regression training data {n , tn } n D

feature vector tn {0,1}

=
p (t | w)

y
n =1

tn
n

{1 yn }1tn

(C1 | n ) (an ) an = wT n
=
yn p=
likelihood posterior distribution yn = p (C1 | n )
observed outcome input xn xn tn
p( xn tn | xn ) yn = p (C1 | n )

Chapter 5 Neural Networks


1. Generalized Linear Model: an activation function acting on a linear function of the feature
variables

=
y ( x) f ( wT ( x) + w0 )
x ( x) x feature vector f activation
functionf link function
(1) f nonlinear function GLM classification model y(x)
posterior probability Generalized Linear Model Discriminative model
f logistic logistic regression cross-entropy error function
(2) f identity function GLM regression model y(x)
probability interpretation sum-of-squares error function

2. Neural Network
Generalized Linear Model nonlinear function
M

yk (x, w) = f ( wkj(2) h( w(1)


ji xi ))
=j 0=i 0

yk (x, w) k outputClassification k Regression


k M hidden layer unit D x h nonlinear
functionf linear nonlinear function regression classification
bias
h logistic logistic regression

3. Neural Network
regression sum-of-squares error function

=
E (w)

1 N
|| y(x n , w) t n ||2

2 n =1

y(x n , w) K output vector x n D input vector t n target


vector output vector K
w weight vector
classification negative logarithm of likelihood function

E (w) = tnk ln[y k (x n , w)] =


=
n 1=
k 1

ln [yk (x n , w)]tnk
n =1

k =1

xn tn 1-of-K coding scheme input k

w)
tnk = 1 0 k yk ( xn ,=

p=
(tk 1| xn )

E(w) likelihood function

(1) input vector


N

E (w) = En (w)
n =1

(2) sum-of-squares error function likelihood


function

4.
observation data E(w) w Gradient decent

(1) Off-line gradient batch gradient steepest decent


conjugate gradients quasi-Newton methods steepest decent

w( +1)= w( ) E ( w( ) )
> 0 learning rate E ( w ) batch error function gradient
( )

observation data
(2) On-line gradient decent sequential gradient decent stochastic gradient decent

w( +1)= w( ) En ( w( ) )
observation data random selection with
replacement

On-line data redundancy escaping from local


minima
5. En Error Back-propagation
Gradient decent

a j hidden layer j

a j = w ji zi
i

zi input layer i j

z j = h( a j )
j =

En

a j

E
En En a j
=
= j zi n j
w ji
w ji a j w ji
k output layer unit

En
=
a j

j
=
aj
=

En ak
=
k a j

a
k

=
w z w

ji i

ji

ak
a j

h(ai )

( wki h(ai ))
i
=
wkj h ' (a j )
a j

ak
=
a j

=
j

w h (a )
=
'

kj

h ' (a j ) k wkj
j
k
k

output layer k k k j

En

w ji
E
wkj

n
output layer hidden layer
=

En ak
= k z j
ak wkj

En

Error Back-propagation output layer k k hidden layer


j j W
O(W)
bias O(N*M + M*K)
BP (1) w a z
evaluationforward propagation(2) E(w) k (3)

j (4) E(w)
E(w) w
Gradient decent E(w)
BP
w
(1) w
(2) w BP E(w)
(3) Gradient decent w
(4)(2)(3) w

6. Neural Networkevaluationforward propagation


Evaluation w D input vector x
output vector
forward propagation

a j = w ji zi
i

z j = h( a j )
O(W)

7. Optimization theory and method


machine learning modeling + optimization
statistical

optimization techniques
Linear and Quadratic Programming, Convex Optimization
Combinatorial Optimization
Probabilistic Optimization: Genetic Algorithm, Simulated Annealing, Particle Swarm
Optimization, Ant Colony Optimization;
Calculus of variation;
Numerical Optimization: Gradient decent, Conjugate gradient, Newton method

8. error propagationHessian En JacobianJ


Jacobian J

J ki =

yk
xi

output vector input vector f : R R


n

Hessian f : R R
n

error propagation
output layer

Hessian

vT H = vT (E ( w)) O(W)
9. Regularization of neural network
Frequentist method
(1) regularizer regression w penalty
quadratic regularizer linear transformation invariance
(2) early stopping validation set

10. neural networkinvariance


linear transformation invariancetranslation invariance scale invariance
T

linear transformation invariance error function regularizer w w


inconsistency input data xi xi = axi + b

1
a

ji =
w ji w j 0 w j 0 = w j 0
w ji w

b
w ji
a i

w regularizer w w
consistency equivalent
regularizer

1
2

wW1

2
2

wW2

Translation invariance scale invariance


translation(scale)neural network
4


(1) data replication augmented data
label
learn invariance

modify data
(2) regularizationpenalize changes in the model when the input is transformed
tangent propagationmodify error function
(3) pre-processing transformation-invariant feature
feature extraction of transformation-invariant features
(4) invariance build the invariance properties into the
structure of a neural networkconvolutional neural network the model with
transformation-invariant structure
11. Bayesian Neural Networks
output
network w

p ( w | ) = N ( w | 0, 1 I )
xconditional distribution p (t | x, w, ) = N (t | y ( x, w), )
1

y ( x, w) N observation {xn } target value D = {tn }

p ( w | D, , ) p ( w | ) p ( D | w, ) =
p ( w | ) N (tn | y ( xn , w), 1 )
n =1

xpredictive distribution

p (t | x, D) = p (t | x, w) p ( w | D, , )d w

(1) Gaussian Laplace approximation

q( w | D)

predictive

distribution

p (t | x, D) = p (t | x, w)q ( w | D)d w p(t|x,w) network function y(x, w)


analytically intractable
(2) network function y(x, w) Taylor p (t | x, w, )
linear-Gaussian model p(t|x, D)
(3) hyper-parameter

, predictive distribution

marginalize empirical Bayesian marginal likelihood

hyper-parameter point estimate


Bayesian Neural Network regression

Machine learning = Modeling + optimization/approximation


ML
modeling SVMSVM

modeling
ML optimization approximation Frequentist
point estimate estimator
estimator / optimization
Bayesian point estimate marginalize
marginalize analytical solution approximation
analytical approximationTaylor expansion, Laplace approximation, Variational Bayes
sampling approximation

Chapter 6 Kernel methods


1. model
training data prediction
training data
linear basis function modelgeneralized linear modelneural network
predication training data
training data kNNGaussian process training
dataSVM Support Vector

2. Kernel
non-linear feature space mapping input x kernel
function k ( x, x ') = ( x)

( x ')

kernel stationary kernel homogeneous kernel

k ( x, x=
') k ( x x ') k ( x=
, x ') k (|| x x ' ||)
radial basis kernel

Kernel kernel
kernel k

k ( x, x ') = ( x)T ( x ') k ( x, z ) = ( xT z ) 2 x z 2


k kernel k ( x, z ) = ( x12 , 2 x1 x2 , x22 )( z12 , 2 z1 z2 , z22 )T kernel

kernel kernel
kernel kernel kernel
kernel kernel kernel
kernel kernel Gaussian kernel kernel

Kernel vector of real numbers


A P(A) vector space
| A1 A2 |

kernel A1 , A2 P ( A) k ( A1 , A2 ) = 2
inner product vector space

P(A)

3. generative modelkernel
Generative model V.S. discriminant modelgenerative model can deal naturally with missing
data HMM handle sequences of varying length discriminative model
generally gives better performance on discriminative tasks.
generative model discriminant model generative model
kernel kernel discriminant model

generative model kernel


k ( x, x ') = p ( x) p ( x ') kernelp(x) input modeling
generative model
p(x)>=0p(x) D x
1 k 1 kernel kernel
kernel kernelk ( x, x ') =

p( x | i) p( x ' | i) p(i)
i

i latent variable
Fish kernel parametric generative model p ( x | )
parameter vector Fisher score g (, x) = ln p ( x | ) scalar
vector vector
1

Fisher Kernelk ( x, x ') = g (, x ) F g (, x ') F Fisher information matrix


T

F = Ex [ g (, x) g (, x)T ] vector outer product


p ( x | ) x Fisher sample average

Fisher score Fisher information


matrix Fisher score random vector covariance matrix Fisher kernel
Fisher score Mahalanobis Euclidean distance kernel

k ( x, x ') = g (, x)T g (, x ')


Fisher kernel document retrieval Kernel measure the similarity of two input
vectors
4. Gaussian process
Gaussian process is defined as a probability distribution over functions y(x)such that
the set of values of y(x) evaluated at an arbitrary set of points x1 ,..., xN jointly have a Gaussian
distribution

x1 ,..., xN y ( x1 ),..., y ( xN )
Gaussian distribution y(x) mean value prior
0 Gaussian process ( y ( x1 ),..., y ( xN )) covariance
xn , xm

( y ( xn ), y ( xm )) covariance ( y ( x1 ),..., y ( xN )) covariance kernel


( y ( xn ), y ( xm )) covariance

E[( y ( xn ) y ( xm )] = k ( xn , xm )
5. Gaussian process for regression
observation noise y ( xn )
tn p (tn | yn ) = N (tn | yn , 1 ) N observations
Gaussian

p (t | y ) = N (t | y , 1I N )
t = ( t1 ,..., t N ) y = ( y1 ,..., y N ) I N N
Gaussian process y = ( y1 ,..., y N ) Gaussian

p (y ) = N (y | 0, K )
K kernel
t = ( t1 ,..., t N ) p (t ) Gaussian

observation t = ( t1 ,..., t N ) input xN +1


observation t N +1 p (t N +1 | t ) p (t )
p( t1 ,..., t N , t N+1 )
Gaussian conditional probability

p (t N +1 | t ) analytical solution

N O ( N ) prediction
2

O ( N ) linear basis function model M


3

basis O ( M ) O ( M ) Gaussian process


basis function M N
Gaussian basis function

O( N 3 ) - O( N 2 ) basis function linear basis function model

6. Gaussian processkernel
rather than fixing the covariance functionwe may prefer to use a parametric family
of functions and then infer the parameter values from the data
kernel

k ( xn , x=
0 exp{
m)

1
2

|| xn xm ||2 } + 2 + 3 xnT xm

Gaussian kernel linear kernel constant


observation t = ( t1 ,..., t N ) = ( 0 ,..., 3) infer
kernel kernel kernel selection
hyper-parameter empirical Bayes marginal likelihood

= p (t | ) Gaussian
p (t ) p (t )
ln p (t | ) MLE

7. Gaussian process for classification


Gaussian process input point xy(x)
Gaussian GP y(x)(0, 1) logistic

binary classificationtarget variable t {0,1} Gaussian process


a ( x) logistic a y = (a ) y Gaussian process

(a)t (1 (a))1t

(t | a )
p=

Gaussian process

p (a N +1 ) = N (| 0, C N +1 )
C N +1 kernel

K C=
( xn , xm ) k ( xn , xm ) + n
m > 0

nm kernel C
C

p (t N=
+1 | t N )

p (t
=
N +1

p (t N +1 1|=
aN +1 )
=
=
p (aN +1 | t N )

1| aN +1 ) p (aN +1 | t N )d aN +1

(aN +1 )

p (a , a | t )da
=
p(a
N +1

N +1

| a N ) p (a N | t N )da N

Gaussian process p ( aN +1 | a N ) conditional Gaussian

p (a N | t N )

p (a N | t N ) Laplace approximation Gaussian


p ( aN +1 | t N ) 2 Gaussian Gaussian p (t N +1 | t N )
Logistic Gaussian N
N+1

model parameters
(1) Gaussian distribution
marginal Gaussian distribution conditional Gaussian distribution Convolution of two
Gaussiansconvolution of logistic and Gaussian
logistic probit Gaussian CDF convolution of logistic
and Gaussian Convolution of two Gaussians

Convolution of two Gaussians


(i) marginalize

N (y; x, ) N (x; , )dx= N (y - x; 0, ) N (x; , )dx


1

x marginalize
=
f 1 (y - x) N (y - x; 0, 1 ) 0, 1

g=
N (x; 2 , 2 ) 2 , 2
2 ,2 ( x)

(y - x) g2 ,2 (x)d x = ( f 1 g2 ,2 )(y )

Gaussian
(ii)
Gaussian N (1 , 1 ) N ( 2 , 2 )

( N (1 , 1 ) N ( 2 , 2 ))(=
t)

N (t z; , )N (z; , )dz
1

(i)

N (y; x, ) N (x; , )dx= N (y - x; 0, ) N (x; , )dx


1

= ( N (0, 1 ) N ( 2 , 2 ))( y )
marginalization N (0, 1 ) N ( 2 , 2 )
(iii)

Gaussian N (, ) e

1
itT tT t
2

N (1 , 1 ) N ( 2 , 2 )

1
itT ( 1 + 2 ) tT ( 1 + 2 )t
2

N (1 + 2 , 1 + 2 )
marginalize

)dx
N (y; x, ) N (x; , =
1

( N (0, 1 ) N ( 2 , 2 ))(
=
y ) N ( 2 , 1 + 2 )

(2) Taylor expansion

(3) Laplace approximation


Gaussian

Chapter 7 Sparse Kernel Machine


1. Lagrange multiplierKTT condition
PRML Andrew Ng Lecture note

2. SVM
SVM discriminant function input RVM
discriminant model
SVM sparse model training datasupport vectors
Gaussian process training data prediction
SVM classificationregression novelty detection

3. SVMmodelinglinearly separable
N {x n , tn } tn {1,1}
(1) modeling

=
y (x) w (x) + b y (x) = 0 SVM

margin decision boundary margin N


decision boundary decision
boundary y ( x) = 0 w b SVM prediction
x y(x) x
(2) modeling mathematical formulation
x n y ( x) = 0

tn y ( x n )
linearly separable
|| w ||

tn y ( xn ) > 0 margin

1
arg max{
min[tn y (x n )]}
|| w || n
w ,b
s.t tn y ( xn ) > 0, n =
1,..., N

margin
w b w w, b b x n

y (x) = 0

tn y ( x n )

|| w ||

1 tn y (x n ) = 1

1
}
tn y (x n ) 1, n =
1,..., N arg max{
|| w ||
w ,b

1
arg min{ || w ||2 }
2
w ,b

s.t. tn y (x n ) 1, n =
1,..., N
inequality constraint quadratic programming

Lagrange function
N
1
2
, a)
|| w || an {tn (w T (x n ) + b) 1}
L(w, b=
2
n =1

Lagrange multiplier a 0
L( w , b, a) w ,b a w ,b
0 a w =

a t ( x
n =1

n n

) antn = 0
n =1

L( w , b, a) w ,b

(a)
L
=

=
n 1

1 N N
antn amtm k (xn , xm )
2=n 1 =m 1

s.t. an 0, n =
1,..., N
N

a t
n =1

n n

=0

L (a) a
w =

a t ( x
n =1

n n

=
y (x) w T (x) + b
)

y (x) = antn k (x, x n )


n =1

kernel k k ( x, x') = ( x) ( x')


T

(3)
KKT

an 0
tn y ( x n ) 1
an {tn ( y (x n ) 1} =
0
3 n 3 an = 0 tn y (x n ) = 1 an = 0

prediction y ( x) =

a t k (x, x
n =1

n n

) n 0 x n

prediction an 0 support vector


tn y (x n ) = 1 y ( x) = 0
SVM sparse model
(4) SVM SMO

4. SVMThe overlapping case


overfitting
training data x n slack variable n 0
decision boundary 1 1 n tn y ( x n ) 1
tn y ( x n ) 1 n

1
2

arg min{ || w ||2 +C


w ,b

}
n =1

tn y (x n ) 1 n n 0 C penalty margin overlapping


overfitting C
Lagrange function w ,b n
Lagrange multiplier n
Lagrange multiplier dual Lagrangian n multiplierdual
N

an

(a)
Lagrangian L
=

=
n 1

1 N N
antn amtm k (xn , xm )
2=n 1 =m 1

0 an C , n =
1,..., N antn = 0 an C
n =1

box constraints
an = 0 x n prediction an 0 support vector
0 < an < C n = 0 x n margin margin an 0
x n margin an = C n x n
n

5. multiclassSVM
open problem

one-versus-the rest K SVM


ambiguous region inconsistency
the restone

6. single-classSVM
unsupervised learning probability density estimation
density of data find a smooth boundary enclosing a region of high density boundary
0 1 distribution
region

feature space

feature space

Single-Class Classification Outlier DetectionNovelty Detection


Concept Learning
7. SVM for regression

SVM regression flat y training


datatarget value y value x n

| y(x n ) tn |< flat model


complexity
overfitting regularizer
optimization problem I

1
min{ || w ||2 }
2
1,..., N
st. | y(x n ) tn |< , n =
modeling hard margin SVM (1)
(2) complexity overfitting
tube
SVM tube

| y(x) t |<
0,
E (y(x) t ) =
| y(x) t | , otherwise
optimization problem II
N

min{C E (y(x n ) t ) +
n =1

1
|| w ||2 } C > 0
2

SVM slack variables x n


slack variables n 0 n 0 tn > y( x n ) + n > 0 n = 0

tn < y(x n ) n > 0 n = 0 x n tube

tn y(x n ) + + n
tn y(x n ) n
n + n optimization
problem III

1
min{C ( n + n ) + || w ||2 }
2
n =1
st. tn y(x n ) + + n
tn y(x n ) n

n 0
n 0
n = 1,..., N
n n w Lagrange multiplier
KKT condition
optimization problem III optimization problem IIoptimization problem I

n = 0 n = 0 x n tube x n optimization problem I


optimization problem II optimization problem III penalty
0
n > 0 n > 0 KKT =
tn y(x n ) + + n

=
tn y(x n ) n E (y(x n ) t ) n n n = 0
n = 0 optimization problem II optimization problem III
penalty
optimization problem II optimization problem III

8. Relevance Vector MachineRVM


sparse Bayesian model
discriminant function SVMRVM discriminant model
RVM SVM prediction
SVM complexity C cross-validation
by definitionSVM
RVM regression classification SVM

RVM for regression


RVM linear basis function model w

y (x, w ) = w T (x)

p (t | x, w, ) = N (t | y (x, w ), 1 )

p (w | ) = N ( wi | 0, i 1 )
i =1

=
| ) N (w | 0, I ) RVM w
p ( w
hyper-parameter
Bayesian modelRVM marginalize predictive distribution
maximize marginal likelihood hyper-parameter
N observation x n X target values
t = (t1 ,..., t N ) marginal likelihood

p (t | X, , ) = p (t | X, w, ) p (w | )dw
p (t | X, w , ) likelihoodwith respect to
*

x predictive distribution

p (t | x, X, t, * , * ) = p (t | x, w, * ) p (w | X, t, * , * )dw

p ( w | X, t , , ) w
*

RVM marginal likelihood with respect to


N ( wi | 0, i 1 ) 0
w wi 0

RVM for classification


binary classification y ( x, w ) logistic

(w T (x))
y (x, w ) =
=

p (C1 | x, w )

p (w | ) = N ( wi | 0, i 1 )
i =1

marginal likelihood classification


*

Bayesian Logistic Regression Laplace approximation


marginalization predictive distribution
RVM Bayesian Logistic Regression RVM prior Bayesian
Logistic Regression prior p ( w ) = N ( w | m 0 , S0 )
parameter RVM Bayesian Logistic Regression
*

Model

Time for Training

Time for Prediction

Remark

SVM (Frequentist

N2 approx.

Gaussian Process

N3

N2

Frequentist linear

M2(N+M)

Bayesian linear basis

M2(N+M)

M2

function model for

(get the necessities of

(specialize the

regression

the predictive

predictive distribution

distribution)

for given data point)

Sparse Model)
RVM(Bayesian Sparse
Model)

basis function model


for regression

Frequentist Logistic
regression
Bayesian Logistic
regression
Frequentist Neural
network
Bayesian Neural
network
N: number of training data
M: number of basis function
V: number of support vectors

Parametric/Non-parametric

Frequentist/Bayesian

Discriminative/Generative

Chapter 8 Graphical Models


1. Probabilistic Graphical ModelPGM
joint distribution PGM

(1) Bayesian network


xi
pai joint distribution pai xi descendant

(2) Markov network ( xi , x j ) joint


distribution x \{xi , x j } xi x j

graph joint distribution semantics


graph joint distribution
graph joint distribution modeling encoding

2. joint distributiongraph
PGM joint distribution Pgraph G
joint distribution

(1) G

Bayesian network descendant


I (G ) Markov network xi , x j
I p (G )
I(G) G I(P) joint distribution P
I(G) I (G ) I p (G )

Bayesian network D-separation I d sep (G )


D-separation ABC x = {x1 ,..., xK }
A B C
AB path path blocked

(i) head-to-tail tail-to-tail C


(ii) head-to-head
decedents C
A B path blocked A B C d-separated
A B C
path
path blocked

Markov network U-separation I u sep (G )


U-separation G X Y Z
X Y Z X Y Z

D-separation U-separation
statement joint distribution P

I d sep (G ) = I (G ) I u sep (G ) = I (G )

(2) G P
P G I (G ) I ( P ) G
P I-map
(3) G P
(2)(3) I (G ) I ( P )

I-map graph I (G ) I ( P ) graph distribution I-map


D-map graph I (G ) I ( P ) graph distribution D-map
Perfect map: graph I (G ) = I ( P ) graph distribution
D-map
perfect map
Bayesian network perfect map DD Markov network
perfect map UD DD UD, DD UD

3. PGM
P PGM GG P

G P I-map G
I(G) P I(P)

factorization I-map
Bayesian network P factorizes over G G P I-map
Markov network P Gibbs distribution factorizes over G G P I-map
P positive distribution G P I-map P Gibbs
distribution factorizes over G

P factorizes over G
P factorizes over G
Bayesian network G P G
K

p (x) = p ( xk | pak ) pak xk K random


k =1

variables
Markov network G P G

p ( x) =

1
C ( X C ) X C C ( X C ) 0
Z C

potential function p ( x) potential function Z

Conditional Independence
Factorization

4. joint distributiongraph

(1) P graph
P G G P I-map P I-map
G P I-map I (G ) I ( P ) G
P I-map P trivial
I (G ) P I-map
(2) I-map P
P I-map G I(G)
I-map I-map minimal I-map


random variable
Bayesian network Markov network
Markov network xi , x j

x \{xi , x j } joint
distribution CK2 K random variable
I-map
Bayesian network xi p ( xi | x1 ,..., xi 1 )
x j {x1 ,..., xi 1} joint distribution

p ( xi | x1 ,..., xi 1 ) = p ( xi |{x1 ,..., xi 1} \ x j ) x j {x1 ,..., xi 1}


pai {x1 ,..., xi 1} pai
xi joint distribution
I-map xi condition

{x1 ,..., xi 1} graph DAG Bayesian network

I-map
I-map

(3) graph P I-map


graph conditional independences graph
conditional independences
I-equivalent graph
graph I-equivalent
DAG
Skeleton skeleton
Immorality head-to-head XZY immorality
X Y
G1 G2 graph I-equivalent G1 G2
skeleton immorality

5. graph
Local independence global independences

Bayesian network G
Local independence xk pak xk descendants
Local independence I (G )
Global independence D-separation Global
independence I (G )
I (G ) I (G ) I ( P ) P joint distributionI (G ) I (G )
I (G ) I (G )
I (G ) G conditional independence
I (G )

Markov network G
Local independence pairwise independence xi , x j
I p (G ) Markov blanket
independence xk Markov blanket Markov
blanket I (G )
Global independence G X Y
Z X Y Z X Y Z
I (G )
I p (G ) I (G ) I (G ) P positive
distribution
I (G )

6. PGM
PGM
Joint distribution random variable
direct representation random variables joint
distribution joint distribution graph
work on the graph graph inference marginal conditional
PGM

joint distribution I-map


graph joint distribution I-map work on
this graph

Chapter 9 Mixture Models and EM


1. EMFrequentistMaximum Likelihood EM
Probabilistic model observed variables X latent variables Z joint distribution
p ( X , Z | ) Frequentist

observed data p ( X | ) =

p( X , Z | )
Z

p ( X | ) complete-data likelihood p ( X , Z | )
Z
EM
q ( Z ) ln p ( X | ) q ( Z )

ln p (=
X | ) L(q, ) + KL(q || p )

L(q, ) = q ( Z ) ln{
Z

p( X , Z | )
}
q( Z )

KL(q || p ) = q ( Z ) ln{
Z

p( Z | X , )
}
q( Z )

EM
(1)

old

old maximize L(q, ) with respect to q( Z )

q ( Z ) = p ( Z | X , old ) E step
(2) q ( Z ) = p ( Z | X ,

new

old

) maximize L(q, ) with respect to

M step

ln p ( X | ) E step ln p ( X | )
q ( Z ) q ( Z ) ln p ( X | ) M step

ln p ( X | ) L(q, new ) L(q, old ) new


q ( Z ) E step p ( Z | X , old ) p ( Z | X , new ) KL(q||p) 0
ln p ( X |

new

) > ln p ( X | old ) EM

EM
(1) ln p ( X | ) q ( Z ) q ( Z ) ln p ( X | )

q ( Z ) L(q, ) q ( Z ) = p ( Z | X , ) KL(q||p) 0
(2) M step maximize L(q, ) with respect to q ( Z )
X Z

L ( q, )

p( Z | X ,

olds

p( Z | X ,

olds

) ln p ( X , Z | ) p ( Z | X , olds ) ln p ( Z | X , olds )
Z

) ln p ( X , Z | ) + co nst

maximize L(q, ) with respect to

Q( , old ) = p ( Z | X , old ) ln p ( X , Z | )
Z

) complete-data likelihood ln p ( X , Z | ) X

with respect to Q ( ,

old

Z p ( Z | X ,

) M step Maximize complete-data

old

likelihood
EM E step Q ( ,
Q ( ,

old

old

) M step

EM observed variables Frequentist


Bayesian EM

2. Gaussian Mixture ModelGMM


Gaussian mixture distribution p ( x ) =

k =1

N ( x | k , k )

Exponential Family
random vector z 1-of-K p ( z ) =

k =1

zk

0 k 1 k = 1
k =1

p ( x | zk= 1)= N ( x | k , k ) x z
K

p ( x | z ) = N ( x | k , k ) zk
k =1

=
p( x )
x

=
p( z ) p( x | z )

k =1

N ( x | k , k )

GMM N x Gaussian mixture distribution

EM z latent variablex observed variable N


X x latent variable
Z z
observation Gaussian mixture distribution inference

=
{
=
1,..., K }=
k : k 1,..., K }=
k : k 1,..., K }
{=
{=

k :k
EM p ( X | , , ) GMM
ln p ( X | , , ) =

ln{ k N ( xn | k , k )} ln

=
n 1=
k 1

Gaussian 0 closed form solution


complete-data log likelihood GMM
K

p ( X, Z | , , ) = kznk N ( xn | k , k ) znk
n 1=
k 1
=

X, Z | , , )
p (=

=
n 1=
k 1

nk

{ln k + ln N ( xn | k , k )}

complete-data log likelihood


Z latent variable complete-data log likelihood
EM likelihood Z
Z p ( Z | X , , , )

n 1=
k 1
=

N ( xn | k , k )]znk p(z) p(x|z)

Z complete-data log likelihood


K

( z

EZ [=
p ( X, Z | , , )]

=
n 1=
k 1

nk

){ln k + ln N ( xn | k , k )} ( znk ) = E[ znk ]

znk Z

( znk ) E=
[ znk ]
=

k N ( xn | k , k )
K

j =1

N ( xn | j , j )

, , EZ [ p ( X, Z | , , )]

k =

1
Nk

k =

Nk
N

=
k

1
Nk

(z
n =1

nk

(z
n =1

) xn

)( xkn k )( xn k )T

N k =

(z
n =1

nk

GMM

old

, old , old

Z znk ( znk ) E stepZ


( znk ) = E[ znk ] EZ [ p ( X, Z | , , )]

, , new , new , new


GMM clustering GMM clustering K
x 1-of-K z x 1
( znk )

p( z
1)=
p ( xn | znk 1)
=
p=
( znk 1|=
xn ) K nk
=
p ( xn | znj 1)
=
p( znj 1)=

k N ( xn | k , k )

= ( znk )
j N ( xn | j , j )
K

=j 1 =j 1

( znk ) xn xn k GMM
, , ( znk )

3. GMMk-means
k-means GMM
clusteringk-means hard assignment of data points to clusters
GMM soft assignment ( znk )
GMM k
GMM

( znk ) =

k exp{ || xn k ||2 /2 }
K

j =1

exp{ || xn j ||2 /2 }

K || xn j || || xn j* ||
2

exp{ || xn j* ||2 /2 } 0 k j * 1 0 k = j *
1
*
0, k j
lim ( znk ) =
*
0
1, k = j

(lim ( zn1 ),..., lim ( znK )) 1-of-K GMM


0

hard assignment
0 k-means N k k
k k k-means

k k K-means cluster means cluster covariance


expected complete-data log likelihood Z

EZ [ p ( X, Z | , , )]

1 K K
( znk ) || xn k ||2 +const

2 =n 1 =k 1

GMM EZ [ p ( X, Z | , , )] k-means distortion

p(x)
lnp(x) x x random variable Bayesian
random variable

lnp(x) x p(x) Gaussian


lnp(x) x lnx p(x) Gamma

Chapter 10 Approximate Inference


1. Approximation
Probabilistic model central task observation X latent variables
Z P(Z|X) expectation with respect to P(Z|X) P(Z|X)
analytically intractable approximation
Latent variable latent variable Bayesian parameter
random variable Probablistic Graphical Model parameter
latent variable parameter latent
variable
Approximation deterministic stochatic Laplace
approximationvariational inference MCMC sampling
2. Variational inference
probablistic model P(X, Z) observed variables X = { x1 ,..., x N } latent
variable Z = {z1 ,..., z N }
P(Z|X) model evidence P(X) approximation

X ) L(q ) + KL(q || p )
q ( Z ) P(X) ln p (=
L(q ) = q ( Z ) ln{

p( X , Z )
}dZ
q( Z )

KL(q || p ) = q ( Z ) ln{

p( Z | X )
}dZ
q( Z )

q ( Z ) P(Z|X) KL(q||p)
q(Z) KL(q||p)
P(Z|X) intractable KL(q||p) q(Z)
joint distribution P(X, Z) ln p ( X ) q(Z)
KL(q||p) L(q)
q(Z) L(q) q(Z)
tractable flexible/
M

q ( Z ) = qi ( Z i )
i =1

Z i Z

q(Z) variational distribution


q(Z) L(q)

=
L(q )

q {ln p( X , Z ) ln q }dZ
i

ln p ( X , Z j )dZ j q j ln q j d Z j + const

=
KL(q j || ln p ( X , Z j )) + const
=
ln p ( X , Z j ) Ei j [ p ( X , Z )] + const Ei j

q ( Z )
i

i j

L(q) with respect to q j

KL(q j || ln p ( X , Z j )) with respect to q j KL

=
ln q* ( Z j ) Ei j [ p ( X , Z )] + const

M-1
ln q ( Z j )
*

q (Z )
i

i j

variational inference L(q) q(Z)


qi ( Z i )
qi ( Z i ) qi* ( Z i )
L(q) qi ( Z i ) convex

3. Variatioinal inferenceBayesian Gaussian Mixture ModelBayesian GMM


Bayesian GMM
GMM =
p( X | Z , , )

N ( x

n 1=
k 1
=

| k , k 1 ) znk Bayesian

random variable prior conjugate prior

p ( Z | ) = k

znk

=
n 1=
k 1

p ( ) = Dir ( | 0 ) 0 = ( 0 ,..., 0 )
p( , =
) p ( | ) p (=
)

N (
k =1

| m0 , ( 0 k ) 1 )W ( k | W0 , 0 ) component

conjugate prior Gaussian-Wishart


random variable PGM joint distribution

, , )
p ( X , Z ,=

p ( X | Z , , ) p ( Z | ) p ( ) p ( | ) p ( )

p ( Z , , , | X )
variational distribution q ( Z , , , )

q ( Z=
, , , ) q ( Z )q ( , , )
parameter random variable latent variable
variational inference
q(Z) L(q)

=
ln q* ( Z ) E , , [ p( X , Z , , , )] + const
= E [ p ( Z | )] + E , [ p ( X | Z , , )] + const
joint distribution p ( X , Z , , , ) Z
const

=
ln q* ( Z )

=
n 1=
k 1

nk

ln nk + const

Ek [ln k ] +
ln =
n

D
1
1
E[ln | k |] ln(2 ) Ek , k [( xn k )T k ( xn k )]
2
2
2

ln q ( Z )
q ( , , ) L(q)
K

ln q* ( , =
, ) ln p ( ) + ln p ( k , k ) + EZ [ p ( Z | )] + E[ zn ]Nk ( xn | k , k 1 ) + const
=
k 1

=
n 1=
k 1

q ( , , ) Z ln q ( , , )
*

, , k
K

q( ,

, , , ) q ( Z )q=
( , , ) q ( Z )q (=
)q ( , ) q ( Z )q ( )
q ( Z=

k =1

variational inference q ( , , ) L(q) q ( ) K q ( k , k )


L(q)q ( Z ) q ( , , )
*

p ( Z , , , | X )
Bayesian GMM K
mixing coefficient

q ( ) Dirchelet K
*

E[ k ] 0 GMM
K variational inference
E[ k ] 0 K*

Predictive distribution:
Bayesian
x predictive distribution

p ( x | X ) = p ( x | z , , ) p (z | ) p ( , , | X )d d d
z

q ( , , ) p ( , , | X )
*

p ( x | X ) K student t

4. Expectation Propagation

variational inference KL(q ( Z ) || p ( Z | X )) with


respect to q ( Z ) p ( Z | X )
KL q ( Z ) expectation propagation reversed form KL
KL(p||q) with respect to q p KL
q(z) exponential family q ( z ) = h( z ) g ( ) ex p {u( z )} KL
T

KL( p || q ) =
ln g ( ) E p ( z ) [u(z )] + const const natuaral
T

parameter

KL

ln g ( ) = E p ( z ) [u(z )] q(z) exponential family


ln g ( ) = E q ( z ) [u(z )] E p ( z ) [u(z )] = E q ( z ) [u(z )] q(z)
q(z) p(z) moment matching
probabilistic model joint distribution p ( D, ) =

f ( )
i

D observed data latent variables p ( | D )


model evidence p(D) p ( | D ) =

1
fi ( ) p( D) = i fi ( )d
p( D) i

model i.i.d. f i ( ) = p ( xi | )
f 0 ( ) = p ( )

q ( ) =

1
fi ( ) p ( | D) factor fi ( ) model

Z i

f i ( ) fi ( ) exponential family q ( )
exponential family family

KL( p ( | D ) || q ( )) with respect q ( )

q ( ) f j ( ) f j ( ) f j ( )

q ( )
1
\j
Z j = f j ( ) q ( ) d
f j ( )q \ j ( ) q \ j ( ) =

Zj
f j ( )
revised f j ( ) KL(

1
f j ( )q \ j ( ) || q new ( )) with respect to
Zj

q new ( ) q new ( ) exponential family moment matching


q

new

q new ( )
new
( ) fj new ( ) = K \ j
K q ( )
q ( )

K = Z j
f j ( )
q ( )

Chapter 11 Sampling Method


(0, 1)
p(z)

1. sampling method
F(x) y (0, 1)

F 1 ( y ) F(x)
1

F ( y )

2. proposal distributionsampling
p(z) q(z) proposal
distribution
Rejection sampling
single-varaible p ( z ) =

1
p ( z ) Z p p(z) z
Zp

p(z) z
Gam( z | a, b) =

b a z a 1 exp(bz )
ba

z Z p
(a )
(a )

p(z) p(z) z normalization


Z p sampling z
Z p
Rejection sampling p(z) proposal distribution q(z) q(z)
sampling k kq ( z ) p ( z ) z
Rejection sampling q(z) sample z0

[0, kq ( z0 )] sample u0 z0 u0 kq(z)


u0 > p ( z0 ) z0 z0 p(z)

k K z0
rejection sampling

Rejection sampling
rejection sampling u0 > p ( z0 )

Importance sampling
p(z) z p(z)
p(z) f(z)
L p(z) E[ f ]

1 L
f ( z (l ) )

L l =1

p(z) Rejection sampling


q(z) q(z) L

p( z ) f ( z )d z
=

E[ f ]
=

p( z ) f ( z )
1 L p( z (l ) )
q( z )d z
f ( z (l ) )
(l )
q( z )
L l =1 q ( z )

p( z (l ) )
q ( z (l ) ) importance weights

rl =

p(z) p(z) z p ( z ) normalization


constant Z p q(z) Z q

p( z ) f ( z )d z
=

=
E[ f ]
r =

Zq
Zp

Zq 1 L
p ( z ) f ( z )
q( z )d z
rl f ( z (l ) )

q ( z )
Z p L l =1

p ( z (l ) )

q ( z (l ) )

Zq

1
q( z )
=
p ( x )d x =
p ( x )d x

Zp
q ( z )

Zp

x
=

q( z )

q( z ) q( x )
=

q ( z ) q ( x )

1 L
rl
L l =1

E[ f ] =
z

(l )

q( x )

q ( z ) p ( x )d x = q ( x) p ( x )d x z

1 L
wl f ( z (l ) )

L l =1
q(z) L

rl
wl =
=
L
rm
m =1

p ( z (l ) ) / q ( z (l ) )
p ( z ( m) ) / q( z ( m) )
m

wl p(z) z p ( z )

Sampling-importance-resampling (SIR)
Rejection sampling kImportance sampling p(z)
Sampling-importance-resampling SIR proposal
distribution q(z)
q(z) L z
{z

(l )

(l )

Importance sampling wl

: l = 1,..., L} L z (l ) wl

L p(z) L

MCMC

3. Monte Carlo EM algorithm


EM E complete-log likelihood latent variables
Q ( ,

old

) = p ( Z | X , old ) ln p ( X , Z | ) sampling
Z

latent variables p ( Z | X ,
{Z

(l )

: l = 1,..., L} Q( , old )

M EM Q ( ,

old

) L

1
ln p(Z (l ) , X | )
L l
old

) with respect to

4. Markov chain
Markov chain a series of random variables {z
p ( z

( m +1)

(l )

: l = 1,..., M }

| z (1) ,..., z ( m ) ) = p ( z ( m +1) | z ( m ) ) m

Transiton probablisty Tm ( z

(m)

, z ( m +1) ) = p ( z ( m +1) | z ( m ) )

Homogeneous mTransiton probablisty Tm homogeneous


Markov chain

Invariant distribution p ( z ) =
*

T ( z, z ) p ( z) Markov chain invariant distribution


*

T Markov chain homogeneous


Detailed balance p(z) Markov chain invariant distribution

p ( z )T ( z , z ) = p( z )T ( z , z )
5. Metropolis-Hastings
Rejection sampling Importance sampling p(z)
proposal sampling q(z) sampling z
q ( z | z

( )

( )

) z * A( z * , z ( ) )

z (*) , if accpet
z ( +1) = ( )
z , if reject
*

A( z , z

( )

) = min{1,

p ( z * )q( z ( ) | z * )
} p ( z ) p(z) z
p ( z ( ) )q ( z * | z ( ) )

Markov chain {z

( )

: = 1,...}

z ( ) p(z) p(z) detailed balance Markov


chain invariant distribution
continuous state space Gaussian centred on the current state proposal
distribution variance variance state space
variance

6. Gibbs sampling
Metropolis-Hastings p ( z ) = p ( z1 ,..., zM ) Gibbs
sampling
{zi : i = 1...M } = 1 sampling
+ 1 sampling
( +1)

~ p ( z1 | z2( ) , z3( ) ,..., zM( ) )

( +1)

~ p ( z2 | z1( +1) , z3( ) ,..., zM( ) )

( +1)

~ p( z j | z1( +1) ,..., z (j+11) , z (j+)1..., zM( ) )

( +1)

~ p ( zM | z1( +1) , z2( +1) ,..., zM( +1)1 )

- sampling z1

- sampling z2

- sampling z j

- sampling zM

Metropolis-Hastings Gibbs sampling zk


zk( +1) z z q ( z | z ) = p ( zk | z \ k ) z \ k {zi : i = 1...M }
*

zk z z zk z \ k = z \*k Metropolis-Hastings
*

( )

A( z , z )
=

p ( z * )q ( z | z * ) p ( zk* | z\*k ) p ( z\*k ) p ( zk | z\*k )


=
= 1
p ( z )q ( z * | z ) p( zk | z\ k ) p ( z\ k ) p ( zk* | z\ k )

Metropolis-Hastings Gibbs sampling

Chapter 12 Continuous Latent Variables


1. Principal Component Analysis (PCA)
unsupervised learning dimentionality reductionfeature extractiondata
visualization lossy data compression
Maximum Variance Subspace
N D observation data { xn : n = 1...N } M<D
N data variance of the projected data is maximize
M>1 variance u1
u1 projected data u1 u2
variance of the projected data M
M
Formulation u1 N observation data variance of
projected data

1
N

{u

T
1

n =1

xn u1T x}2 =
u1T Su1

x =

1
N

=
S sample covariance matrix S

1
N

(x

x )( xn x )T

u1T Su1 with respect to u1 || u1 ||


u1T u1 = 1
Lagrange multiplier Su1 = 1u1 1 S
u1T Su1 = 1 u1T Su1 1 S
u1 1
u2 S M

Minimum Projection Error


xn projection x n
Formulation Maximum Variance Subspace u1

1
N

|| x
n

(u1T xn )u1 ||2 u1T u1 = 1 Maximum Variance

Subspace
2. Probabilistic PCA (PPCA)
latent variable z principle-component subspaceobserved data x data
point

p ( z ) = N ( z | 0, I )
=
p ( x | z ) N ( x | Wz + , 2 I )
W principle-component subspace
linear-Gaussian

p( x ) = N ( x | , C )
=
p ( z | x ) N ( z | M 1W T ( x ), 2 M )
C = WW T + 2 I D M = W T W + 2 I M
p ( x ) latent space z rotation
Rz R orthogonal matrix p ( x ) W = WR


R WW

= WRRT W T = WW T

observed data set PPCA W , ,


2

MLE closed-form solution EM


W , , Prior Bayesian PPCA closed-form
2

solution EM
3. Factor analysis
PPCA

p ( z ) = N ( z | 0, I )

p( =
x | z ) N ( x | Wz + , )
diagonal matrix Factor analysis PPCA PPCA
isotropic matrix
4. Kernel PCA
PCA principle-component subspace projection Kernel PCA
projection
feature map

( x ) data point

M feature space
feature space PCA projection
feature space data point 0

( x ) = 0 M
n

sample covariance C =

1
N

( x ) ( x )

PCA C

Cvi = i vi i=1,,M kernel M


eigenvector equation kernel trick
Kai = i Nai K ( xn , xm ) = ( xn )T ( xm )
N-by-N N eigenvector equation
feature space
K feature space data point x
projection i vi
N

=
yi ( x ) =
( x )T vi

n =1

in

k ( x , xn )

( x ) = 0
n

data centralize

( xn ) ( xn )
=

1
N

( x ) K
n

nm

= ( xn )T ( xm ) K nm = ( xn )T ( xm )

K nm eigenvector equation

4. Nonlinear latent variable models


PPCA latent variable continuous
GMM HMM latent variable discrete
observed variable linear-Gaussian latent variable
Gaussian
Independent component analysis
Observed variable latent variable latent variable
Gaussian p ( z ) =

p( z )
j =1

Autoassociative neural network


auto-encoder neural network unsupervised learning NN
D inputsD outputsM (< D) hidden units 3

=
E (w)

1
|| y ( xn , w ) xn ||2

2 n

hidden unit linear activation function E ( w )


N D M neural network hidden unit
D D principle component

hidden unit nonlinear activation function


M principle-component space
nonlinear principle component analysis hidden layer

Chapter 13 Sequential Data


1. Hidden Markov Model (HMM)
HMM
observation xn discrete latant variable zn
latent variable Markov chain HMM Bayesian Network
latent Markov chain homogeneoustransition probability p ( zn | zn 1 )
n zn K K 1-of-K
zn transition probability p ( zn | zn 1 ) K-by-K A
A entry
HMM
K

p( zn | zn 1 , A) = Ajkn1, j zk
z

=
k 1 =j 1

p ( z1 ) = kz1k k = 1, 0 k 1
k =1

p ( xn | zn , ) = p ( xn | k ) znk

emission probability

k =1

HMM likelihood function


HMM observation { xn : n = 1..N }
likelihood function the product over all data points of the probability evaluated
at each data pointHMM likelihood Bayesian Network
N

p ( X , Z | ) = p ( z1 | )[ p ( zn | zn 1 , A)][ p ( xm | zm , )]
=
n 2=
m 1

=
X { x=
1..N=
}, Z {=
zn : n 1..N=
}, { , A, } HMM

n :n

= { , A, }
HMM learningMaximum Likelihood EM

=
X {=
xn : n 1..N } HMM = { , A, }
learning observation
MLE latent variable observation likelihood
p ( X | ) =

p( X , Z | ) EM likelihood
Z

EM HMM inference local posterior


marginals for latent variables

( zn ) = p ( zn | X , old )

( zn 1 , zn ) = p ( zn 1 , zn | X , old )
Forward-backward marginal HMM factor graph
PGM sum-product
HMM inferenceViterbi
inference marginal conditional ()HMM
inference Z = arg max p ( Z | X , ) HMM
*

=
X {=
xn : n 1..N } latent statessequence of states

PGM max-sum HMM Viterbi
2. Linear Dynamical System (LDS)
LDS HMM PGMLDS latent variables
transition probabilisty emission probabilisty linear-Gaussion LDS

p( zn | zn 1 ) = N ( zn | Azn 1 , )
p ( xn | zn ) = N ( xn | Czn , )
p ( z1 ) = N ( z1 | 0 ,V0 )

Kernelize
Probabilistize

Chapter 14 Combining Models


1. Model combination
model averaging model selection
Model averaging Boostmixture of linear regressioinmixture of logtistic regression
GMM
Model selection Decision tree
2. Bayesian model averaging V.S. Combined model
data set DBayesian model averaging a single
model Combined model D
model GMM D
K Gaussian model
3. Decision tree
Decision tree CART (Classification And Regression Tree) ID3 C4.5
combined model decision tree
PRML decision tree combined model model input space
region data point data point regression/classification decision
tree root leaf model selection data point
modelleaf
Decision tree
D
=
X {=
xn : n 1..N } labels
=
t {=
tn : n 1..N }
regression classification
leaf T leaf input space region R
N data point
regression R model R optimal prediction

y =

1
N

xn R

sum-of-squares error

Q
=
(T )

{t

xn R

y }2

classification p k region R k

Q (T ) = p k ln p k

cross entropy

Q (T )
=

(1 p k ) Gini index

over-fitting leaf data point training error 0


decision tree model complexity Leaf
regression/classification

=
C (T )

|T |

Q (T ) + | T |

=1

Decision tree C(T)


Decision tree
tree root
tree leaf
D featureD-dimension data point data point D feature
feature
feature threshold
threshold leaf leaf feature threshold
data point data point
leaf
Deision tree
human interpretability
tree struture data set training data
tree struture

4. Mixture Model
Unconditional Mixture Model Gaussian Mixture Model (GMM)
K

p ( x ) = k N ( x | k , k ) , ,
k =1

Conditional Mixture ModelMixtures of linear regression models linear


regression model p (t | ) =

k =1

N (t | wkT , 1 ) = {W , , }

conditional mixture

= {W , , } input
model

Mixture of Expert Mixtures of linear regression models mixing coefficients

input input mixture of expert


p (t | x ) =

k =1

( x ) pk (t | x ) pk (t | x ) expert

Hierarchical Mixture of Expert mixture of expert pk (t | x )


mixture of expert
Unconditional Mixture Model GMM closed-form solution
EM
5. AdaBoost
model combination model averaging M model { ym ( x ) : m = 1..M }
model M Model Y ( x ) = sgn(

m =1

ym ( x )) m

ym ( x ) binary classification sgn

M model ym ( x ) ym ( x )
ym 1 ( x )
data point wn( m ) m=1 1/N
ym ( x )
=
Jm

w
n =1

(m)
n

I ( ym ( xn ) tn )

tn {1, +1} I indicator function


N

ym ( x ) m =

w
n =1

(m)
n

I ( ym ( x n ) t n )
N

w
n =1

1 m

m = ln{

ym ( x )

(m)
n

ym ( x ) data point

wn( m +1) wn( m ) exp{ m I ( ym ( xn ) tn )}


=

AdaBoost Y ( x ) = sgn(

m =1

ym ( x ))

exponential error function


=
E

exp{t
n =1

{ xn , tn } f ( xn ) =

f ( xn )} tn {1, +1}

1 M
m ym ( x ) combined modelAdaBoost
2 m =1

sequential optimization

Vous aimerez peut-être aussi