Académique Documents
Professionnel Documents
Culture Documents
Checklist .....................................................................................................2
Chapter 1 Introduction ................................................................................4
Chapter 2 Probability Distribution............................................................10
Chapter 3 Linear Models for Regression ..................................................14
Chapter 4 Linear Models for Classification..............................................19
Chapter 5 Neural Networks ......................................................................26
Chapter 6 Kernel methods ........................................................................33
Chapter 7 Sparse Kernel Machine ............................................................39
Chapter 8 Graphical Models .....................................................................47
Chapter 9 Mixture Models and EM ..........................................................53
Chapter 10 Approximate Inference ...........................................................58
Chapter 11 Sampling Method ...................................................................63
Chapter 12 Continuous Latent Variables ..................................................68
Chapter 13 Sequential Data ......................................................................72
Chapter 14 Combining Models .................................................................74
iamxiaojian@gmail.com
Checklist
Frequentist-Bayesian
Frequentist
Bayesian
closed-form
regression
regression
solution
Logistic regression
(IRLS)
Laplace approximation
gradient decent
regression, classification)
regression, classification)
Laplace approximation
classification)
classification)
Laplace approximation
EM Variation
model
inferencce
closed-form solution
Probabilistic PCA
EM Laplace
approximation
Hidden markov model
EM
model
Linear dynamic system
EM
system
Bayesian
Fully Bayesian marginalize with respect to hyper-parameters as well as parameters
analytical intractable
curve fitting ( p ( w | ) = N ( w | 0, 1 I ) p (t | x , w , ) = N (t | y ( x , w ), 1 ) )
p (t | t ) = p (t | w , ) p ( w | t , , ) p ( , | t )dwd d
Empirical Bayes/type 2 maximum likelihood/evidence approximation hyper-parameter
marginal likelihood * *
hyper-parameter * * w marginalize
*
*
*
p (t | t ) p (t | t , * , * ) =
p(t | w, ) p(w | t , , )dw
Optimization/approximation
Linear/Quadratic/Convex optimization//
Lagrange multiplier
Gradient decent
Newton iteration
Laplace approximation
Expectation Maximation (EM)/latent variable model
Variational inference
Expectation Propagation (EP)
MCMC/Gibbs sampling
Latent variable
Latent variable
GMM
HMM
LDS
Chapter 1 Introduction
1. Bayesian interpretation of probability
degree of belief
uncertaintydegree of belief
Cox showed that if numerical values are used to represent degrees of belief, then a simple set of
axioms encoding common sense properties of such beliefs leads uniquely to a set of rules for
manipulating degrees of belief that are equivalent to the sum and product rules of probability.
use the machinery of probability theory to describe the uncertainty in
model parameters.
2. parameterBayesian
Frequentist model parameter w fixed estimator
estimator likelihood
Bayesian w prior probability p(w)
=
p( w | D)
p ( D | w) p ( w)
=
p( D)
p ( D | w) p ( w)
p( D | w) p(w)dw
C1,,Ck P(C1)
P(Ck)
x P(C1|x)P(Ck|x)
P(C1)=P(C1|x)P(Ck)=P(Ck|x)
P(C1)
P(Ck)
3. BayesianFrequentist
Bayesian prior distribution is often selected on the basis of mathematical
convenience rather than as a reflection of any prior beliefs conjugate prior
Frequentist Over-fitting problem can be understood as a general property of
maximum likelihood
4. over-fitting
Frequentist over-fitting
1) regularization penalty term
L2 regularizer ridge regression
L1 regularizer Lasso regression
penalty shrinkage method reduce the value of the coefficients.
2) cross-validation validation
5. Bayesianmarginalization
Marginalization lies at the heart of Bayesian methods.
Bayesian methods marginalization full Bayesian procedure
make prediction compare different models
marginalize (sum or integrate)
over the whole of parameter space.
marginalization
sampling Markov chain Monte CarloMonte Carlo method flexible
model computationally intensive small-scale problems.
deterministic approximation variational Bayes expectation propagation
large-scale applications.
6. Curve fitting
1) MLE likelihood function w point estimation
2) MAP (poor mans bayes) prior probability posterior probability
wMAP MLE likelihood function L2 penalty
point estimation
3) fully Bayesian approach sum rule product ruledegree of belief
machinery rule degree of belief predictive
distribution marginalize (sum or integrate) over the whole of parameter space w
p (t | x , X , t ) = p (t | x , w ) p ( w | X , t )d w
x X t label
w probability
w marginalization
8. inferencedecision
solution inference decision inference stage
decision stage posterior probability to make
optimal class assignments
9. decision problem
1) discriminant function: map inputs x directly into decisions. discriminant function
inference decision
2) discriminant model: inference problemdetermining the posterior class
probabilities, P (Ck | x ) decision problem x
class
3) generative modelexplicitly or implicitly model the distribution of inputs as well as outputs.
class-conditional density P ( x | Ck ) joint distribution P ( x , Ck )
posteriordecision problem
10.
Generative model make classification decision joint distribution
wasteful of computational resources and excessively demanding of data
financial application
function
(2) X-ray 99.9%
P(Ck | xI , xB ) P( xI , xB | Ck ) P(Ck )
P( xI | Ck ) P( xB | Ck ) P(Ck )
P(Ck | xI ) P(Ck | xB )
P(Ck )
xI xB Ck
R ( ai | x)
c
R(ai | x) = (ai | j ) P( j | x)
j =1
(ai | x j ) P( j | x) x j
(x)
(x) risk
R[ ] = R ( ( x) | x) p ( x)dx
R ( ( x) | x)
12.
Entropy is a lower bound on the number of bits needed to transmit the state of a random
variable.
Xx p(x)
1) log 2 p ( x)
2) - p(x)log 2 p(x)dx
2 e
13.
Conditional entropy joint distribution p(X,Y) H[Y|X]/
H [=
Y | X]
(Y | X
H=
x=
) p ( X x)d x
= p ( x, y ) log p ( y | x)dxdy
H[Y|X] X=x Y
q(x)
}dx
p(x)
q(x) q(x)
p(x) p(x)p(x)
Mutual information X,Y p(x, y)=P(x)P(y)
14. 1.14
15 1.32
x H[x]x y=Ax
H[y]=H[x]+ln|A|
Feature extraction: a special form of dimensionality reduction, the input data be transformed into
a reduced representation set of features (features vector)
real-time face detection in a high-resolution video stream, huge
numbers of pixels per second speed up computation
PCA, kernel PCA, manifold leaning
In statistics, the most popular form of feature selection is stepwise regression. It is a greedy
algorithm that adds the best feature (or deletes the worst feature) at each round. The main
control issue is deciding when to stop the iteration.
In machine learning, this is typically done by cross-validation.
Unsupervised leaning: the training data consists of a set of input vectors without any
corresponding target values.
Clustering, density estimation (determine the distribution of data within the input space),
visualization (project the data from a high-dimensional space down to two or three dimensions)
p(D|w)
Frequentistw p(D|w)
Bayesianw p(D|w)
2. Conjugate prior: lead to posterior distribution having the same functional form as the prior
Distribution
Conjugate Prior
Bernoulli
Beta distribution
Multinomial
Dirichlet distribution
Gaussian distribution
Gamma distribution
Gaussian-Gamma distribution
4. Conjugate priormultinomial
D N multinomial K
K
p ( D | ) = k mk
k =1
mk N k
prior hype parameter conjugate prior
p ( | ) k k 1
k =1
Dir ( | ) =
(1 + ... + K ) K k 1
k
(1 ) + ... + ( K ) k =1
Dirichlet Multinomial
prior functional
form
5.
Linear Algebra + Matrix Theory, Multivariate Calculus
R
random vector
A G A = GG
(1)
(2)
flexible
multimodal function
multimodal function
(1) introducing discrete latent variables Gaussian Mixtures model
(2) introducing continuous latent variables Linear Dynamic System
7.
(1) f : R R
(2) f : R R n gradient
n
f ( xn ) T
f ( x1 )
,...,
)
x1
xn
(3) f : R R n (
n
f ( x)
f1 ( x)
,..., n )T
x
x
fi
)ij
xj
Aij
x
)ij
x
)ij
Aij
Ax x x Ax x
p ( x | ) = h( x ) g ( ) exp{ T u ( x )}
natural parameter
[u ( x )]
g ( ) E=
ln
=
1
N
u(x )
n =1
vector
9.
The prior intended to have as litter influence on the posterior distribution as possible.
normalize improper
Translation invariant Scale invariant
N R K Bin(K|N,P)
R p(x) R P = p(x) * VV R
N Bin(K|N,P) N*P K = N * P
R p(x) = K / (N * V)
1,..., D
1,| u | 1/ 2, i =
k (u ) = i
0, otherwise
k(
x xn
) =1
x
n x h hypercube
h
hypercube N hypercube
N
K = k(
n =1
x xn
)
h
p(x) = K / (N * V) hypercube
kernel function artificial discontinuity
Gaussian
=
p( x)
1
N
|| x xn ||2
1
exp{
}
2 1/2
2h 2
n =1 (2 h )
N
kNN K K V
kNN
classificationkNN posterior MAP
=
E ( w)
1 N
{ y( xn , w) tn }
2 n =1
MLE t y(x, w)
w
MAP MLE w
minimize the sum of squares of error with regularizer regularizer E(w)
regularizer
1 N
{ y ( xn , w) tn } + || w ||2
=
E ( w)
2 n =1
2
2
2. linear model
input variable
linear basis function model
M 1
y ( x, w) =
w0 + w j j ( x) =
wT ( x)
j =1
L(t , y=
( x)) { y ( x) t}2
L E[L] expected loss E[L] y(x)
E[L] y(x)
=
y ( x)
tp (t | x)d t E [t | x]
=
t
t x conditional expectation
squared loss L
h(x)=E[t|x]
E[L]
{h( x) t}
p ( x, t )dxdt y(x)
arise from the intrinsic noise on the data the represents the minimum achievable value of expected
loss y(x)
{E
=
(bias ) 2
=
var iance
[{ y ( x; D) ED [ y ( x; D)]}2 ] p ( x)dx
{h( x) t}
=
noise
p ( x, t )dxdt
5. bias-variance
curve fitting L data sets N data points
y (l ) ( x) D (l ) average model
y ( x) =
1 L (l )
y ( x)
L l =1
bias variance
(bias ) 2
=
1
N
{ y ( x ) h( x )}
n =1
1 N 1 L (l )
var iance
=
{ y ( xn ) y ( xn )}
N
Ll 1
=
n 1=
p( M i | D) p( M i ) p( D | M i )
p ( M i ) allows us to express a preference for different model
p ( D | M i ) model evidence marginal
likelihood
(3) Model averaging V.S. model selection
Model averaging predictive distribution
p (t | x, D ) =
p(t | x, M , D) p(M
i =1
| D)
p ( D | M i ) = p ( D |w, M i ) p ( w | M i )d w
w marginalize
7. Bayesian
Fully Bayesian marginalize with respect to hyper-parameters as well as parameters
analytical intractable
curve fitting ( p ( w | ) = N ( w | 0, 1 I ) p (t | x , w , ) = N (t | y ( x , w ), 1 ) )
p (t | t ) = p (t | w , ) p ( w | t , , ) p ( , | t )dwd d
Empirical Bayes/type 2 maximum likelihood/evidence approximation hyper-parameter
marginal likelihood * *
hyper-parameter * * w marginalize
*
*
*
p (t | t ) p (t | t , * , * ) =
p(t | w, ) p(w | t , , )dw
p ( y | x) =
p( y | x, z ) p( z | x)d
p=
( y | x)
p ( x, y )
=
p( x)
, z ) d z p ( y | x, z ) p ( z | x ) p ( x ) d z
p( x, y=
=
p( x)
p( x)
p ( y | x, z ) p ( z | x ) d
Linearly separable D
Coding scheme1-of-K binary coding scheme K i
K i 1 0
Feature vector D input x fixed nonlinear transformation
D ( x) feature vector
2. Generalized Linear Model: an activation function acting on a linear function of the feature
variables.
=
y ( x) f ( wT ( x) + w0 )
Generalized Linear Model (GLM)
x ( x) x feature vector
f activation functionf link function
(1) f nonlinear function GLM classification model
(2) f identity function GLM regression model
y ( x) wT ( x) + w0 w x
=
polynomial function basis GLM activation function
y w
GLM
(1) Logistic regression model f logistic sigmoid
(2) Probit regression activation function f probit
3.
1 decision problem
(1) Discriminant function
x
(2) Generative model joint distribution p ( x, Ck ) class-conditional
distribution p ( x | Ck ) posterior p (Ck | x)
(3) Discriminant model posterior p (Ck | x) GLM
p (Ck | x) =
P(Ck | x)
f ( wT ( x) + w0 ) training data
yk ( x) = y j ( x) ( wk w j )T x + ( wk 0 w j 0 ) =
0 x
k j k yk ( x) > y j ( x) K-class discriminant
decision region ambiguous region
D 1 Fisher
1
D 1 Fisher criterion
1
f ( wT ( x) + w0 )
perceptron criterion
1 tn +1 2 tn -1 xn
1 wT xn > 0 2 wT xn < 0 xn
wT xn tn > 0 misclassified pattern xn wT xn tn < 0
wT xn tn perceptron criterion misclassified pattern
E p ( w=
)
5. Generative model
input class-conditional distribution
Discriminant model make decision
2
p (Ck | x)
(1) p (C1 | x) =
=
a = ln
p ( x | C1 ) p (C1 )
p ( x | C1 ) p (C1 ) + p ( x | C2 ) p (C2 )
1
= (a)
1 + exp(a )
p ( x | C1 ) p (C1 )
p ( x | C2 ) p (C2 )
| Ck ) N ( k , )
k p ( x=
=
(ln
p (C
1 | x)
p ( x | C1 ) p (C1 )
=
) ( wT x + w0 )
p ( x | C2 ) p (C2 )
w=
1 ( 1 2 )
p (C1 )
1
1
w0 =
1T 11 + 2T 12 + ln
p (C2 )
2
2
k covariance 2
p ( x | C1 ) p (C1 )
x x wT x + w0
p ( x | C2 ) p (C2 )
(4) MLE
p (C1 ) = p (C2 ) = 1 {xn , tn } xn C1 tn = 1
tn = 0
=
p ( xn , C1 ) N ( xn | 1 , )
(1 ) N ( xn | 2 , )
p ( xn , C2 ) =
p (t =
| , 1 , 2 , )
[ N ( x
n =1
MLE (3)
class-conditional distribution
MLE class-conditional distribution
GLM Logistic
6. Discriminant model
D generative model D
Logistic (a ) =
1
step function
1 + exp(a )
exp(ak )
max function
exp(a j )
j
Softmax logistic
=
(ak ; a1 ,..., an )
p (Ck | ) s=
exp(ak )
ak = wk T
exp(a j )
j
2 logistic function
2 training data {n , tn } n
D feature vector tn {0,1}
=
p (t | w)
y
n =1
tn
n
{1 yn }1tn
=
(C1 | n ) (an ) an = wT n
yn p=
negative logarithm of the likelihood cross-entropy error function
N
E ( w) =
ln p (t | w) =
{tn ln yn + (1 tn ) ln(1 yn )}
n =1
(y
n =1
tn )n = 0
0 closed-form solution
E ( w) =
logistic function yn
E(w)concave E(w)
0 w
E ( w) =
IRLS
E ( w) =
0
Hessian
w( new) = w( old ) H 1E ( w)
7. Laplace approximation
q(z) p ( z ) =
1
f ( z ) Z f(z)
Z
ln f ( z ) ln f ( z0 )
A =
1
A( z z0 ) 2
2
d2
ln f ( z ) |z = z0 f(z)
dz 2
f ( z ) f ( z0 ) exp{
A
( z z0 ) 2 }
2
q( z ) (
A 1/2
A
) exp{ ( z z0 ) 2 }
2
2
p(z)
predictive distribution
p (C | , w) p ( w | t )dw ( w
=
x) p ( w | t )dw Bayesian
likelihood function
Wikipedia The likelihood of a set of parameter values given some observed outcomes is
equal to the probability of those observed outcomes given those parameter values.
{xn , tn } xn C1 tn = 1 tn = 0
=
p ( xn , C1 ) N ( xn | 1 , )
(1 ) N ( xn | 2 , )
p ( xn , C2 ) =
p (t =
| , 1 , 2 , )
[ N ( x
n =1
=
p (t | w)
y
n =1
tn
n
{1 yn }1tn
(C1 | n ) (an ) an = wT n
=
yn p=
likelihood posterior distribution yn = p (C1 | n )
observed outcome input xn xn tn
p( xn tn | xn ) yn = p (C1 | n )
=
y ( x) f ( wT ( x) + w0 )
x ( x) x feature vector f activation
functionf link function
(1) f nonlinear function GLM classification model y(x)
posterior probability Generalized Linear Model Discriminative model
f logistic logistic regression cross-entropy error function
(2) f identity function GLM regression model y(x)
probability interpretation sum-of-squares error function
2. Neural Network
Generalized Linear Model nonlinear function
M
3. Neural Network
regression sum-of-squares error function
=
E (w)
1 N
|| y(x n , w) t n ||2
2 n =1
ln [yk (x n , w)]tnk
n =1
k =1
w)
tnk = 1 0 k yk ( xn ,=
p=
(tk 1| xn )
E (w) = En (w)
n =1
4.
observation data E(w) w Gradient decent
w( +1)= w( ) E ( w( ) )
> 0 learning rate E ( w ) batch error function gradient
( )
observation data
(2) On-line gradient decent sequential gradient decent stochastic gradient decent
w( +1)= w( ) En ( w( ) )
observation data random selection with
replacement
a j hidden layer j
a j = w ji zi
i
zi input layer i j
z j = h( a j )
j =
En
a j
E
En En a j
=
= j zi n j
w ji
w ji a j w ji
k output layer unit
En
=
a j
j
=
aj
=
En ak
=
k a j
a
k
=
w z w
ji i
ji
ak
a j
h(ai )
( wki h(ai ))
i
=
wkj h ' (a j )
a j
ak
=
a j
=
j
w h (a )
=
'
kj
h ' (a j ) k wkj
j
k
k
output layer k k k j
En
w ji
E
wkj
n
output layer hidden layer
=
En ak
= k z j
ak wkj
En
j (4) E(w)
E(w) w
Gradient decent E(w)
BP
w
(1) w
(2) w BP E(w)
(3) Gradient decent w
(4)(2)(3) w
a j = w ji zi
i
z j = h( a j )
O(W)
optimization techniques
Linear and Quadratic Programming, Convex Optimization
Combinatorial Optimization
Probabilistic Optimization: Genetic Algorithm, Simulated Annealing, Particle Swarm
Optimization, Ant Colony Optimization;
Calculus of variation;
Numerical Optimization: Gradient decent, Conjugate gradient, Newton method
J ki =
yk
xi
Hessian f : R R
n
error propagation
output layer
Hessian
vT H = vT (E ( w)) O(W)
9. Regularization of neural network
Frequentist method
(1) regularizer regression w penalty
quadratic regularizer linear transformation invariance
(2) early stopping validation set
1
a
ji =
w ji w j 0 w j 0 = w j 0
w ji w
b
w ji
a i
w regularizer w w
consistency equivalent
regularizer
1
2
wW1
2
2
wW2
(1) data replication augmented data
label
learn invariance
modify data
(2) regularizationpenalize changes in the model when the input is transformed
tangent propagationmodify error function
(3) pre-processing transformation-invariant feature
feature extraction of transformation-invariant features
(4) invariance build the invariance properties into the
structure of a neural networkconvolutional neural network the model with
transformation-invariant structure
11. Bayesian Neural Networks
output
network w
p ( w | ) = N ( w | 0, 1 I )
xconditional distribution p (t | x, w, ) = N (t | y ( x, w), )
1
p ( w | D, , ) p ( w | ) p ( D | w, ) =
p ( w | ) N (tn | y ( xn , w), 1 )
n =1
xpredictive distribution
p (t | x, D) = p (t | x, w) p ( w | D, , )d w
q( w | D)
predictive
distribution
, predictive distribution
modeling
ML optimization approximation Frequentist
point estimate estimator
estimator / optimization
Bayesian point estimate marginalize
marginalize analytical solution approximation
analytical approximationTaylor expansion, Laplace approximation, Variational Bayes
sampling approximation
2. Kernel
non-linear feature space mapping input x kernel
function k ( x, x ') = ( x)
( x ')
k ( x, x=
') k ( x x ') k ( x=
, x ') k (|| x x ' ||)
radial basis kernel
Kernel kernel
kernel k
kernel kernel
kernel kernel kernel
kernel kernel kernel
kernel kernel Gaussian kernel kernel
kernel A1 , A2 P ( A) k ( A1 , A2 ) = 2
inner product vector space
P(A)
3. generative modelkernel
Generative model V.S. discriminant modelgenerative model can deal naturally with missing
data HMM handle sequences of varying length discriminative model
generally gives better performance on discriminative tasks.
generative model discriminant model generative model
kernel kernel discriminant model
p( x | i) p( x ' | i) p(i)
i
i latent variable
Fish kernel parametric generative model p ( x | )
parameter vector Fisher score g (, x) = ln p ( x | ) scalar
vector vector
1
x1 ,..., xN y ( x1 ),..., y ( xN )
Gaussian distribution y(x) mean value prior
0 Gaussian process ( y ( x1 ),..., y ( xN )) covariance
xn , xm
E[( y ( xn ) y ( xm )] = k ( xn , xm )
5. Gaussian process for regression
observation noise y ( xn )
tn p (tn | yn ) = N (tn | yn , 1 ) N observations
Gaussian
p (t | y ) = N (t | y , 1I N )
t = ( t1 ,..., t N ) y = ( y1 ,..., y N ) I N N
Gaussian process y = ( y1 ,..., y N ) Gaussian
p (y ) = N (y | 0, K )
K kernel
t = ( t1 ,..., t N ) p (t ) Gaussian
p (t N +1 | t ) analytical solution
N O ( N ) prediction
2
6. Gaussian processkernel
rather than fixing the covariance functionwe may prefer to use a parametric family
of functions and then infer the parameter values from the data
kernel
k ( xn , x=
0 exp{
m)
1
2
|| xn xm ||2 } + 2 + 3 xnT xm
= p (t | ) Gaussian
p (t ) p (t )
ln p (t | ) MLE
(a)t (1 (a))1t
(t | a )
p=
Gaussian process
p (a N +1 ) = N (| 0, C N +1 )
C N +1 kernel
K C=
( xn , xm ) k ( xn , xm ) + n
m > 0
nm kernel C
C
p (t N=
+1 | t N )
p (t
=
N +1
p (t N +1 1|=
aN +1 )
=
=
p (aN +1 | t N )
1| aN +1 ) p (aN +1 | t N )d aN +1
(aN +1 )
p (a , a | t )da
=
p(a
N +1
N +1
| a N ) p (a N | t N )da N
p (a N | t N )
model parameters
(1) Gaussian distribution
marginal Gaussian distribution conditional Gaussian distribution Convolution of two
Gaussiansconvolution of logistic and Gaussian
logistic probit Gaussian CDF convolution of logistic
and Gaussian Convolution of two Gaussians
x marginalize
=
f 1 (y - x) N (y - x; 0, 1 ) 0, 1
g=
N (x; 2 , 2 ) 2 , 2
2 ,2 ( x)
(y - x) g2 ,2 (x)d x = ( f 1 g2 ,2 )(y )
Gaussian
(ii)
Gaussian N (1 , 1 ) N ( 2 , 2 )
( N (1 , 1 ) N ( 2 , 2 ))(=
t)
N (t z; , )N (z; , )dz
1
(i)
= ( N (0, 1 ) N ( 2 , 2 ))( y )
marginalization N (0, 1 ) N ( 2 , 2 )
(iii)
Gaussian N (, ) e
1
itT tT t
2
N (1 , 1 ) N ( 2 , 2 )
1
itT ( 1 + 2 ) tT ( 1 + 2 )t
2
N (1 + 2 , 1 + 2 )
marginalize
)dx
N (y; x, ) N (x; , =
1
( N (0, 1 ) N ( 2 , 2 ))(
=
y ) N ( 2 , 1 + 2 )
2. SVM
SVM discriminant function input RVM
discriminant model
SVM sparse model training datasupport vectors
Gaussian process training data prediction
SVM classificationregression novelty detection
3. SVMmodelinglinearly separable
N {x n , tn } tn {1,1}
(1) modeling
=
y (x) w (x) + b y (x) = 0 SVM
tn y ( x n )
linearly separable
|| w ||
tn y ( xn ) > 0 margin
1
arg max{
min[tn y (x n )]}
|| w || n
w ,b
s.t tn y ( xn ) > 0, n =
1,..., N
margin
w b w w, b b x n
y (x) = 0
tn y ( x n )
|| w ||
1 tn y (x n ) = 1
1
}
tn y (x n ) 1, n =
1,..., N arg max{
|| w ||
w ,b
1
arg min{ || w ||2 }
2
w ,b
s.t. tn y (x n ) 1, n =
1,..., N
inequality constraint quadratic programming
Lagrange function
N
1
2
, a)
|| w || an {tn (w T (x n ) + b) 1}
L(w, b=
2
n =1
Lagrange multiplier a 0
L( w , b, a) w ,b a w ,b
0 a w =
a t ( x
n =1
n n
) antn = 0
n =1
L( w , b, a) w ,b
(a)
L
=
=
n 1
1 N N
antn amtm k (xn , xm )
2=n 1 =m 1
s.t. an 0, n =
1,..., N
N
a t
n =1
n n
=0
L (a) a
w =
a t ( x
n =1
n n
=
y (x) w T (x) + b
)
(3)
KKT
an 0
tn y ( x n ) 1
an {tn ( y (x n ) 1} =
0
3 n 3 an = 0 tn y (x n ) = 1 an = 0
prediction y ( x) =
a t k (x, x
n =1
n n
) n 0 x n
1
2
}
n =1
an
(a)
Lagrangian L
=
=
n 1
1 N N
antn amtm k (xn , xm )
2=n 1 =m 1
0 an C , n =
1,..., N antn = 0 an C
n =1
box constraints
an = 0 x n prediction an 0 support vector
0 < an < C n = 0 x n margin margin an 0
x n margin an = C n x n
n
5. multiclassSVM
open problem
6. single-classSVM
unsupervised learning probability density estimation
density of data find a smooth boundary enclosing a region of high density boundary
0 1 distribution
region
feature space
feature space
1
min{ || w ||2 }
2
1,..., N
st. | y(x n ) tn |< , n =
modeling hard margin SVM (1)
(2) complexity overfitting
tube
SVM tube
| y(x) t |<
0,
E (y(x) t ) =
| y(x) t | , otherwise
optimization problem II
N
min{C E (y(x n ) t ) +
n =1
1
|| w ||2 } C > 0
2
tn y(x n ) + + n
tn y(x n ) n
n + n optimization
problem III
1
min{C ( n + n ) + || w ||2 }
2
n =1
st. tn y(x n ) + + n
tn y(x n ) n
n 0
n 0
n = 1,..., N
n n w Lagrange multiplier
KKT condition
optimization problem III optimization problem IIoptimization problem I
=
tn y(x n ) n E (y(x n ) t ) n n n = 0
n = 0 optimization problem II optimization problem III
penalty
optimization problem II optimization problem III
y (x, w ) = w T (x)
p (t | x, w, ) = N (t | y (x, w ), 1 )
p (w | ) = N ( wi | 0, i 1 )
i =1
=
| ) N (w | 0, I ) RVM w
p ( w
hyper-parameter
Bayesian modelRVM marginalize predictive distribution
maximize marginal likelihood hyper-parameter
N observation x n X target values
t = (t1 ,..., t N ) marginal likelihood
p (t | X, , ) = p (t | X, w, ) p (w | )dw
p (t | X, w , ) likelihoodwith respect to
*
x predictive distribution
p (t | x, X, t, * , * ) = p (t | x, w, * ) p (w | X, t, * , * )dw
p ( w | X, t , , ) w
*
(w T (x))
y (x, w ) =
=
p (C1 | x, w )
p (w | ) = N ( wi | 0, i 1 )
i =1
Model
Remark
SVM (Frequentist
N2 approx.
Gaussian Process
N3
N2
Frequentist linear
M2(N+M)
M2(N+M)
M2
(specialize the
regression
the predictive
predictive distribution
distribution)
Sparse Model)
RVM(Bayesian Sparse
Model)
Frequentist Logistic
regression
Bayesian Logistic
regression
Frequentist Neural
network
Bayesian Neural
network
N: number of training data
M: number of basis function
V: number of support vectors
Parametric/Non-parametric
Frequentist/Bayesian
Discriminative/Generative
2. joint distributiongraph
PGM joint distribution Pgraph G
joint distribution
(1) G
D-separation U-separation
statement joint distribution P
I d sep (G ) = I (G ) I u sep (G ) = I (G )
(2) G P
P G I (G ) I ( P ) G
P I-map
(3) G P
(2)(3) I (G ) I ( P )
3. PGM
P PGM GG P
G P I-map G
I(G) P I(P)
factorization I-map
Bayesian network P factorizes over G G P I-map
Markov network P Gibbs distribution factorizes over G G P I-map
P positive distribution G P I-map P Gibbs
distribution factorizes over G
P factorizes over G
P factorizes over G
Bayesian network G P G
K
variables
Markov network G P G
p ( x) =
1
C ( X C ) X C C ( X C ) 0
Z C
Conditional Independence
Factorization
4. joint distributiongraph
(1) P graph
P G G P I-map P I-map
G P I-map I (G ) I ( P ) G
P I-map P trivial
I (G ) P I-map
(2) I-map P
P I-map G I(G)
I-map I-map minimal I-map
random variable
Bayesian network Markov network
Markov network xi , x j
x \{xi , x j } joint
distribution CK2 K random variable
I-map
Bayesian network xi p ( xi | x1 ,..., xi 1 )
x j {x1 ,..., xi 1} joint distribution
I-map
I-map
5. graph
Local independence global independences
Bayesian network G
Local independence xk pak xk descendants
Local independence I (G )
Global independence D-separation Global
independence I (G )
I (G ) I (G ) I ( P ) P joint distributionI (G ) I (G )
I (G ) I (G )
I (G ) G conditional independence
I (G )
Markov network G
Local independence pairwise independence xi , x j
I p (G ) Markov blanket
independence xk Markov blanket Markov
blanket I (G )
Global independence G X Y
Z X Y Z X Y Z
I (G )
I p (G ) I (G ) I (G ) P positive
distribution
I (G )
6. PGM
PGM
Joint distribution random variable
direct representation random variables joint
distribution joint distribution graph
work on the graph graph inference marginal conditional
PGM
observed data p ( X | ) =
p( X , Z | )
Z
p ( X | ) complete-data likelihood p ( X , Z | )
Z
EM
q ( Z ) ln p ( X | ) q ( Z )
ln p (=
X | ) L(q, ) + KL(q || p )
L(q, ) = q ( Z ) ln{
Z
p( X , Z | )
}
q( Z )
KL(q || p ) = q ( Z ) ln{
Z
p( Z | X , )
}
q( Z )
EM
(1)
old
q ( Z ) = p ( Z | X , old ) E step
(2) q ( Z ) = p ( Z | X ,
new
old
M step
ln p ( X | ) E step ln p ( X | )
q ( Z ) q ( Z ) ln p ( X | ) M step
new
) > ln p ( X | old ) EM
EM
(1) ln p ( X | ) q ( Z ) q ( Z ) ln p ( X | )
q ( Z ) L(q, ) q ( Z ) = p ( Z | X , ) KL(q||p) 0
(2) M step maximize L(q, ) with respect to q ( Z )
X Z
L ( q, )
p( Z | X ,
olds
p( Z | X ,
olds
) ln p ( X , Z | ) p ( Z | X , olds ) ln p ( Z | X , olds )
Z
) ln p ( X , Z | ) + co nst
Q( , old ) = p ( Z | X , old ) ln p ( X , Z | )
Z
) complete-data likelihood ln p ( X , Z | ) X
with respect to Q ( ,
old
Z p ( Z | X ,
old
likelihood
EM E step Q ( ,
Q ( ,
old
old
) M step
k =1
N ( x | k , k )
Exponential Family
random vector z 1-of-K p ( z ) =
k =1
zk
0 k 1 k = 1
k =1
p ( x | zk= 1)= N ( x | k , k ) x z
K
p ( x | z ) = N ( x | k , k ) zk
k =1
=
p( x )
x
=
p( z ) p( x | z )
k =1
N ( x | k , k )
=
{
=
1,..., K }=
k : k 1,..., K }=
k : k 1,..., K }
{=
{=
k :k
EM p ( X | , , ) GMM
ln p ( X | , , ) =
ln{ k N ( xn | k , k )} ln
=
n 1=
k 1
p ( X, Z | , , ) = kznk N ( xn | k , k ) znk
n 1=
k 1
=
X, Z | , , )
p (=
=
n 1=
k 1
nk
{ln k + ln N ( xn | k , k )}
n 1=
k 1
=
( z
EZ [=
p ( X, Z | , , )]
=
n 1=
k 1
nk
znk Z
( znk ) E=
[ znk ]
=
k N ( xn | k , k )
K
j =1
N ( xn | j , j )
, , EZ [ p ( X, Z | , , )]
k =
1
Nk
k =
Nk
N
=
k
1
Nk
(z
n =1
nk
(z
n =1
) xn
)( xkn k )( xn k )T
N k =
(z
n =1
nk
GMM
old
, old , old
p( z
1)=
p ( xn | znk 1)
=
p=
( znk 1|=
xn ) K nk
=
p ( xn | znj 1)
=
p( znj 1)=
k N ( xn | k , k )
= ( znk )
j N ( xn | j , j )
K
=j 1 =j 1
( znk ) xn xn k GMM
, , ( znk )
3. GMMk-means
k-means GMM
clusteringk-means hard assignment of data points to clusters
GMM soft assignment ( znk )
GMM k
GMM
( znk ) =
k exp{ || xn k ||2 /2 }
K
j =1
exp{ || xn j ||2 /2 }
K || xn j || || xn j* ||
2
exp{ || xn j* ||2 /2 } 0 k j * 1 0 k = j *
1
*
0, k j
lim ( znk ) =
*
0
1, k = j
hard assignment
0 k-means N k k
k k k-means
EZ [ p ( X, Z | , , )]
1 K K
( znk ) || xn k ||2 +const
2 =n 1 =k 1
p(x)
lnp(x) x x random variable Bayesian
random variable
X ) L(q ) + KL(q || p )
q ( Z ) P(X) ln p (=
L(q ) = q ( Z ) ln{
p( X , Z )
}dZ
q( Z )
KL(q || p ) = q ( Z ) ln{
p( Z | X )
}dZ
q( Z )
q ( Z ) P(Z|X) KL(q||p)
q(Z) KL(q||p)
P(Z|X) intractable KL(q||p) q(Z)
joint distribution P(X, Z) ln p ( X ) q(Z)
KL(q||p) L(q)
q(Z) L(q) q(Z)
tractable flexible/
M
q ( Z ) = qi ( Z i )
i =1
Z i Z
=
L(q )
q {ln p( X , Z ) ln q }dZ
i
ln p ( X , Z j )dZ j q j ln q j d Z j + const
=
KL(q j || ln p ( X , Z j )) + const
=
ln p ( X , Z j ) Ei j [ p ( X , Z )] + const Ei j
q ( Z )
i
i j
=
ln q* ( Z j ) Ei j [ p ( X , Z )] + const
M-1
ln q ( Z j )
*
q (Z )
i
i j
N ( x
n 1=
k 1
=
| k , k 1 ) znk Bayesian
p ( Z | ) = k
znk
=
n 1=
k 1
p ( ) = Dir ( | 0 ) 0 = ( 0 ,..., 0 )
p( , =
) p ( | ) p (=
)
N (
k =1
| m0 , ( 0 k ) 1 )W ( k | W0 , 0 ) component
, , )
p ( X , Z ,=
p ( X | Z , , ) p ( Z | ) p ( ) p ( | ) p ( )
p ( Z , , , | X )
variational distribution q ( Z , , , )
q ( Z=
, , , ) q ( Z )q ( , , )
parameter random variable latent variable
variational inference
q(Z) L(q)
=
ln q* ( Z ) E , , [ p( X , Z , , , )] + const
= E [ p ( Z | )] + E , [ p ( X | Z , , )] + const
joint distribution p ( X , Z , , , ) Z
const
=
ln q* ( Z )
=
n 1=
k 1
nk
ln nk + const
Ek [ln k ] +
ln =
n
D
1
1
E[ln | k |] ln(2 ) Ek , k [( xn k )T k ( xn k )]
2
2
2
ln q ( Z )
q ( , , ) L(q)
K
ln q* ( , =
, ) ln p ( ) + ln p ( k , k ) + EZ [ p ( Z | )] + E[ zn ]Nk ( xn | k , k 1 ) + const
=
k 1
=
n 1=
k 1
q ( , , ) Z ln q ( , , )
*
, , k
K
q( ,
, , , ) q ( Z )q=
( , , ) q ( Z )q (=
)q ( , ) q ( Z )q ( )
q ( Z=
k =1
p ( Z , , , | X )
Bayesian GMM K
mixing coefficient
q ( ) Dirchelet K
*
E[ k ] 0 GMM
K variational inference
E[ k ] 0 K*
Predictive distribution:
Bayesian
x predictive distribution
p ( x | X ) = p ( x | z , , ) p (z | ) p ( , , | X )d d d
z
q ( , , ) p ( , , | X )
*
p ( x | X ) K student t
4. Expectation Propagation
KL( p || q ) =
ln g ( ) E p ( z ) [u(z )] + const const natuaral
T
parameter
KL
f ( )
i
1
fi ( ) p( D) = i fi ( )d
p( D) i
model i.i.d. f i ( ) = p ( xi | )
f 0 ( ) = p ( )
q ( ) =
1
fi ( ) p ( | D) factor fi ( ) model
Z i
f i ( ) fi ( ) exponential family q ( )
exponential family family
q ( ) f j ( ) f j ( ) f j ( )
q ( )
1
\j
Z j = f j ( ) q ( ) d
f j ( )q \ j ( ) q \ j ( ) =
Zj
f j ( )
revised f j ( ) KL(
1
f j ( )q \ j ( ) || q new ( )) with respect to
Zj
new
q new ( )
new
( ) fj new ( ) = K \ j
K q ( )
q ( )
K = Z j
f j ( )
q ( )
1. sampling method
F(x) y (0, 1)
F 1 ( y ) F(x)
1
F ( y )
2. proposal distributionsampling
p(z) q(z) proposal
distribution
Rejection sampling
single-varaible p ( z ) =
1
p ( z ) Z p p(z) z
Zp
p(z) z
Gam( z | a, b) =
b a z a 1 exp(bz )
ba
z Z p
(a )
(a )
k K z0
rejection sampling
Rejection sampling
rejection sampling u0 > p ( z0 )
Importance sampling
p(z) z p(z)
p(z) f(z)
L p(z) E[ f ]
1 L
f ( z (l ) )
L l =1
p( z ) f ( z )d z
=
E[ f ]
=
p( z ) f ( z )
1 L p( z (l ) )
q( z )d z
f ( z (l ) )
(l )
q( z )
L l =1 q ( z )
p( z (l ) )
q ( z (l ) ) importance weights
rl =
p( z ) f ( z )d z
=
=
E[ f ]
r =
Zq
Zp
Zq 1 L
p ( z ) f ( z )
q( z )d z
rl f ( z (l ) )
q ( z )
Z p L l =1
p ( z (l ) )
q ( z (l ) )
Zq
1
q( z )
=
p ( x )d x =
p ( x )d x
Zp
q ( z )
Zp
x
=
q( z )
q( z ) q( x )
=
q ( z ) q ( x )
1 L
rl
L l =1
E[ f ] =
z
(l )
q( x )
q ( z ) p ( x )d x = q ( x) p ( x )d x z
1 L
wl f ( z (l ) )
L l =1
q(z) L
rl
wl =
=
L
rm
m =1
p ( z (l ) ) / q ( z (l ) )
p ( z ( m) ) / q( z ( m) )
m
wl p(z) z p ( z )
Sampling-importance-resampling (SIR)
Rejection sampling kImportance sampling p(z)
Sampling-importance-resampling SIR proposal
distribution q(z)
q(z) L z
{z
(l )
(l )
Importance sampling wl
: l = 1,..., L} L z (l ) wl
L p(z) L
MCMC
old
) = p ( Z | X , old ) ln p ( X , Z | ) sampling
Z
latent variables p ( Z | X ,
{Z
(l )
: l = 1,..., L} Q( , old )
M EM Q ( ,
old
) L
1
ln p(Z (l ) , X | )
L l
old
) with respect to
4. Markov chain
Markov chain a series of random variables {z
p ( z
( m +1)
(l )
: l = 1,..., M }
Transiton probablisty Tm ( z
(m)
, z ( m +1) ) = p ( z ( m +1) | z ( m ) )
Invariant distribution p ( z ) =
*
p ( z )T ( z , z ) = p( z )T ( z , z )
5. Metropolis-Hastings
Rejection sampling Importance sampling p(z)
proposal sampling q(z) sampling z
q ( z | z
( )
( )
) z * A( z * , z ( ) )
z (*) , if accpet
z ( +1) = ( )
z , if reject
*
A( z , z
( )
) = min{1,
p ( z * )q( z ( ) | z * )
} p ( z ) p(z) z
p ( z ( ) )q ( z * | z ( ) )
Markov chain {z
( )
: = 1,...}
6. Gibbs sampling
Metropolis-Hastings p ( z ) = p ( z1 ,..., zM ) Gibbs
sampling
{zi : i = 1...M } = 1 sampling
+ 1 sampling
( +1)
( +1)
( +1)
( +1)
- sampling z1
- sampling z2
- sampling z j
- sampling zM
zk z z zk z \ k = z \*k Metropolis-Hastings
*
( )
A( z , z )
=
1
N
{u
T
1
n =1
xn u1T x}2 =
u1T Su1
x =
1
N
=
S sample covariance matrix S
1
N
(x
x )( xn x )T
1
N
|| x
n
Subspace
2. Probabilistic PCA (PPCA)
latent variable z principle-component subspaceobserved data x data
point
p ( z ) = N ( z | 0, I )
=
p ( x | z ) N ( x | Wz + , 2 I )
W principle-component subspace
linear-Gaussian
p( x ) = N ( x | , C )
=
p ( z | x ) N ( z | M 1W T ( x ), 2 M )
C = WW T + 2 I D M = W T W + 2 I M
p ( x ) latent space z rotation
Rz R orthogonal matrix p ( x ) W = WR
R WW
= WRRT W T = WW T
solution EM
3. Factor analysis
PPCA
p ( z ) = N ( z | 0, I )
p( =
x | z ) N ( x | Wz + , )
diagonal matrix Factor analysis PPCA PPCA
isotropic matrix
4. Kernel PCA
PCA principle-component subspace projection Kernel PCA
projection
feature map
( x ) data point
M feature space
feature space PCA projection
feature space data point 0
( x ) = 0 M
n
sample covariance C =
1
N
( x ) ( x )
PCA C
=
yi ( x ) =
( x )T vi
n =1
in
k ( x , xn )
( x ) = 0
n
data centralize
( xn ) ( xn )
=
1
N
( x ) K
n
nm
= ( xn )T ( xm ) K nm = ( xn )T ( xm )
K nm eigenvector equation
p( z )
j =1
=
E (w)
1
|| y ( xn , w ) xn ||2
2 n
p( zn | zn 1 , A) = Ajkn1, j zk
z
=
k 1 =j 1
p ( z1 ) = kz1k k = 1, 0 k 1
k =1
p ( xn | zn , ) = p ( xn | k ) znk
emission probability
k =1
p ( X , Z | ) = p ( z1 | )[ p ( zn | zn 1 , A)][ p ( xm | zm , )]
=
n 2=
m 1
=
X { x=
1..N=
}, Z {=
zn : n 1..N=
}, { , A, } HMM
n :n
= { , A, }
HMM learningMaximum Likelihood EM
=
X {=
xn : n 1..N } HMM = { , A, }
learning observation
MLE latent variable observation likelihood
p ( X | ) =
p( X , Z | ) EM likelihood
Z
( zn ) = p ( zn | X , old )
( zn 1 , zn ) = p ( zn 1 , zn | X , old )
Forward-backward marginal HMM factor graph
PGM sum-product
HMM inferenceViterbi
inference marginal conditional ()HMM
inference Z = arg max p ( Z | X , ) HMM
*
=
X {=
xn : n 1..N } latent statessequence of states
PGM max-sum HMM Viterbi
2. Linear Dynamical System (LDS)
LDS HMM PGMLDS latent variables
transition probabilisty emission probabilisty linear-Gaussion LDS
p( zn | zn 1 ) = N ( zn | Azn 1 , )
p ( xn | zn ) = N ( xn | Czn , )
p ( z1 ) = N ( z1 | 0 ,V0 )
Kernelize
Probabilistize
y =
1
N
xn R
sum-of-squares error
Q
=
(T )
{t
xn R
y }2
classification p k region R k
Q (T ) = p k ln p k
cross entropy
Q (T )
=
(1 p k ) Gini index
=
C (T )
|T |
Q (T ) + | T |
=1
4. Mixture Model
Unconditional Mixture Model Gaussian Mixture Model (GMM)
K
p ( x ) = k N ( x | k , k ) , ,
k =1
k =1
N (t | wkT , 1 ) = {W , , }
conditional mixture
= {W , , } input
model
k =1
( x ) pk (t | x ) pk (t | x ) expert
m =1
ym ( x )) m
M model ym ( x ) ym ( x )
ym 1 ( x )
data point wn( m ) m=1 1/N
ym ( x )
=
Jm
w
n =1
(m)
n
I ( ym ( xn ) tn )
ym ( x ) m =
w
n =1
(m)
n
I ( ym ( x n ) t n )
N
w
n =1
1 m
m = ln{
ym ( x )
(m)
n
ym ( x ) data point
AdaBoost Y ( x ) = sgn(
m =1
ym ( x ))
exp{t
n =1
{ xn , tn } f ( xn ) =
f ( xn )} tn {1, +1}
1 M
m ym ( x ) combined modelAdaBoost
2 m =1
sequential optimization