Talk7 ModelSelection2

Model Assessment &
Selection
Dept. Computer Science & Engineering,
Shanghai Jiaotong University
Outline
• Bias, Variance and Model Complexity
• The Bias-Variance Decomposition
• Optimism of the Training Error Rate
• Estimates of In-Sample Prediction Error
• The Effective Number of Parameters
• The Bayesian Approach and BIC
• Minimum Description Length
• Vapnik-Chernovenkis Dimension
• Cross-Validation
• Bootstrap Methods
2018/10/25 Model Assessment & Selection 2
Bias, Variance & Model Complexity

• The standard of model assessment : the
generalization performance of a learning
method
– Model: X  Y; Y  f (X )  
– Prediction Model: fˆ ( X )
– Loss function:
(Y  ˆ ( X )) 2 squared error
f
L(Y , fˆ ( X ))  
 ˆ ( X ) | absolute error
 | Y f

• Error: training error, generalization error
1 N
Training error : err   L( yi , fˆ ( xi ))
N i 1
Generaliza tion error : Err  E L(Y , fˆ ( X ))  
• Typical loss function:
0 - 1 loss L(G, Gˆ ( X ))  I (G  Gˆ ( X ))
K
log - likelihood L(G, pˆ ( X ))  2 I (G  k ) log pˆ k ( X )
k 1
 2 log pˆ G ( X )
Bias-Variance Decomposition
• Basic Model: Y  f ( X )   ,  ~ N (0,  2 )
• The expected prediction error of a regression
fit fˆ ( X ) .
ˆ
Err ( x0 )  E[(Y  f ( x0 )) | X  x0 ]
2
ˆ ˆ ˆ
    [ Ef ( x0 )  f ( x0 )]  E[ Ef ( x0 )  f ( x0 )]
2 2 2
   Bias ( f ( x ))  Var ( fˆ ( x ))
2

ˆ
0
2
0
 Irreducibl e Error  Bias  Variance
2
• The more complex the model, the lower the

(squared) bias but the higher the variance.
• For the k-NN regression fit the prediction
error: ˆ
Err ( x0 )  E[(Y  f ( x0 )) |X  x0 ]
2
k
1
 σ ε  [ f ( x0 )   f ( x j )]  σ ε / k
2 2 2
k j 1
ˆ
• For the linear model fit f p ( x)  ̂ x T
ˆ
Err( x )  E[(Y  f ( x )) | X  x ]
2
0 p 0 0
 
2 ˆ
 [ f ( x0 )  Ef p ( x0 )]  h( x0 )  
2 2 2

• The in-sample error of the Linear Model
N N
1 1 p 2
N
 Err ( x )   
i 1
i
2

N

i1
ˆ
[ f ( xi )  Ef ( xi )]   
2
N
– The model complexity is directly related to the
number of parameters p.
• For ridge regression the square bias
E x0  ˆ    
f ( xx0 )  Ef ( x0 )  E x0 f ( x0 )    x0  E x0  T x0  EˆT x0
2
T 2
 2
 [Model Bias]2  [Estimatio n Bias]2

• Schematic of the behavior of bias and variance
Closest fit in population
Realization
Closest fit
Truth MODEL
SPACE
Model bias Regularized
fit
Estimation
bias RESTRICED
MODEL SPACE
Estimation
Variance
Optimism of the Training Error Rate
• Training Error < True Error N
1
Training Error err   L( yi , fˆ ( xi ))
N i 1
True Error Err  E[ L(Y , fˆ ( X )]
• Err is extra-sample error
• The in-sample error
1 N

Errin   E y EY new L(Yi new , fˆ ( xi ))
N i 1

• Optimism: op  Errin  E y [err ]
Optimism of the Training Error Rate
• For squared error, 0-1, other loss function：
2 N
op  i 1 Cov( yˆ i , yi )
N
  2 N
Erri n  E y err  i 1 Cov( yˆ i , yi )
N
• ŷ•输入维数或基函数的个数增加，乐观性增大
i is obtained by a linear fit with d inputs or
•训练样本数增加，乐观性降低
basis function, a simplification is:
i1 cov( yi , yi )  d 
N
ˆ 2
  d 2
Erri n  E y err  2  
N
Estimates of In-sample Prediction
Error
• The general form of the in-sample estimates is
Eˆ rrin  E y [err ]  oˆp
• d parameters are fit under Squared error loss
d 2
C p statistic : C p  err  2 ˆ 
N
• Use a log-likelihood function to estimate Errin
N  , -2 Elog Pr (Y )   Eloglik   2
2 d
N N
loglik  i 1 log Pr ( yi )
N
– This relationship introduce the Akaike Information

Criterion
Akaike Information Criterion
• Akaike Information Criterion is a similar but more
generally applicable estimate of Errin
• A set of models f (x) with a turning parameter :
d ( ) 2
AIC( )  err ( )  2 ˆ 
N
err ( ) : the training error
d ( ) : number of parameters
• AIC( ) provides an estimate of the test error curve,
and we find the turning parameter ̂ that minimizes
it.  f ( x) | ˆ : minAIC( ˆ )
ˆ

Akaike Information Criterion
• For the logistic regression model, using the
binomial log-likelihood.
2 d
AIC   E[log lik ]  2
N N
• For Gaussian model the AIC statistic equals
to the Cp statistic.
d 2
AIC  C p  err  2 ˆ 
N

Akaike信息准则
• 音素识别例子：
M
 ( f )   k hk ( f ), d( )  M
k 1
Effective number of parameters
• A linear fitting method:
yˆ  Sy, S is N  N matrix, depending on xi
• Effective number of parameters:
d ( S )  trace( S )
– If S is an orthogonal projection matrix onto a
basis set spanned by M features, then:
trace( S )  M
– trace( S )  M is the correct quantity to replace d
in the Cp statistic
Bayesian Approach & BIC
• The Bayesian Information Criterion (BIC)
BIC  2loglik  (log N )d
• Gaussian model:
– Variance  2 N
– then  2loglik  C  ( yi  fˆ ( xi )) 2 / 2  N  err /  2
– So i 1
N d 2
BIC  2 [err  (log N )   ]
 N
– BIC is proportional to AIC(C p ), 2 replaced by log N
– N  e 2  7.4, 倾向选择简单模型, 而惩罚复杂模型
Bayesian Model Selection
• BIC derived from Bayesian Model Selection
• Candidate models Mm , model parameter  m and a
prior distribution Pr( m | M m )
• Posterior probability:
Pr( M m | Z)  Pr( M m ) Pr(Z | M m )

 Pr( M m )  Pr(Z |  m , M m ) Pr( m | M m )d m
- Z represents the training data {xi , yi }1N

Bayesian Model Selection
• Compare two models M m and M 
Pr( M m | Z) Pr( M m ) Pr(Z | M m )

Pr( M  | Z) Pr( M  ) Pr(Z | M  )
• If the odds are greater than 1, model m will
be chosen, otherwise choose model 
• Bayes Factor: Pr(Z | M m )
BF(Z) 
Pr(Z | M  )
– The contribution of the data to the posterior odds
Bayesian模型选择
• 如果模型的先验是均匀的Pr(M)是常数，
ˆ dm
log Pr( Z | M m )  log Pr( Z ||  m , M m )  log N  O( 1 )
2
其中ˆm是模型参数的极大似然估计，d m是模型维数。
 2 log Pr( Z || ˆm , M m )
极小BIC的模型等价于极大化后验概率模型
优点：当模型包含真实模型是，当样本趋于无穷时，BIC
选择正确的概率是一。
最小描述长度(MDL)
• 来源：最优编码
• 信息： z1 z2 z3 z4
• 编码： 0 10 110 111
• 编码2： 110 10 111 0
• 准则：最频繁的使用最短的编码
• 发送信息zi的概率： Pr( zi )
• 香农定律指出使用长度： li   log Pr( zi )

最小描述长度(MDL)
E (Length ）  Pr( zi ) log Pr( zi )
 li
pi  A , 等式成立。
Pr( zi )  1 / 2; 1 / 4; 1 / 8; 1 / 8

模型选择MDL
模型：M    参数： ; 输入输出Z  ( X , y )
假定模型输出的条件概率 Pr( y |  , M , X )
length   log Pr( y |  , M , X )  log Pr( | M )
 log Pr( y |  , M , X ) ：传递模型与实际目标值的偏差
 log Pr( | M )：模型参数的平均长度

模型选择MDL
如果y ~ N ( ,  ), 参数 ~ N (0，
2
1)
( y   )2  2
Length  Const.  log   
 2
2
 小的具有较短的信息长度。
MDL原理：我们应该选择模型，使得下列长度极小
模型：M  参数：
length   log Pr( y |  , M , x)  log Pr( | M ).

Vapnik-Chernovenkis维
• 问题：如何选择模型的参数的个数 d？
• 该参数代表了模型的复杂度
• VC维是描述模型复杂度的一个重要的指标
{ f ( x,  )}, x  IR p   参数, f  指示函数

函数族：
Ex :  （ 0，1），线性指示函数 f  I ( 0   T 1 x  0)
f 复杂度 : p  1
但是 x  IR, f ( x,  )  I (sin   x) ? f只有一个参数。

VC维
• 类 { f ( x,  )} 的VC维定义为可以被 { f ( x,  )} 成员分
散的点的最大的个数
• 平面的直线类VC维为3。
• sin(ax) 的VC维是无穷大。
1
-1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

VC维
• 实值函数类 {g ( x,  )} 的VC维定义为指示类
{I ( g ( x,  )    0)} 的VC维。
• 引入VC维可以为泛化误差提供一个估计
• 设 {g ( x,  )} 的VC维为h,样本数为N.
 4err
Err  err  (1  1  ) 二类分类
2 
err h[log( a2 N / h  1)  log(  / 4)
Err  ;回归   a1
(1  c  )  N
交叉验证
1 2 3 4 5
训练集训练集检验集训练集训练集

自助法
• 基本思想：从训练数据中有放回地随机抽样
数据集，每个数据集的与原训练集相同。
• 如此产生B组自助法数据集
• 如何利用这些数据集进行预测？
ˆ
如果 f ( x ) 是x 上的预测值
*b
i i
自助集误差： B N
1 1
Errboot 
BN
 i,
L ( y ˆ *b ( x ))
f
b 1 i 1
i

自助法
• 自助法过程图解：重复实验
S ( Z *1 ) S ( Z *2 ) S ( Z *B )
样本
Z *1
Z *2
Z *B
Z  ( z1 , z2 ,, z N ) 训练样本

Summary
• Bias, Variance and Model Complexity
• The Bias-Variance Decomposition
• Optimism of the Training Error Rate
• Estimates of In-Sample Prediction Error
• The Effective Number of Parameters
• The Bayesian Approach and BIC
• Minimum Description Length
• Vapnik-Chernovenkis Dimension
• Cross-Validation
• Bootstrap Methods

Talk7 ModelSelection2

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Talk7 ModelSelection2

Transféré par

Droits d'auteur :

Formats disponibles

Model Assessment &

2018/10/25 Model Assessment & Selection 3

2018/10/25 Model Assessment & Selection 4

• The more complex the model, the lower the

2018/10/25 Model Assessment & Selection 7

 [Model Bias]2  [Estimatio n Bias]2

2018/10/25 Model Assessment & Selection 8

– This relationship introduce the Akaike Information

2018/10/25 Model Assessment & Selection 13

2018/10/25 Model Assessment & Selection 14

Pr( M m | Z)  Pr( M m ) Pr(Z | M m )

2018/10/25 Model Assessment & Selection 18

 2 log Pr( Z || ˆm , M m )

2018/10/25 Model Assessment & Selection 21

2018/10/25 Model Assessment & Selection 22

模型：M    参数： ; 输入输出Z  ( X , y )

 log Pr( y |  , M , X ) ：传递模型与实际目标值的偏差

 log Pr( | M )：模型参数的平均长度

2018/10/25 Model Assessment & Selection 23

2018/10/25 Model Assessment & Selection 24

{ f ( x,  )}, x  IR p   参数, f  指示函数

但是 x  IR, f ( x,  )  I (sin   x) ? f只有一个参数。

2018/10/25 Model Assessment & Selection 25

2018/10/25 Model Assessment & Selection 26

训练集 训练集 检验集 训练集 训练集

2018/10/25 Model Assessment & Selection 28

2018/10/25 Model Assessment & Selection 29

2018/10/25 Model Assessment & Selection 30

Vous aimerez peut-être aussi

训练集训练集检验集训练集训练集