Vous êtes sur la page 1sur 31

Model Assessment &

Selection
Dept. Computer Science & Engineering,
Shanghai Jiaotong University
Outline
• Bias, Variance and Model Complexity
• The Bias-Variance Decomposition
• Optimism of the Training Error Rate
• Estimates of In-Sample Prediction Error
• The Effective Number of Parameters
• The Bayesian Approach and BIC
• Minimum Description Length
• Vapnik-Chernovenkis Dimension
• Cross-Validation
• Bootstrap Methods
2018/10/25 Model Assessment & Selection 2
Bias, Variance & Model Complexity

2018/10/25 Model Assessment & Selection 3


Bias, Variance & Model Complexity
• The standard of model assessment : the
generalization performance of a learning
method
– Model: X  Y; Y  f (X )  
– Prediction Model: fˆ ( X )
– Loss function:
(Y  ˆ ( X )) 2 squared error
f
L(Y , fˆ ( X ))  
 ˆ ( X ) | absolute error
 | Y f

2018/10/25 Model Assessment & Selection 4


Bias, Variance & Model Complexity
• Error: training error, generalization error
1 N
Training error : err   L( yi , fˆ ( xi ))
N i 1
Generaliza tion error : Err  E L(Y , fˆ ( X ))  
• Typical loss function:
0 - 1 loss L(G, Gˆ ( X ))  I (G  Gˆ ( X ))
K
log - likelihood L(G, pˆ ( X ))  2 I (G  k ) log pˆ k ( X )
k 1
 2 log pˆ G ( X )
2018/10/25 Model Assessment & Selection 5
Bias-Variance Decomposition
• Basic Model: Y  f ( X )   ,  ~ N (0,  2 )
• The expected prediction error of a regression
fit fˆ ( X ) .
ˆ
Err ( x0 )  E[(Y  f ( x0 )) | X  x0 ]
2

ˆ ˆ ˆ
    [ Ef ( x0 )  f ( x0 )]  E[ Ef ( x0 )  f ( x0 )]
2 2 2

   Bias ( f ( x ))  Var ( fˆ ( x ))
2

ˆ
0
2
0
 Irreducibl e Error  Bias  Variance
2

• The more complex the model, the lower the


(squared) bias but the higher the variance.
2018/10/25 Model Assessment & Selection 6
Bias-Variance Decomposition
• For the k-NN regression fit the prediction
error: ˆ
Err ( x0 )  E[(Y  f ( x0 )) |X  x0 ]
2

k
1
 σ ε  [ f ( x0 )   f ( x j )]  σ ε / k
2 2 2

k j 1
ˆ
• For the linear model fit f p ( x)  ̂ x T

ˆ
Err( x )  E[(Y  f ( x )) | X  x ]
2
0 p 0 0

 
2 ˆ
 [ f ( x0 )  Ef p ( x0 )]  h( x0 )  
2 2 2

2018/10/25 Model Assessment & Selection 7


Bias-Variance Decomposition
• The in-sample error of the Linear Model
N N
1 1 p 2
N
 Err ( x )   
i 1
i
2

N

i1
ˆ
[ f ( xi )  Ef ( xi )]   
2

N
– The model complexity is directly related to the
number of parameters p.
• For ridge regression the square bias
E x0  ˆ    
f ( xx0 )  Ef ( x0 )  E x0 f ( x0 )    x0  E x0  T x0  EˆT x0
2
T 2
 2

 [Model Bias]2  [Estimatio n Bias]2

2018/10/25 Model Assessment & Selection 8


Bias-Variance Decomposition
• Schematic of the behavior of bias and variance
Closest fit in population
Realization
Closest fit
Truth MODEL
SPACE
Model bias Regularized
fit
Estimation
bias RESTRICED
MODEL SPACE
Estimation
Variance
2018/10/25 Model Assessment & Selection 9
Optimism of the Training Error Rate
• Training Error < True Error N
1
Training Error err   L( yi , fˆ ( xi ))
N i 1
True Error Err  E[ L(Y , fˆ ( X )]
• Err is extra-sample error
• The in-sample error
1 N

Errin   E y EY new L(Yi new , fˆ ( xi ))
N i 1

• Optimism: op  Errin  E y [err ]
2018/10/25 Model Assessment & Selection 10
Optimism of the Training Error Rate
• For squared error, 0-1, other loss function:
2 N
op  i 1 Cov( yˆ i , yi )
N
  2 N
Erri n  E y err  i 1 Cov( yˆ i , yi )
N
• ŷ•输入维数或基函数的个数增加,乐观性增大
i is obtained by a linear fit with d inputs or
•训练样本数增加,乐观性降低
basis function, a simplification is:
i1 cov( yi , yi )  d 
N
ˆ 2

  d 2
Erri n  E y err  2  
N
2018/10/25 Model Assessment & Selection 11
Estimates of In-sample Prediction
Error
• The general form of the in-sample estimates is
Eˆ rrin  E y [err ]  oˆp
• d parameters are fit under Squared error loss
d 2
C p statistic : C p  err  2 ˆ 
N
• Use a log-likelihood function to estimate Errin
N  , -2 Elog Pr (Y )   Eloglik   2
2 d
N N
loglik  i 1 log Pr ( yi )
N

– This relationship introduce the Akaike Information


Criterion
2018/10/25 Model Assessment & Selection 12
Akaike Information Criterion
• Akaike Information Criterion is a similar but more
generally applicable estimate of Errin
• A set of models f (x) with a turning parameter :
d ( ) 2
AIC( )  err ( )  2 ˆ 
N
err ( ) : the training error
d ( ) : number of parameters
• AIC( ) provides an estimate of the test error curve,
and we find the turning parameter ̂ that minimizes
it.  f ( x) | ˆ : minAIC( ˆ )
ˆ

2018/10/25 Model Assessment & Selection 13


Akaike Information Criterion
• For the logistic regression model, using the
binomial log-likelihood.
2 d
AIC   E[log lik ]  2
N N
• For Gaussian model the AIC statistic equals
to the Cp statistic.
d 2
AIC  C p  err  2 ˆ 
N

2018/10/25 Model Assessment & Selection 14


Akaike信息准则
• 音素识别例子:

M
 ( f )   k hk ( f ), d( )  M
k 1
2018/10/25 Model Assessment & Selection 15
Effective number of parameters
• A linear fitting method:
yˆ  Sy, S is N  N matrix, depending on xi
• Effective number of parameters:
d ( S )  trace( S )
– If S is an orthogonal projection matrix onto a
basis set spanned by M features, then:
trace( S )  M
– trace( S )  M is the correct quantity to replace d
in the Cp statistic
2018/10/25 Model Assessment & Selection 16
Bayesian Approach & BIC
• The Bayesian Information Criterion (BIC)
BIC  2loglik  (log N )d
• Gaussian model:
– Variance  2 N
– then  2loglik  C  ( yi  fˆ ( xi )) 2 / 2  N  err /  2
– So i 1
N d 2
BIC  2 [err  (log N )   ]
 N
– BIC is proportional to AIC(C p ), 2 replaced by log N
– N  e 2  7.4, 倾向选择简单模型, 而惩罚复杂模型
2018/10/25 Model Assessment & Selection 17
Bayesian Model Selection
• BIC derived from Bayesian Model Selection
• Candidate models Mm , model parameter  m and a
prior distribution Pr( m | M m )
• Posterior probability:

Pr( M m | Z)  Pr( M m ) Pr(Z | M m )


 Pr( M m )  Pr(Z |  m , M m ) Pr( m | M m )d m
- Z represents the training data {xi , yi }1N

2018/10/25 Model Assessment & Selection 18


Bayesian Model Selection
• Compare two models M m and M 
Pr( M m | Z) Pr( M m ) Pr(Z | M m )

Pr( M  | Z) Pr( M  ) Pr(Z | M  )
• If the odds are greater than 1, model m will
be chosen, otherwise choose model 
• Bayes Factor: Pr(Z | M m )
BF(Z) 
Pr(Z | M  )
– The contribution of the data to the posterior odds
2018/10/25 Model Assessment & Selection 19
Bayesian模型选择
• 如果模型的先验是均匀的Pr(M)是常数,
ˆ dm
log Pr( Z | M m )  log Pr( Z ||  m , M m )  log N  O( 1 )
2
其中ˆm是模型参数的极大似然估计,d m是模型维数。

 2 log Pr( Z || ˆm , M m )

极小BIC的模型等价于极大化后验概率模型
优点:当模型包含真实模型是,当样本趋于无穷时,BIC
选择正确的概率是一。
2018/10/25 Model Assessment & Selection 20
最小描述长度(MDL)
• 来源:最优编码
• 信息: z1 z2 z3 z4
• 编码: 0 10 110 111
• 编码2: 110 10 111 0
• 准则:最频繁的使用最短的编码

• 发送信息zi的概率: Pr( zi )
• 香农定律指出使用长度: li   log Pr( zi )

2018/10/25 Model Assessment & Selection 21


最小描述长度(MDL)
E (Length )  Pr( zi ) log Pr( zi )
 li
pi  A , 等式成立。
Pr( zi )  1 / 2; 1 / 4; 1 / 8; 1 / 8

2018/10/25 Model Assessment & Selection 22


模型选择MDL

模型:M    参数: ; 输入输出Z  ( X , y )

假定模型输出的条件概率 Pr( y |  , M , X )
length   log Pr( y |  , M , X )  log Pr( | M )

 log Pr( y |  , M , X ) :传递模型与实际目标值的偏差

 log Pr( | M ):模型参数的平均长度

2018/10/25 Model Assessment & Selection 23


模型选择MDL
如果y ~ N ( ,  ), 参数 ~ N (0,
2
1)
( y   )2  2
Length  Const.  log   
 2
2
 小的具有较短的信息长度。

MDL原理:我们应该选择模型,使得下列长度极小

模型:M  参数:
length   log Pr( y |  , M , x)  log Pr( | M ).

2018/10/25 Model Assessment & Selection 24


Vapnik-Chernovenkis维
• 问题:如何选择模型的参数的个数 d?
• 该参数代表了模型的复杂度
• VC维是描述模型复杂度的一个重要的指标

{ f ( x,  )}, x  IR p   参数, f  指示函数


函数族:

Ex :  ( 0,1),线性指示函数 f  I ( 0   T 1 x  0)
f 复杂度 : p  1

但是 x  IR, f ( x,  )  I (sin   x) ? f只有一个参数。

2018/10/25 Model Assessment & Selection 25


VC维
• 类 { f ( x,  )} 的VC维定义为可以被 { f ( x,  )} 成员分
散的点的最大的个数

• 平面的直线类VC维为3。
• sin(ax) 的VC维是无穷大。
1

-1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

2018/10/25 Model Assessment & Selection 26


VC维
• 实值函数类 {g ( x,  )} 的VC维定义为指示类
{I ( g ( x,  )    0)} 的VC维。
• 引入VC维可以为泛化误差提供一个估计
• 设 {g ( x,  )} 的VC维为h,样本数为N.

 4err
Err  err  (1  1  ) 二类分类
2 
err h[log( a2 N / h  1)  log(  / 4)
Err  ;回归   a1
(1  c  )  N
2018/10/25 Model Assessment & Selection 27
交叉验证

1 2 3 4 5

训练集 训练集 检验集 训练集 训练集

2018/10/25 Model Assessment & Selection 28


自助法
• 基本思想:从训练数据中有放回地随机抽样
数据集,每个数据集的与原训练集相同。
• 如此产生B组 自助法数据集
• 如何利用这些数据集进行预测?
ˆ
如果 f ( x ) 是x 上的预测值
*b
i i

自助集误差: B N
1 1
Errboot 
BN
 i,
L ( y ˆ *b ( x ))
f
b 1 i 1
i

2018/10/25 Model Assessment & Selection 29


自助法
• 自助法过程图解: 重复实验

S ( Z *1 ) S ( Z *2 ) S ( Z *B )
样本

Z *1
Z *2
Z *B

Z  ( z1 , z2 ,, z N ) 训练样本

2018/10/25 Model Assessment & Selection 30


Summary
• Bias, Variance and Model Complexity
• The Bias-Variance Decomposition
• Optimism of the Training Error Rate
• Estimates of In-Sample Prediction Error
• The Effective Number of Parameters
• The Bayesian Approach and BIC
• Minimum Description Length
• Vapnik-Chernovenkis Dimension
• Cross-Validation
• Bootstrap Methods
2018/10/25 Model Assessment & Selection 31

Vous aimerez peut-être aussi