Académique Documents
Professionnel Documents
Culture Documents
Selection
Dept. Computer Science & Engineering,
Shanghai Jiaotong University
Outline
• Bias, Variance and Model Complexity
• The Bias-Variance Decomposition
• Optimism of the Training Error Rate
• Estimates of In-Sample Prediction Error
• The Effective Number of Parameters
• The Bayesian Approach and BIC
• Minimum Description Length
• Vapnik-Chernovenkis Dimension
• Cross-Validation
• Bootstrap Methods
2018/10/25 Model Assessment & Selection 2
Bias, Variance & Model Complexity
ˆ ˆ ˆ
[ Ef ( x0 ) f ( x0 )] E[ Ef ( x0 ) f ( x0 )]
2 2 2
Bias ( f ( x )) Var ( fˆ ( x ))
2
ˆ
0
2
0
Irreducibl e Error Bias Variance
2
k
1
σ ε [ f ( x0 ) f ( x j )] σ ε / k
2 2 2
k j 1
ˆ
• For the linear model fit f p ( x) ̂ x T
ˆ
Err( x ) E[(Y f ( x )) | X x ]
2
0 p 0 0
2 ˆ
[ f ( x0 ) Ef p ( x0 )] h( x0 )
2 2 2
N
– The model complexity is directly related to the
number of parameters p.
• For ridge regression the square bias
E x0 ˆ
f ( xx0 ) Ef ( x0 ) E x0 f ( x0 ) x0 E x0 T x0 EˆT x0
2
T 2
2
d 2
Erri n E y err 2
N
2018/10/25 Model Assessment & Selection 11
Estimates of In-sample Prediction
Error
• The general form of the in-sample estimates is
Eˆ rrin E y [err ] oˆp
• d parameters are fit under Squared error loss
d 2
C p statistic : C p err 2 ˆ
N
• Use a log-likelihood function to estimate Errin
N , -2 Elog Pr (Y ) Eloglik 2
2 d
N N
loglik i 1 log Pr ( yi )
N
M
( f ) k hk ( f ), d( ) M
k 1
2018/10/25 Model Assessment & Selection 15
Effective number of parameters
• A linear fitting method:
yˆ Sy, S is N N matrix, depending on xi
• Effective number of parameters:
d ( S ) trace( S )
– If S is an orthogonal projection matrix onto a
basis set spanned by M features, then:
trace( S ) M
– trace( S ) M is the correct quantity to replace d
in the Cp statistic
2018/10/25 Model Assessment & Selection 16
Bayesian Approach & BIC
• The Bayesian Information Criterion (BIC)
BIC 2loglik (log N )d
• Gaussian model:
– Variance 2 N
– then 2loglik C ( yi fˆ ( xi )) 2 / 2 N err / 2
– So i 1
N d 2
BIC 2 [err (log N ) ]
N
– BIC is proportional to AIC(C p ), 2 replaced by log N
– N e 2 7.4, 倾向选择简单模型, 而惩罚复杂模型
2018/10/25 Model Assessment & Selection 17
Bayesian Model Selection
• BIC derived from Bayesian Model Selection
• Candidate models Mm , model parameter m and a
prior distribution Pr( m | M m )
• Posterior probability:
极小BIC的模型等价于极大化后验概率模型
优点:当模型包含真实模型是,当样本趋于无穷时,BIC
选择正确的概率是一。
2018/10/25 Model Assessment & Selection 20
最小描述长度(MDL)
• 来源:最优编码
• 信息: z1 z2 z3 z4
• 编码: 0 10 110 111
• 编码2: 110 10 111 0
• 准则:最频繁的使用最短的编码
• 发送信息zi的概率: Pr( zi )
• 香农定律指出使用长度: li log Pr( zi )
假定模型输出的条件概率 Pr( y | , M , X )
length log Pr( y | , M , X ) log Pr( | M )
MDL原理:我们应该选择模型,使得下列长度极小
模型:M 参数:
length log Pr( y | , M , x) log Pr( | M ).
Ex : ( 0,1),线性指示函数 f I ( 0 T 1 x 0)
f 复杂度 : p 1
• 平面的直线类VC维为3。
• sin(ax) 的VC维是无穷大。
1
-1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
4err
Err err (1 1 ) 二类分类
2
err h[log( a2 N / h 1) log( / 4)
Err ;回归 a1
(1 c ) N
2018/10/25 Model Assessment & Selection 27
交叉验证
1 2 3 4 5
自助集误差: B N
1 1
Errboot
BN
i,
L ( y ˆ *b ( x ))
f
b 1 i 1
i
S ( Z *1 ) S ( Z *2 ) S ( Z *B )
样本
Z *1
Z *2
Z *B
Z ( z1 , z2 ,, z N ) 训练样本