Académique Documents
Professionnel Documents
Culture Documents
Validation
A tutorial
Preprocessing
Lutgarde Buydens
Multivariate Regression
Multivariate Regression
k
Raw data
Raw data
1.5
1.5
0.5
X
0.5
-0.5
2000
4000
6000
8000
10000
12000
14000
-1
W avenumber (cm )
4000
6000
8000
10000
12000
14000
W avenumber (cm-1 )
P: Spectral variables
Analytical measurements
K: Class information
Concentration,..
y= b0 +b1 x1 +
b0 : intercept
b1 : slope
y= b0 +b1 x1 +
b0 : intercept
b1 : slope
x
Multiple Linear Regression
-0.5
2000
^
Y Y E
maximizes
x1
r ( y, y )
x2
12/9/2013
Disadavantages: (XTX)-1
^
Y Y E
n p +1
r(x1,x2) 1
x1
x2
p+1
b = (XTX)-1XTy
:
:
:
b0
b1
1
1
x1
e
+
x2
bp
1
1
Disadavantages: (XTX)-1
r(x1,x2) 1
x1
Set B
r(x1,x2) 1
x1
x2
x1
x2
x2
-1.01
-0.99
-1.01
-0.99
-1.89
3.23
3.25
3.23
3.25
10.33
5.49
5.55
5.49
5.55
19.09
0.23
0.21
0.23
0.23
2.19
-2.87
-2.91
-2.87
-2.91
-8.09
3.67
3.76
3.67
3.76
11.29
y= b1 x1 + b2x2 +
x2
MLR
b1
b2
b1
10.3
-6.92
2.96
R2
b2
R2
=0.98
x1
0.28
=0.98
(XTX)-1
n p +1
a
cols
PCA
T
X
Step 1
X
a1
a2
MLR
aa
y
Step2
p
n-rows
n-rows
n
Dimension reduction
Variable Selection
Latent variables (PCR, PLS)
b0
n-rows
a1
a2
aa
b1
Step 3
bp
12/9/2013
xp
Step 0 : Meancenter
X = TPT X* = (TPT)*
Y=TA
PC1
A = (TTT)-1TTY
x1
Step 3 : Calculate B
Y = X* B
Y = (T PT) B
A = PT B
B = (PPT)-1PA
x2
Dimension reduction:
B = PA
b 0 y y
Calculate b0s
Phase 2
a
col
a2
PLS
Calculate Crossvalidation RMSE for different # PCs
RMSECV
( y y )
i
MLR
k
cols
a1
aa
n-rows
n-rows
n-rows
a1
k
cols
Phase 3
b0
b1
a1
a2
aa
n-rows
xp
PLS
xp
Sequential Algorithm: Latent variables and their scores are calculated sequentially
Step 0: Mean center X
PC1
x1
LV1 (w)
Step 1: Calculate w
Calculate LV1= w1 that maximizes Covariance (X,Y) : SVD on XTY
(XTY)pk = WpaDaa ZTak
w1 = 1st col. of W
x1
xp
x2
Use PC:
Maximizes variance in X
bp
w1
x2
Use LV:
Maximizes covariance (X,y)
= VarX*vary*cor(X,y)
x1
x2
12/9/2013
k
cols
a1
a2
PLS
MLR
Phase 2
a
col
aa
w1 = 1st col. of W
n-rows
xp
Step 2:
a1
w
tn1 = Xnpwp1
n-rows
n-rows
k
cols
Phase 3
x1
b0
b1
a1
a2
aa
n-rows
bp
x2
Set A
RMSECV
(y i y i )2
Set B
x1
x2
x1
x2
-1.01
-0.99
-1.01
-0.99
-1.89
3.23
3.25
3.23
3.25
10.33
5.49
5.55
5.49
5.55
19.09
0.23
0.21
0.23
0.23
2.19
-2.87
-2.91
-2.87
-2.91
-8.09
3.67
3.76
3.67
3.76
11.29
y= b1 x1 + b2x2 +
VALIDATION
b1
b2
b1
b2
MLR
10.3
-6.92
2.96
0.28
PCR
1.60
1.62
1.60
1.62
PLS
1.60
1.62
1.60
1.62
12/9/2013
A Biased Approach
Error is biased!
Samples also used to build the model
Several ways:
One large test set
Leave one out and repeat: LOO
Leave n objects out and repeat: LNO
...
Apply entire model procedure on the test set
Validation
b0
Training
set
Build model :
bp
Full data
set
Test
set
RMSEP
Cross-validation
Cross-validation: an example
The data
12/9/2013
Cross-validation: an example
Split data into training set and validation set
Cross-validation: an example
Cross-validation: an example
Split data into training set and test set
Cross-validation: an example
Cross-validation: an example
Split data again into training set and valid. set
Until all samples have been in the validation set once
Common: Leave-One-Out (LOO)
Cross-validation: an example
Split data again into training set and valid. set
Until all samples have been in the validation set once
Common: Leave-One-Out (LOO)
12/9/2013
Cross-validation: an example
Cross-validation: an example
Cross-validation: an example
Cross-validation: an example
Cross-validation: a warning
Cross-validation: a warning
The data
1102
1
Composit
ion
NaOH (wt%)
NaOCl
(wt%)
Na2CO3 (wt%)
18.99
15
21
27
34
40
9.15
9.99
0.15
15
21
27
34
40
15.01
4.01
15
21
27
34
40
9.34
5.96
3.97
15
21
27
34
40
13
13
16.02
2.01
1.00
15
21 27
34
40
Temperature (C)
2
y
65
65
12/9/2013
Validation
Trough Validation:
2) Build model : b
0
Full data
set
Test
set
Divide trainingset
Crossvalidation
Test
set
bp
RMSEP
CV2
Double cross-validation
The data
Training
setC
CV
1
bp
RMSEP
Double cross-validation
Double cross-validation
12/9/2013
Double cross-validation
1LV
2LV
3LV
1LV
2LV
3LV
1LV
2LV
3LV
Lowest RMSECV
Double cross-validation
12/9/2013
Cross-validation: an example
Cross-validation: an example
Repeat procedure
Repeat procedure
Double cross-validation
PLS: an example
In this way:
Raw data
Meancentered data
0.3
0.25
1.5
0.2
Absorbance (a.u.)
0.15
Absorbance (a.u.)
0.5
0.1
0.05
0
-0.05
-0.1
-0.15
4000
6000
8000
10000
12000
-0.2
2000
14000
Wavenumber (cm-1)
4000
6000
8000
10000
12000
14000
Wavenumber (cm-1)
Regression coeffficients
Raw data
Absorbance (a.u.)
0.6
-0.5
3000
0.4
4000
5000
6000
7000
8000
Wavenumber (cm-1)
0.3
10
0.2
0.1
5
6
7
Number of LVs
10
Regression coefficient
RMSECV
1
0.5
0
0.5
1.5
8
6
4
2
0
-2
3000
4000
5000
6000
7000
8000
Wavenumber (cm-1)
10
12/9/2013
Why Pre-Processing ?
Data Artefacts
3
Original spectrum
18
NaOH, predicted
16
14
12
Baseline correction
Alignment
Scatter correction
Noise removal
Scaling, Normalisation
Transformation
..
2.5
10
12
14
NaOH, true
16
original
0.8
0.6
0.7
0.5
0.6
0.4
0.3
0.2
0.1
18
Other
20
1400
1600
0.4
0.8
0.3
0.6
200
400
600
800
1000 1200
Wavelength (a.u.)
1400
1600
0.6
offset+slope
0.6
0.5
0.4
0.4
0.3
200
400
0
600 800 1000 1200 1400 16000
Wavelength (a.u.)
Pre-Processing Methods
STEP 2:
(10x) SCATTER
STEP 3:
(10x) NOISE
STEP 4:
(7x) SCALING &
TRANSFORMATION
S
Meancentering
No baseline correction
No scatter correction
No noise removal
(3x) Detrending
polynomial order
(2-3-4)
(2x) Derivatisation
(1st 2nd )
SNV
Pareto scaling
Poisson scaling
AsLS
MSC
Level scaling
400
600
800
1000
Wavelength (a.u.)
1200
1400
1600
Pre-Processing Results
Complexity of the model : no of LV
Classification Accuracy
Raw Data
Autoscaling
Range scaling
Log transformation
200
OSC
DOSC
0.2
0.1
STEP 1:
(7x) BASELINE
0.3
0.1
0.2
0.2
0.1
0.1
0.4
0.3
0.3
0.2
multiplicative
0.6
0.5
0.5
0.5
0.4
Intensity (a.u.)
Intensity (a.u.)
Intensity (a.u.)
offset
0.7
0.7
0.8
0.7
original
offset
offset+slope
multiplicative
offset + slope + multiplicative
0.7
0.1
0.8
0.8
2500
original
offset
offset+slope
multiplicative
offset + slope + multiplicative
Intensity (a.u.)
600
800 1000 1200
Wavelength (a.u.)
2000
Meancentering
Autoscaling
Range scaling
Pareto scaling
Poisson scaling
Level scaling
Log scaling
400
1500
Wavelength (cm-1)
0.2
200
1000
0.5
0
0
00
500
Missing values
Outliers
0.7
0
0
Intensity (a.u.)
Intensity (a.u.)
0.8
1.5
0.5
10
Offset
Slope
Scatter
2
Intensity (a.u)
20
Classification accuracy %
11
12/9/2013
SOFTWARE
PLS Toolbox (Eigenvector Inc.)
www.eigenvector.com
For use in MATLAB (or standalone!)
XLSTAT-PLS (XLSTAT)
www.xlstat.com
For use in Microsoft Excel
12