Vous êtes sur la page 1sur 12

12/9/2013

Partial least Squares


Multivariate regression
Multiple Linear Regression (MLR)
Principal Component Regression (PCR)
Partial Least Squares (PLS)

Partial Least Squares

Validation

A tutorial
Preprocessing
Lutgarde Buydens

Multivariate Regression

Multivariate Regression
k

Raw data

Raw data

1.5

1.5

0.5

X
0.5

-0.5
2000

4000

6000

8000

10000

12000

14000

-1

W avenumber (cm )
4000

6000

8000

10000

12000

14000

Rows: Cases, observations

W avenumber (cm-1 )

Rows: Cases, observations,

Collums: Variables, Classes, tags

Analytical observations of different samples


Experimental runs
Persons
.
X: Independent variabels (will be always available)
Y: Dependent variables ( to be predicted later from X)

P: Spectral variables
Analytical measurements

Y = f(X) : Predict Y from X


MLR: Multiple Linear Regression
PCR: Principal Component Regression
PLS: Partial Least Sqaures

K: Class information
Concentration,..

MLR: Multiple Linear Regression

From univariate to Multiple Linear Regression (MLR)


y

y= b0 +b1 x1 +

b0 : intercept
b1 : slope

Least squares regression

y= b0 +b1 x1 +

b0 : intercept
b1 : slope

Least squares regression

Collums: Variables, Classes, tags

X: Independent variabels (will be always available)


Y: Dependent variables ( to be predicted later from X)

x
Multiple Linear Regression

y= b0 +b1 x1 + b2x2 + bpxp +

-0.5
2000

^
Y Y E

maximizes

x1

r ( y, y )
x2

12/9/2013

MLR: Multiple Linear


Regression
y= b0 +b1 x1 + b2x2 + bpxp +

Disadavantages: (XTX)-1

^
Y Y E

MLR: Multiple Linear Regression

Uncorrelated X-variables required

n p +1

r(x1,x2) 1

x1

x2

p+1

Ynk = XnpBpk + Enk

b = (XTX)-1XTy

:
:
:

b0
b1

1
1

x1

e
+
x2

bp

1
1

MLR: Multiple Linear Regression

MLR: Multiple Linear Regression


Disadavantages: (XTX)-1

Disadavantages: (XTX)-1

Uncorrelated X-variables required

Uncorrelated X-variables required


Set A

r(x1,x2) 1

Fits a plane through a line !!

x1

Set B

r(x1,x2) 1

x1

x2

x1

x2

x2

-1.01

-0.99

-1.01

-0.99

-1.89

3.23

3.25

3.23

3.25

10.33

5.49

5.55

5.49

5.55

19.09

0.23

0.21

0.23

0.23

2.19

-2.87

-2.91

-2.87

-2.91

-8.09

3.67

3.76

3.67

3.76

11.29

y= b1 x1 + b2x2 +

x2
MLR

b1

b2

b1

10.3

-6.92

2.96

R2

b2

R2

=0.98

MLR: Multiple Linear Regression


Disadavantages:

x1

yn1 = Xnpbp1 + en1

0.28

=0.98

PCR: Principal Component Regression

(XTX)-1

Uncorrelated X-variables required

Step 1: Perform PCA on the original X

n p +1

Step 2 : Use the orthogonal PC-scores as independent variables in a MLR model


p
cols

a
cols
PCA
T

X
Step 1
X

a1
a2

MLR

aa

y
Step2

p
n-rows

n-rows

n
Dimension reduction

Variable Selection
Latent variables (PCR, PLS)

Step 3: Calculate b-coefficients from the a-coefficients

b0

n-rows
a1
a2
aa

b1
Step 3
bp

12/9/2013

PCR: Principal Component Regression

PCR: Principal Component Regression

xp
Step 0 : Meancenter

Step 1: Perform PCA:

X = TPT X* = (TPT)*

Step 2: Perform MLR

Y=TA

PC1

A = (TTT)-1TTY

x1

Step 3 : Calculate B

Y = X* B

Y = (T PT) B

MLR on reconstructed X*= (TPT)*

A = PT B
B = (PPT)-1PA

x2
Dimension reduction:

B = PA

b 0 y y

Calculate b0s

Use scores (projections) on latent variables that explain maximal variance in X

PCR: Principal Component Regression

PLS: Partial Least Squares Regression


Phase 1
p
cols

Optimal number of PCs

Phase 2

a
col

a2

PLS
Calculate Crossvalidation RMSE for different # PCs

RMSECV

( y y )
i

MLR

k
cols

a1

aa

n-rows

n-rows

n-rows

a1
k
cols

Phase 3

b0
b1

a1
a2
aa

n-rows

PLS: Partial Least Squares Regression

PLS: Partial Least Squares Regression


Phase 1 : Calculate new independent variables (T)

Projection to Latent Structure


PCR

xp

PLS

xp

Sequential Algorithm: Latent variables and their scores are calculated sequentially
Step 0: Mean center X

PC1

x1

LV1 (w)

Step 1: Calculate w
Calculate LV1= w1 that maximizes Covariance (X,Y) : SVD on XTY
(XTY)pk = WpaDaa ZTak

w1 = 1st col. of W

x1
xp

x2

Use PC:
Maximizes variance in X

bp

w1

x2

Use LV:
Maximizes covariance (X,y)
= VarX*vary*cor(X,y)

x1

x2

12/9/2013

PLS: Partial Least Squares Regression

PLS: Partial Least Squares Regression


Phase 1
p
cols

Phase 1 : Calculate new independent variables (T)


Sequential Algorithm: Latent variables and their scores are calculated sequentially

k
cols

a1
a2

PLS

MLR

Step 1: Calculate LV1= w1 that maximizes Covariance (X,Y) : SVD on XTY


(XTY)pk = WpaDaa ZTak

Phase 2

a
col

aa

w1 = 1st col. of W
n-rows

xp

Step 2:

a1
w

Calculate t1, scores (projections) of X on w1

tn1 = Xnpwp1

n-rows

n-rows

k
cols

Phase 3

x1

b0
b1

a1
a2

aa

n-rows

bp

x2

PLS: Partial Least Squares Regression

MLR, PCR, PLS:

Optimal number of LVs

Set A

Calculate Crossvalidation RMSE for different # LVs

RMSECV

(y i y i )2

Set B

x1

x2

x1

x2

-1.01

-0.99

-1.01

-0.99

-1.89

3.23

3.25

3.23

3.25

10.33

5.49

5.55

5.49

5.55

19.09

0.23

0.21

0.23

0.23

2.19

-2.87

-2.91

-2.87

-2.91

-8.09

3.67

3.76

3.67

3.76

11.29

y= b1 x1 + b2x2 +

VALIDATION

b1

b2

b1

b2

MLR

10.3

-6.92

2.96

0.28

PCR

1.60

1.62

1.60

1.62

PLS

1.60

1.62

1.60

1.62

Common measure for prediction error

Estimating prediction error.


Basic Principle:
test how well your model works with new data,
it has not seen yet!

12/9/2013

A Biased Approach

Validation: Basic Principle


Basic Principle:

Prediction error of the samples the model was built on


test how well your model works with new data, it has not
seen yet!

Error is biased!
Samples also used to build the model

Split data in training and test set.

model is biased towards accurate prediction of these


specific samples

Several ways:
One large test set
Leave one out and repeat: LOO
Leave n objects out and repeat: LNO
...
Apply entire model procedure on the test set

Validation

Training and test sets


Split in training and test set.
Test set should be
representative of training set
Random choice is often the
best
Check for extremely unlucky
divisions
Apply whole procedure on the
test and validation sets

b0
Training
set

Build model :
bp

Full data
set

Test
set

RMSEP

Remark: for final model use whole data set.

Cross-validation

Cross-validation: an example
The data

Most simple case: Leave-One-Out (=LOO, segment=1


sample). Normally 10-20% out (=LnO).
Remark: for final model use whole data set.

12/9/2013

Cross-validation: an example
Split data into training set and validation set

Cross-validation: an example

Cross-validation: an example
Split data into training set and test set

Cross-validation: an example

Build a model on the training set

Cross-validation: an example
Split data again into training set and valid. set
Until all samples have been in the validation set once
Common: Leave-One-Out (LOO)

Cross-validation: an example
Split data again into training set and valid. set
Until all samples have been in the validation set once
Common: Leave-One-Out (LOO)

12/9/2013

Cross-validation: an example

Cross-validation: an example

Split data again into training set and valid. set

Split data again into training set and valid. set

Until all samples have been in the validation set once


Common: Leave-One-Out (LOO)

Until all samples have been in the validation set once


Common: Leave-One-Out (LOO)

Cross-validation: an example

Cross-validation: an example

Split data again into training set and valid. set

Split data again into training set and valid. set

Until all samples have been in the validation set once


Common: Leave-One-Out (LOO)

Until all samples have been in the validation set once


Common: Leave-One-Out (LOO)

Cross-validation: a warning

Cross-validation: a warning

Data: 13 x 5 = 65 NIR spectra (1102 wavelengths)


13 samples: different composition of NaOH, NaOCl and Na2CO3
5 temperatures: each sample measured at 5 temperatures

The data
1102

1
Composit
ion

NaOH (wt%)

NaOCl
(wt%)

Na2CO3 (wt%)

18.99

15

21

27

34

40

9.15

9.99

0.15

15

21

27

34

40

15.01

4.01

15

21

27

34

40

9.34

5.96

3.97

15

21

27

34

40

13

13

16.02

2.01

1.00

15

21 27

34

40

Temperature (C)

2
y

65

65

Leave SAMPLE out

12/9/2013

Validation

Selection of number of LVs


Training
Set

Trough Validation:

2) Build model : b
0

Choose number of LVs that results in model with


lowest prediction errror
Testset to assess final model cannot be used !

1) determine #LVs : wit test set

Full data
set

Test
set

Divide trainingset
Crossvalidation
Test
set

bp

RMSEP

Remark: for final model use whole data set.

Double Cross Validation

CV2

Double cross-validation

1) determine #LVs : CV Innerloop

The data

2) Build model : CV Outer loop


b0
Full data
set

Training
setC
CV
1

bp

RMSEP

Remark: for final model use whole data set Skip.

Double cross-validation

Double cross-validation

Split data into training set and validation set

Split data into training set and validation set

Used later to assess model performance!

12/9/2013

Double cross-validation

1LV

2LV

3LV

Apply crossvalidation on the rest: Split training set into


(new) training set and test set

1LV

2LV

3LV

1LV

2LV

3LV

Lowest RMSECV

Double cross-validation

12/9/2013

Cross-validation: an example

Cross-validation: an example

Repeat procedure

Repeat procedure

Until all samples have been in the validation set once

Until all samples have been in the validation set once

Double cross-validation

PLS: an example

In this way:

Raw + meancentered data

The number of LVs is determined by using samples not


used to build the model with

Raw data

Meancentered data

0.3
0.25

1.5

0.2

Absorbance (a.u.)

0.15
Absorbance (a.u.)

The prediction error is also determined using samples the


model has not seen before

0.5

0.1
0.05
0
-0.05

-0.1
-0.15

Remark: for final model use whole data set.


-0.5
2000

4000

6000

8000

10000

12000

-0.2
2000

14000

Wavenumber (cm-1)

RMSECV vs. No of LVs

4000

6000

8000

10000

12000

14000

Wavenumber (cm-1)

Regression coeffficients
Raw data

Absorbance (a.u.)

RMSECV values for prediction of NaOH


0.7

0.6

-0.5
3000

0.4

4000

5000

6000

7000

8000

9000 10000 11000 12000 13000

Wavenumber (cm-1)

0.3

10

0.2

0.1

5
6
7
Number of LVs

10

Regression coefficient

RMSECV

1
0.5
0

0.5

1.5

8
6
4
2
0
-2
3000

4000

5000

6000

7000

8000

9000 10000 11000 12000 13000

Wavenumber (cm-1)

10

12/9/2013

Why Pre-Processing ?

True vs. predicted

Data Artefacts

3
Original spectrum

True values vs. predictions

18

NaOH, predicted

16

14

12

Baseline correction
Alignment
Scatter correction
Noise removal
Scaling, Normalisation
Transformation
..

2.5

10

12

14
NaOH, true

16

original

0.8

0.6

0.7

0.5

0.6

0.4
0.3
0.2
0.1

18

Other

20

1400

1600

0.4

0.8

0.3

0.6
200

400

600
800
1000 1200
Wavelength (a.u.)

1400

1600

0.6

offset+slope

0.6

0.5

0.4

0.4

0.3

200

400

0
600 800 1000 1200 1400 16000
Wavelength (a.u.)

Pre-Processing Methods
STEP 2:
(10x) SCATTER

STEP 3:
(10x) NOISE

STEP 4:
(7x) SCALING &
TRANSFORMATION
S
Meancentering

No baseline correction

No scatter correction

No noise removal

(3x) Detrending
polynomial order
(2-3-4)

(4x) scaling: Mean


Median Max L2 norm

(9x) S-G smoothing


(window: 5-9-11 pt)
(order: 2-3-4)

(2x) Derivatisation
(1st 2nd )

SNV

Pareto scaling

(3x) RNV (15, 25, 35)%

Poisson scaling

AsLS

MSC

Level scaling

400

600

800
1000
Wavelength (a.u.)

1200

1400

1600

Pre-Processing Results
Complexity of the model : no of LV
Classification Accuracy

Raw Data

Autoscaling
Range scaling

Log transformation

Supervised pre-processing methods


No noise removal

200

200 400 600 800 1000 1200 1400 1600


Wavelength (a.u.)

4914 combinations: all reasonable

OSC
DOSC

0.2

0.1

200 400 600 800 1000 1200 1400 160000


Wavelength (a.u.)

STEP 1:
(7x) BASELINE

0.3

0.1

0.2

0.2
0.1

0.1

0.4

0.3

0.3

0.2

multiplicative

0.6

0.5

0.5

0.5

0.4

Intensity (a.u.)

Intensity (a.u.)

Intensity (a.u.)

offset

0.7

0.7

0.8
0.7

original
offset
offset+slope
multiplicative
offset + slope + multiplicative

0.7

0.1

0.8
0.8

2500

original
offset
offset+slope
multiplicative
offset + slope + multiplicative

Intensity (a.u.)

600
800 1000 1200
Wavelength (a.u.)

2000

Meancentering
Autoscaling
Range scaling
Pareto scaling
Poisson scaling
Level scaling
Log scaling

Complexity of the model (no of LV)

400

1500
Wavelength (cm-1)

0.2
200

1000

0.5

0
0

00

500

Missing values
Outliers

0.7

0
0

Intensity (a.u.)

Intensity (a.u.)

0.8

1.5

0.5

10

Offset
Slope
Scatter

2
Intensity (a.u)

20

J. Engel et al. TrAC 2013

Classification accuracy %

11

12/9/2013

SOFTWARE
PLS Toolbox (Eigenvector Inc.)
www.eigenvector.com
For use in MATLAB (or standalone!)

XLSTAT-PLS (XLSTAT)
www.xlstat.com
For use in Microsoft Excel

Package pls for R


Free software
http://cran.r-project.org

12

Vous aimerez peut-être aussi