2003, GLS

Pre-whitening of data by covariance-weighted
pre-processing
Harald Martens
1
*, Martin Hy
2
, Barry M. Wise
3
, Rasmus Bro
1
and Per B. Brockhoff
4
1
Department of Food and Dairy Science, Royal Veterinary and Agricultural University, DK-1958 Frederiksberg C, Denmark
2
Institute of Chemistry, Norwegian University of Science and Technology, N-7491 Trondheim, Norway
3
Eigenvector Research Inc., Manson, WA, USA
4
Department of Mathematics and Physics, Royal Veterinary and Agricultural University, DK-1871 Frederiksberg C, Denmark
Received 7 May 2001; Revised 9 September 2002; Accepted 22 November 2002
A data pre-processing method is presented for multichannel `spectra' from process spectro-
photometers and other multichannel instruments. It may be seen as a `pre-whitening' of the spectra,
and serves to make the instrument `blind' to certain interferants while retaining its analyte
sensitivity. Thereby the instrument selectivity may be improved already prior to multivariate
calibration. The result is a reduced need for process perturbation or sample spiking just to generate
calibration samples that span the unwanted interferants. The method consists of shrinking the
multidimensional data space of the spectra in the off-axis dimensions corresponding to the spectra of
these interferants. A `nuisance' covariance matrix S is first constructed, based on prior knowledge or
estimates of the major interferants' spectra, and the scaling matrix G = S
1/2
is defined. The pre-
processing then consists of multiplying each input spectrum by G. When these scaled spectra are
analysed in conventional chemometrics software by PCA, PCR, PLSR, curve resolution, etc., the
modelling becomes simpler, because it does not have to account for variations in the unwanted
interferants. The obtained model parameter may finally be descaled by G
1
for graphical inter-
pretation. The pre-processing method is illustrated by the use of prior spectroscopic knowledge to
simplify the multivariate calibration of a fibre optical vis/NIR process analyser. The 48-dimensional
spectral space, corresponding to the 48 instrument wavelength channels used, is shrunk in two of its
dimensions, defined by the known spectra of two major interferants. Successful multivariate
calibration could then be obtained, based on a very small calibration sample set. Then the paper
shows the pre-whitening used for reducing the number of bilinear PLSRcomponents in multivariate
calibration models. Nuisance covariance S is either based on the prior knowledge of interferants'
spectra or based on estimating the interferants' spectral subspace from the calibration data at hand.
The relationship of the pre-processing to weighted and generalized least squares from classical
statistics is outlined. Copyright #2003 John Wiley & Sons, Ltd.
KEYWORDS: pre-whitening; covariance; weighted; preprocessing; GLS; prior knowledge; process; multivariate
calibration
1. INTRODUCTION
1.1. Reducing unwanted effects
Classical chemical modelling, where prior knowledge is
used to formulate mathematical models based on causal/
mechanistic/first-principles theory, has problems when the
a priori knowledge is erroneous or incomplete. On the other
hand, data-driven explorative modelling, such as multi-
variate regression of one set of variables Y on another set of
variables X, has problems if the available data are inade-
quate. Sometimes, purely data-driven modelling requires
large amounts of input data for estimation of parameters that
one already knows.
The goal of the present covariance-weighted pre-proces-
sing technique is to maintain the flexibility of the data-driven
`soft modelling', but to reduce the requirements for
empirical calibration data, by including quantitative prior
knowledge in the modelling. If successful, this should
reduce the existing prerequisite for spanning all relevant
types of variation by the calibration samplesa requirement
that has made multivariate calibration of process analysers
expensive and cumbersome. It should also decrease the total
number of calibration samples needed, as fewer statistical
*Correspondence to: H. Martens, Department of Food and Dairy Science,
Royal Veterinary and Agricultural University, DK-1958 Frederiksberg C,
Denmark.
E-mail: Harald.Martens@mail.tele.dk
Copyright # 2003 John Wiley & Sons, Ltd.
JOURNAL OF CHEMOMETRICS
J. Chemometrics 2003; 17: 153165
Published online in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/cem.780
model parameters are to be estimated. Finally, the pre-
processing technique is intended to simplify the interpreta-
tion of the results, by reducing the dimensionality of the
resulting models. Another interpretational purpose is to
separate the effects of prior knowledge, graphically and
conceptually, from the unknown empirical structures in the
data. The present pre-processing version of statistical
covariance weighting [13] is intended to form a versatile
link between chemometrical `soft' modelling and traditional
analytical chemistry.
In signal processing, similar techniques have been used for
preconditioning data to remove known systematic noise
covariance patterns in the time domain prior to statistical
modelling, and it is then called `pre-whitening' [46]. In
chemometrics and related areas the use of assumptions or
prior knowledge about the residual distributions has been
presented earlier, introducing new algorithms for fitting the
models [79] according to a `maximum likelihood' criterion.
In the present paper the pre-processing concept is employed
for reduction of the impact of unwanted chemical inter-
ferants or physical effects in the samples or the instrument. It
also outlines the converse pre-colouring of multivariate data
to enhance the bilinear modelling.
The covariance-weighted pre-processing is illustrated
here for spectroscopic data, but the method is expected to
be generally applicable for multivariable measurements.
1.2. Shrinking and expanding the X-space: an
overview
The methodology in this paper is most easily understood if
envisioned geometrically in terms of an idealized version of
the conventional chemical mixture model. Assume that we
want to find how to estimate the concentrations of one or
more analytes from input spectra X in mixtures that also
contain irrelevant interferants (irrelevant chemical constitu-
ents or other undesired sources of variation in X). According
to the theoretical linear model for chemical mixtures, the
input spectra X contain additive contributions from desired
analytes, undesired interferants and random noise:
X = CK
/
DL
/
E (1)
where C represents the concentrations (i.e. Y ~C) of the
desired analytes with spectra K, D represents the concentra-
tions of one or more undesired interferants with spectra L,
and E represents the unknown residuals in X, ideally
assumed to be random, identically distributed and indepen-
dent of C and D as well as of K and L.
If it is known a priori that some variables (columns) in X
have exceptionally high error levels or are irrelevant for
determination of the analyte(s), then it is common to
eliminate them. Alternatively, one may scale them down to
an error level more similar to that of the others (or even
lower if they are suspected to be irrelevant for determining
the analyte). This reduces the risk that their effects will
dominate the subsequent data modelling, by contaminating
e.g. the first few components from X (PCs, latent variables)
obtained by bilinear modelling methods such as principal
component analysis (PCA) or partial least squares regression
(PLSR). Thereby the PCs may become more useful for
graphical interpretation and statistical regression (calibra-
tion) of Y.
Likewise, variables in X known to be particularly precise
or particularly Y-relevant may be scaled up to increase the
chance that they will be represented in the first few useful
PCs from X.
Geometrically, this down- or upscaling of variables may
be seen as shrinking or expanding the column space of X
along its individual axes, the individual X-variables. The
present pre-processing allows shrinkage or expansion in any
direction in the X-space, including off-axis directions corre-
sponding to linear combinations of the X-variables. By
shrinking the column space in X in the direction of known
interferant spectra L and/or by expansion of this column
space in the directions of known analyte spectra K, the
chance of being able to pick up the relevant analyte
information in the first few PCs, before irrelevant inter-
ference information and random noise effects, is thereby
enhanced. The covariance-weighted pre-processing, com-
bined with subsequent least squares-based linear or bilinear
modelling, is related to the classical statistical methods of
weighted least squares and generalized least squares [1].
Likewise, a priori known (or estimated) information about
the individual samples' analyte concentrations C and inter-
ferant concentrations D may be used for shrinking and
expanding the row space of X, for similar reasons.
2. THEORY
2.1. Notation
Matrices are written as upper-case boldface, e.g. X, vectors as
lower-case boldface, e.g. s, and scalars and index symbols as
ordinary italics, e.g. i = 1,2,,N. In particular, let matrix X
represent a table of input data having N rows (objects,
samples) and K columns (X-variables). For the sake of
generality, Y and the corresponding model parameters are
written as matrices, even in the case of having only J = 1
Y-variable. For simplicity, the variables (columns in X and Y)
are assumed to be mean-centred.
2.2. Variance- and covariance-weighted
pre-processing
2.2.1. Unweighted multivariate modelling
Some of the most commonly used multivariate methods in
chemometrics, such as PCA, PCR and PLSR, implictly or
explicitly define latent variables from eigenanalysis. In the
case of PLSR, each latent variable's score vector is defined as
the first eigenvector of XX'YY', after deflating X and Y for
previous latent variables [10]. In PCA/PCR the score vectors
are simply obtained from the dominant eigenvectors of XX'.
In order to handle missing values, software systems such
as The Unscrambler, SIMCA, etc. use the NIPALS algorithm
to find these eigenvectors from PLSR or PCA. For each new
latent variable, this involves iterative use of a series of simple
linear ordinary least squares (OLS) regressions over objects
and over variables, between X or Y and preliminary versions
of the latent variable.
2.2.2. Weighting the variables
To control the impact of the different variables in the
Copyright # 2003 John Wiley & Sons, Ltd. J. Chemometrics 2003; 17: 153165
154 H. Martens et al.
eigenanalyses, these software systems include a weighting-
based pre-processing step, to balance the relevance and noise
levels of the different variables. This weighting may be
written as
X = X
Input
G (2)
where G (K K) is a scaling matrix. In this conventional
weighting, G is diagonal, with scaling elements that are the
inverse of a predefined standard deviation s (K 1). In the
commonly used standardization, vector s is defined as the
total initial standard deviation s
0
of the K variables in the set
of available objects. However, it is also possibleand
statistically more optimalto define s as the standard un-
certainty of the different variables, i.e. the expected standard
deviation of their errors.
More formally, the scaling matrix G may be seen as the
inverse square root of the diagonal variance elements in
matrix S:
G = S
1=2
(3)
Defining S = diag(s
2
) and replacing X by X
Input
G =
X
Input
S
1/2
in the PCA and PLSR definitions shows that
the pre-processing of the X-variables is equivalent (see
Appendix I) to defining the score vectors as eigenvectors of
X
Input
S
1/2
X'
Input
in PCA/PCR and of X
Input
S
1/2
X'
Input
YY'
in PLSR (after deflation). In the NIPALS estimation
algorithm it may equivalently be attained by using weighted
least squares (WLS) in the repeated regression over X-vari-
ables that defines each score vector.
If the errors in different X-variables are correlated, S
becomes a covariance matrix with non-zero off-diagonal
elements. From more or less approximate prior knowledge
about this uncertainty covariance, Equation (3) may still be
used for defining the pre-processing. Equation (2) then
yields a covariance-weighted pre-processing of the input
data. The equivalent NIPALS algorithm then requires
generalized least squares (GLS) regression [1,2] over the
X-variables to estimate the score vectors. Further details of
the relationship between classical GLS and the present use of
covariance-weighted pre-processing for `pre-whitening' of
spectral data are given in Appendix II. This also shows the
converse object weighting to remove correlated errors
between objects.
2.2.3. Denition of the pre-processing weights G
A practical implementation of Equation (3) is based on
eigenanalysis of the uncertainty variancecovariance matrix
S in terms of its eigenvectors V and eigenvalues l:
SV = Vdiag(l) (4a)
The covariance weighting matrix is here defined as
G = Vdiag(l
1=2
)V
/
(4b)
The chosen symmetrical definition of G is not mandatory as
long as GG' = S
1
, but it simplifies the visual interpretation
of the weighted model parameters and residuals.
2.2.4. Deweighting the model parameters
The loadings P and residuals E of the X-variables, obtained
from the bilinear model of the mean-centred, weighted
X-data,
X = TP
/
E (5a)
may be descaled to fit the model of the mean-centred,
unweighted data, i.e.
X
Input
= TP
/
Input
E
Input
(5b)
If G is symmetrical and has full rank (see below), the
inversion of Equation (2) gives
X
Descaled
= X
Input
= XG
1
(5c)
Likewise,
E
Descaled
= EG
1
(5d)
and
P
Descaled
= G
1
P (5e)
This simplifies the graphical interpretation of the X-loadings.
In regression methods such as PCR and PLSR the mean-
centred, reduced-rank linear regression model summary,
based on the scaled X-variables, may be written as
Y = XB
A
F
A
(5f)
where the regression coefficient parameter matrix B
A
(K J)
uses A latent variables and F
A
(NJ) represents residuals.
B
A
may be seen as linear combinations of orthogonal
X-loadings (PCR) or orthogonal loading-like loading weights
(PLSR). For graphical interpretation, B
A
may therefore be
descaled in analogy to Equation (5e) as
B
A;Descaled
= G
1
B
A
(5g)
On the other hand, the regression coefficients suitable for
prediction of the Y-variables directly from the unweighted
X-variables,
b
Y
A
= X
Input
B
A;ForInput
(5h)
may be obtained by inserting Equation (2) into Equation (5f),
yielding
B
A;ForInput
= GB
A
(5i)
2.2.5. Denition of the uncertainty covariance S from
prior knowledge
In the situation with undesired interferants outlined in
Equation (1), it is natural to define S from D = DL' E. The
spectra L of the interferants (the undesired variation
patterns) may sometimes be assumed known, while their
concentrations D are unknown. The formally correct defini-
tion could then be
S = L cov(D)L
/
cov(E) (6a)
where cov(D) represents the expected variancecovariance
of the interferant concentrations and cov(E) represents the
covariance of other, unidentified error patterns plus the
variance of random i.i.d. noise. In practice, the variation in
interferant concentrations may be difficult to specify and
may e.g. be replaced by the approximation
cov(D) = d
2
I (6b)
Pre-whitening of spectra 155
where d
2
is the expected average variance of the interferants'
concentrations; intercorrelations between the interferants'
concentrations are assumed to be negligible. The scalar d is
given in the unit of interferant concentrations. Moreover, it
may often be adequate to assume that the errors in E are
uncorrelated, i.e.
cov(E) = diag(s
2
) (6c)
Thereby Equation (6a) simplifies to
S = d
2
LL
/
diag(s
2
) (6d)
If all the X-variables have about the same uncertainty
variance s
2
, i.e.
cov(E) = s
2
I (6e)
this leads to a further simplification. With the expected
average interferant concentration variance d
2
being a general
scaling factor determining the contribution of the interferant
spectra, this further simplifies the definition of S to
S = d
2
LL
/
I (6f)
By defining the scaling factor d sufficiently large, the pre-
processing X = X
Input
S
1/2
(Equations (2) and (3)) in effect
can make the subsequent least squares-based modelling of X
completely insensitive (`blind') to signal variations caused
by the unknown interferant concentrations. Only the net
analyte signal obtained as the residual after projecting K
(Equation (1)) on L will remain in X, together with un-
modelled variations and measurement noise.
previous residuals
When explicit prior knowledge about the spectrum of the
individual interferants in L is lacking, the required informa-
tion may instead be defined from spectral modelling
residuals in previous calibration data. If X and Y data from
a previous relevant set of M objects are available, D, the
spectral residuals in these data, may be obtained after
projection of X on the J known constituent concentrations Y:
D = X[I Y(Y
/
Y)
1
Y
/
[ (7a)
These residuals D may then be used for estimating the future
error covariance matrix S, by defining L in Equation (6f) as
e.g. the first few (A) principal components of D, obtained by
singular value decomposition of D:
USV
/
= D (7b)
In the notation of e.g. Matlab the subspace of the interferants
may be defined as
L = V(:; 1 : A)S(1 : A; 1 : A) (7c)
the data at hand
Equations (7a)(7c) may alternatively be based on the X and
Y data at hand in the actual set of N calibration samples,
instead of on previous data. However, care must then be
taken to avoid overfitting. For instance, if cross-validation
and jackknifing are to be used for statistical assessment of a
calibration model, S may e.g. have to be re-estimated within
each cross-validation segment.
3. MATERIALS AND METHODS
3.1. Input data
The data set used for illustrating the pre-processing has been
chosen for its simplicity, in order to make the method clear.
The data [11] concern the determination of the protonated
state of a chemical dye, litmus.
3.2. Methods
Transmitted light spectra T were measured remotely by fibre
optics in an industrial process spectrophotometer (Guided
WaveModel 200). The transmittance spectra were converted
into absorbance (here referred to as òptical density' (OD))
spectra and collected in K = 48 wavelength channels between
about 400 and 700 nm. These OD spectra were termed X
Input
,
available for a total of 23 samples.
The samples contain different known concentrations [11]
of protonated (red-coloured) litmus, which is the analyte to
be calibrated for here, Y = [protonated litmus]. In addition,
the samples have various unknown concentration variations
of two interferants, unprotonated (blue-coloured) litmus
(due to varying pH) and white zinc oxide powder. The data
were analysed in Matlab
2
Version 5.3 (The MathWorks, Inc.)
using the first author's software.
4. RESULTS
4.1. Previous results for the same data
Without any interferants the OD data are expected to
increase proportionally with the concentration of the red-
coloured analyte, Y = [protonated litmus], at each wave-
length k where the analyte absorbs light, x
Input,k
, k = 1,2,,K.
However, the two interferants (blue litmus, white powder)
generate selectivity problems: strongly varying but un-
known levels of one or both of the interferants make it
impossible to determine the analyte by conventional
univariate calibration based on a single wavelength channel.
Such selectivity problems may be removed by multi-
variate calibration [2], without knowing anything about the
spectral characteristics of the pure analyte and the inter-
ferants, and without even knowing the concentrations of the
interferants in the calibration samples, as demonstrated for
these data in References [2,11]. However, this requires that
the calibration sample set spans not only the analyte's
concentration but also each of the interferants' concentra-
tions. The present paper shows how additional spectral
information about the interferants may be used to filter out
their effects by shrinking the X-space, to the extent that they
do not have to be modelled and therefore not even spanned
by the calibration set.
4.2. Input data
The two full curves in Figure 1 show the known
interference structures in the present application example:
the instrument responses L=[l
1
, l
2
] (crosses) of the two
interferants, represented by their OD spectra at K = 48
wavelength channels in the visible wavelength range. These
will be used for shrinking the column space of X
Input
in
interferant directions by covariance-weighted pre-proces-
sing.
Curve 1 shows that the chemical interferant, protonated
litmus (blue transparent solution, 15 g l
1
, measured at pH
9), has a spectrum l
1
with a broad but clear peak starting
already at the lowest wavelength recorded (416 nm, X-vari-
able #1), with a maximum at about 560 nm (~#20) and with
little OD beyond 700 nm (~#37).
Curve 2 shows that the physical interferant, ZnO (white
suspension, 6 g l
1
), displays a relatively flat ODspectruml
2
,
with slightly higher OD at the lower wavelengths, as
expected for a light-scattering powder. The purpose of the
pre-processing in the present illustration is to reduce the
impact of these two interferants on the OD spectra of future
unknown samples.
The corresponding spectrum of the analyte itself (dotted
curve 3 in Figure 1), with its absorbance maximum near
500 nm (~X-variable #11), is from now on considered
unknown. It was only included in Figure 1 to help illustrate
how the subsequent covariance-weighted pre-processing
works.
Figure 2 presents the input mixture spectra, and then
demonstrates the effect of increasingly covariance-weighted
pre-processing of these spectra. Figure 2(a) shows the OD
spectra of the N = 23 aqueous samples with various
concentration levels of dissolved litmus, various pH levels
and various amounts of suspended ZnO powder. These
spectra, termed X
Input
, are shown as a function of wave-
length channel k = 1,2,,48.
The characteristic contributions from the two known
interferants are clearly visible: varying levels of the peak
of the unprotonated blue litmus around 460 nm (X-vari-
able #20), as well as the general baseline variation due to
ZnO. In addition, a peak near X-variable #11 (500 nm)
may also be be observed, which reflects the ùnknown'
spectrumof the analyte itself (curve 3 in Figure 1, obtained at
low pH). A few samples show evidence of both species of
litmusthey represent purple-coloured solutions at neutral
pH.
The analytical question here is to what extent the effects of
the two interferants in Figure 2(a) (unprotonated litmus and
ZnO powder) can be removed by pre-processing of these
data X
Input
in preparation for the calibration w.r.t. Y, the
analyte concentration (protonated litmus).
The full curve in Figure 2(b) reflects the fit of the 23 input
spectra X
Input
in Figure 2(a) to their first principal component
score vectors t
a
, a = 14 in terms of the average squared
multiple correlation, i.e. the cumulative fraction of the
X-variance explained by the first four components from
PCA of the mean-centred input spectra X = X
Input
. The
superimposed broken curves show the corresponding fit of
Y to these principal component score vectors, i.e. the fit of Y
in PCR. The figure shows that the mean-centred but
otherwise untreated input spectra have three major principal
components (as expected with one analyte and two
independent interferants). The third PCR component is most
strongly correlated with Y; the first two components appear
mainly to span the variation of the two irrelevant inter-
ferants.
Figure 1. Known optical density (OD) spectra L = [l
1
, l
2
] of two interferants, and
analyte spectrum K. Curve 1: blue (unprotonated) litmus (l
1
, 15 g in 1000 ml of
water, pH 9). Curve 2: white powder (l
2
, 6 g ZnO in 1000 ml of water). The crosses
mark the 48 wavelength channels to be used as X-variables. Curve 3: red
(protonated) litmus (K, 15 g in 1000 ml of water, pH 4), regarded as unknown and
included here only for illustration of the methodology.
4.3. Increasing degree of shrinkage of input
data
The rest of Figure 2 illustrates how the spectra X look after
increased downscaling of the two known interferants'
impact in the pre-processing X = X
Input
G = X
Input
S
1/2
(Equations (2) and (3)). The error covariance matrix S was
here defined by the simplified expression in Equation (6f) as
an increasingly weighted sumof the covariance d
2
LL' (where
L=[l
1
, l
2
] from Figure 1) plus a constant noise variance,
diag(s
2
) = I.
The scalar d
2
determines the degree of shrinkage. The four
rows in Figure 2 represents four increasing degrees of
shrinkage, d
2
= 0, 0.1, 1 and 100. This may be thought of as
four different subjective judgements of the relevance of the
two interferants. The left side of the figure shows a gradual
simplification of the X-data, until with d
2
= 100 (Figure 2(g))
only one systematic pattern of variation is clearly discernible
from the random measurement noise.
The right side of the figure confirms this: as the
contributions from the two interferants are diminished, the
ability of the remaining absorbance variation in X to describe
the analyte Y increases. Without any shrinkage of the
interferants' absorbance contributions (d
2
= 0), three PCs
were required to describe both X and Y. Already at d
2
= 1
most of the variation in Y is described after only one PC.
With d
2
= 100 the first PC gives more or less a complete
description of X as well (Figure 2(h)). Equivalently (see
Appendix I), this means that X
Input
S
1
X'
Input
has only one
large eigenvalue.
4.4. Apriori information for OLS, WLS and GLS
pre-processing
Figure 3 compares the pre-processing parameters in con-
ventional unweighted linear regression (here termed ÒLS'),
in the pre-processing with diagonal S, as used in e.g. most
chemometric software (here termed `WLS'), and in the new
covariance-weighted pre-processing (here termed `GLS'; see
Appendix II). The left subplots show the uncertainty
information assumed available a priori in each of the three
cases. The right subplots illustrate the effect of the pre-
processing for three arbitrary X-variables (out of 48), namely
#10, 20 and 30, for all the samples.
In the top row (ÒLS') there is no prior information used
(in Equation (6d), diag(s)=I and d
2
= 0). The variation in all
three directions #10, 20 and 30 is seen to be the information
that we expect from Figure 2(a).
Figure 2. Effect of increasing degree of GLS shrinkage of input data. Left: GLS pre-processed input data X = X
Input
G, where X
Input
is the
input spectra (a). Right: cumulative fit (fraction of explained variance, R
2
) of X (crosses, full line) and Y (circles, broken line) as a function
of PCA component a = 14. Rows 14: covariance scaling factors d
2
= 0, 0.1, 1 and 100 respectively (Equation (6d)).
The X-variables from wavelength channel #30 onwards
represent mostly baseline information. In order to visualize
the effect of the WLS pre-processing available in most
chemometrics software packages today, we make the
subjective assumption that the baseline channels from #30
onwards contain mainly irrelevant noise (we ignore that the
X-data in this region may carry useful baseline information).
Therefore we a priori ascribe relative standard uncertainty
s
k
= 1 for X-variables k = 129, but increase this to s
k
= 4 for
k = 3048, and use these expected noise levels as s in
Equation (6d). For this WLS pre-processing, the covariance
shrinkage factor is still defined as d = 0. The vertical variation
in X-variable #30 is seen to have been reduced in Figure 3(d)
compared with Figure 3(b), but otherwise the sample
configuration is unchanged and the cloud of sample points
still spans three dimensions.
In the third row (`GLS') we additionally employ the
spectral background knowledge about the two interferants
from Figure 1, l
1
and l
2
, with shrinkage factor d
2
= 100. We
retain the value of s from the WLS case to illustrate how
variance diag(s
2
) and covariance d
2
LL' in Equation (6d) can
be used at the same time. The cloud of sample points in
Figure 3(f) nowspans mainly a single dimensionvariations
in net analyte signal. Many of the interferant effects have
been removed already during pre-processing.
4.5. Calibration based on very few samples
In this subsection we illustrate one possible use of pre-
whitening: the removal of interference effects not seen in the
calibration sample set. Conventional cross-validated PLSR is
used as the calibration method.
In regression-based multivariate calibration, all the inter-
ference phenomena that may occur in future samples have to
be represented in the calibration sample set, with sufficient
clarity and sufficiently independent of the other types of
variations. Sometimes that is difficult to attain, for economic
or practical reasons, for instance when calibrating an
industrial process spectrophotometer. The covariance-
weighted pre-processing method allows interference phe-
nomena with known spectra L to be corrected for at the pre-
processing stage, so that they do not have to be spanned in
the calibration set.
The first column of subplots in Figure 4 shows the original
absorbance spectra X
Input
. The second column of subplots in
Figure 4 shows the spectra after pre-processing by the three
methods illustrated in Figure 3 for three of the X-variables.
Figure 3. Comparison of OLS, WLS and GLS pre-processing. Top (a,b), OLS; middle (c,d), WLS; bottom (e,f), GLS. Left: information
available a priori. Right: data plotted in 3D for X-variables #10, 20 and 30. Each point represents one samples spectrum.
Calibration set. The three densely dotted curves in Figure
4(a) represent N= 3 objects that together are here regarded as
if they were the only samples available with both X-and Y-data.
This tiny calibration sample set has relative analyte
concentrations Y=[0.009,0.365,0.679]'. Test set. For the sake
of illustration, the thin curves in Figure 4 represent the
remaining 20 objects, which will now be treated as a new,
future set, for which Y is to be predicted fromtheir spectra X.
These input data are the same for the OLS, WLS and GLS
cases (rows 1, 2, and 3 in Figure 4).
The three densely dotted curves were used as X in
calibration against Y, with the model parameters estimated
by PLSR. In all three cases, OLS, WLS and GLS, the PLSR
model with one PC appeared to perform best in the small
calibration set, because the calibration samples only spanned
the analyte variation and no interferants. The linear regres-
sion coefficient vector B
A = 1
gave more or less equally
`perfect' fit in the N = 3 calibration samples by all three pre-
processing methods, as evidenced by the three dots along the
ìdeal' diagonal (middle column of subplots in Figure 4).
The analyte concentration in the remaining 20 ùnknown'
samples,
b
Y
A
, was nowpredicted fromtheir spectra, using the
òptimal' calibration model B
A=1
. The circles in the middle
column of subplots in Figure 4 show that the OLS and WLS
calibration models gave bad Y-predictions in the new,
independent samples, while the GLS calibration model gave
good prediction. The reason is that variations in the input
spectra due to varying, uncontrolled levels of the two
interferants were not seen in the calibration set and hence
were left unchecked by the conventional unweighted and
variance-weighted cases (OLS and WLS). In contrast, the
damaging effects of the interferants on the predictive ability
of the calibration model were more or less eliminated by the
covariance-weighted pre-processing (GLS).
The two rightmost columns of subplots in Figure 4 show
the X-residuals after the one-dimensional PLSR model, in
terms of the scaled residuals E (obtained after projection of X
on the first PC t
1
) and their descaled version E
Descaled
(Equation (5d)) respectively. This shows that the unmo-
delled interference information was clearly visible for the
Figure 4. Calibration with very few samples. Top (a-1 to a-5), OLS; middle (b-1 to b-5), WLS; bottom (c-1 to c-5), GLS. Column 1: input
data X
Input
of three calibration samples (densely dotted) and 20 unknown test samples. Column 2: scaled spectra for regression
modelling, X = X
OLS
, X
WLS
or X
GLS
. Column 3: Y-values predicted from optimal models, by
i;A=1
(ordinate), vs measured values y
i
(abscissa); Target line by
i;A=1
= y
i
. Column 4: spectral residuals from one-PC PLSR model of scaled X-data, E. Column 5: spectral
residuals E after descaling by Equation (5d), E
Descaled
.
new unknown samples, both for the OLS/WLS and GLS
cases. In the GLS case, E was very low (Figure 4(c-4))
compared with the scaled X-data (Figure 4(c-2)), even for the
20 `new' samples. However, after descaling, the characteris-
tic signals of the two unmodelled interferants became clearly
visible in the residual spectra E
Descaled
(Figure 4(c-5)). These
residuals may be submitted to a second bilinear modelling,
yielding a second set of score vectors and residual variances,
for outlier analysis, etc.
In summary, the pre-processing in this case allowed us to
make a valid calibration model with a small and otherwise
inadequate calibration set, in spite of a glaring lack of
interferant variability between the calibration objects. This
illustrates that shrinking away interference effects in the
X-space by pre-whitening makes it possible to use fewer
calibration samples, and in particular fewer Y-data, and
hence to get cheaper and simpler calibration models.
4.6. Calibration based on many samples
The next two figures illustrate another advantage of pre-
whitening: the ability to reduce the required dimensionality
of the calibration model for a given set of calibration
samples. The main purpose of this reduction is to simplify
model interpretation, with a possible enhancement of the
predictive performance. In this case all the available objects
fromFigure 2(a) are used as calibration samples (N = 23). The
same parameter sets (termed OLS, WLS and GLS) were used
as in the last example, and PLSR was again used for
developing the calibration models.
Full leave-one-out cross-validation was used for assessing
the models in terms of their optimal rank A and their root
mean square error of prediction in Y, RMSEP(Y)
A
. The input
spectra of the calibration samples now represent all N
(3 20 = 23) curves displayed in the right column of
subplots in Figure 4. The three full curves in Figure 5 show
the predictive ability of the OLS, WLS and GLS cases, in
terms of the cross-validated RMSEP(Y)
A
vs A = 0,1, 2,,6.
(The dotted curve will be discussed later.)
The figure first of all shows that while the OLS and WLS
models require at least A = 3 PCs to reach acceptably low
predictive error, the GLS model did so with only A = 1 PC.
Moreover, a slight improvement in predictive ability was
attained: using two PCs, the GLS case gives a lower
predictive error than the OLS/WLS cases gave with three
or more PCs.
Finally, Figure 6 illustrates the effect of rescaling and
descaling of the model parameters, in this case of the
estimated regression coefficient vector at the lowest accep-
table rank, for OLS, WLS and GLS. The OLS solution is
superimposed on the WLS and GLS solutions as a dotted
line, for comparison.
The left column of subplots shows B
A
, as obtained from
bilinear PLSR at the optimal number of PCs (A), based on the
scaled X-variables in the OLS, WLS and GLS cases. The three
ways of pre-processing may be seen to yield somewhat
different scaled regression coefficients. Moreover, while the
OLS and WLS solutions required A = 3 PCs, the GLS solution
required only A = 1 PC.
The middle column shows the rescaled coefficient
spectrum B
A,ForInput
(Equation (5i)), suitable for application
Figure 5. Calibration based on all samples: predictive performance after OLS, WLS
and GLS pre-processing. Prediction error of y, estimated by full leave-one-out
cross-validation, from PLSR modelling from X = X
Input
G with G= S
1/2
. Squares:
OLS; S = l (no pre-processing). Circles: WLS; S diagonal (variance weighting).
Triangles: knowledge-based GLS; S defined from two known interferant spectra l
1
and l
2
. Dotted curve: data-based GLS; S defined from spectral residuals after
projection of X
Input
on y.
directly to the input X-variables. Again the OLS solution is
superimposed on the WLS and GLS solutions (dotted line).
The scaling of the individual X-variables in vector B
A,ForInput
is independent of the pre-processing of the X-variables, so
the only difference between the solutions is due to the impact
of the pre-processing on the estimation process itself. Figures
6(e) and 6(h) show that the downweighting of the X-vari-
ables _channel #30 has rendered the other channels more
important for separating the baseline variations due to the
turbidity from the blue-coloured interferant and the red-
coloured analyte. The wavelength channels just below #30,
with low absorbance at the end of interferant spectrum l
1
(Figure 1), are given higher relative importance in the
modelling. This confirms that in a rank-reduced calibration
model such as the present low-rank PLSR modelling, there
are several almost equivalent ways to combine the 48 input
variables in order to attain the desired selectivity enhance-
ment.
The right column of subplots in Figure 6 shows the
descaled coefficient spectrum B
A,Descaled
(Equation (5g)),
suitable for graphical interpretation, with the OLS solution
again superimposed (dotted line). Now the obvious effect of
e.g. the sharp downweighing of X-variables _channel #30
has been removed.
The three solutions are qualitatively similar: they have
positive values below about channel #15, as expected from
the spectral characteristic of the analyte red litmus, and
negative values at higher wavelength channels in order to
compensate for the possible presence of the interferants blue
litmus and white ZnO. However, quantitatively, the three
solutions are somewhat different. This shows that with
different pre-processing methods the PLSR models needed
to describe different Y-relevant patterns of variation in the
data in order to attain the desired selectivity.
the calibration data at hand
The dotted curve in Figure 5 represented the results when
interferant spectra L (Figure 1) were considered unknown,
and instead estimated from the X- and Y-data of the 23
samples in the actual calibration data set at hand. As before,
leave-one-out cross-validation was employed, with re-
estimation of the spectral interferant covariance S for each
cross-validation segment. The figure shows that the pre-
whitening based on the estimated spectral residual matrix D
(Equation (7a)) with its dominant subspace L (Equations (7b)
and (7c), using A = 2 PCs) gives almost as simple modelling
as the one based on prior knowledge of the two interferants'
individual spectra L=[l
1
, l
2
]: in both cases the number of
PLSR components required is reduced, because the model
does not have to span these major interferants. However, the
prediction error is now slightly higher. A possible reason for
this is that the former, knowledge-based pre-processing used
the known spectra L as additional independent information
in estimating S, while the latter, data-driven pre-processing
had no such extra information available.
Figure 6. Calibration based on all samples: regression coefficients estimated, rescaled
and descaled. Top, OLS (A = 3 PCs); middle, WLS (A = 3 PCs); bottom, GLS (A = 1 PC).
Left: coefficients
b
B
A
obtained from scaled spectra X. Middle: rescaled coefficients
b
B
A;ForInput
(Equation (5i)), applicable directly to unscaled input spectra X
Input
. Right:
descaled coefficients
b
B
A;Descaled
(Equation (5g)); weighting effects removed. Dotted
curves: OLS estimate
b
B
A=3
from (a), for comparison.
5. DISCUSSION
Figure 4 demonstrated an ability of the covariance-weighted
pre-processing to give good predictive ability even for new
samples with interferants not present in the calibration set. This
may become important in e.g. calibrating industrial process
analysers, when it is difficult to perturb the actual process
enough to get a sufficiently informative calibration sample
set. By introducing prior knowledge about known inter-
ferants' spectral signatures, the interferants can be compen-
sated for already in a pre-processing filtering step, and thus
do not have to vary in the calibration set.
Figure 5 demonstrated that the covariance-weighted `GLS'
pre-processing yielded calibration models with lower rank
than those from the conventional ÒLS' and `WLS' methods.
High-dimensional models are generally cumbersome to
interpret graphically, so that is an advantage. Moreover, as
long as the uncertainty covariance S represents prior
knowledge, a slight improvement in prediction ability may
be expected, because the subsequent calibration then
requires fewer statistical parameters to be estimated from
the available N calibration data.
5.1. Comparison with other methods
The covariance-weighted pre-processing based on prior
known spectra L has the advantage of reducing interference
without consuming degrees of freedom from the available,
often expensive Y-data. In that respect it resembles spectral
interference subtraction (SIS) [12]. If, instead, S is estimated
fromthe available data [X, Y] at hand, the pre-processing has
some similarity to so-called orthogonal signal correction
(OSC) [13] and direct orthogonalization (DO) [14]. Extended
multiplicative signal correction (EMSC) [12,15] has similar
properties to SIS and covariance-weighted pre-processing,
but allows for removal of both additive and multiplicative
effects.
There is one major difference in how the covariance-
weighted pre-processing and the set of OSC, DO, SIS and
EMSC methods attempt to reduce the interference effects in
X
Input
. The latter methods subtract the effects in one way or
another. In contrast, the new covariance-weighted pre-
processing is based on shrinking by division (i.e. multi-
plication by the inverse of S; see Equations (2) and (3)). The
full consequences of this distinction are not yet clear.
However, it may be noted that DO [14] is particularly
similar to the data-driven estimation of interferant subspace
L (Equations (7a)(7c); Figure 5, dotted line), even though it
employs subtraction instead of inverted scaling to eliminate
the effect of the interferants.
5.2. Pre-colouring the spectra
Instead of just shrinking the X-space in particularly
undesired or irrelevant directions, one may also reformulate
the covariance-weighted pre-processing to expand the
X-space in directions known to be particularly desired or
relevant. For instance, after having contracted the X-space to
filter out irrelevant or detrimental interferants, the X-space
could then be expanded in the dimension of the analyte's
spectrum (curve 3, Figure 1), to enhance this desired type of
variation over e.g. random measurement noise in the
subsequent multivariate subspace analysis. Preliminary
Monte Carlo simulations (not shown here) indicate this to
have some statistical advantage.
The pre-processing has been used for pre-whitening
spectral X-variables in this paper. However, it may equally
well be applied to the set of Y-variables. Appendix I outlines
various equivalent alternatives for integrating the interferant
covariance matrix S into the actual estimators in PCA/PCR
and PLSR, instead of using S
1/2
for pre-processing. When
prior knowledge is available about the available objects, the
pre-processing may also then be used, in a bilinear analogy
to the conventional GLS estimator (Appendix II).
It should be noted that after covariance-weighted pre-
processing to remove all major interferants, the remaining
spectra mainly show the net signal of the analyte plus
random noise (see Figure 2(g)). Of course, if the spectrum of
the analyte, K (Equation (1)), is a linear combination of the
spectra L of the interferants, the covariance-weighted pre-
processing will filter out the analyte effect too; the remaining
net analyte signal is zero. Thus the usual requirement in
quantitative analysis, that the analyte spectrum has to be
linearly independent of the major interferant spectra,
remains valid.
6. CONCLUSIONS
A method has been presented for covariance-weighted pre-
processing of multivariate input data. It facilitates the use of
prior knowledge about undesired (and desired) structures
that are expected to vary in the input data. Its purpose is to
reduce the complexity of the ensuing model and to improve
its predictive ability. The method was illustrated for
reducing the effect of spectral variations due to known
interferants' known spectra.
In general, multivariate calibration by low-rank regres-
sion, using e.g. PCR or PLSR, has proven highly effective for
solving selectivity problems in complex systems. Many
unidentified interference problems can even be dealt with, as
long as they are spanned well in the calibration sample set
and picked up clearly by the multichannel instrument.
However, the present combination of prior knowledge
and empirical calibration data may simplify calibration,
because already known parameters do not have to be
estimated statistically from the calibration data. The final
statistical regression stage in the calibration process could
then primarily be used for finding and correcting unknown
or unexpected phenomena in the data. Thereby calibration of
multichannel instruments may become less expensive and
time-consuming, and easier to understand.
APPENDIX I. EIGENVECTOR EXPRESSIONS
FOR COVARIANCE-WEIGHTED PRE-
PROCESSING
In PCA, each latent variable (PC) is an eigenvector of XX'
(after suitable mean centring). If the score vector for an
individual PC, t, is scaled to t't = 1, this may be written as
tl=(XX')t. Inserting X = X
Input
S
1/2
(Equations (2) and (3))
into this eigenvalue expression yields the covariance-
weighted expression tl = (X
Input
S
1
X'
Input
)t. Equivalently, t
is then a right-hand singular vector of X
Input
S
1/2
.
Conversely, if the PCA loading vector p is scaled to
p'p = 1, then pl=(X'X)p. Inserting X = X
Input
S
1/2
gives
pl = (S
1/2
X'
Input
X
Input
S
1/2
)p; p is then a left-hand singu-
lar vector of X
Input
S
1/2
.
In PLSR, each component is an eigenvector of the XY
covariance structure [10]. For instance, with orthonormal
scores, t is defined by tl = (XX'YY')t (after suitable deflation
for previous components). With X = X
Input
S
1/2
this gives
the expression tl = (X
Input
S
1
X'
Input
YY')t. Conversely, the
orthonormal loading weight w for each component, used for
defining t = X'w (after suitable deflation of X for previous
components), is defined by wl = (X'YY'X)w. Covariance-
weighted pre-processing is equivalent to defining
wl = (S
1/2
X'
Input
YY'X
Input
S
1/2
)w, or w as the first left-
hand singular vector of S
1=2
X
T
Input
Y.
Hence the PCA/PCR and PLSR solutions may be
obtained either by covariance-weighted pre-processing
X = X
Input
S
1/2
followed by standard OLS-based software
for PCA/PCR or PLSR, or by eigenvector decomposition of
cross-product matrices weighted by S
1
. The latter is
analoguous to generalised least squares (GLS) regression.
APPENDIX II. GLS AND COVARIANCE-
WEIGHTED PRE-PROCESSING
The relationship between generalized least squares (GLS)
regression and covariance-weighted pre-processing will be
demonstrated here. In weighted least squares (WLS) the
regressorregressor and regressorregressand cross-product
matrices are modified by the inverse error covariance matrix
S
1
. When S has off-diagonal elements, this approach is
called `GLS' in some statistical literature [2]. The terms `WLS'
and `GLS' are therefore employed here to distinguish purely
variance-based weighting from covariance-based weighting.
In some other statistical literature the WLS and GLS terms
are used more interchangeably. More details are given in
Reference [1].
II. 1. Regression over objects
In the conventional OLS case the input data for one or more
regressands, Y
Input
(NJ)=[y
Input,j
, j = 1,2,,J], are modelled
by projection on one or more regressors, X
Input
(NK)=[x
Input,k
, k = 1,2,,K], over a set of N objects,
according to the linear model Y
Input
= X
Input
B F
Input
(ignor-
ing the mean centring). To estimate the regression coeffi-
cients B (K J), the conventional estimator fits each
regressand y
Input
(N1) individually to X
Input
by minimiz-
ing f'
Input
f
Input
. This yields the conventional full-rank OLS
estimator
b
B = (X
/
Input
X
Input
)
1
X
/
Input
Y
Input
.
If the correlation pattern between the response errors in
the N objects, S
N
(NN), is known, the GLS estimator
b
B =
(X
/
Input
S
1
N
X
Input
)
1
X
/
Input
S
1
N
Y
Input
yields better estimates,
because it minimizes f
/
Input
S
1
N
f
Input
for each regressor, i.e.
the importance of the correlated error pattern is down-
weighted.
Equivalently, the pre-whitening operators X = S
1=2
N
X
Input
and Y = S
1=2
N
Y
Input
allow the model to be rewritten as
Y = XB F. The same GLS estimator may now be rewritten
as
b
B = (X
/
X)
1
X
/
Y, which shows that covariance-weighted
pre-processing allows the GLS estimation of B to be
performed by conventional OLS tools. This was here shown
for full-rank OLS/GLS regression, but is equally applicable
for regression methods that handle collinear X-variables,
such as ridge regression and the bilinear methods PCR and
PLSR.
II. 2. Regression over X-variables
The converse case is traditional direct multivariate calibra-
tion or multicomponent curve resolution according to Beer's
law. Here each spectrum x
Input
(1 K) in the matrix
X
Input
=[x
Input,k
; k = 1,2,,K] is modelled by a set of J known
analyte spectra K (K J) in the linear regression model
X
Input
= CK'
Input
E
Input
, where C (NJ) is the matrix of
unknown analyte concentrations and E
Input
(NK) is the
matrix of spectral residuals (ignoring baseline offsets). When
the constituent spectrum matrix K
Input
has full column rank,
the OLS estimator minimizes e
Input
e'
Input
for each row in
X
Input
, yielding
b
C = X
Input
K
Input
(K
/
Input
K
Input
)
1
.
If the correlation pattern between the response errors
in the K X-variables, S (K K), is known, then the
GLS estimator minimizes e
Input
S
1
e'
Input
and yields
b
C = X
Input
S
1
K
Input
(K
/
Input
S
1
K
Input
)
1
.
The equivalent covariance-weighted pre-processing solu-
tion for curve resolution pre-whitens the spectra [X; K'] =
[X
Input
; K'
Input
]S
1/2
, thereby shrinking away the noise
correlations between the X-variables. The model may then
be written as X = CK'E and the GLS concentration estimate
may be obtained by
b
C = XK(K
/
K)
1
, i.e. by an OLS
expression.
In summary, prior knowledge about the uncertainty
covariances S may be used to improve linear regression. In
Appendix I the same was shown for bilinear regressions. In
both cases, one may either analyse the input data directly by
GLS or GLS-like expressions, involving S
1
, or perform
covariance-weighted pre-processing of the input data by
S
1/2
, followed by OLS or OLS-like expressions, as
illustrated in this paper.
REFERENCES
1. Read BC. Weighted least squares. In Encyclopedia of
Statistical Sciences, vol. 9, Kotz S, Johnson NL (eds). Wiley
Interscience, J. Wiley & Sons Inc: New York, 1988; 576
578.
2. Martens H, Naes T. Multivariate Calibration. Wiley:
Chichester, 1989.
3. Gower JC. Generalised canonical analysis. In Multiway
Data Analysis, Coppi R, Bolasco S (eds). Elsevier:
Amsterdam, 1989; 221232.
4. Bullmore E, Long C, Suckling J, Fadili J, Calvert G, Zelaya
F, Carpenter A, Brammer M. Colored noise and
computational inference in neurophysiological (fMRI)
time series analysis: resampling methods in time and
wavelet domains. Human Brain Mapp. 2001; 12: 6178.
5. De Lathauwer L, de Moor B, Vandewalle J. An introduc-
tion to independent component analysis. J. Chemometrics
2000; 14: 123149.
6. Kuldvee R, Kaljurand M, Smit HC. Improvement of
signal-to-noise ratio of electropherograms and analysis
reproducibility with digital signal processing and multi-
ple injections. J. High Resol. Chromatogr. 1998; 21: 169174.
7. Wentzell PD, Andrews DT, Kowalski BR. Maximum
likelihood multivariate calibration. Anal. Chem. 1997; 69:
22992311.
8. Wentzell PD, Lohnes MT. Maximum likelihood principal
component analysis with correlated measurement errors:
theoretical and practical considerations. Chemometrics
Intell. Lab. Syst. 1999; 45: 6585.
9. Paatero P, Tapper U. Positive matrix factorisation: a non-
negative factor model with optimal utilisation of error
estimates of data values. Environmetrics 1994; 5: 111126.
10. Ho skuldsson A. PLS Regrl 7 session methods. J Chemo-
metrics, 1988; 2: 211228.
11. Martens H, Martens M. Multivariate Analysis of Quality.
An Introduction. Wiley: Chichester, 2001.
12. Martens H, Stark E. Extended multiplicative signal
correction and spectral interference subtraction: new
pre-processing methods for near infrared spectroscopy. J.
Pharmaceut. Biomed. Anal. 1991; 9: 625635.
13. Wold S, Antti H, Lindgren F, O
hman J. Orthogonal signal

correction of near-infrared spectra. Chemometrics Intell.
Lab. Syst. 1998; 44: 175185.
14. Andersson CA. Direct orthogonalization. Chemometrics
Intell. Lab. Syst. 1999; 47: 5163.
15. Martens H, Pram Nielsen J, Balling Engelsen S. Light
scattering and light absorbance separated by extended
multiplicative signal correction (EMSC). Application to
NIT analysis of powder mixtures. Anal. Chem. 2003; 75:
394404.

2003, GLS

Transféré par

Informations du document

Description originale:

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

2003, GLS

Transféré par

Droits d'auteur :

Formats disponibles

Pre-whitening of data by covariance-weighted

hman J. Orthogonal signal

Vous aimerez peut-être aussi