Académique Documents
Professionnel Documents
Culture Documents
Gbor Pceli
2017.
Measurement Theory: Lecture 1, 08.02.2017.
0. The subject Measurement Theory can be related to Data Science, which is a very hot topic
for every engineering discipline. The reason is that due to the revolution of sensors incredible
amount of measured data are available, and these data must be processed in a very clever
way. During the lectures, we will discuss information processing methods to make decisions,
compute estimates and model our environment. We keep one eye also on the aspects of real
time implementation of these techniques in building subsystems of our CPS, IoT and
Industry 4.0 applications and their adaptive operating mechanisms.
1. Reminder: Within the subject Measurement Technology (VIMIA206), which is a BSc
subject, the foundations of measurement theory were already introduced. The key concepts
and terms to remember were the following:
- Measurement and modelling
- Model-fitting
- Measurement errors (modelling, transfer- and instrument error)
- error propagation (application of the differential)
- measuring structures (serial, parallel, feedback)
- transducer errors (null-point, load, temperature, calibration, etc. errors)
- accuracy of the devices (analogue, digital; component and frequency dependence)
- Basic measuring methods (difference, direct and indirect comparison, substitution,
swapping or Gauss method)
- Evaluation of measurement sequence (calculation of measurement error, uncertainty)
- Uncertainty calculation based on GUM (Guide to the Expression of Uncertainty in
Measurement)
The above concepts and topics are prerequisites, within the subject Measurement Theory
only model-fitting will be revisited.
2. Measurement procedure: part of the cognition process, which contributes to the
improvement of our a priori knowledge. The improvement can be either higher precision, or
new, additional information. Figure 1 helps the interpretation of this procedure. During the
measurement procedure, we would like to tackle phenomena of real world around us. This
tackling is based preferably on such features, which in a sense show some stability.
Obviously, such features are also abstractions. Significant role is played by the
- state variables (x), the time-dependent changes of which can be related to energy
processes (voltage, pressure, temperature, speed, etc.)
- parameters (a), which characterize the intensity relations of the interactions, and the
- structures (S), which describe the relations of the systems components.
. .
.
. (Noisy) channel
Uncertainty
2
Measurement Theory: Lecture 1, 08.02.2017.
The Space of reality is such an abstraction where values of the investigated features
correspond to one point of the space. Before the measurement, the coordinates of this point are
unknown. During measurements, we try to determine/measure these coordinates, however as
it is well known this can be performed only approximately, due to measurement errors. A
further difficulty is the limited availability of the feature to be measured, e.g. it is impossible to
measure directly where the feature is. Therefore some kind of a mapping is unavoidable. This
mapping is called observation. The path between the feature to be measured and the observation
is called measuring channel.
Observation in case of deterministic measuring channel: Figure 2 presents a discrete-time
observer as an illustrative example. The equations describing the reality and the observation
are as follows:
x(n 1) Ax (n) , (1)
y (n) Cx (n) , (2)
1
( + 1) = () ( + 1) = () + ()
Correction
() = () () () () = ()
()
() = () ()
Figure 2
where x(n) is an N-dimensional state vector, A is an N*N dimensional state-transition matrix,
y(n) is an MN dimensional observation vector, and the observation matrix C is M*N
dimensional. Our goal is the estimation of the state vector x(n). This is fulfilled by the observer,
which tries to provide a copy of the reality in such a way that a correction/learning/adaptive
mechanism forces the observer to follow the reality. After convergence, the result of the
measuring procedure x ( n) can be read from the observer. Within the observer, we apply the
copy of the state transition and the observation equations:
x (n 1) Ax (n) Ge(n) , (3)
y (n) Cx (n) , (4)
where the correction matrix G is of N*M dimension, e(n) y (n) y (n) . The matrix G is set in
such a way, that x ( n) x(n). The difference of (1) and (3):
x(n 1) x (n 1) Ax (n) Ax (n) Ge(n) ( A GC )( x(n) x (n)) . (5)
Introducing (n 1) x(n 1) x (n 1) , and notation F A GC , the state-transition matrix
of the so-called error system is:
(n 1) F (n) . (6)
The correction matrix G should be designed in such a way that (n) n
0 , in favor of which
preferably (n 1) (n) , for n, i.e. F reduces the length of vector (n) in every step, which
means it is contractive.
Remarks:
1. The inequality for the error vector (n) can be interpreted for its length (norm), and in the
scalar case for its absolute value.
3
Measurement Theory: Lecture 1, 08.02.2017.
2. Obviously, it is not necessary to require monotony of the reduction process to force the error
to zero, only the stability of the error system is needed, i.e. its convergence to zero in case
of zero excitation. This can be interpreted in such a way, that the error system dissipates its
internal energy in order to reach the stable state. If dissipation is present in every step, then
the reduction of the error vector length will be monotonic.
Special cases:
1. F A GC 0 . In this case G AC 1 . This is possible if C is quadratic, i.e. the observation
has as many components as the state vector. In this case, it is obvious that we can calculate
the state vector in one step without iteration. This means that the observer, and within the
observer the copy of the system will follow the observed (physical) systems after a single
step.
2. F N ( A GC ) N 0 . In this case the error system converges in N steps:
x( N ) x ( N ) ( A GC ) N ( x(0) x (0)) 0 (7)
F matrices having property F N 0 are the so-called non-derogatory nilpotent matrices. The
eigenvalues of such matrices are zero. Systems, which can be characterized by such state-
transition matrices, produce finite impulse response (these are the so-called FIR systems),
since the initial error disappears in N steps. (Comment: if F M 0 , where M<N, then F is a
derogatory nilpotent matrix. In this case convergence is achieved in less than N steps. )
3. If F N ( A GC ) N 0 , then for a stable error system the length of its state vector will decay
exponentially. Such an error system will be stable if all its eigenvalues are within the unut
circle. Systems, which can be characterized by such state-transition matrices, produce finite
impulse response (these are the so-called FIR systems), since the initial error disappears in
N steps. Systems, which can be characterized by such state-transition matrices, produce
infinite impulse response (these are the so-called IIR systems), since the initial error
disappears only in infinite steps.
Examples:
1 0 1 0 1 0
Example 1: Given A ; C . How to set G? G AC 1 A
0 1 0 1 0 1
1 0 g
Example 2: Given A ; C 1 1 . How to set G? G 0 ?
0 1 g1
g g g0 1 g 0 g0
GC 0 1 1 0 . A GC . G is given by A GC 0 :
2
g1 g1 g1 g1 1 1g1
1 g 0 g 0 1 g 0 g 0 1 2 g 0 g 02 g 0 g1 g 0 g 02 g 0 g 0 g1 0 0
g
1 1 g1 g1 1 g1 g1 g12 g1 g 0 g1 1 2 g1 g12 g 0 g1 0 0
.
By replacing the expressions of the minor diagonal into the expressions of the main diagonal
we receive: 1 2 g 0 0 , and 1 2 g1 0 , from which: g 0 0.5 and g1 0.5 .
Checking by replacement:
0.5 0.5 0.5 0.5 0 0
0.5 0.5 0.5 0.5 0 0 .
Example 3: Lets calculate the eigenvalues of A GC using the results of Example 2:
4
Measurement Theory: Lecture 1, 08.02.2017.
0.5 0.5
detI A GC 0 det ( 0.5)( 0.5) 0.25 2 0.25 0.25 0 .
0.5 0.5
Both eigenvalues are zero.
Comments:
1. This attribution is universally valid for systems capable to converge in limited number of
steps, starting from arbitrary initial state.
2. The transfer function of such systems is such a degenerated rational function, which has
poles only in the origin:
1 2 N a N a N 1 z a N 2 z 2 ... a1 z N 1
H ( z ) a1 z a2 z ... a N z (8)
zN
These are often called as Finite Impulse Response (FIR) filters. The time-domain equivalent
of (8):
y (n) a1 x(n 1) a2 x(n 2) ... a N x(n N ) , (9)
where for the real-time computability of (9) only previous samples of x (n) are considered.
3. In example 3 the condition valid for the eigenvalues can be used also for the determination
of g 0 and g1 :
1 g 0 g0
detI A GC 0 det 2 ( g 0 g1 ) g 0 g1 1 2 0
g1 1 g1
From this: g 0 g1 0 , and g 0 g1 1 , hence: g 0 0.5 s g1 0.5 .
Observation in case of noisy channel: In this case our expectation is not (n) n
0 , but
E[ (n) T (n)] n
min . With this definition, the state equation of the error system (6) is
replaced by:
E[ (n 1) T (n 1)] FE[ (n) T (n)]F T (10)
This error matrix plays a central role in the operation of the Kalman predictor and filter. (R.E.
Kalman was a famous scientist with Hungarian origin.)
Remarks:
1. Both models seen on Figure 2 can have a common additional input excitation signal. Since
these models are linear, due to the superposition principle, the observer will converge also
in this case.
2. The observer on Figure 2 is called Luenberger observer. According to Luenberger almost
any system is an observer. The condition to be able to serve as an observer is simply that
the observer should be faster than the observed system; otherwise it is not possible to follow
the changes of the observed system properly.
3. In case of an impedance measuring bridge the unknown branch, consisting the impedance to
be measured, is the physical model of the reality, while the branch containing the balancing
components corresponds to the tunable model within the observer. The tuning of the bridge,
i.e. the observer mechanism, is performed by the operator based on the difference of the
voltages on the dividing point of the branches.
Modelling noisy channels: To describe random events we use random variables and stochastic
processes. The random variable x ( ) is such a rule or function, which assigns real numbers to
the events of the random event space. If we make a histogram of the occurrence of the samples
5
Measurement Theory: Lecture 1, 08.02.2017.
of a random variable, then we can get a statistical characterization. This leads to the Probability
Density Function (PDF) (), see Figure 3.
(relative)
occurrence
Figure 3
which tells what is the probability that the random variable is not larger than u. The stochastic
process x (t , ) is such a function which assigns a time-function to the events of the random
event space, see Figure 4. The value of these function at a given time instant (e.g. at t0)
represents a random variable.
stochastic process
(0 , )
(, )
(0 , 0 )
0
Figure 4
source Z decision
6
Measurement Theory: Lecture 1, 08.02.2017.
decision threshold
f(z|H0) f(z|H1)
PM PF
Z0 Z1
Figure 6
Please note that in the case of the first two terms the outcome corresponds to hypothesis H0
while in the case of the second two terms to H1. The minimum of (12) can be obtained by
selecting a proper threshold. Lets denote the range of acceptance by Zi. For these ranges lets
calculate the probabilities of occurrence:
7
Measurement Theory: Lecture 1, 08.02.2017.
P( H i H j ) f (z H
Zi
j )dz , (13)
Since the two acceptance ranges cover the complete event space, the integrals of the density
functions above one of the acceptance ranges can be replaced with one minus the integrals
above the other acceptance range. By replacing the integrals above Z1 with integrals above Z0,
(14) can be written in the following form:
R C10 P0 C11P1 P0 (C00 C10 ) f ( z H 0 )dz P1 (C01 C11 ) f ( z H1 )dz . (15)
Z0 Z0
Lets suppose that C10 C00 , s C01 C11 , and consider the decision threshold value as an
independent input variable of the single variable integral (15). The minimum of (15) is reached,
if the decision threshold along the z axis is set to the value, where
P0 (C10 C00 ) f ( z H 0 ) P1 (C01 C11 ) f ( z H1 ) (16)
holds. If we deviate from the threshold value which meets (16), either to the left or to the right,
the mean risk R, i.e. the value of (15) will increase. This is illustrated by Figure 7, where it is
demonstrated how will behave the value of (15), if the threshold value would be shifted to the
right, or to the left of the optimum setting.
decision threshold
P1(C01-C11) f(z|H1)
P0(C10-C00) f(z|H0)
Z0 Z1
Figure 7
Rewriting (16):
f ( z H1 ) P0 (C10 C00 )
, (17)
f (z H0 ) P1 (C01 C11 )
i.e. the ratio of the two conditional density functions is an a priori given constant value. (In (17)
z takes the value of the decision threshold.) If we put the recently measured value into the
equation
f ( z H1 )
( z ) , (18)
f (z H0 )
which is the so-called likelihood-ratio function, and if (z ) , then the decision is H1, if
(z ) , then the decision is H0. By writing in a concise way:
8
Measurement Theory: Lecture 1, 08.02.2017.
H1
( z ) (19)
H0
9
Measurement Theory: Lecture 2, 15.02.2017.
can be made using the a posteriori probabilities. This special case is called maximum a
posteriori (MAP) decision.
H1
2. Instead of (19) ( z ) ln ( z ) ln , the so-called log-likelihood ratio test can also be
H0
used.
Example 1: Detection of a constant signal through noisy channel: Is the signal present or not?
Lets suppose that the observations are independent random variables with Gaussian
distribution, zero mean and variance n2 . Hypothesis H0 is, that the observation takes the value
zk nk , i.e. the signal is absent, only the actual sample of the noise is observed. Hypothesis H1
is, that the observation takes the value zk a nk , i.e. the signal is present, and the sum of the
signal and the noise is observed. Index k runs through 0,1,, N-1, i.e. we consider N samples
simultaneously. For the decision the joint conditional probability density functions will be
applied. In case of a single observation the conditional density functions are as follows:
z k2 ( zk a ) 2
1 2 n2 1 2 n2
f (z H0 ) e , ill. f ( z H1 )
(20) e
2 n 2 n
Since the observations are independent, the joint density function of the N observations is the
product of the single density function. The actual form of (19):
( zk a)2
H1
2 n2
N 1
f ( zk H1 ) N 1
e
( z ) , (21)
k 0 f ( zk H 0 ) k 0
z k2
2 n2 H0
e
or the log-likelihood ratio:
H1
( z a)2 N 1 z 2
N 1
a N 1
Na 2
ln ( z ) ( z ) k 2 k 2 2 z ln (22)
2 n k 0 2 n n 2 n2
k
k 0 k 0
H0
After reordering:
H1
1 N 1
n2 a
N
z
k 0
k
Na
ln
2
(23)
H0
According to (23), for the test the mean of the observed values is to be calculated, and compared
to a threshold. The block diagram of the decision-making device can be seen on Figure 8.
10
Measurement Theory: Lecture 2, 15.02.2017.
1 present
1 threshold
zk ( )
detector
=0 absent
a N
Figure 8
Remarks:
H1
1 N 1
a
1. If 1 , then ln 0 , therefore
N
z
k 0
k
2
, i.e. the decision threshold is the half of the
H0
constant signal value. This is achieved e.g. if P0 P1 0.5 , C00 C11 and C10 C01 .
2. If 1 , then ln 0 , in this case the decision threshold will be smaller than the half of the
constant signal value. This can be achieved e.g. if P0 P1 , C00 C11 and C10 C01 . In this case
the probability of the occurrence of the signal is higher, therefore the threshold will be lower,
and otherwise it will be higher.
3. Note, what is the effect of the noise variance, of the number of observations, and of the signal
level itself in expression (23)?
Example 2: Detection of a changing magnitude signal through a noisy channel: Is the signal
present or not? Lets suppose that the observations are independent random variables with
Gaussian distribution, zero mean and variance n2 . Hypothesis H0 is, that the observation takes
the value zk nk , i.e. the signal is absent, only the actual sample of the noise is observed.
Hypothesis H1 is, that the observation takes the value z k ak nk , i.e. the signal is present, and
the sum of the signal and the noise is observed. Index k runs through 0,1,, N-1, i.e. we
consider N samples simultaneously. For the decision the joint conditional probability density
functions will be applied. In case of a single observation the conditional density functions are
as follows:
z k2 ( z k ak ) 2
1 2 n2 1 2 n2
f (z H0 ) e , ill. f ( z H1 )
(24) e
2 n 2 n
Since the observations are independent, the joint density function of the N observations is the
product of the single density function. The actual form of (19):
( z k ak ) 2
H1
2 n2
N 1
f ( z k H1 ) N 1
e
( z ) , (25)
k 0 f ( zk H 0 ) k 0
z k2
2 n2 H0
e
or the log-likelihood ratio:
H1
N 1
( z a )2 N 1 z 2 1 N 1
1 N 1
ln ( z ) ( z ) k 2k k 2 2 z a a 2
ln (26)
2 n k 0 2 n n 2
k k 2 k
k 0 k 0 n k 0
H0
After reordering:
H1
1 N 1
n2 1 N 1
N
ak zk
k 0 N
ln
2N
a
k 0
2
k (27)
H0
11
Measurement Theory: Lecture 2, 15.02.2017.
According to (27), for the test the weighted mean of the observed values is to be calculated, and
compared to a threshold. The a priori known signal samples are the weights. The block diagram
of the decision-making device can be seen on Figure 9.
1 present
1 Threshold
zk ( )
detector
=0 absent
ak
ak N
Figure 9
Remarks:
1. Note that (23) can be easily derived form (27), if all the signal samples are equal.
2. The signal-weighting described by (27) is called matched filtering.
Example 3: Detection of a random magnitude signal through a noisy channel: Is the signal
present or not? Lets suppose that both the signal and the noise are discrete, stationary stochastic
with-noise processes with Gaussian distribution, zero mean and variance a2 , and n2 .
Hypothesis H0 is, that the observation takes the value zk nk , i.e. the signal is absent, only the
actual sample of the noise is observed. Hypothesis H1 is, that the observation takes the value
zk ak nk , i.e. the signal is present, and the sum of the signal and the noise is observed. Index
k runs through 0,1,, N-1, i.e. we consider N samples simultaneously. For the decision the joint
conditional probability density functions will be applied. In case of a single observation the
conditional density functions are as follows:
z k2 z 2k
1 2 n2 1 2 ( a n2 )
2
f (z H0 ) e , and f ( z H1 )
e (28)
2 n 2 a2 n2
Since the observations are independent, the joint density function of the N observations is the
product of the single density function. The actual form of (19):
z 2k
H1
2 ( a2 n2 )
N 1
f ( z k H1 ) N 1
n e
( z ) ,
k 0 f ( zk H 0 ) k 0 2 2
z k2
a n 2 n2 H0
e
or the log-likelihood ratio:
H1
N 1
zk2 N 1
zk2 a2 N 1
N n2
ln ( z ) k 2 2 2 ln
z 2
ln (29)
k 0 2( a n ) k 0 2 n 2 n2 ( a2 n2 ) k 0
2 2 2
a n
H0
After reordering:
H1
1 N 1
2 n2 ( a2 n2 ) 1 a2 n2 1
z 2
k
a 2 ln
n2
ln (30)
N k 0
H0
2 N
According to (30), for the test the mean of the squared sample values is to be calculated, and
compared to a threshold. The block diagram of the decision-making device can be seen on
Figure 10.
Remarks:
12
Measurement Theory: Lecture 2, 15.02.2017.
H1
1 N
1. If 1 , and a2 n2 , then
N
z
k 1
2
k
2 n2 ln 2 .
H0
H1
1 N
2 1 2
zk2
1
2. Alternative form of (30): 2 n2 (1 n2 ) ln(1 a2 ) ln , where the effect of
N k 1 a 2 n N
H0
2 1 present
1 Threshold
zk ( )
detector
=0
absent
Figure 10
1 1
( |0 ) ( |1 )
=0 =0
H1 H0 H1
Z1 Z0 Z1
Figure 11
Example 4: Bayesian decision with discrete probabilities. A student needs to achieve a decision
on which course to take, based only on his/her first lecture. Define 3 categories of courses:
good, fair, bad. From his previous experience, he knows:
P(good)=0.2, P(fair)=0.4 and P(bad)=0.4.
These are a priori probabilities. The student also knows the class-conditionals: how much the
impressions from the lectures coincide with the categories. These are the conditional
probabilities which correspond to the channel characteristics:
(|) = 0.8, (|) = 0.5, (|) = 0.1,
(|) = 0.2, (|) = 0.5, (|) = 0.9.
The cost/loss/risk function values:
Ctaking good 0 , Ctaking fai r 5 , Ctaking bad 10
Cnot _ taking good 20 , Cnot _ taking fair 5 , Cnot _ taking bad 0
The student wants to make an optimal decision; therefore he/she needs to minimize the
conditional risk. (The condition is the impression got at the first lecture.) The risk values are as
follows:
(|), (|)
(_|), (_|)
Lets calculate the first value: (|) =
13
Measurement Theory: Lecture 2, 15.02.2017.
Ctaking good P( good interesting ) Ctaking fair P( fair interesting ) Ctaking bad P (bad interesting )
The conditional probabilities here are the so-called a posteriori probabilities, which can be
calculated using the Bayes Theorem:
P(interesting good) P( good)
P( good interesting ) ,
P(interesting )
P(interesting fair) P( fair)
P( fair interesting ) ,
P(interesting )
Here the factors of the nominator are known, only P (interesting ) is to be calculated:
P(interesting ) P(interesting good) P( good) P(interesting fair) P( fair) P(interesting bad) P(bad)
0.8 0.2 0.5 0.4 0.1 0.4 0.4
Thus:
P(boring) 1 0.4 0.6 , P( good interesting ) 0.4 , P( fair interesting ) 0.5 ,
P(bad interesting ) 0.1
If after the first lecture the impression of the student is that the lecture is interesting, then he/she
will compare the risk values: R(taking interesting ) , and R(not _ taking interesting ) .
R(taking interesting ) 0 0.4 5 0.5 10 0.1 3.5 ,
R(not _ taking interesting ) 20 0.4 5 0.5 0 0.1 10.5 ,
i.e. the student will take the lecture, since this decision has the lower risk/cost. Calculate the
risk/cost values also for the case, when the experience with the first lecture is that it is boring.
4. Estimation Theory Basics (Main reference: Fundamentals of Statistical Signal Processing.
Estimation Theory, by S.M.Kay, Prentice-Hall, 1993, and the slides Estimation Theory of
Alireza Karimi, Laboratoire dAutomatique, MEC2 397, Spring 2011.)
The objective of parameter estimation is the determine an estimator a of an unknown
parameter(vector) a. Figure 12 illustrates this objective by indicating what kind of information
might help to solve this problem if a proves to be a random value.
a
f(a) Z
(|) (|)
(|)
Figure 12
14
Measurement Theory: Lecture 2, 15.02.2017.
Obviously it might happen that the unknown parameter is a deterministic value, statistical
characterizations have no meaning. In the following first we will present methods based on
statistical characterizations, and afterwards we will continue with the deterministic solutions.
For the followings, it might be useful to recall the concept of the: (1) density function and its
relation to measured data; (2) the conditional density function to characterize measuring
channels; and (3) expected value, and its calculation based on density functions. To
characterize the estimator the following measures are useful (see Figure 13):
()
( |)
(|)
Figure 13
1. Conditional expectation/expected value: Ea a a f (a a)da (31)
2. Conditional covariance matrix: cova, a a E (a E (a a))(a E (a a))T a (32)
4. Mean Square Error (MSE): E (a a)(a a)T a cova , a a b(a)bT (a) (34)
Comment: If the probability density function f (a ) is known, then we can calculate:
I. Bayesian estimation: Lets suppose that the density function f (a ) of the observed parameter
and the channel characteristics f ( z a) are known. Having the observations, and using the Bayes
rule, the so-called a posteriori density function f (a z ) can be derived:
f ( z a) f (a)
f (a z ) , where (37)
f ( z)
f ( z) f (a) f ( z a)da .
(38)
The idea behind the Bayesian estimator is, that the observations are performed on a given
realization of the parameter; therefore, this additional information might help to sharpen the
estimation.
(|)
()
Figure 14
15
Measurement Theory: Lecture 2, 15.02.2017.
Figure 14 shows that if the observations have information about the parameter, then the a
posteriori density function will span to a narrower area of the possible parameter values. The
best estimator is calculated using the a posteriori density function. The concept of best is
enforced by risk/cost functions:
R(a , a) EC (a , a) C (a , a) f (a z )da , (39)
Here C (a, a) is the cost function, the widely-used alternatives of which are illustrated in Figure
15:
m
I. Quadratic: C (a, a) (ai ai )2 (a a)T (a a) (41)
i 1
m
II. Absolute C (a, a) ai ai (42)
i 1
0 i re
ha ai ai
III. Hit-or-Miss C (a , a) 2 (43)
1 otherwise
Figure 15
Minimum mean square Estimator (MMSE):
R(a , a) E (a a)T (a a) (a a)T (a a) f (a z )da (44)
Since (a a )T (a a ) 2(a a ) , and in (45) a can be placed in front of the integral, and the
a
integral of the density function is 1, therefore
aMS af (a z )da , (46)
i.e. using the quadratic cost function, the best estimation is the a posteriori mean value.
Minimum Mean Absolute Error Estimator (the scalar case only):
16
Measurement Theory: Lecture 2, 15.02.2017.
a
R(a , a) Ea a (a a) f (a z )da (a a) f (a z )da (47)
a
By setting the derivative of the risk function equal to zero and the use of Leibnitzs rule:
f (a z )da f (a z)da ,
a ABS
(49)
i.e.
a ABS The median of (|) (50)
Median: area to the left = area to the right.
Maximum a posteriori (MAP) estimator:
a a
2 2
R(a , a)
f (a z )da
a
2
f (a z )da 1 f (a z)da .
(51)
a
2
If is arbitrarily small, but 0 , the optimal estimate is the location of the maximum of
(|), or the mode of the a posteriori density function.
aMAP Location of the maximum of (|) (52)
Remarks:
1. The Bayesian estimations are performed always using a posteriori density functions.
2. The MS estimation is linear in the sense described below:
If b Aa c , then bMS Aa MS c , furthermore Ea b z Ea z Eb z aMS bMS .
Bayesian estimators in case of Gaussian distributions: Lets suppose that the unknown
parameter a, and the observation noise have Gaussian distribution. Given Ea a ,
cova, a aa, En 0 , covn, n nn. If everything has Gaussian distribution, then the
moments of the a posteriori density functions can be given explicitly.
Lets suppose that noisy observation can be described by the following expression:
z Ua n , (53)
where dim a=p, dim z=q, dim U=q*p. U stands for the so-called observation matrix. The explicit
formula of the a posteriori mean value is:
aMS a z [U T nn1U aa1 ]1U T nn1 ( z Ua ) (54)
a
a _ priori _ knowledge correction _ as _ function _ of _ z U a
17
Measurement Theory: Lecture 2, 15.02.2017.
observed values are: z k a nk , where k=0, 1, ..., N-1, a is the unknown parameter (unknown
resistance), nk is the sample of the additive noise. Lets suppose that both the resistance of the
resistor (which is one element of large set of resistors), and the observation noise are Gaussian
random variables. The observed samples of the noise are uncorrelated. Lets suppose, that a
and a2 are known, the mean value of the noise is n 0 , and covnk , n j n2 kj , where
1 if k j
kj . In vector-form: z Ua n , z T z0 , z2 ,..., z N 1 , nT n0 , n2 ,..., nN 1 ,
0 if k j
U T 11...1 , nn n2 I . Using (54) we have:
a2 a2 N 1
N 1 1 1 N 1 N n2
N
1 N 1
a z
n2 k 0 k
a [ 2 2 ] ( 2 zk 2 a ) a
a2 N
a MS a z ( zk a )
n a n k 0 n a2
1 N 2 k 0
1 N 2
n n
(55)
Remarks:
1. Based on (55) the estimation can be interpreted as a prediction-correction form, the first term
of which is a prediction based on a priori knowledge, which is completed by a correction term
which introduces new information based on measurements. This is proportional with the
difference of the mean of the measured values and the expected value of the parameter. The
a2
weighting factor depending on the value of N is somewhere between zero and one. If
n2
N 1
1
a n , then aMS a , if a n or N , then aMS
N
z
k 0
k .
18
Measurement Theory: Lecture 3, 22.02.2017.
N 1
(aa ) 2
( z k ask ) 2
1
1 2 a2 1 2 n2
Here f (a) e , f ( z a) e k 0
, and the derivatives:
2 a N
(2 ) 2 2
ln f (a) a a ln f ( z a ) 1 N 1
a
a 2
,
a
2
n
s
k 0
k ( z k ask ) . (59)
a2 N 1
a 2 sk zk
1 N 1
a a n k 0
2 s (z
k k ask )
a2
0 , from where aMAP
2 N 1
(60)
n k 0 a a MAP 1 a2 sk2
n k 0
Remarks:
1. Using the MAP estimator the application of (54) could be avoided.
2. The variance of the estimator:
a2
var a~ , (61)
a2 N 1 2
1 2 sk
n k 0
3. If sk=1, k , then we get the result of the previous example.
4. Here we can also identify the matched filter mentioned in Example 2 of the chapter on
decision theory.
5. Obviously aMAP aMS .
II. Maximum likelihood (ML) estimation (a is stochastic): The a priori probability density
function of the value to be measured is unknown. In this case we suppose that the unknown a
priori density function spreads widely, therefore the a posteriori density function will coincide
with the channel characteristics. The optimal estimate will be the location where the channel
characteristics takes it maximum value:
f ( z a) ln f ( z a)
0 , or 0 (62)
a a a ML
a a a ML
19
Measurement Theory: Lecture 3, 22.02.2017.
Remark: Figure 16 illustrates the situation from the viewpoint of expression (37): The
nominator of the expression producing the a posteriori density function is the product of the
two functions indicated on the Figure. Visibly the location of the maximum can be given (62).
(|)
()
=
16. bra
III. Gauss-Markov (GM) estimation: Special case of the maximum likelihood estimation,
where the observation noise is Gaussian and the observation is modelled by a linear equation.
We take N-dimensional observations, n stands for the N-dimensional noise vector:
1
n T nn1 n
En 0 , covn, n nn , f (n)
1 2
N 1
e , (63)
(2 ) nn 2
2
( zk a ) 2
1
Gauss-Markov estimate of the parameter a is the location of the maximum of the channel
ln f ( z a) N 1 N 1
characteristics: 2 [ zk a ] 0 ;
a a a a
n N k 0
ML GM
N 1
1
aML aGM
N
z
k 0
k , (67)
20
Measurement Theory: Lecture 3, 22.02.2017.
i.e. if the observation equation is linear, and the channel noise is Gaussian, then the best (Gauss-
Markov) estimate is the simple mean of the samples.
IV. Minimum Variance Unbiased Estimation (MVUE): in the followings, the parameter to
be measured is supposed to be deterministic. The estimation problem is the following: given
= { }, = 0,1, , 1, i.e. N measured value, which depend on an unknown parameter a.
Lets determine an estimator of a: = (0 , 1 , , 1 ), where g is some function. The first
step is to find the probability density function (PDF) of data as a function of a: (; ). (This
PDF is the channel characteristics.)
Example 1: Consider the problem of DC level in white Gaussian noise with one observed data:
0 = + 0, where 0 has the PDF (0, 2 ). In this case the PDF of 0 :
1 1
(0 ; ) = exp[ ( )2 ]
2 2 2 2 0
Example 2: Consider a data sequence that can be modeled with a linear trend, in white Gaussian
noise: = + + , = 0,1, , 1. Suppose that is uncorrelated with all the
other samples, and its PDF is (0, 2 ). Letting = [ ] and = [0 , 1 , , 1 ] the PDF
is:
1 1
1 1
(; ) = ( ; ) = exp[ 2 ( )2 ]
(2 2 ) 2
=0 =0
Assessing Estimator Performance: Consider the problem of estimating a DC level A in
uncorrelated noise:
= + , = 0,1, , 1.
Consider the following estimators: 1 = 1 1 , 2 = 0 . Suppose that = 1, 1 =
=0
2 = 0.98. Which estimator is better? An estimator is a random variable, so its
0.95,
performance can only be described by its PDF or statistically (e.g. by Monte-Carlo simulation).
Unbiased Estimator: An estimator that on the average yield the true value is unbiased.
Mathematically: ( ) = 0, for 1 < < 2 . Lets compute the expectation of the two
1 and
estimators 2 :
1 1 1
1 1 1
1 ) = ( ) = ( + ) = ( + 0) =
(
=0 =0 =0
2 ) = (0 ) = ( + 0 ) = + 0 =
(
Both estimators are unbiased. Which one is better? Lets compute the variance of the two
estimators!
1 1
1 1 1 2
1 ) = [ ] =
( ( ) = 2
=
2 2
=0 =0
2 ) = (0 ) = 2 > (
( 1 )
Remark: When several unbiased estimators of the same parameters from independent set of
data are available, i.e. 0 , 1 , , 1 , a better estimator can be obtained by averaging:
21
Measurement Theory: Lecture 3, 22.02.2017.
1
1
= () = .
=0
By increasing N, the variance will decrease (if , ). It is not the case for biased
estimators, no matter how many estimators are averaged.
Minimum Variance Criterion: The most logical criterion is the Mean Square Error
(MSE/mse):
() = [( )2 ]
Unfortunately, this type of estimators leads to unrealizable estimators (the estimator will depend
on the unknown a). By introducing the expected value of the estimate:
() = {[ [] + [] ]2 } = {[ [] + ()]2 },
where () = [] is defined as the bias of the estimator. Therefore:
() = {[ ()]2 } + 2()[ ()] + 2 () = () + 2 ().
Instead of minimizing MSE we can minimize the variance of the unbiased estimators: Minimum
Variance Unbiased Estimator.
Minimum Variance Unbiased Estimator, MVU Estimator: In general, MVU estimator does
not always exist. There may be no unbiased estimator or none of unbiased estimators has
uniformly minimum variance. There is no known procedure which always leads to the MVU
estimator. What can we do?
1. Determine the Cramer-Rao lower bound (CRLB) and check to see if some estimator satisfies
it.
2. Restrict to linear unbiased estimators.
V. Cramer-Rao Lower Bound (CRLB): is a lower bound on the variance of any unbiased
estimator,
() ()
Note that the CRLB is a function of a. It tells us what is the best performance that can be
achieved. It may lead us to compute the MVU estimator.
CRLB Theorem: Assume that the PDF (; ) satisfies the regularity condition:
[(; )/] = 0 for all a.
Then the variance for any unbiased estimator satisfies:
1
2 (; )
() [( )]
2
An unbiased estimator that attains the CRLB can be found iff:
(; )
= ()(() )
22
Measurement Theory: Lecture 3, 22.02.2017.
for some functions () and (). The estimator is = (), and the minimum variance
1/().
Example 1: Consider the estimation of a DC level in additive white Gaussian noise based on a
single measurement: 0 = + 0, where the PDF of 0 is (0, 2 ):
1 1
(0 ; ) = [ ( )2 ]
2 2 2 2 0
1
(0 ; ) = 2 2 ( )2
2 2 0
Then:
ln(0 ; ) 1 2 (0 ; ) 1
[
= 2 0 ] =
2 2
According to the CRLB Theorem:
1
() 2 , () = 2 , = (0 ) = 0 .
Example 2: Consider the estimation of a DC level in additive white Gaussian noise based on
multiple observations: = + , = 0,1, , 1, where the PDF of is (0, 2 ),
and the samples are uncorrelated:
1
1 1
(; ) = [ 2
( )2 ]
2
(2 ) 2
=0
Then
1
(; ) 1
= [ [(2 2 ) 2 ] 2 ( )2 ] =
2
=0
1 1
1 1
= 2 ( ) = 2 ( )
=0 =0
According to the CRLB Theorem:
1
2 1
() , () = 2 , = () =
=0
23
Measurement Theory: Lecture 3, 22.02.2017.
Compute the CRLB and the MVU estimator that achieves this bound.
Step 1: Compute (; );
2 (;)
Step 2: Compute () = [ ] and the covariance matrix of : () = 1 ().
2
(;)
Step 3: Find the MVU estimator () by factoring = ()[() ].
These steps in the case of the above model (Linear model with WGN):
1
Step 1: (; ) = (2 2 ) 22 ( ) ( ).
(,) 1 1
Step 2: = 22 [ 2 + ] = 2 [ ].
2 (;) 1
Then () = [ ] = 2 .
2
I.e. for a linear model with WGN the MVU estimator is: = ( )1 . This estimator is
efficient and attains the CRLB. This estimator is unbiased that can be seen easily by
() = ( )1 ( + ) = .
The statistical performance of is completely specified because is a linear transformation of
a Gaussian vector and hence has a Gaussian distribution:
~(, 2 ( )1 )
Examples:
1. Curve fitting: Consider fitting the data by a p-th order polynomial function of n:
xn a0 a1n a2 n2 ... aP n P wn ,
where wn is the n-th sample of the noise. We have N samples:
1 0 0 0
x0 a0 w0
x 1 1
a1 w1
1 1
z 1 1 2 4 2 P
Ua w , a [U T U ]1U T z
xN 1 1 N 1 P aP wN 1
N 12 N 1
24
Measurement Theory: Lecture 3, 22.02.2017.
a1
1 0 a2
2k 2k
x0 cos sin w0
x aM w1
N N
4k 4k
1 Ua w
z cos sin
N N
1 b
b
xN 1 2 2 2 wN 1
cos N 1 sin N 1
N N
bM
a [U TU ]1U T z
2 2 2 2 2
N 1 N 1
Note that (U T U ) 1 I , therefore a k x cos
n 0 n
kn , bk x sin
n 0 n
kn.
N N N N N
Remarks:
(1) From the properties of the linear models the estimates are unbiased.
(2) The covariance matrix:
cova w2 U TU w2 I
1 2
N
The estimates are Gaussian random variance, and since their covariance matrix is diagonal, the
amplitude estimates are independent.
3. System identification: Consider identification of a Finite Impulse Response (FIR) model,
for = 0,1, , 1, with input and output provided for = 0,1, , 1 ( xn 0 ,
if n 0 ):
yn k 0 ak xnk wn , rewritten in the z Ua w form:
P 1
x0 0 0
y0 a w
y x1 x0 0 0 0
z 1 x2
x1 x0
a w
0 1 1 , a U T U U T z
1
0
y N 1 x a P 1 wN 1
N 1 x N 2 x N 3 x N P
Remark: The covariance matrix in case of WGN: cova w2 [U TU ]1 .
25
Measurement Theory: Lecture 4, 01.03.2017.
a [U T U ]1U T z
Note that there is slight difference if we compare this solution to that of Example 3: Through
the sliding window we see nonzero values also with negative indices.
It is interesting to investigate what is the meaning of the matrix [U T U ]1 , and that of the
vector U T z . First lets reorder matrix U T U as the sum of dyadic products:
26
Measurement Theory: Lecture 4, 01.03.2017.
1 N 1
R xx (k p) xn k xn p , k,p =0,1,,P-1. (74)
N n 0
Similarly, for the vector U T z :
xn z n
N 1 x
n 1 z n
U z
T (75)
n 0
xn P 1 z n
(75) is such a vector, the normed elements of which estimate the cross-correlation of the
discrete sequences and :
1 N 1
R xz (k ) xnk zn , k = 0,1, ,P-1. (76)
N n 0
Lets denote R the matrix composed from the elements of (74), R the vector composed
xx xz
from the elements of (76):
a R xx1 R xz (77)
Remarks:
(1) Obviously (77) can be considered only a formal rewriting for the case of the actual example,
however later we will see its meaning and importance.
P
(2) In case of real-time computations instead of (72) the form yn ak xn k is to be used,
k 1
because just to be able to compute the output we need one step delay.
27
Measurement Theory: Lecture 4, 01.03.2017.
() = () = = , therefore = 1, where = {0 , 1 , , 1 } .
Minimize subject to = 1. This constrained optimization can be solved using
Lagrangian Multipliers. We have to minimize
= + ( 1)
= 2 + = 0, from which = 2 1 , that can be replaced into the condition term
1
= 1 = 2 1 , 2 = 1 , which after replacing into the expression of h: =
1
, and finally using = the optimal solution is:
1
1 1
= 1 () = 1
Finding the BLUE (vector case):
= +
where w is a noise vector with zero mean and C, (the PDF of w is arbitrary), then the BLUE of
a is:
= ( 1 )1 1 ,
and the covariance of a is:
() = ( 1 )1 .
Remark: If the noise is Gaussian then the BLUE is an MVU estimator.
Example: Consider the problem of DC level n noise: = + , where is of unspecified
PDF with ( ) = 2 . = [1,1, ,1] , = . The covariance matrix is:
1
0
02 0 02 0
0
1 0
= 0 12 0
1 =
0 12
[ 0 0 2
1 ] 1
[0 0 2
1 ]
1
1
and hence the BLUE is: = ( 1 )1 1 = (1
=0 2 ) 1
=0
2
1
1
and the minimum covariance: = ( 1 )1 = (1
=0 2 )
28
Measurement Theory: Lecture 4, 01.03.2017.
Problems: MVU estimator does not often exist or cannot be found. BLUE is restricted to linear
models.
VII. Maximum likelihood estimation (MLE) (a is deterministic):
- can always be applied if the PDF is known;
- is optimal for large data size;
- is computationally complex and requires numerical methods.
Basic idea: Choose the parameter value that makes the observed data, the most likely data to
have been observed.
Likelihood Function: is the PDF (; ) when a is regarded as a variable (not a parameter).
ML Estimate: is the values of a that maximizes the likelihood function.
Procedure: find log-likelihood function (; ); differentiate w.r.t a and set to zero and solve
for a.
Example: Consider DC level in WGN with unknown variance = + . Suppose that A>0
and 2 = . The PDF is:
1
1 1 2
(; ) = [ 2 ( ) ]
(2) 2 =0
1
1 1 1
= + 2 +
2 4
=0
29
Measurement Theory: Lecture 4, 01.03.2017.
VIII. Least Squares (LS) Estimation: In all the previous methods, we assumed that the
measured signal is the sum of a true signal and a measurement error with known probabilistic
model. In the least squares method, we do not need a probabilistic assumption but only a
deterministic signal model.
= () +
where represents the modeling and the measurement errors. The objective is to minimize the
LS cost:
1
2
() = ( ())
=0
.
Example: Estimate the DC level of a signal. We observe = + for = 0,1, , 1,
and the LS criterion is:
1
() = ( )2
=0
1 1
() 1
= 2 ( ) = 0 =
=0 =0
30
Measurement Theory: Lecture 4, 01.03.2017.
a [U T QU ]1U T Qz .
The minimum LS cost:
() = ( ) ( ) = [ ( )1 ]
3. If we take = 1 , where C is the covariance of noise then the weighted least squares
estimator is the BLUE. However, there is no true LS-based reason for this choice.
3. If Q nn1 , then we get the Gauss-Markov (GM) estimate of (66), i.e. the GM estimator is a
weighted least squares error estimator, where the weights are given by the inverse of the
covariance matrix of the noise.
4. Fitting of the observation model to data leads us to the general problem of model fitting. One
of the simplest case of model fitting is the regression analysis problem, where based on
independent variable value - function value pairs an approximating function is composed,
typically by fixing the structure of the function in advance, and setting its parameters by
minimizing some cost function. See e.g. as the simplest, the linear regression problem.
5. At this point we formally end the introduction of estimation theory basics, however, in
following we will still continue to deal with measurements, i.e. with fixing the states and the
parameters of different phenomena, that typically involve estimations.
5. Model fitting
In the case of LS estimators, we do not have a priori knowledge about the observations,
therefore what we practically do is nothing else than model fitting.
Regression analysis: In statistical modeling, regression analysis is a statistical process for
estimating the relationships among variables. It includes many techniques for modeling and
analyzing several variables, when the focus is on the relationship between a dependent variable
and one or more independent variables. Finding this relationship is a special case of model
fitting. On Figure 17 the function y g (u , w) has two type of independent variables: the one,
denoted by u(n), in many applications can be considered as a discrete input time sequence that
can be set/influenced by the operator, while the other, denoted by w(n), cannot be influenced,
it is typically unknown noise or disturbance.
()
()
(, ) ()
Cost function
-1
() ()
Cost minimization
17. bra
Remark:
In the following the small n stands to identify an iteration step, or it is a discrete time index,
which sometimes takes the role of real indexing, as well. In the following u (n) un , and
y (n) yn are equivalent notations.
31
Measurement Theory: Lecture 4, 01.03.2017.
For modelling the unknown y g (u , w) such a y g (u ) function is used, which has the same
input, and some tunable parameters that are set to minimize some cost function. Typically as
cost function mean least squares are used:
E( y y )T ( y y ) (78)
Regression analysis in case of fully specified statistics: If we know f (u , y ) , the joint
probability density function of u and y, then we face a Bayes estimation problem, the solution
of which is the a posteriori expected value:
g (u ) Ey u (79)
The curve [u , g (u )] is the so-called regression curve of the variable y with respect to u. If the
input is a vector, then we have a regression surface.
Regression analysis with partially specified statistics: We do not know the joint density
function, only limited number of moments:
Linear regression: The function to be fitted is scalar linear function g (u ) a0 a1u whose
parameters are to be selected to minimize E ( y g (u ) 2 . Lets suppose that the means and
, and the standard deviations and are known, together with the normalized cross-
E(u u )( y y
covriance function: . Minimize the cost function
u y
J (a0 , a1 ) E ( y a0 a1u) 2 E y 2 a02 a12 E u 2 2a0 Ey 2a1Euy 2a0 a1Eu (80)
according to a 0 and a1 :
J (a0 , a1 )
2a0 2 y 2a1 u 0 , thus a0 y a1u , and (81)
a0
J (a0 , a1 )
2a1 ( u2 u2 ) 2( u y u y ) 2a0 u =0. By solving this set of equations:
a1
y y
a0 y u , a1 (82)
u u
Remarks:
1. To get (82) we have used the following relationships E (u u ) 2 u2 E u 2 u2 , and
E(u u )( y y ) Euy u y .
2. If we replace the optimum values into (80), we get var y g (u ) y2 (1 ) , which is the
variance of the approximation error, the minimum of the cost function. It is interesting to
investigate the relations as the function of 0 1 . If the cross-covariance is 0, the we will
get 1 = 0, i.e. a lying linear curve, which means that the best estimate of the output is the
expected value of the measured values. If the cross-covariance is 1 (100%), then y depends
only on u, i.e. independent of the w noise.
3. One possible generalization of the linear regression is the polynomial regression:
N
g (u) ak u k , (83)
k 0
which has the important property of being linear in its parameters. We prefer models linear
in their parameters, because in case of quadratic cost function, finding the minimum results
in solving a set of linear equations.
32
Measurement Theory: Lecture 4, 01.03.2017.
Linear regression based on measured data: with slight modifications, the procedure above
can be performed also in the case of not having a priori information. Then we can use the model:
yn a0 a1u n wn , z Ua w , as previously.
y0 1 u0 w0
y 1 u a w u N 1 yn
N 1
1 0 1 , [U T U ] N
z 1 n 0 n
, U z Nn10
T
.
a1 n 0 un
N 1
N 1 2
u n 0 un yn
n 0 n
y N 1 1 u N 1 wN 1
1 N 1 2 1 N 1 1 N 1
a0 1 N n 0 u n un yn
N n 0 N n 0
a 2 1 N 1 1 N 1 .
1 1 N 1 2 1 N 1 n 0 u n n 0 u n y n
n N n0 n
N n 0
u u N
1
N
Remark: In these expressions, we can identify the statistical estimates of the moments used in
(82): If we rewrite the equations using the differences to the mean values, we can reach to the
complete correspondence. Please do it as an exercise.
Generalization of the regression scheme: On Figure 18 model fitting is arranged according
to the regression scheme. The response y to the input u is approximated by the response y of
the model, which is adjusted by minimizing some cost function.
()
()
Reality ()
Cost function
-1
Model ()
Minimization
Figure 18
It is interesting to compare this scheme with that of Figure 2. Lets redraw Figure 2 to the form
of Figure 19:
()
-1
()
Reality Cost function Model
() ()
Figure 19
The two schemes are similar: in both cases model fitting is performed. In the observer scheme,
we know the parameters and the systems states are estimated/measured, while in the regression
scheme (implicitly) we are familiar with the state and the parameters are estimated/measured.
Both schemes are parallel in that sense that the input is fed in a parallel way to the system and
its approximator.
33
Measurement Theory: Lecture 4, 01.03.2017.
Remark: We can fit models also in serial forms, when practically the so-called inverse
modell is fitted in such a way that the input is estimated (see Figure 20).
()
() ()
Reality Inverse Model Cost function
() -1
Figure 20
The drawback of this approach is that in case of dynamic systems, due to the system delay,
() should be predicted, or () delayed.
Adaptive linear combinator: Figure 21 presents a widely-used model family fitted into the
generalized regression scheme.
0 ()
0 ()
1 ()
()
() 1 ()
()
1 ()
1 () Model
Minimization
Figure 21
In this model from the discrete sequence of u(n) the sequence of vectors
X (n) xo (n) x1 (n) xN 1 (n) , the so-called regression vector is generated, and then its
T
components are linearly combined to produce the output sequence y (n) . The most suitable
values of the W T (n) w0 (n) w1 (n) wN 1 (n) weights are derived by minimizing the
mean square error:
J (W (n)) E [ y(n) X T (n)W (n)]T [ y(n) X T (n)W (n)]
E y T (n) y (n) 2W T (n) EX (n) y (n) W T (n) E X (n) X T (n) W (n) . (84)
Lets denote EX (n) y(n) P , and E X (n) X T (n) R ! The cost is minimal if
J (W (n))
2P 2RW (n) 0 ,
W (n)
thus the best weights are given by the so-called Wiener-Hopf equation:
W * R 1 P (85)
Remarks:
1. By replacing (85) into (84):
T 1
J min E y (n) y (n) P R P E y (n) y (n) P W
T T T *
(86)
34
Measurement Theory: Lecture 4, 01.03.2017.
2. Equation (87) gives the mean squared error as a function of the parameters and of the
parameter error. The equation is illustrated by Figure 22.
Error surface
Parameter subspace
Figure 22
At any point of the error surface, the change of the error with respect to the parameter change
can be characterized by the gradient of the surface:
J (W (n))
(n) 2R[W (n) W * ] 2RV (n) 2( RW (n) P) . (88)
W (n)
Equation (88) plays an important role in cost function minimization: to find the minimum we
will descend on the error surface.
35
Measurement Theory: Lecture 5, 08.03.2017.
2 2
() = 0 () = ( ) 0 () = ( )
1
2
1 () = ( 1)
f(u)
Figure 23
36
Measurement Theory: Lecture 5, 08.03.2017.
1 1
W * W ( n) R ( n) . (92)
2
If we suppose that our knowledge about the matrix R is not perfect, and therefore the situation
is the same with the gradient, (92) can be rewritten into an iterative form, because we are unable
to reach the optimum in a single step:
1 1
W (n 1) W (n) R ( n) .
2
With the introduction of the convergence factor 0 1 into (92):
If the matrices R and P are known, the equations describing the operation of the adaptive
combinator are as follows:
37
Measurement Theory: Lecture 5, 08.03.2017.
paraboloid form error surface. The axes of such a coordinate system are in the directions of the
eigenvectors of matrix R.
J (W (n)) J min (W (n) W * )T R(W (n) W * ) J min V T (n) RV (n) (95)
The eigenvalue/eigenvector system of R plays an important role. Lets see this in the case of
2
0.5 0.5 cos
N . The roots of detI R 0 give the eigenvalues:
(90). R
2
0.5 cos 0.5
N
2 2
2
( 0.5) 2 0.25cos2 2 0.25sin 0 (96)
N N
The two roots are:
2 2
0 0.5 0.5 cos , ill. 1 0.5 0.5 cos (97)
N N
The eigenvectors can be derived from equations RQ0 0Q0 , RQ1 1Q1 .
2
0.5 0.5 cos
N q00 (0.5 cos 2 ) q00
q q00 q01 (98)
2 q N
0.5 cos 0.5 01 01
N
2
0.5 0.5 cos q 2 q10
N
(0.5 cos )
10
q10 q11 (99)
2 q N q
0.5 cos 0.5 11 11
N
2 2
The eigenvectors normed to unit length: Q0 2 , Q1 2 , see Figure 24. (100)
2 2
2 2
2
1 2 0
2 2
2 2
Figure 24
In this simple example the eigenvectors are orthogonal, and its angle to the coordinate vectors
is 45 . These eigenvectors show those directions where descending can be performed by
changing a single parameter at a given time.
In general, det(R I ) 0 0 , 1 ,..., N 1. ( R n I )Qn 0 , n 0,1,..., N 1 . By ordering
the eigenvectors into a matrix:
R Q0 Q1 QN 1 Q0 Q1 QN 1 diag 0 1 N 1 (101)
Q Q
38
Measurement Theory: Lecture 5, 08.03.2017.
RQ Q , or
R QQ 1 , (102)
that is called the normal form of R. Since R by definition is a symmetric matrix, therefore
R RT . It is an important property, that in such a case the eigenvectors are orthogonal:
QiT Q j 0 , if i j , otherwise QiT Qi ci i . If QiT Qi 1 for i , then the eigenvectors are
orthonormal, and Q T Q I , i.e. Q 1 Q T .
Remarks:
1. Since V T RV positive definite, its eigenvalues are nonnegative.
2. The eigenvectors of the matrix R designate the principle axes of the error surface.
J (W (n)) J min (W (n) W * )T R(W (n) W * ) J min V T (n) RV (n) J min V T (n)QQT V (n)
V 'T ( n ) V ' (n)
T
J min Q T V (n) Q T V (n) J min V 'T (n)V ' (n) . (103)
Figure 25
In this case the optimization can be performed as a sequence of single variable optimization.
This is illustrated by the following example, where the optimum is approached by descending
along the gradient:
Example:
Single variable case: w(n 1) w(n) ((n)) , (n) 2 ( w(n) w*) , where , and
the parameter error:
w(n 1) w* (1 2 )(w(n) w*) , ill. V (n 1) rV (n) r n1V (0) .
39
Measurement Theory: Lecture 5, 08.03.2017.
1
0 (105)
1 1
If 0 , then the iteration procedure is overdamped, if , then it is critically
2 2
1 1
damped, while if , then it is underdamped.
2
Remark: Note that in the single variable case R , i.e. the first part of (94) has the form of
W (n 1) W (n) 2 (W (n) W * ) , and after subtracting W * from both sides, we get the second
part of (94).
Multi variable case: V ' (n 1) ( I 2) n1V ' (n) . In case of applying a single scalar as ,
convergence requires the relationship:
1
0 (106)
max
Note that here we have N variables. The steepest descent is achieved along that axis, which
corresponds to the highest eigenvalue. If the eigenvalues of matrix R are not known, only its
N 1
diagonal, then we can use max i tr[] tr[ R] , because the eigenvalues are
i 1
nonnegative, therefore
1
0 . (107)
tr[ R]
Remark: If would be known, i.e. we would have global information about the error surface,
1
then instead of a scalar , the matrix of the form 1 would be preferable. What we can
2
do is to use gradient as local information, and follow its direction to reach the minimum of the
error surface.
Iterative model fitting methods: In the following we will summarize some classical
minimization methods, which are widely used in case of quadratic cost functions and models
linear in their parameters. These can be considered also as learning procedures, because they
get and process information about the actual relations. In our case this information source is the
gradient of the error surface, we step forward accordingly. Obviously, we can use other
methods, where e.g. W(n) is selected in a different way, and after it we check the error. If the
error is smaller, then the selected value is the next proposition, otherwise we ignore it. (Monte-
Carlo methods, genetic algorithms.) However, these methods are preferable merely if (1) the
cost function is not quadratic, (2) the model is nonlinear in parameters. In such cases the error
surface is not paraboloid, it might have local minima, and methods based on local information
may stop in one of them.
Iterative model fitting using Newtons method:
This method can be derived from the Wiener-Hopf equation. Here we suppose the a priori
knowledge of R and P. Since this cannot be expected in practice, this method has only
theoretical significance, however, it gives hints to create approximate solutions. Two type of
expressions will be given. The first provides the parameter vector for the next iteration step,
while the second the relation of the parameter error with respect to the initial error.
40
Measurement Theory: Lecture 5, 08.03.2017.
41
Measurement Theory: Lecture 5, 08.03.2017.
the parameter value back and force on the internal wall of the paraboloid, and it will be
unable to descend to a lower point. Therefore, it worth reducing near to the optimum.
The expression of the parameter error can be derived from (114) in such a way, that we subtract
from both sides W * , and we suppose that y (n) X T (n)W * . This assumption means that the
model fitting perfectly succeeded.
W (n 1) W * W (n) W * 2X (n)[ X T (n)W * X T (n)W (n)]
[ I 2X (n) X T (n)][W (n) W * ]
Hence:
n
V (n 1) [ ( I 2X (i) X T (i))]V (0) (115)
i 0
Equation (115) shows how contribute the convergence factor and the regression vector X(n)
to the reduction of the parameter error. Obviously the product of matrices should be contractive,
i.e. should reduce the length of the parameter vector, possibly in every step.
Iterative model fitting using the -LMS method:
In (114) it might be useful to norm the regression vector X(n), because otherwise the correction
of the parameter vector is highly dependent on the signal level. The modified versions of (114),
and (115) are:
W (n 1) W (n) T
X (n)e(n) (116)
X (n) X (n)
n
V (n 1) [ ( I T
X (i) X T (i))]V (0) (117)
i 0 X (i) X (i)
Iterative model fitting using LMS-Newton method:
In principle to norm the regression vector X(n) in (114) is possible also by matrix R. If we were
in that particular situation that we are familiar with matrix R and the gradient is estimated by
its instantaneous value, then
W (n 1) W (n) 2R 1 X (n)e(n) (118)
n
V (n 1) [ ( I 2R 1 X (i) X T (i))]V (0) (119)
i 0
This idea has practical importance, if matrix R is estimated iteratively from our observations.
Iterative model fitting using LMS-Newton method, R is estimated iteratively:
W (n 1) W (n) 2R 1 (n 1) X (n)e(n) (120)
42
Measurement Theory: Lecture 5, 08.03.2017.
1 R 1 (n) X (n) X T (n) R 1 (n)
R (n 1) R 1 (n)
1
(121)
X ( n) R ( n) X ( n)
T 1
Remarks:
1. The matrix inversion lemma: [ A BC ]1 A1 A1 B[ I CA 1 B]1 CA 1 . Note: if BC is a
dyadic product, as it is in our case, then the inverse of the matrix sum in brackets on the
right-hand side will be a scalar. Here A R (n) , BC X (n) X T (n) .
2. The iteration may be started from R (0) I , where 0 1.
43
Measurement Theory: Lecture 6, 22.03.2017.
R (n)
1
n 1
X (n) X T (n) R (n) .
- Exponential averaging:
x ( n 1) ax (n) by(n) ,
where a and b are positive constants. The frequency-domain behavior can be given with the
help of the z-transform: zX ( z) aX ( z) bY ( z) , from which we can derive the transfer
function of the exponential averager:
X ( z ) b bz 1
H ( z)
Y ( z ) z a 1 az 1
which behaves as a lowpass filter, and for constant input signal after some transients
produces a constant value at its output. Typically, it is normed to have H ( z ) 1 , for z=1.
b
Using this condition: 1 , i.e. a 1 b. Hence
1 a
x (n 1) x (n) b( y (n) x (n)) .
In this case the new observed value is multiplied by a constant in contrary to the linear
averaging. The above computation R (n 1) R (n) X (n) X T (n) has the same structure as
the exponential averaging, where 1 b , and b .
2. Model fitting can be performed basically using one of two approaches:
- first collecting data followed by batch-processing i.e. in an off-line way;
- data processing parallel with the acquisition iteratively or recursively, i.e. in an on-line
way.
3. Concerning the role of model fitting we distinguish two approaches:
- Identification: we try to describe reality with high accuracy;
- Adaptation: we try to follow reality with low delay.
In case of adaptive systems we use basically iterative/recursive methods, while for
identification in principle the two approaches produce the same result.
44
Measurement Theory: Lecture 6, 22.03.2017.
be solved.
C (W (n))
W (n 1) W (n) C (W (n)) (123)
[C (W (n))]T C (W (n))
This is the so-called Newton-Raphson method. A typical experience is that this method
behaves well far from the optimum, while its behavior near to the optimum depends on to
what extent is fulfilled the supposition C (W * ) 0 .
b) If we try to find the minimum of the expanded C (W ) by computing its gradient, then the
condition C (W (n 1)) 0 C (W (n)) H (W (n))(W (n 1) W (n)) will be received,
which results in the Newton method:
W (n 1) W (n) H 1 (W (n))C (W (n)) (124)
Remarks: The multiplier does not appear here, because it is present in (122) unlike to the
expressions used in the case of quadratic cost functions.
Additive IIR systems
It might be more efficient, however from several respects more problematic, if we fit so-called
Infinite Impulse Response (IIR) models. These can be implemented in several alternative forms,
but in the following we will restrict ourselves only to the so-called direct structure. Formally
we apply further on an adaptive linear combinator, however in computing the actual estimator
earlier estimates are also considered:
y (n) k 0 ak (n) x(n k ) k 1 bk (n) y (n k ) ,
M 1 N 1
(125)
i.e.
W T (n) [a0 (n), a1 (n),..., aM 1 (n);b1 (n), b2 (n),..., bN 1 (n)] , (126)
X T (n) [ x(n), x(n 1),..., x(n M 1); y (n 1), y (n 2),..., y (n N 1)] . (127)
If we apply the methods discussed up till now with (126) and (127), then we perform the so-
called pseudolinear regression (PLR). In this case, we neglect the fact the regression vector
(127) depends on the previous outputs of the fitted model (the adaptive filter), which is an
implicit dependence with the consequence that the error surface is not a paraboloid any more.
For every gradient/based method there is the real danger to have local minima.
45
Measurement Theory: Lecture 6, 22.03.2017.
Equation-Error Formulation:
The implicit dependence mentioned above can be avoided by alternative error formulations. In
this paragraph the nominator ( N (z ) ) and the denominator ( D(z )) of the transfer function (
H (z ) ) of the fitted adaptive filter will be considered as operators allowing the simultaneous
presence of (discrete) time and frequency in the equations:
N ( z ) Y ( z )
H ( z) , (128)
D( z ) X ( z )
and the approximation error, which was up till now minimized in squared sense is
e(n) y (n) y (n) . Instead of this lets introduce ee (n) D( z )e(n) , because using (128) this
can be written in the following form:
ee (n) D( z )e(n) D( z ) y (n) D( z ) y (n) D( z ) y (n) N ( z ) x(n) (129)
which is independent of y (n) . Lets denote
N ( z ) A(n, z ) k 0 ak (n) z k , and D( z ) 1 B(n, z ) , where B(n, z ) k 1 bk (n) z k :
M 1 N 1
where using (126) ye (n) W T (n) X e(n) . Here (compare it with (127)):
X eT (n) [ x(n), x(n 1),..., x(n M 1); y(n 1), y(n 2),..., y(n N 1)] . (131)
For (130) the quadratic error surface will be paraboloid, therefore all methods successful for
adaptive linear combinators can be applied for IIR adaptive filters. The block diagram of the
method can be seen on Figure 26.
() 1 ()
(, )
1 (, )
Copied parameters
(, ) () = () ()
+
()
Figure 26
Remark: In case of noisy observations distortions (parameter bias) may appear, i.e. the expected
value of the parameters may differ from their ideal value. This is because the observation noise
is also filtered and takes part in the model fitting.
Output-Error Formulation:
If we try to avoid parameter bias, then the output error formulation is more advantageous,
however the danger of local minima exists. If we accept this condition, then the different
gradient-based methods can be also considered.
1. Instantaneous gradient (LMS/like) methods: here we try to minimize the output error
e2 (n) eo2 (n) . The form of the gradient estimate has the previously discussed structure:
46
Measurement Theory: Lecture 6, 22.03.2017.
() ()
() 1 ()
1
1 (, ) 1 (, )
( ) ( )
Figure 27
This approximate technique is the so-called Recursive Prediction Error (RPE) method. This
refers that the correction term within the predicted parameter value is computed by a
recursive filter. It is an obvious drawback of this approach is that the number of filters
required equal to the number of parameters. A further simplification can be seen on Figure
28. This is called simplified RPE.
()
() 1 0 () () 1
1 (, ) 1 (, )
1 1
() ()
1 () 1 ()
1 1
() ()
1 () 1 ()
Figure 28
47
Measurement Theory: Lecture 6, 22.03.2017.
2. A further simplification can be if we omit from (133) the feedback terms. With this step we
are back to the pseudolinear regression (PLR) method. Practically also in this case the LMS
or the LMS/Newton methods are applied.
3. As it was already mentioned, in the case of output-error adaptive IIR filters, the error surface
above the subspace of the parameters is not a paraboloid, local minima might occur. Since
gradient methods may lead the algorithm into such local minimum, we might try to transform
this surface into another having a single minimum. This can be achieved with a different
error definition:
Filtered-error (FE) Algorithm:
1 C (n, z )
e(n) e f (n) [1 C(n, z)]e(n) , where C ( n, z ) is designed to force to be Strictly
1 B ( z )
Positive Real (SPR), because in this case global convergence can be assured. B (z ) is
B(n, z ) at the optimal parameter setting W .
Remark: The SPR property indicates that without excitation the internal energy will be
dissipated, and the system will reach its minimum energy state. This corresponds to
convergence of the cost function minimization procedure.
General form of the adaptive IIR filtering algorithms:
where X f (n) is the filtered information (or regression) vector, e f (n) is the filtered error
vector. Names of the algorithms: If X f (n) X e (n), e f (n) ee (n) , then we are speaking about
Recursive Least Square (RLS) algorithms, moreover if R (n 1) I , then the name is Least
Mean Square (LMS) algorithm.
Stability theory based approach: introduction via the LMS method.
W (n 1) W (n) 2 e(n) X (n) , V (n 1) [ I 2X (n) X (n)]V (n) .
T
This latter is an
autonomous system: V (n) 0 , thus W (n) W .
Using Ljapunovs method we are looking for an appropriate energy function:
G (n) V T (n)V (n) . We would like to reduce this energy: G (n 1) G (n 1) G (n) 0 , for
all n. If G (0) is bounded, then G (n) 0 .
G (n 1) V T (n 1)V (n 1) V T (n)V (n)
V T (n)[I 2X (n) X T (n)]T [ I 2X (n) X T (n)]V (n) V T (n)V (n) (136)
4e (n)(1 X (n) X (n)) .
2 T
1
Since 0 , therefore if 0 T for every n, then G 0 involves e 2 (n) 0,
X ( n) X ( n)
and V T (n) X (n) 0.
Remarks:
1
1. The condition 0 T
implicitly guaranties that X (n ) is bounded, and
X ( n) X ( n)
1 /[1 B (n, z )] is stable. (All the poles of the transfer function are located on the unit circle.)
2. e(n) (W (n) W )T X (n) can also be zero, if the parameter error vector and the regression
vector are orthogonal. Obviously, this should be avoided.
48
Measurement Theory: Lecture 7, 29.03.2017
Here y0 , y1 ,..., y N 1 stand for the observed data, wk , k 0,1,..., N 1 are the unknown
weighting factors. This is illustrated by Figure 29 where it is indicated, that element/vector x
is approximated by the linear combination of yk elements within the subspace, which is
spanned by the element vectors. The best estimation is generated by the orthogonal projection
of x on the subspace.
0
Figure 29
This strategy results in the very same solution which is provided by the minimization of the
squared error:
N 1
e x x , E e 2 E ( x x ) 2 E ( x wk yk ) 2 (137)
k 0
E e 2 N 1
2 E ( x wk yk ) y j 0 , E ey j 0 for j .
(138)
w j k 0
This latter is the so-called orthogonality equation, which if we apply an interpretation using
vectors expresses that the error vector e is orthogonal with every y j . By reordering (138):
N 1
Ey y Exy ,
w k k j j j 0,1,2,..., N 1 . (139)
k 0
R yy ( k , j ) Pxy ( j )
N 1 N 1
E (e 2 E e( x wk yk ) Eex E( x x ) x E x 2 wk Exyk
min
k 0 min k 0
(140)
N 1
E x wk Pxy (k )
2
k 0
49
Measurement Theory: Lecture 7, 29.03.2017
( x2 n2 )w0 x2 w1 x2 wN 1 x2
x2 w0 ( x2 n2 )w1 x2 wN 1 x2 (142)
w0 w1 ( x n )wN 1 x2
2
x
2
x
2 2
x2 1 N 1
w0 w1 ... wN 1
n2 N x2
, thus x
y k , and the error: (143)
N ( n )2 k 0
x
n2
Ee 2
min
2
x
2
x
N
(144)
N ( n )2 N ( n )2
x x
n 2
Remark: It is interesting to investigate the above result as the function of ( ) .
x
Example 2: We take two samples from a linearly increasing function. Lets estimate the slope.
N 1
We use the estimator x wk yk where y k (k 1) x nk , k=0,1. The correlation matrices:
k 0
Ryy ( j, k ) E(( j 1) x n j )((k 1) x nk ) ( j 1)(k 1) E x 2 En j nk ( j 1)(k 1) S n2 jk
Pxy ( j ) E( xyi Ex(( j 1) x n j ) ( j 1) E x 2 ( j 1)S , where S x2 ( Ex) 2 .
Here we obviously do not suppose that the expected value of x would be zero. By replacing the
above results into the first equation of (141):
( S n2 ) w0 2Sw1 S 1 2 n2
w0 , w1 . Lets have , therefore
2Sw0 (4S n ) w1 2S
2
n2 n2 S
5 5
S S
y 2 y1
w0
1
5
, w1
2
5
, and finally, x 0
5
, E e2 S
5
(145)
50
Measurement Theory: Lecture 7, 29.03.2017
Remark: An important advantage of a recursive procedure is that it is not necessary to wait for
all data: the estimator continuously computes the estimator while the quality of the estimator
becomes step-by-step better.
Lets return to the optimal estimator and write instead of N time-index n:
2
n 1
1
x (n) wk (n) y(k ) , wk ( n) , where ( n ) 2 , E e 2 (n) E ( x x (n))2 n ,
k 0 n x n
n2
n
1
x (n 1) wk (n 1) y(k ) , wk (n 1) , E e 2 (n 1) E ( x x (n 1))2 ,
k 0 n 1 n 1
Ee (n 1) 2
based on the above an alternative form w (n 1) . following the steps of (146)
n2
k
we have:
1 n 1
1 n 1
x (n 1)
n 1
y(k ) n 1 y(n) n 1 x(n) n 1 y(n)
k 0
1
x (n) ( y (n) x (n)) (147)
n 1
It is interesting to see how the mean square error behaves depending on the amount of data.
Based on the previous development:
E{e 2 (n 1)} wk (n 1) n 1 1
(148)
2
E{e (n)} wk (n) n 1 1 1 E{e 2 (n)}
1
n n2
Using this, an alternative form of (147) can be:
E{e 2 (n 1)} E{e 2 (n 1)} E{e 2 (n 1)}
x (n 1)
x ( n) y ( n)
x ( n) ( y(n) x (n)) (149)
E{e 2 (n)} n2 n2
y (0) 2
Example 3: Given 2 , n2 . From the above equations: x (1) , E{e 2 (1)} n ,
3 3
E{e 2 (1)} n2
using (136): E{e 2 (2)} , thus (137) for n=1
E{e 2 (1)} 4
1 2
n
51
Measurement Theory: Lecture 7, 29.03.2017
n2 n2
3 1 1 1
x (2) 42 x (1) 42 y (1) x (1) y (1) x (1) ( y (1) x (1)) ( y (0) y (1)) ,
n n 4 4 4 4
3
n2 n2
2
4 1
E{e 2 (3)} , x (3) 52 x (2) 52 y (2) x (2) y (2) , etc.
n
5 n n 5 5
4
E{e 2 (n 1)} E{e 2 (n 1)}
Notation: the coefficients of (149): a(n 1) , b(n 1) ,
E{e 2 (n)} n2
thus: x (n 1) a (n 1) x (n) b(n 1) y (n) (150)
a(n 1) 1
Note that , and using (136)
b(n 1) b(n)
1 1 a(n 1)
a(n 1) , thus a(n 1) 1 b(n 1) ,
1 b(n) 1 b(n 1) a(n 1) b(n 1)
a(n 1)
x (n 1) x (n) b(n 1)( y (n) x (n)) (151)
korrekcis _ tag
Equation (151) is the recursive form of the of the optimum non-recursive estimator, which in
every step, due to a new measurement extends with a new dimension the subspace where to
vector x is projected.
II. Optimal recursive estimator (scalar Kalman filter):
The optimal recursive estimator is based on a model, which is more detailed. It is the simplest
state variable model the excitation of which is provided by a noise process. It might have
deterministic excitation, as well, but thanks to the superposition theorem it can be processed
separately.
x(n) ax(n 1) w(n 1) ,
where {w(n)} zero mean white noise process, i.e.:
0 n j x(n) 0
E{w(n)} 0, E{w(n) w( j )} 2 , ha n 0
w n j w(n) 0
Remark: This is the so-called first-order autoregressive process: it depends in first order from
the value available in the previous time instant.
E{x(n)} 0, E{x 2 (n)} Rxx (0) x2 E a 2 x 2 (n 1) w2 (n 1) 2ax(n 1)w(n 1)
w2
a Rxx (0) Rww (0) a , where from Rxx (0)
2 2 2 2
. (152)
x w
1 a2
Rxx (1) Ex(n) x(n 1) Ex(n)(ax(n) w(n)) aRxx (0) , because Ex(n)w(n) 0 .
Rxx (2) Ex(n) x(n 2) Ex(n)(ax(n 1) w(n 1)) a 2 Rxx (0) , in general
52
Measurement Theory: Lecture 7, 29.03.2017
Figure 30
The observation is noisy: for its description, an additive noise source is used. Our assumptions
concerning the noise are exactly the same as that of w(n). The two noise processes are
independent of each other. The recursive estimator is the linear combination of the new
observation and the previous estimate (see Figure 31):
()
()
() 1 ( 1)
()
Figure 31
x (n) a (n) x (n 1) b(n) y (n) (154)
Remark: There exists a so-called predicator scheme, as well:
We are looking for the optimum weights in (154) using the illustration in Figure 29. The
approximation error is:
e(n) x(n) x (n), E e 2 (n) E ( x(n) a(n) x (n 1) b(n) y (n))2 .
Conditions of the optimum:
E e 2 (n)
2Ee(n) x (n 1) 0 ,
E e 2 (n)
2Ee(n) y(n) 0 ,
a(n) b(n)
These are the so-called orthogonality equations for the Kalman filter:
53
Measurement Theory: Lecture 8, 06.04.2017
()
()
() 1 ( 1)
1
Figure 32
Lets compute b (n ) :
E e 2 (n) Ee(n)( x(n) x (n)) Ee(n) x(n),
because the right-hand side of x (n) a (n) x (n 1) b(n) y (n) is orthogonal to e(n ) . Since
y (n) cx (n) n(n) cx (n) y (n) n(n) .
E e 2 ( n)
b( n)
c
Ey (n)n(n)
b( n) 2
c
n , where from b(n) c
E{e 2 (n)}
n2
(159)
54
Measurement Theory: Lecture 8, 06.04.2017
Remark: This form is useless, because to be able to compute the mean square of e(n) we need
b(n). We need such a form where only the mean square of e(n-1) or earlier is used. Lets
introduce (158) into mean square error:
E{e 2 (n)} E{[ x(n) x (n)]2 } E{[ x(n) ax (n 1) b(n)( y(n) acx (n 1))]2 } (160)
Write in y (n) cx (n) n(n) and x(n) ax(n 1) w(n 1) :
E{e 2 (n)}
E{[ ax(n 1) w(n 1) ax (n 1) b(n)(acx(n 1) cw(n 1) n(n) acx (n 1))]2 }
E{[ a (1 cb (n))e(n 1) (1 cb (n))w(n) b(n)n(n)]2 }
After some manipulations, and observing that the expected values of the cross-products will be
zero:
The estimator (see (155)): x (n 1) (n) x (n) (n) y (n) , with similar manipulations:
x (n 1) ax (n) (n)( y (n) cx (n))
The derivation consists of similar steps like in the case of recursive filter, the difference is that
here we minimize p(n 1) E{e 2 (n 1)} . The detailed development will be provided for the
vector case.
Summary:
55
Measurement Theory: Lecture 8, 06.04.2017
Applying system model: x(n 1) ax(n) w(n) , the observation: y (n) cx (n) n(n) . The
optimal recursive predictor is:
Remark: Figure 33 shows the block diagram of the optimal single-step predictor. Not that this
scheme corresponds to the observer scheme introduced somewhat earlier.
1
( + 1) ()
() () 1 ()
Figure 33
Example: Scalar Kalman Filter: Lets suppose: E{x(n)} 0 , = 1, and x (0) 0 .
x (1) b(1) y (1) . The value of b(1) comes from the orthogonality condition:
E{[ x(1) x (1)] y (1)} 0 , where y(1)=x(1)+n(1) and therefore x (1) b(1)[ x(1) n(1)] . Thus the
orthogonality condition:
E{[(1 b(1))x(1) b(1)n(1)][ x(1) n(1)]} (1 b(1)) x2 b(1) n2 0
x2 1 w2 2
I.e. b(1) 2 . If e.g. n w , and a , then x
2 2 2 2
2 n2 . Thus b(1) . Using
x n 2
2 1 a 2
3
2
(159) E{e 2 (1)} n2 . Having this we can use (162):
3
1
1
a 2 E{e 2 (1)} w2 3 4
b(2) 2 0.57 .
n w a E{e (1)} 2 1 7
2 2 2
3
4 9 9
With (147) and (162): E{e 2 (2)} n2 0.57 n2 , b(3) , E{e 2 (3)} n2 0.562 n2 .
7 16 16
If we continue this iteration, we will reach steady state where E{e (k 1)} E{e 2 (k )} p .
2
Using (159) and (162) with this p we get: p 2 3 p n2 2 n4 0 , which holds if p 0.56 n2 .
It can be seen that the third iteration step gives result relatively close enough to the steady state.
Vector Kalman Filter:
The system model: x(n) Ax (n 1) w(n 1) , and the observation: y (n) Cx (n) n(n) . Both
the system and the observation noise have zero mean and are white. Their correlation matrices:
Q(n) E{w(n) wT (n)} (replaces w2 ), R(n) E{n(n)nT (n)} , (replaces n2 ). The optimum
recursive filter is:
56
Measurement Theory: Lecture 8, 06.04.2017
57
Measurement Theory: Lecture 8, 06.04.2017
2. The covariance matrix of the system noise is an additive component in (172), which
increases the covariance of the estimate. The mechanism behind this effect is that the
discrete w(n) values perturb x(n+1), i.e. the predicted value.
3. The covariance matrix of the observation noise also increases the covariance of the
estimate, because the discrete n(n) values via () perturb x ( n 1) , i.e. the predicted
estimate.
Remark: (172) might have a condensed form, because by expanding its first component we
have:
P(n 1) AP (n) AT AP (n)C T G T (n) G (n)CP (n) AT G (n)CP (n)C T G T (n)
(173)
G (n) R(n)G T (n) Q(n).
If we combine the fourth and the fifth component, and use (171), we have:
G(n)[CP (n)C T R(n)]G T (n) AP(n)C T G T (n) , that equals the second component of (173)
with opposite sign. Remain only the first, third and sixth components, thus:
P(n 1) [ A G (n)C ]P(n) AT Q(n) (174)
Summary:
Applying system model: x(n 1) Ax (n) w(n) , the observation: y (n) Cx (n) n(n) . The
Kalman predictor:
Remark: The optimal recursive estimator can be used for model fitting, as well. The fitted
model is again the adaptive linear combinator. By applying the notation used previously, and
supposing Q(n)=0, (175) will have the following form:
W (n 1) W (n) G (n)( y (n) X T (n)W (n)) W (n) G (n)e(n) (176)
G (n) P(n) X (n)[ X T (n) P(n) X (n) R(n)]1 (177)
P(n 1) ( I G (n) X (n))P(n)
T
(178)
where P(n) stands for the covariance matrix of the parameter estimation: P(n) E{V (n)V (n)} T
, and X(n) is the so-called regression vector. It worth studying equations (108) -(121).
Remark:
It is interesting to compare the observer on Figure 2 with the Kalman predictor on Figure 34.
1
( + 1) ()
() () 1 ()
Figure 34
58
Measurement Theory: Lecture 8, 06.04.2017
1. Note that the first component of (172) is rather similar to (5), the difference is only the
quadratic nature. Obviously here it also valid that to reduce error, the state transition matrix
of the error systems F (n) A G (n)C should be contractive.
2. If the noise processes are stationary, then Q(n)=Q, R(n)=R.
3. Please note how the model of the observed system is built into the observer.
59