Measurement Theory Lecture Notes 1-8

Measurement Theory
Lecture Notes 1-8
Gbor Pceli
2017.
Measurement Theory: Lecture 1, 08.02.2017.
0. The subject Measurement Theory can be related to Data Science, which is a very hot topic
for every engineering discipline. The reason is that due to the revolution of sensors incredible
amount of measured data are available, and these data must be processed in a very clever
way. During the lectures, we will discuss information processing methods to make decisions,
compute estimates and model our environment. We keep one eye also on the aspects of real
time implementation of these techniques in building subsystems of our CPS, IoT and
Industry 4.0 applications and their adaptive operating mechanisms.
1. Reminder: Within the subject Measurement Technology (VIMIA206), which is a BSc
subject, the foundations of measurement theory were already introduced. The key concepts
and terms to remember were the following:
- Measurement and modelling
- Model-fitting
- Measurement errors (modelling, transfer- and instrument error)
- error propagation (application of the differential)
- measuring structures (serial, parallel, feedback)
- transducer errors (null-point, load, temperature, calibration, etc. errors)
- accuracy of the devices (analogue, digital; component and frequency dependence)
- Basic measuring methods (difference, direct and indirect comparison, substitution,
swapping or Gauss method)
- Evaluation of measurement sequence (calculation of measurement error, uncertainty)
- Uncertainty calculation based on GUM (Guide to the Expression of Uncertainty in
Measurement)
The above concepts and topics are prerequisites, within the subject Measurement Theory
only model-fitting will be revisited.
2. Measurement procedure: part of the cognition process, which contributes to the
improvement of our a priori knowledge. The improvement can be either higher precision, or
new, additional information. Figure 1 helps the interpretation of this procedure. During the
measurement procedure, we would like to tackle phenomena of real world around us. This
tackling is based preferably on such features, which in a sense show some stability.
Obviously, such features are also abstractions. Significant role is played by the
- state variables (x), the time-dependent changes of which can be related to energy
processes (voltage, pressure, temperature, speed, etc.)
- parameters (a), which characterize the intensity relations of the interactions, and the
- structures (S), which describe the relations of the systems components.
Space of reality Space of observations Space of decisions and estimations

. .
.
. (Noisy) channel
Uncertainty
Observation process Inverse observation process

Figure 1
2
The Space of reality is such an abstraction where values of the investigated features
correspond to one point of the space. Before the measurement, the coordinates of this point are
unknown. During measurements, we try to determine/measure these coordinates, however as
it is well known this can be performed only approximately, due to measurement errors. A
further difficulty is the limited availability of the feature to be measured, e.g. it is impossible to
measure directly where the feature is. Therefore some kind of a mapping is unavoidable. This
mapping is called observation. The path between the feature to be measured and the observation
is called measuring channel.
Observation in case of deterministic measuring channel: Figure 2 presents a discrete-time
observer as an illustrative example. The equations describing the reality and the observation
are as follows:
x(n 1) Ax (n) , (1)
y (n) Cx (n) , (2)
1
( + 1) = () ( + 1) = () + ()
Correction
() = () () () () = ()
()
() = () ()
Figure 2
where x(n) is an N-dimensional state vector, A is an N*N dimensional state-transition matrix,
y(n) is an MN dimensional observation vector, and the observation matrix C is M*N
dimensional. Our goal is the estimation of the state vector x(n). This is fulfilled by the observer,
which tries to provide a copy of the reality in such a way that a correction/learning/adaptive
mechanism forces the observer to follow the reality. After convergence, the result of the
measuring procedure x ( n) can be read from the observer. Within the observer, we apply the
copy of the state transition and the observation equations:
x (n 1) Ax (n) Ge(n) , (3)
y (n) Cx (n) , (4)
where the correction matrix G is of N*M dimension, e(n) y (n) y (n) . The matrix G is set in
such a way, that x ( n) x(n). The difference of (1) and (3):
x(n 1) x (n 1) Ax (n) Ax (n) Ge(n) ( A GC )( x(n) x (n)) . (5)
Introducing (n 1) x(n 1) x (n 1) , and notation F A GC , the state-transition matrix
of the so-called error system is:
(n 1) F (n) . (6)
The correction matrix G should be designed in such a way that (n) n
0 , in favor of which

preferably (n 1) (n) , for n, i.e. F reduces the length of vector (n) in every step, which
means it is contractive.
Remarks:
1. The inequality for the error vector (n) can be interpreted for its length (norm), and in the
scalar case for its absolute value.
3
2. Obviously, it is not necessary to require monotony of the reduction process to force the error
to zero, only the stability of the error system is needed, i.e. its convergence to zero in case
of zero excitation. This can be interpreted in such a way, that the error system dissipates its
internal energy in order to reach the stable state. If dissipation is present in every step, then
the reduction of the error vector length will be monotonic.
Special cases:
1. F A GC 0 . In this case G AC 1 . This is possible if C is quadratic, i.e. the observation
has as many components as the state vector. In this case, it is obvious that we can calculate
the state vector in one step without iteration. This means that the observer, and within the
observer the copy of the system will follow the observed (physical) systems after a single
step.
2. F N ( A GC ) N 0 . In this case the error system converges in N steps:
x( N ) x ( N ) ( A GC ) N ( x(0) x (0)) 0 (7)
F matrices having property F N 0 are the so-called non-derogatory nilpotent matrices. The
eigenvalues of such matrices are zero. Systems, which can be characterized by such state-
transition matrices, produce finite impulse response (these are the so-called FIR systems),
since the initial error disappears in N steps. (Comment: if F M 0 , where M<N, then F is a
derogatory nilpotent matrix. In this case convergence is achieved in less than N steps. )
3. If F N ( A GC ) N 0 , then for a stable error system the length of its state vector will decay
exponentially. Such an error system will be stable if all its eigenvalues are within the unut
circle. Systems, which can be characterized by such state-transition matrices, produce finite
impulse response (these are the so-called FIR systems), since the initial error disappears in
N steps. Systems, which can be characterized by such state-transition matrices, produce
infinite impulse response (these are the so-called IIR systems), since the initial error
disappears only in infinite steps.
Examples:
1 0 1 0 1 0
Example 1: Given A ; C . How to set G? G AC 1 A
0 1 0 1 0 1
1 0 g
Example 2: Given A ; C 1 1 . How to set G? G 0 ?
0 1 g1
g g g0 1 g 0 g0
GC 0 1 1 0 . A GC . G is given by A GC 0 :
2

g1 g1 g1 g1 1 1g1
1 g 0 g 0 1 g 0 g 0 1 2 g 0 g 02 g 0 g1 g 0 g 02 g 0 g 0 g1 0 0
g
1 1 g1 g1 1 g1 g1 g12 g1 g 0 g1 1 2 g1 g12 g 0 g1 0 0
.
By replacing the expressions of the minor diagonal into the expressions of the main diagonal
we receive: 1 2 g 0 0 , and 1 2 g1 0 , from which: g 0 0.5 and g1 0.5 .
Checking by replacement:
0.5 0.5 0.5 0.5 0 0
0.5 0.5 0.5 0.5 0 0 .

Example 3: Lets calculate the eigenvalues of A GC using the results of Example 2:
4
0.5 0.5
detI A GC 0 det ( 0.5)( 0.5) 0.25 2 0.25 0.25 0 .
0.5 0.5
Both eigenvalues are zero.
Comments:
1. This attribution is universally valid for systems capable to converge in limited number of
steps, starting from arbitrary initial state.
2. The transfer function of such systems is such a degenerated rational function, which has
poles only in the origin:
1 2 N a N a N 1 z a N 2 z 2 ... a1 z N 1
H ( z ) a1 z a2 z ... a N z (8)
zN
These are often called as Finite Impulse Response (FIR) filters. The time-domain equivalent
of (8):
y (n) a1 x(n 1) a2 x(n 2) ... a N x(n N ) , (9)
where for the real-time computability of (9) only previous samples of x (n) are considered.
3. In example 3 the condition valid for the eigenvalues can be used also for the determination
of g 0 and g1 :
1 g 0 g0
detI A GC 0 det 2 ( g 0 g1 ) g 0 g1 1 2 0
g1 1 g1
From this: g 0 g1 0 , and g 0 g1 1 , hence: g 0 0.5 s g1 0.5 .
Observation in case of noisy channel: In this case our expectation is not (n) n
0 , but

E[ (n) T (n)] n
min . With this definition, the state equation of the error system (6) is

replaced by:
E[ (n 1) T (n 1)] FE[ (n) T (n)]F T (10)
This error matrix plays a central role in the operation of the Kalman predictor and filter. (R.E.
Kalman was a famous scientist with Hungarian origin.)
Remarks:
1. Both models seen on Figure 2 can have a common additional input excitation signal. Since
these models are linear, due to the superposition principle, the observer will converge also
in this case.
2. The observer on Figure 2 is called Luenberger observer. According to Luenberger almost
any system is an observer. The condition to be able to serve as an observer is simply that
the observer should be faster than the observed system; otherwise it is not possible to follow
the changes of the observed system properly.
3. In case of an impedance measuring bridge the unknown branch, consisting the impedance to
be measured, is the physical model of the reality, while the branch containing the balancing
components corresponds to the tunable model within the observer. The tuning of the bridge,
i.e. the observer mechanism, is performed by the operator based on the difference of the
voltages on the dividing point of the branches.
Modelling noisy channels: To describe random events we use random variables and stochastic
processes. The random variable x ( ) is such a rule or function, which assigns real numbers to
the events of the random event space. If we make a histogram of the occurrence of the samples
5
of a random variable, then we can get a statistical characterization. This leads to the Probability
Density Function (PDF) (), see Figure 3.
(relative)
occurrence
Figure 3
Its integral is the so-called Probability Distribution Function

u
F (u ) f (v)dv P( x u) ,

(11)
which tells what is the probability that the random variable is not larger than u. The stochastic
process x (t , ) is such a function which assigns a time-function to the events of the random
event space, see Figure 4. The value of these function at a given time instant (e.g. at t0)
represents a random variable.
stochastic process
(0 , )
(, )
(0 , 0 )
0
Figure 4
3. Decision theory basics

Example: Detection with radar. Binary or two-hypothesis decision: The question of presence
or absence is to be decided. The measuring scheme is given on Figure 5. The channel is noisy,
from the very same phenomenon we get orderly different observations. We have to decide
based on one or more observations which hypothesis from the possible two should be
accepted?
source Z decision
observation decision space

space
tere
Figure 5
6
Hypothesis H 0 : the (hostile) object is absent.

Hypothesis H1 : the (hostile) object is present.
Possible errors:
We accept H 0 , but H1 is true. The probability of this error is PM (miss probability);
We accept H1 , but H 0 is true. The probability of this error PF (false alarm probability).
As a first step, we take the histograms of the observations of the two cases (absent, present),
and based on these histograms, we approximate the f ( z H 0 ) and f ( z H 1 ) conditional
probability density functions. These functions are called channel-characteristics. The condition
indicates the behavior corresponding to one of the two hypotheses. (Note that this is
information/data acquisition, i.e. some kind of a learning.) The relation of the two conditional
density functions is given in Figure 6. We are looking for the decision threshold which is
optimum in some sense. Since the density functions overlap, it is not trivial to assign a
threshold. The possible strategy to this assignment depends on the available information.
decision threshold
f(z|H0) f(z|H1)
PM PF
Z0 Z1
Figure 6
Two-hypothesis Bayesian Decision:

Conditions: 1. The so-called a priori probabilities are known: H 0 P0 s H1 P1 .
2. The channel characteristics are known: f ( z H 0 ) s f ( z H 1 ) .
Lets define the costs:
Cij is the cost of accepting hypothesis Hi, while Hj is true.
Lets define the probabilities of occurrence:
P( H i H j ) , where index i indicates the supposed hypothesis, while index j
denotes the occurred outcome. (Here i and j are 0 or 1.)
Our intent is the minimization of the mean risk (or cost):
R C00 P0 P( H 0 H 0 ) + C10 P0 P( H1 H 0 ) + C01P1 P( H 0 H1 ) + C11P1 P( H1 H1 ) (12)
Please note that in the case of the first two terms the outcome corresponds to hypothesis H0
while in the case of the second two terms to H1. The minimum of (12) can be obtained by
selecting a proper threshold. Lets denote the range of acceptance by Zi. For these ranges lets
calculate the probabilities of occurrence:
7
P( H i H j ) f (z H
Zi
j )dz , (13)
using (13) rewrite (12)
R C00 P0 f ( z H 0 )dz C10 P0 f ( z H 0 )dz C01 P1 f ( z H1 )dz C11 P1 f ( z H1 )dz . (14)

Z0 Z1 Z0 Z1
Since the two acceptance ranges cover the complete event space, the integrals of the density
functions above one of the acceptance ranges can be replaced with one minus the integrals
above the other acceptance range. By replacing the integrals above Z1 with integrals above Z0,
(14) can be written in the following form:
R C10 P0 C11P1 P0 (C00 C10 ) f ( z H 0 )dz P1 (C01 C11 ) f ( z H1 )dz . (15)
Z0 Z0
Lets suppose that C10 C00 , s C01 C11 , and consider the decision threshold value as an
independent input variable of the single variable integral (15). The minimum of (15) is reached,
if the decision threshold along the z axis is set to the value, where
P0 (C10 C00 ) f ( z H 0 ) P1 (C01 C11 ) f ( z H1 ) (16)
holds. If we deviate from the threshold value which meets (16), either to the left or to the right,
the mean risk R, i.e. the value of (15) will increase. This is illustrated by Figure 7, where it is
demonstrated how will behave the value of (15), if the threshold value would be shifted to the
right, or to the left of the optimum setting.
decision threshold
P1(C01-C11) f(z|H1)
P0(C10-C00) f(z|H0)
the risk increases the risk increases
Z0 Z1
Figure 7
Rewriting (16):
f ( z H1 ) P0 (C10 C00 )
, (17)
f (z H0 ) P1 (C01 C11 )
i.e. the ratio of the two conditional density functions is an a priori given constant value. (In (17)
z takes the value of the decision threshold.) If we put the recently measured value into the
equation
f ( z H1 )
( z ) , (18)
f (z H0 )
which is the so-called likelihood-ratio function, and if (z ) , then the decision is H1, if
(z ) , then the decision is H0. By writing in a concise way:
8
H1

( z ) (19)

H0
This is called Bayesian decision rule or likelihood-ratio test.
9
3. Decision theory basics (cont.)

Comments to the two-hypothesis Bayesian Decision:
P0
1. If the costs are set in such a way that (17) takes the value , then the decision rule will
P1
H1 H1

have the form of P1 f ( z H1 ) P0 f ( z H 0 ) , which equals to P( H1 z ) P( H 0 z ) , i.e. the decision

H0 H0
can be made using the a posteriori probabilities. This special case is called maximum a
posteriori (MAP) decision.
H1

2. Instead of (19) ( z ) ln ( z ) ln , the so-called log-likelihood ratio test can also be

H0
used.
Example 1: Detection of a constant signal through noisy channel: Is the signal present or not?
Lets suppose that the observations are independent random variables with Gaussian
distribution, zero mean and variance n2 . Hypothesis H0 is, that the observation takes the value
zk nk , i.e. the signal is absent, only the actual sample of the noise is observed. Hypothesis H1
is, that the observation takes the value zk a nk , i.e. the signal is present, and the sum of the
signal and the noise is observed. Index k runs through 0,1,, N-1, i.e. we consider N samples
simultaneously. For the decision the joint conditional probability density functions will be
applied. In case of a single observation the conditional density functions are as follows:
z k2 ( zk a ) 2

1 2 n2 1 2 n2
f (z H0 ) e , ill. f ( z H1 )
(20) e
2 n 2 n
Since the observations are independent, the joint density function of the N observations is the
product of the single density function. The actual form of (19):
( zk a)2
H1
2 n2
N 1
f ( zk H1 ) N 1
e
( z ) , (21)
k 0 f ( zk H 0 ) k 0
z k2

2 n2 H0
e
or the log-likelihood ratio:
H1
( z a)2 N 1 z 2
N 1
a N 1
Na 2
ln ( z ) ( z ) k 2 k 2 2 z ln (22)
2 n k 0 2 n n 2 n2
k
k 0 k 0
H0
After reordering:
H1
1 N 1
n2 a
N
z
k 0
k
Na
ln
2
(23)
H0
According to (23), for the test the mean of the observed values is to be calculated, and compared
to a threshold. The block diagram of the decision-making device can be seen on Figure 8.
10
1 present
1 threshold
zk ( )
detector
=0 absent
a N
Figure 8
Remarks:
H1
1 N 1
a
1. If 1 , then ln 0 , therefore
N
z
k 0
k
2
, i.e. the decision threshold is the half of the
H0
constant signal value. This is achieved e.g. if P0 P1 0.5 , C00 C11 and C10 C01 .
2. If 1 , then ln 0 , in this case the decision threshold will be smaller than the half of the
constant signal value. This can be achieved e.g. if P0 P1 , C00 C11 and C10 C01 . In this case
the probability of the occurrence of the signal is higher, therefore the threshold will be lower,
and otherwise it will be higher.
3. Note, what is the effect of the noise variance, of the number of observations, and of the signal
level itself in expression (23)?
Example 2: Detection of a changing magnitude signal through a noisy channel: Is the signal
present or not? Lets suppose that the observations are independent random variables with
Gaussian distribution, zero mean and variance n2 . Hypothesis H0 is, that the observation takes
the value zk nk , i.e. the signal is absent, only the actual sample of the noise is observed.
Hypothesis H1 is, that the observation takes the value z k ak nk , i.e. the signal is present, and
the sum of the signal and the noise is observed. Index k runs through 0,1,, N-1, i.e. we
consider N samples simultaneously. For the decision the joint conditional probability density
functions will be applied. In case of a single observation the conditional density functions are
as follows:
z k2 ( z k ak ) 2

1 2 n2 1 2 n2
f (z H0 ) e , ill. f ( z H1 )
(24) e
2 n 2 n
( z k ak ) 2
H1
2 n2
N 1
f ( z k H1 ) N 1
e
( z ) , (25)
k 0 f ( zk H 0 ) k 0
z k2

2 n2 H0
e
H1
N 1
( z a )2 N 1 z 2 1 N 1
1 N 1

ln ( z ) ( z ) k 2k k 2 2 z a a 2
ln (26)
2 n k 0 2 n n 2
k k 2 k
k 0 k 0 n k 0
H0
After reordering:
H1
1 N 1
n2 1 N 1
N
ak zk
k 0 N
ln
2N
a
k 0
2
k (27)
H0
11
According to (27), for the test the weighted mean of the observed values is to be calculated, and
compared to a threshold. The a priori known signal samples are the weights. The block diagram
of the decision-making device can be seen on Figure 9.
1 present
1 Threshold
zk ( )
detector
=0 absent
ak
ak N
Figure 9
Remarks:
1. Note that (23) can be easily derived form (27), if all the signal samples are equal.
2. The signal-weighting described by (27) is called matched filtering.
Example 3: Detection of a random magnitude signal through a noisy channel: Is the signal
present or not? Lets suppose that both the signal and the noise are discrete, stationary stochastic
with-noise processes with Gaussian distribution, zero mean and variance a2 , and n2 .
Hypothesis H0 is, that the observation takes the value zk nk , i.e. the signal is absent, only the
actual sample of the noise is observed. Hypothesis H1 is, that the observation takes the value
zk ak nk , i.e. the signal is present, and the sum of the signal and the noise is observed. Index
k runs through 0,1,, N-1, i.e. we consider N samples simultaneously. For the decision the joint
conditional probability density functions will be applied. In case of a single observation the
conditional density functions are as follows:
z k2 z 2k

1 2 n2 1 2 ( a n2 )
2
f (z H0 ) e , and f ( z H1 )
e (28)
2 n 2 a2 n2
z 2k
H1
2 ( a2 n2 )
N 1
f ( z k H1 ) N 1
n e
( z ) ,
k 0 f ( zk H 0 ) k 0 2 2

z k2

a n 2 n2 H0
e
H1
N 1
zk2 N 1
zk2 a2 N 1
N n2
ln ( z ) k 2 2 2 ln
z 2
ln (29)
k 0 2( a n ) k 0 2 n 2 n2 ( a2 n2 ) k 0
2 2 2
a n
H0
After reordering:
H1
1 N 1
2 n2 ( a2 n2 ) 1 a2 n2 1
z 2
k
a 2 ln
n2
ln (30)
N k 0
H0
2 N
According to (30), for the test the mean of the squared sample values is to be calculated, and
compared to a threshold. The block diagram of the decision-making device can be seen on
Figure 10.
Remarks:
12
H1
1 N

1. If 1 , and a2 n2 , then
N
z
k 1
2
k

2 n2 ln 2 .
H0
H1
1 N
2 1 2
zk2
1
2. Alternative form of (30): 2 n2 (1 n2 ) ln(1 a2 ) ln , where the effect of
N k 1 a 2 n N
H0
the ration of the variances can be analyzed.

3. Figure 11 shows the location of the acceptance regions. It can be seen that they are
symmetrical, since the result doesnt depends on the sign of the observed value.
2 1 present
1 Threshold
zk ( )
detector
=0
absent
Figure 10
1 1
( |0 ) ( |1 )
=0 =0
H1 H0 H1
Z1 Z0 Z1
Figure 11
Example 4: Bayesian decision with discrete probabilities. A student needs to achieve a decision
on which course to take, based only on his/her first lecture. Define 3 categories of courses:
good, fair, bad. From his previous experience, he knows:
P(good)=0.2, P(fair)=0.4 and P(bad)=0.4.
These are a priori probabilities. The student also knows the class-conditionals: how much the
impressions from the lectures coincide with the categories. These are the conditional
probabilities which correspond to the channel characteristics:
(|) = 0.8, (|) = 0.5, (|) = 0.1,
(|) = 0.2, (|) = 0.5, (|) = 0.9.
The cost/loss/risk function values:
Ctaking good 0 , Ctaking fai r 5 , Ctaking bad 10
Cnot _ taking good 20 , Cnot _ taking fair 5 , Cnot _ taking bad 0
The student wants to make an optimal decision; therefore he/she needs to minimize the
conditional risk. (The condition is the impression got at the first lecture.) The risk values are as
follows:
(|), (|)
(_|), (_|)
Lets calculate the first value: (|) =
13
Ctaking good P( good interesting ) Ctaking fair P( fair interesting ) Ctaking bad P (bad interesting )
The conditional probabilities here are the so-called a posteriori probabilities, which can be
calculated using the Bayes Theorem:
P(interesting good) P( good)
P( good interesting ) ,
P(interesting )
P(interesting fair) P( fair)
P( fair interesting ) ,
P(interesting )
P(interesting bad ) P(bad )

P(bad interesting )
P(interesting )
Here the factors of the nominator are known, only P (interesting ) is to be calculated:
P(interesting ) P(interesting good) P( good) P(interesting fair) P( fair) P(interesting bad) P(bad)
0.8 0.2 0.5 0.4 0.1 0.4 0.4
Thus:
P(boring) 1 0.4 0.6 , P( good interesting ) 0.4 , P( fair interesting ) 0.5 ,
P(bad interesting ) 0.1
If after the first lecture the impression of the student is that the lecture is interesting, then he/she
will compare the risk values: R(taking interesting ) , and R(not _ taking interesting ) .
R(taking interesting ) 0 0.4 5 0.5 10 0.1 3.5 ,
R(not _ taking interesting ) 20 0.4 5 0.5 0 0.1 10.5 ,
i.e. the student will take the lecture, since this decision has the lower risk/cost. Calculate the
risk/cost values also for the case, when the experience with the first lecture is that it is boring.
4. Estimation Theory Basics (Main reference: Fundamentals of Statistical Signal Processing.
Estimation Theory, by S.M.Kay, Prentice-Hall, 1993, and the slides Estimation Theory of
Alireza Karimi, Laboratoire dAutomatique, MEC2 397, Spring 2011.)
The objective of parameter estimation is the determine an estimator a of an unknown
parameter(vector) a. Figure 12 illustrates this objective by indicating what kind of information
might help to solve this problem if a proves to be a random value.
Reality Observations Estimations
a
f(a) Z
(|) (|)
(|)
Figure 12
14
Obviously it might happen that the unknown parameter is a deterministic value, statistical
characterizations have no meaning. In the following first we will present methods based on
statistical characterizations, and afterwards we will continue with the deterministic solutions.
For the followings, it might be useful to recall the concept of the: (1) density function and its
relation to measured data; (2) the conditional density function to characterize measuring
channels; and (3) expected value, and its calculation based on density functions. To
characterize the estimator the following measures are useful (see Figure 13):
()
( |)
(|)
Figure 13

1. Conditional expectation/expected value: Ea a a f (a a)da (31)

2. Conditional covariance matrix: cova, a a E (a E (a a))(a E (a a))T a (32)
3. Conditional bias: b(a) Ea a a (33)

4. Mean Square Error (MSE): E (a a)(a a)T a cova , a a b(a)bT (a) (34)
Comment: If the probability density function f (a ) is known, then we can calculate:
5. Expected value without condition: E (a ) EEa a (35)
6. Covariance matrix without condition: cova , a Ecova , a a (36)
I. Bayesian estimation: Lets suppose that the density function f (a ) of the observed parameter
and the channel characteristics f ( z a) are known. Having the observations, and using the Bayes
rule, the so-called a posteriori density function f (a z ) can be derived:
f ( z a) f (a)
f (a z ) , where (37)
f ( z)

f ( z) f (a) f ( z a)da .

(38)
The idea behind the Bayesian estimator is, that the observations are performed on a given
realization of the parameter; therefore, this additional information might help to sharpen the
estimation.
(|)
()

Figure 14
15
Figure 14 shows that if the observations have information about the parameter, then the a
posteriori density function will span to a narrower area of the possible parameter values. The
best estimator is calculated using the a posteriori density function. The concept of best is
enforced by risk/cost functions:

R(a , a) EC (a , a) C (a , a) f (a z )da , (39)

the minimum of which is called Bayes risk/cost:

RB min R(a, a) min EC (a, a). (40)
a a
Here C (a, a) is the cost function, the widely-used alternatives of which are illustrated in Figure
15:
m
I. Quadratic: C (a, a) (ai ai )2 (a a)T (a a) (41)
i 1
m
II. Absolute C (a, a) ai ai (42)
i 1
0 i re
ha ai ai
III. Hit-or-Miss C (a , a) 2 (43)
1 otherwise
Quadratic Absolute Hit-or-Miss
Figure 15
Minimum mean square Estimator (MMSE):

R(a , a) E (a a)T (a a) (a a)T (a a) f (a z )da (44)

The minimum of (44) can be obtained by:

a
(a a)T (a a) f (a z )da
a aMS
0. (45)

Since (a a )T (a a ) 2(a a ) , and in (45) a can be placed in front of the integral, and the
a
integral of the density function is 1, therefore

aMS af (a z )da , (46)

i.e. using the quadratic cost function, the best estimation is the a posteriori mean value.
Minimum Mean Absolute Error Estimator (the scalar case only):
16
a
R(a , a) Ea a (a a) f (a z )da (a a) f (a z )da (47)
a
By setting the derivative of the risk function equal to zero and the use of Leibnitzs rule:
(a a) f (a z )da (a a) f (a z )da 0, (48)

a a a a ABS
we get:
a ABS

f (a z )da f (a z)da ,
a ABS
(49)
i.e.
a ABS The median of (|) (50)
Median: area to the left = area to the right.
Maximum a posteriori (MAP) estimator:

a a
2 2

R(a , a)

f (a z )da
a

2
f (a z )da 1 f (a z)da .

(51)
a
2
If is arbitrarily small, but 0 , the optimal estimate is the location of the maximum of
(|), or the mode of the a posteriori density function.
aMAP Location of the maximum of (|) (52)
Remarks:
1. The Bayesian estimations are performed always using a posteriori density functions.
2. The MS estimation is linear in the sense described below:
If b Aa c , then bMS Aa MS c , furthermore Ea b z Ea z Eb z aMS bMS .
Bayesian estimators in case of Gaussian distributions: Lets suppose that the unknown
parameter a, and the observation noise have Gaussian distribution. Given Ea a ,
cova, a aa, En 0 , covn, n nn. If everything has Gaussian distribution, then the
moments of the a posteriori density functions can be given explicitly.
Lets suppose that noisy observation can be described by the following expression:
z Ua n , (53)
where dim a=p, dim z=q, dim U=q*p. U stands for the so-called observation matrix. The explicit
formula of the a posteriori mean value is:
aMS a z [U T nn1U aa1 ]1U T nn1 ( z Ua ) (54)
a
a _ priori _ knowledge correction _ as _ function _ of _ z U a
Remark: a MS a ABS a MAP , because the a posteriori density function is Gaussian.

Example 1: The resistance of a resistor is to be measured by measuring the voltage caused by
a known current through the resistor. We take N measurements. V IR noise. The
17
observed values are: z k a nk , where k=0, 1, ..., N-1, a is the unknown parameter (unknown
resistance), nk is the sample of the additive noise. Lets suppose that both the resistance of the
resistor (which is one element of large set of resistors), and the observation noise are Gaussian
random variables. The observed samples of the noise are uncorrelated. Lets suppose, that a
and a2 are known, the mean value of the noise is n 0 , and covnk , n j n2 kj , where
1 if k j
kj . In vector-form: z Ua n , z T z0 , z2 ,..., z N 1 , nT n0 , n2 ,..., nN 1 ,
0 if k j
U T 11...1 , nn n2 I . Using (54) we have:
a2 a2 N 1
N 1 1 1 N 1 N n2
N
1 N 1
a z
n2 k 0 k
a [ 2 2 ] ( 2 zk 2 a ) a
a2 N
a MS a z ( zk a )
n a n k 0 n a2
1 N 2 k 0
1 N 2
n n
(55)
Remarks:
1. Based on (55) the estimation can be interpreted as a prediction-correction form, the first term
of which is a prediction based on a priori knowledge, which is completed by a correction term
which introduces new information based on measurements. This is proportional with the
difference of the mean of the measured values and the expected value of the parameter. The
a2
weighting factor depending on the value of N is somewhere between zero and one. If
n2
N 1
1
a n , then aMS a , if a n or N , then aMS
N
z
k 0
k .
2. The variance of the estimation error is the a posteriori covariance:

a2
cova, a z aa z [U T nn1U aa1 ]1 var a~ , (56)
2
1 N a2
n
a2 1
where a a a . Depending on the value of N 2 it varies between a2 and n2 .
~
n N
If a n , then var a~ a2 , if a n or N , then var a~ n2 .

1
N
3. The estimation is conditionally biased, since if N , then () 0.
a2 N 1
a

z
n2 k 0 k a a
b(a) EaMS a a E a . (57)
1 N a a2
2
1 N 2
n2 n
18
4. Estimation Theory Basics (Cont.)

Example 2: The unknown magnitude of a known signal is to be measured: zk ask nk , k =
0, 1, , N-1. The unknown magnitude a, and nk are values of Gaussian distribution, with
known mean and variance. Ea a , var a a2 , Enk 0 , cov ni , n j n2 ij , cova, ni 0
for i and j .
Lets apply the maximum a posteriori (MAP) estimation method!
f (a z ) ln f ( z a) ln f (a)
0 , and using (37): 0. (58)
a a a MAP
a a a a MAP
N 1
(aa ) 2
( z k ask ) 2
1

1 2 a2 1 2 n2
Here f (a) e , f ( z a) e k 0
, and the derivatives:
2 a N
(2 ) 2 2
ln f (a) a a ln f ( z a ) 1 N 1
a

a 2
,
a
2
n
s
k 0
k ( z k ask ) . (59)
By substituting into (58):
a2 N 1
a 2 sk zk
1 N 1
a a n k 0
2 s (z
k k ask )
a2
0 , from where aMAP
2 N 1
(60)
n k 0 a a MAP 1 a2 sk2
n k 0
Remarks:
1. Using the MAP estimator the application of (54) could be avoided.
2. The variance of the estimator:
a2
var a~ , (61)
a2 N 1 2
1 2 sk
n k 0
3. If sk=1, k , then we get the result of the previous example.
4. Here we can also identify the matched filter mentioned in Example 2 of the chapter on
decision theory.
5. Obviously aMAP aMS .
II. Maximum likelihood (ML) estimation (a is stochastic): The a priori probability density
function of the value to be measured is unknown. In this case we suppose that the unknown a
priori density function spreads widely, therefore the a posteriori density function will coincide
with the channel characteristics. The optimal estimate will be the location where the channel
characteristics takes it maximum value:
f ( z a) ln f ( z a)
0 , or 0 (62)
a a a ML
a a a ML
19
Remark: Figure 16 illustrates the situation from the viewpoint of expression (37): The
nominator of the expression producing the a posteriori density function is the product of the
two functions indicated on the Figure. Visibly the location of the maximum can be given (62).
(|)
()

=
16. bra
III. Gauss-Markov (GM) estimation: Special case of the maximum likelihood estimation,
where the observation noise is Gaussian and the observation is modelled by a linear equation.
We take N-dimensional observations, n stands for the N-dimensional noise vector:
1
n T nn1 n
En 0 , covn, n nn , f (n)
1 2
N 1
e , (63)
(2 ) nn 2
2
where nn denotes the determinant of matrix nn .

The observation equation is: z Ua n , therefore the channel characteristics
1
1 ( z Ua ) T nn1 ( z Ua )
f ( z a) N 1
e 2 , (64)
(2 ) nn 2
2
which takes its maximum at:

ln f ( z a)
[( z Ua )T nn1 ( z Ua )] 0 (65)
a a a a GM
Based on (65): U T nn1Ua U T nn1 z 0 , from which
aGM [U T nn1U ]1U T nn1 z , (66)

1 1
if [U U ] exists.
T
nn
Remarks:
(1) By placing aa1 0 into the Bayes estimator we get (66). aa1 0 means infinite variance.
(2) The Gauss-Markov estimator is unbiased.
Example: The zk a nk , k=0,1,,N-1, are independent observations. Enk 0 ;
N 1
( zk a ) 2
1
covn k , n j . The channel characteristics is: f ( z a)

2 1 2 n2 k 0
n kj N
e . The
(2 n2 ) 2
Gauss-Markov estimate of the parameter a is the location of the maximum of the channel
ln f ( z a) N 1 N 1
characteristics: 2 [ zk a ] 0 ;
a a a a
n N k 0
ML GM
N 1
1
aML aGM
N
z
k 0
k , (67)
20
i.e. if the observation equation is linear, and the channel noise is Gaussian, then the best (Gauss-
Markov) estimate is the simple mean of the samples.
IV. Minimum Variance Unbiased Estimation (MVUE): in the followings, the parameter to
be measured is supposed to be deterministic. The estimation problem is the following: given
= { }, = 0,1, , 1, i.e. N measured value, which depend on an unknown parameter a.
Lets determine an estimator of a: = (0 , 1 , , 1 ), where g is some function. The first
step is to find the probability density function (PDF) of data as a function of a: (; ). (This
PDF is the channel characteristics.)
Example 1: Consider the problem of DC level in white Gaussian noise with one observed data:
0 = + 0, where 0 has the PDF (0, 2 ). In this case the PDF of 0 :
1 1
(0 ; ) = exp[ ( )2 ]
2 2 2 2 0
Example 2: Consider a data sequence that can be modeled with a linear trend, in white Gaussian
noise: = + + , = 0,1, , 1. Suppose that is uncorrelated with all the
other samples, and its PDF is (0, 2 ). Letting = [ ] and = [0 , 1 , , 1 ] the PDF
is:
1 1
1 1
(; ) = ( ; ) = exp[ 2 ( )2 ]
(2 2 ) 2
=0 =0
Assessing Estimator Performance: Consider the problem of estimating a DC level A in
uncorrelated noise:
= + , = 0,1, , 1.
Consider the following estimators: 1 = 1 1 , 2 = 0 . Suppose that = 1, 1 =
=0
2 = 0.98. Which estimator is better? An estimator is a random variable, so its
0.95,
performance can only be described by its PDF or statistically (e.g. by Monte-Carlo simulation).
Unbiased Estimator: An estimator that on the average yield the true value is unbiased.
Mathematically: ( ) = 0, for 1 < < 2 . Lets compute the expectation of the two
1 and
estimators 2 :
1 1 1
1 1 1
1 ) = ( ) = ( + ) = ( + 0) =
(

=0 =0 =0
2 ) = (0 ) = ( + 0 ) = + 0 =
(
Both estimators are unbiased. Which one is better? Lets compute the variance of the two
estimators!
1 1
1 1 1 2
1 ) = [ ] =
( ( ) = 2
=
2 2
=0 =0
2 ) = (0 ) = 2 > (
( 1 )
Remark: When several unbiased estimators of the same parameters from independent set of
data are available, i.e. 0 , 1 , , 1 , a better estimator can be obtained by averaging:
21
1
1
= () = .

=0
Assuming that the estimators have the same variance, we have:

1
1 1 ( )
() = 2 ( ) = 2 ( ) = .

=0
By increasing N, the variance will decrease (if , ). It is not the case for biased
estimators, no matter how many estimators are averaged.
Minimum Variance Criterion: The most logical criterion is the Mean Square Error
(MSE/mse):
() = [( )2 ]
Unfortunately, this type of estimators leads to unrealizable estimators (the estimator will depend
on the unknown a). By introducing the expected value of the estimate:
() = {[ [] + [] ]2 } = {[ [] + ()]2 },
where () = [] is defined as the bias of the estimator. Therefore:
() = {[ ()]2 } + 2()[ ()] + 2 () = () + 2 ().
Instead of minimizing MSE we can minimize the variance of the unbiased estimators: Minimum
Variance Unbiased Estimator.
Minimum Variance Unbiased Estimator, MVU Estimator: In general, MVU estimator does
not always exist. There may be no unbiased estimator or none of unbiased estimators has
uniformly minimum variance. There is no known procedure which always leads to the MVU
estimator. What can we do?
1. Determine the Cramer-Rao lower bound (CRLB) and check to see if some estimator satisfies
it.
2. Restrict to linear unbiased estimators.
V. Cramer-Rao Lower Bound (CRLB): is a lower bound on the variance of any unbiased
estimator,
() ()
Note that the CRLB is a function of a. It tells us what is the best performance that can be
achieved. It may lead us to compute the MVU estimator.
CRLB Theorem: Assume that the PDF (; ) satisfies the regularity condition:
[(; )/] = 0 for all a.
Then the variance for any unbiased estimator satisfies:
1
2 (; )
() [( )]
2
An unbiased estimator that attains the CRLB can be found iff:
(; )
= ()(() )

22
for some functions () and (). The estimator is = (), and the minimum variance
1/().
Example 1: Consider the estimation of a DC level in additive white Gaussian noise based on a
single measurement: 0 = + 0, where the PDF of 0 is (0, 2 ):
1 1
(0 ; ) = [ ( )2 ]
2 2 2 2 0
1
(0 ; ) = 2 2 ( )2
2 2 0
Then:
ln(0 ; ) 1 2 (0 ; ) 1
[
= 2 0 ] =
2 2
According to the CRLB Theorem:
1
() 2 , () = 2 , = (0 ) = 0 .
Example 2: Consider the estimation of a DC level in additive white Gaussian noise based on
multiple observations: = + , = 0,1, , 1, where the PDF of is (0, 2 ),
and the samples are uncorrelated:
1
1 1
(; ) = [ 2
( )2 ]
2
(2 ) 2
=0
Then
1
(; ) 1
= [ [(2 2 ) 2 ] 2 ( )2 ] =
2
=0
1 1
1 1
= 2 ( ) = 2 ( )

=0 =0
According to the CRLB Theorem:
1
2 1

() , () = 2 , = () =

=0
The case of Linear Models with White Gaussian Noise (WGN):

If N point of samples of data observed can be modeled as
= +
where
z N*1 dimensional observation vector;
U N*M dimensional, known, rank M observation;
a M*1 dimensional vector of parameters to be estimated;
w N*1 dimensional noise vector with PDF (0, 2 ).
The procedure to be followed:
23
Compute the CRLB and the MVU estimator that achieves this bound.
Step 1: Compute (; );
2 (;)
Step 2: Compute () = [ ] and the covariance matrix of : () = 1 ().
2
(;)
Step 3: Find the MVU estimator () by factoring = ()[() ].

These steps in the case of the above model (Linear model with WGN):
1
Step 1: (; ) = (2 2 ) 22 ( ) ( ).
(,) 1 1
Step 2: = 22 [ 2 + ] = 2 [ ].

2 (;) 1
Then () = [ ] = 2 .
2
Step 3: Find the MVU estimator () by factoring:

(;)
= ()[() ] = [( )1 ] .
2
Therefore:
= () = ( )1 () = 1 () = 2 ( )1 .
I.e. for a linear model with WGN the MVU estimator is: = ( )1 . This estimator is
efficient and attains the CRLB. This estimator is unbiased that can be seen easily by
() = ( )1 ( + ) = .
The statistical performance of is completely specified because is a linear transformation of
a Gaussian vector and hence has a Gaussian distribution:
~(, 2 ( )1 )
Examples:
1. Curve fitting: Consider fitting the data by a p-th order polynomial function of n:
xn a0 a1n a2 n2 ... aP n P wn ,
where wn is the n-th sample of the noise. We have N samples:
1 0 0 0
x0 a0 w0
x 1 1
a1 w1
1 1
z 1 1 2 4 2 P
Ua w , a [U T U ]1U T z

xN 1 1 N 1 P aP wN 1
N 12 N 1

2. Fourier analysis: Consider the Fourier analysis of the data :

2kn 2kn
xn k 1 ak cos
k 1 bk sin
M M
wn
N N
Remark: In this example DC component is not fitted.
We have N samples:
24
a1
1 0 a2
2k 2k
x0 cos sin w0
x aM w1
N N
4k 4k

1 Ua w
z cos sin
N N
1 b
b
xN 1 2 2 2 wN 1
cos N 1 sin N 1
N N
bM
a [U TU ]1U T z
2 2 2 2 2

N 1 N 1
Note that (U T U ) 1 I , therefore a k x cos
n 0 n
kn , bk x sin
n 0 n
kn.
N N N N N
Remarks:
(1) From the properties of the linear models the estimates are unbiased.
(2) The covariance matrix:

cova w2 U TU w2 I
1 2
N

The estimates are Gaussian random variance, and since their covariance matrix is diagonal, the
amplitude estimates are independent.
3. System identification: Consider identification of a Finite Impulse Response (FIR) model,
for = 0,1, , 1, with input and output provided for = 0,1, , 1 ( xn 0 ,
if n 0 ):
yn k 0 ak xnk wn , rewritten in the z Ua w form:
P 1
x0 0 0
y0 a w
y x1 x0 0 0 0
z 1 x2

x1 x0
a w
0 1 1 , a U T U U T z
1

0
y N 1 x a P 1 wN 1
N 1 x N 2 x N 3 x N P
Remark: The covariance matrix in case of WGN: cova w2 [U TU ]1 .
25
4. Estimation Theory Basics (cont.)

Examples (cont.):
4. Linear Models with Colored Gaussian Noise: Determine the MVU estimator for the linear
model: z Ua w , in case of colored Gaussian noise w ~ N (0, C ) . (The covariance matrix
is not diagonal.)
The so-called whitening approach is applied: Since C is positive definite, its inverse can be
factored as: C 1 D T D where D is an invertible matrix. This matrix acts as a whitening
transformation for w:
E[( Dw)(Dw)T ] E[ DwwT DT ] DCDT DD 1 ( DT )1 DT I , (68)
i.e. the noise will be whitened and will have unit variance. If the observation equation is
transformed by this matrix D, then:
z Dz DUa Dz U a w . Here w Dw ~ N (0, I ) , (69)
i.e. white, and therefore we can compute the MVU estimator as:
a [U T U ]1U T z [U T DT DU ]1U T DT Dz [U T C 1U ]1U T C 1 z , C (a ) [U T C 1U ]1 . (70)
5. Linear models with known components: Consider the linear model: z Ua s w , where
s is a known signal. To determine the MVU estimator let z z s , thus z' Ua w is a
standard linear model. The MVU estimator is a [U T U ]1U T ( z s) , and for the case of
WGN: C(a ) w2 [U TU ]1 .
6. Linear models with known components: Consider a DC level and an exponential
inWGN: xn A r n wn . Here A is unknown, while r is a known constant. The MVU:
x0 1 1 w0
x 1 r w

1 , A 1 N 1( xn r n ) , var A w .
2
z 1 A (71)
N n 0 N
N 1
x N 1 1 r wN 1
7. Estimation of Moving Average (MA) parameters: The , n=0,1,,N-1, output values are
the linear combinations of input values seen through a sliding window:
P 1
y n a k xn k . (72)
k 0
The observation is linear with WGN. Lets estimate the weights : z Ua w ;

zn yn wn , n=0,1,,N-1, z T [ z0 z1 ,..., z N 1 ] , aT [a0 , a1 ,..., aP1 ] . The MVU is:
a [U T U ]1U T z
Note that there is slight difference if we compare this solution to that of Example 3: Through
the sliding window we see nonzero values also with negative indices.
It is interesting to investigate what is the meaning of the matrix [U T U ]1 , and that of the
vector U T z . First lets reorder matrix U T U as the sum of dyadic products:
26
x0 x1 ... x N 1 x0 x1 ... x1 P xn2

xn xn 1 ... xn xn P 1
x x0 ... x N 2 x1 x0 ... x2 P N 1 N 1 2
xn 1x x ... xn 1 xn P 1
1 X n X n n 1 n
T
n 0 n 0

x1 P x2 P ... x N P x N 1 x N 2 ... x N P xn P 1 xn
xn P 1 xn 1 ... xn2 P 1
(73)
In (73) X n xn xn1 xn P1 . Note that (73) is such a matrix, the normed elements
T
of which estimate the auto-correlation values of the discrete sequence :
1 N 1
R xx (k p) xn k xn p , k,p =0,1,,P-1. (74)
N n 0
Similarly, for the vector U T z :
xn z n
N 1 x
n 1 z n
U z
T (75)
n 0

xn P 1 z n
(75) is such a vector, the normed elements of which estimate the cross-correlation of the
discrete sequences and :
1 N 1
R xz (k ) xnk zn , k = 0,1, ,P-1. (76)
N n 0
Lets denote R the matrix composed from the elements of (74), R the vector composed
xx xz
from the elements of (76):
a R xx1 R xz (77)
Remarks:
(1) Obviously (77) can be considered only a formal rewriting for the case of the actual example,
however later we will see its meaning and importance.
P
(2) In case of real-time computations instead of (72) the form yn ak xn k is to be used,
k 1
because just to be able to compute the output we need one step delay.
VI. Best Linear Unbiased Estimator, BLUE):

- The MVU estimator does not always exist or impossible to find.
- The PDF of data may be unknown.
The BLUE is a suboptimal estimator that:
- restricts estimates to be linear in data: = ;
- restricts estimates to be unbiased: () = () = ;
- minimizes the variance of the estimates;
- needs only the means and the variance of the data (not the PDF). As a result, in general, the
PDF of the estimates cannot be computed.
Finding the BLUE (scalar case):
27
- Assign to , = 0,1, , 1 a linear estimator = 1

=0 = , where =
{0 , 1 , , 1 } .
- Restrict estimate to be unbiased, i.e.: () = 1
=0 ( ) = .
2 2
- Minimize the variance: () = {( ()) } = {( ()) } =
= { [ ()][ ()] } = .
Example:
Consider the problem of amplitude estimation of known signals in noise: = + .
The linear estimator: = 1
=0 = . This is unbiased:
() = () = = , therefore = 1, where = {0 , 1 , , 1 } .
Minimize subject to = 1. This constrained optimization can be solved using
Lagrangian Multipliers. We have to minimize
= + ( 1)

= 2 + = 0, from which = 2 1 , that can be replaced into the condition term
1
= 1 = 2 1 , 2 = 1 , which after replacing into the expression of h: =

1
, and finally using = the optimal solution is:
1
1 1
= 1 () = 1
Finding the BLUE (vector case):
= +
where w is a noise vector with zero mean and C, (the PDF of w is arbitrary), then the BLUE of
a is:
= ( 1 )1 1 ,
and the covariance of a is:
() = ( 1 )1 .
Remark: If the noise is Gaussian then the BLUE is an MVU estimator.
Example: Consider the problem of DC level n noise: = + , where is of unspecified
PDF with ( ) = 2 . = [1,1, ,1] , = . The covariance matrix is:
1
0
02 0 02 0
0
1 0
= 0 12 0
1 =
0 12

[ 0 0 2
1 ] 1

[0 0 2
1 ]
1
1
and hence the BLUE is: = ( 1 )1 1 = (1
=0 2 ) 1
=0
2
1
1
and the minimum covariance: = ( 1 )1 = (1
=0 2 )

28
Problems: MVU estimator does not often exist or cannot be found. BLUE is restricted to linear
models.
VII. Maximum likelihood estimation (MLE) (a is deterministic):
- can always be applied if the PDF is known;
- is optimal for large data size;
- is computationally complex and requires numerical methods.
Basic idea: Choose the parameter value that makes the observed data, the most likely data to
have been observed.
Likelihood Function: is the PDF (; ) when a is regarded as a variable (not a parameter).
ML Estimate: is the values of a that maximizes the likelihood function.
Procedure: find log-likelihood function (; ); differentiate w.r.t a and set to zero and solve
for a.
Example: Consider DC level in WGN with unknown variance = + . Suppose that A>0
and 2 = . The PDF is:
1
1 1 2
(; ) = [ 2 ( ) ]
(2) 2 =0
Taking the derivative of the log-likelihood function, we have:

1 1
ln(; ) 1 1
= + ( ) + 2 ( )2
2 2
=0 =0
What is the CRLB? Does an MVU estimator exist? MLE is found by setting the above equation
to zero:
1
1 1 1
= + 2 +
2 4
=0
Example: Consider DC level in WGN with unknown variance = + .

The PDF is:
1
1 1
(; ) = [ 2
( )2 ]
2
(2 2 ) 2 =0
Taking the derivative of the log-likelihood function, we have:

1
ln(; ) 1
= + 2 ( ) = 0

=0
Which leads to:
1
1
=

=0
Remark: The result is the same as in case of Gauss-Markov estimation, which is a special case
of Maximum Likelihood Estimation of stochastic parameters.
29
VIII. Least Squares (LS) Estimation: In all the previous methods, we assumed that the
measured signal is the sum of a true signal and a measurement error with known probabilistic
model. In the least squares method, we do not need a probabilistic assumption but only a
deterministic signal model.
= () +
where represents the modeling and the measurement errors. The objective is to minimize the
LS cost:
1
2
() = ( ())
=0
.
Example: Estimate the DC level of a signal. We observe = + for = 0,1, , 1,
and the LS criterion is:
1
() = ( )2
=0
1 1
() 1
= 2 ( ) = 0 =

=0 =0
Linear Least Squares:

Suppose that the observation model is linear: z Ua e , then
J (a) ( z Ua) T ( z Ua) z T z z T Ua a T U T z a T U T Ua zT z 2 a T U T z a T U T Ua
J (a)
The gradient equals zero condition: 2U T z 2U T Ua 0 a [U TU ]1U T z
a a a
The minimum LS cost is:
() = ( ) ( ) = ( [ ]1 ) = ( )
Remarks:
1. Comparing Different Estimators for the Linear Model:
Consider the following linear model:
= +
Estimator Assumption Estimate
LSE No probabilistic assumption = [ ]1
BLUE w is white with unit variance unknown PDF = [ ]1
MLE w is white with Gaussian noise = [ ]1
MVUE w is white with Gaussian noise = [ ]1
For MLE and MVUE the PDF of the estimate will be Gaussian.
2. Weighted Linear Least Squares:
The LS criterion can be modified by including a positive definite (symmetric) weighting
matrix Q:
J (a) ( z Ua)T Q( z Ua)
That leads to the following estimator:
30
a [U T QU ]1U T Qz .
The minimum LS cost:
() = ( ) ( ) = [ ( )1 ]
3. If we take = 1 , where C is the covariance of noise then the weighted least squares
estimator is the BLUE. However, there is no true LS-based reason for this choice.
3. If Q nn1 , then we get the Gauss-Markov (GM) estimate of (66), i.e. the GM estimator is a
weighted least squares error estimator, where the weights are given by the inverse of the
covariance matrix of the noise.
4. Fitting of the observation model to data leads us to the general problem of model fitting. One
of the simplest case of model fitting is the regression analysis problem, where based on
independent variable value - function value pairs an approximating function is composed,
typically by fixing the structure of the function in advance, and setting its parameters by
minimizing some cost function. See e.g. as the simplest, the linear regression problem.
5. At this point we formally end the introduction of estimation theory basics, however, in
following we will still continue to deal with measurements, i.e. with fixing the states and the
parameters of different phenomena, that typically involve estimations.
5. Model fitting
In the case of LS estimators, we do not have a priori knowledge about the observations,
therefore what we practically do is nothing else than model fitting.
Regression analysis: In statistical modeling, regression analysis is a statistical process for
estimating the relationships among variables. It includes many techniques for modeling and
analyzing several variables, when the focus is on the relationship between a dependent variable
and one or more independent variables. Finding this relationship is a special case of model
fitting. On Figure 17 the function y g (u , w) has two type of independent variables: the one,
denoted by u(n), in many applications can be considered as a discrete input time sequence that
can be set/influenced by the operator, while the other, denoted by w(n), cannot be influenced,
it is typically unknown noise or disturbance.
()
()
(, ) ()
Cost function
-1
() ()
Cost minimization
17. bra
Remark:
In the following the small n stands to identify an iteration step, or it is a discrete time index,
which sometimes takes the role of real indexing, as well. In the following u (n) un , and
y (n) yn are equivalent notations.
31
For modelling the unknown y g (u , w) such a y g (u ) function is used, which has the same
input, and some tunable parameters that are set to minimize some cost function. Typically as
cost function mean least squares are used:
E( y y )T ( y y ) (78)
Regression analysis in case of fully specified statistics: If we know f (u , y ) , the joint
probability density function of u and y, then we face a Bayes estimation problem, the solution
of which is the a posteriori expected value:
g (u ) Ey u (79)
The curve [u , g (u )] is the so-called regression curve of the variable y with respect to u. If the
input is a vector, then we have a regression surface.
Regression analysis with partially specified statistics: We do not know the joint density
function, only limited number of moments:
Linear regression: The function to be fitted is scalar linear function g (u ) a0 a1u whose

parameters are to be selected to minimize E ( y g (u ) 2 . Lets suppose that the means and
, and the standard deviations and are known, together with the normalized cross-
E(u u )( y y
covriance function: . Minimize the cost function
u y

J (a0 , a1 ) E ( y a0 a1u) 2 E y 2 a02 a12 E u 2 2a0 Ey 2a1Euy 2a0 a1Eu (80)
according to a 0 and a1 :
J (a0 , a1 )
2a0 2 y 2a1 u 0 , thus a0 y a1u , and (81)
a0
J (a0 , a1 )
2a1 ( u2 u2 ) 2( u y u y ) 2a0 u =0. By solving this set of equations:
a1
y y
a0 y u , a1 (82)
u u
Remarks:

1. To get (82) we have used the following relationships E (u u ) 2 u2 E u 2 u2 , and
E(u u )( y y ) Euy u y .
2. If we replace the optimum values into (80), we get var y g (u ) y2 (1 ) , which is the
variance of the approximation error, the minimum of the cost function. It is interesting to
investigate the relations as the function of 0 1 . If the cross-covariance is 0, the we will
get 1 = 0, i.e. a lying linear curve, which means that the best estimate of the output is the
expected value of the measured values. If the cross-covariance is 1 (100%), then y depends
only on u, i.e. independent of the w noise.
3. One possible generalization of the linear regression is the polynomial regression:
N
g (u) ak u k , (83)
k 0
which has the important property of being linear in its parameters. We prefer models linear
in their parameters, because in case of quadratic cost function, finding the minimum results
in solving a set of linear equations.
32
Linear regression based on measured data: with slight modifications, the procedure above
can be performed also in the case of not having a priori information. Then we can use the model:
yn a0 a1u n wn , z Ua w , as previously.
y0 1 u0 w0
y 1 u a w u N 1 yn
N 1
1 0 1 , [U T U ] N
z 1 n 0 n
, U z Nn10
T
.
a1 n 0 un
N 1

N 1 2
u n 0 un yn
n 0 n
y N 1 1 u N 1 wN 1
1 N 1 2 1 N 1 1 N 1
a0 1 N n 0 u n un yn
N n 0 N n 0
a 2 1 N 1 1 N 1 .
1 1 N 1 2 1 N 1 n 0 u n n 0 u n y n
n N n0 n
N n 0
u u N
1
N
Remark: In these expressions, we can identify the statistical estimates of the moments used in
(82): If we rewrite the equations using the differences to the mean values, we can reach to the
complete correspondence. Please do it as an exercise.
Generalization of the regression scheme: On Figure 18 model fitting is arranged according
to the regression scheme. The response y to the input u is approximated by the response y of
the model, which is adjusted by minimizing some cost function.
()
()
Reality ()
Cost function
-1
Model ()
Minimization
Figure 18
It is interesting to compare this scheme with that of Figure 2. Lets redraw Figure 2 to the form
of Figure 19:
()
-1
()
Reality Cost function Model
() ()
Figure 19
The two schemes are similar: in both cases model fitting is performed. In the observer scheme,
we know the parameters and the systems states are estimated/measured, while in the regression
scheme (implicitly) we are familiar with the state and the parameters are estimated/measured.
Both schemes are parallel in that sense that the input is fed in a parallel way to the system and
its approximator.
33
Remark: We can fit models also in serial forms, when practically the so-called inverse
modell is fitted in such a way that the input is estimated (see Figure 20).
()
() ()
Reality Inverse Model Cost function
() -1
Figure 20
The drawback of this approach is that in case of dynamic systems, due to the system delay,
() should be predicted, or () delayed.
Adaptive linear combinator: Figure 21 presents a widely-used model family fitted into the
generalized regression scheme.
0 ()
0 ()
1 ()
()
() 1 ()
()
1 ()
1 () Model
Minimization
Figure 21
In this model from the discrete sequence of u(n) the sequence of vectors
X (n) xo (n) x1 (n) xN 1 (n) , the so-called regression vector is generated, and then its
T
components are linearly combined to produce the output sequence y (n) . The most suitable
values of the W T (n) w0 (n) w1 (n) wN 1 (n) weights are derived by minimizing the
mean square error:

J (W (n)) E [ y(n) X T (n)W (n)]T [ y(n) X T (n)W (n)]

E y T (n) y (n) 2W T (n) EX (n) y (n) W T (n) E X (n) X T (n) W (n) . (84)

Lets denote EX (n) y(n) P , and E X (n) X T (n) R ! The cost is minimal if
J (W (n))
2P 2RW (n) 0 ,
W (n)
thus the best weights are given by the so-called Wiener-Hopf equation:
W * R 1 P (85)
Remarks:
1. By replacing (85) into (84):

T 1

J min E y (n) y (n) P R P E y (n) y (n) P W
T T T *
(86)
34
J (W (n)) J min [W (n) W * ]T R[W (n) W * ] J min V T (n) RV (n) , (87)

where V (n) [W (n) W ] is the vector of the parameter deviation or parameter error.
*
2. Equation (87) gives the mean squared error as a function of the parameters and of the
parameter error. The equation is illustrated by Figure 22.
Error surface
Parameter subspace
Figure 22
At any point of the error surface, the change of the error with respect to the parameter change
can be characterized by the gradient of the surface:
J (W (n))
(n) 2R[W (n) W * ] 2RV (n) 2( RW (n) P) . (88)
W (n)
Equation (88) plays an important role in cost function minimization: to find the minimum we
will descend on the error surface.
35
5. Model fitting (cont.)

Example: Let X T (n) sin(2n / N ) sin(2 (n 1) / N , i.e. two subsequent sample of a sine
wave (see Figure 23). Both the regression and the parameter vectors are of two dimensions.
Here N denotes the number of signal samples from one single period. y (n) 2 cos(2n / N ) .
How to set
W T (n) w0 (n) w1 (n) (89)
to achieve minimum mean square error? The R and P matrices can be derived by averaging the
sine and cosine waveforms (N>2) for complete periods:
2
0.5 0.5 cos 0
R N , P
2 . (90)
2 sin
0.5 cos 0.5 N
N
To get (90) the following equalities were used:
2
cos( )
2
2

E sin (2 / N ) E sin (2 (n 1) / N ) 0.5 , Esin(2n / N ) sin(2 (n 1) / N )
2
N ,
2
E2 sin(2n / N ) cos(2n / N ) 0 , E2 sin(2 (n 1) / N ) cos(2n / N ) sin .
N
Applying (85):
2 2
0 .5 0.5 cos
R 1
1 N , W * R 1 P tan( 2 / N ) (91)
2 2 2
0.25sin 2 0.5 cos 0.5
N N
sin( 2 / N )
Remarks:
1. Since with the linear combination of sine samples it is possible to generate cosine samples,
therefore for this example J min 0 , i.e. on Figure 22 the lowest point of the paraboloid will
reach the subspace of the parameters.
2. In this example, we calculate the moving average (see (72)) from samples of known
waveforms. Here P=2, i.e. we have a sliding window size of 2. Averaging for the complete
period corresponds to N in calculating the weights of the moving average.
sin(2n / N ) sin(2 (n 1) / N )
3. X T (n)W * 2 2 2 cos(2n / N ) , i.e. we can reproduce
tan( 2 / N ) sin(2 / N )
().
2 2
() = 0 () = ( ) 0 () = ( )

1
2
1 () = ( 1)
f(u)
Figure 23
36
Towards adaptive processing methods: Based on (85) and (88): W * R 1 P ,

1
(n) 2( RW (n) P) . If we multiply both sides of this latter by R 1 , we have:
2
1 1
W * W ( n) R ( n) . (92)
2
If we suppose that our knowledge about the matrix R is not perfect, and therefore the situation
is the same with the gradient, (92) can be rewritten into an iterative form, because we are unable
to reach the optimum in a single step:
1 1
W (n 1) W (n) R ( n) .
2
With the introduction of the convergence factor 0 1 into (92):
W (n 1) W (n) R 1(n) . (93)

Remarks:
1
1. If the matrix R and the gradient are perfectly known, then provides single-step
2
convergence from any W(n) initial parameter value.
2. Since (n) 2 R[W (n) W * ] , this can be replaced into (93), and subtracting from both sides
of (93) the optimum W * we have:
W (n 1) W * (1 2 )(W (n) W * ) V (n 1) (1 2 ) n1V (0) ,
1
i.e. the initial error will decay exponentially, if . The error will decrease in a
2
monotonic way if 0 0.5 , otherwise with alternating sign, in a swinging way.
3. We will distinguish the different gradient methods of model fitting depending on the
available a priori knowledge to use (93).
If the matrices R and P are known, the equations describing the operation of the adaptive
combinator are as follows:
W (n 1) W (n) R 1(n) and V (n 1) (1 2 )V (n) . (94)

Remarks:
1. In the following we will investigate what can we do, if our a priori knowledge concerning
the R and P matrices is partial, or completely lack, or can be based on ongoing measurements
only. These considerations will be continuously present in our thinking, and is important to
the understanding of the followings.
2. Note that matrix R provides global information about the error surface, while gradient (n)
for a given parameter vector W (n ) gives only local characterization. Using gradient
methods, we descend on the error surface to reach minimal mean squared error position.
Investigation of matrix R: The error surface depends on R. First, we will show under what
conditions is possible to minimize the error by sequentially reaching local minima changing
only one parameter at a given time, while the error does not increase. To have this property we
need such a coordinate system whose axes are in the direction of the principle axes of the
37
paraboloid form error surface. The axes of such a coordinate system are in the directions of the
eigenvectors of matrix R.
J (W (n)) J min (W (n) W * )T R(W (n) W * ) J min V T (n) RV (n) (95)
The eigenvalue/eigenvector system of R plays an important role. Lets see this in the case of
2
0.5 0.5 cos
N . The roots of detI R 0 give the eigenvalues:
(90). R
2
0.5 cos 0.5
N
2 2
2
( 0.5) 2 0.25cos2 2 0.25sin 0 (96)
N N
The two roots are:
2 2
0 0.5 0.5 cos , ill. 1 0.5 0.5 cos (97)
N N
The eigenvectors can be derived from equations RQ0 0Q0 , RQ1 1Q1 .
2
0.5 0.5 cos
N q00 (0.5 cos 2 ) q00
q q00 q01 (98)
2 q N
0.5 cos 0.5 01 01
N
2
0.5 0.5 cos q 2 q10
N
(0.5 cos )
10
q10 q11 (99)
2 q N q
0.5 cos 0.5 11 11
N
2 2

The eigenvectors normed to unit length: Q0 2 , Q1 2 , see Figure 24. (100)
2 2
2 2
2
1 2 0
2 2
2 2
Figure 24
In this simple example the eigenvectors are orthogonal, and its angle to the coordinate vectors
is 45 . These eigenvectors show those directions where descending can be performed by
changing a single parameter at a given time.
In general, det(R I ) 0 0 , 1 ,..., N 1. ( R n I )Qn 0 , n 0,1,..., N 1 . By ordering
the eigenvectors into a matrix:

R Q0 Q1 QN 1 Q0 Q1 QN 1 diag 0 1 N 1 (101)

Q Q
38
RQ Q , or
R QQ 1 , (102)
that is called the normal form of R. Since R by definition is a symmetric matrix, therefore
R RT . It is an important property, that in such a case the eigenvectors are orthogonal:
QiT Q j 0 , if i j , otherwise QiT Qi ci i . If QiT Qi 1 for i , then the eigenvectors are
orthonormal, and Q T Q I , i.e. Q 1 Q T .
Proof of the orthogonality: By definition: QiT RT i QiT , and RQ j j Q j . Multiplying both

sides of the first equation by Q j from the right, and multiplying both sides of the second
equation by QiT from the left: QiT RT Q j i QiT Q j , and QiT RQ j j QiT Q j . Since R RT ,
therefore the left side of the equations equal, thus i QiT Q j j QiT Q j . Since i j , therefore
equality can hold only if QiT Q j 0 .
Remarks:
1. Since V T RV positive definite, its eigenvalues are nonnegative.
2. The eigenvectors of the matrix R designate the principle axes of the error surface.
J (W (n)) J min (W (n) W * )T R(W (n) W * ) J min V T (n) RV (n) J min V T (n)QQT V (n)

V 'T ( n ) V ' (n)

T
J min Q T V (n) Q T V (n) J min V 'T (n)V ' (n) . (103)
Thus (n) 2V (n) 2 v v

N 1vN' 1 .
' ' ' T
0 0 1 1 (104)
This is illustrated on Figure 25: The coordinates of the parameter-error vectors, which were
transformed by the matrix of the eigenvectors, can be interpreted along the principal axes of the
paraboloid.
1
1
1 0
0
Figure 25
In this case the optimization can be performed as a sequence of single variable optimization.
This is illustrated by the following example, where the optimum is approached by descending
along the gradient:
Example:
Single variable case: w(n 1) w(n) ((n)) , (n) 2 ( w(n) w*) , where , and
the parameter error:
w(n 1) w* (1 2 )(w(n) w*) , ill. V (n 1) rV (n) r n1V (0) .
Convergence requires r 1 2 1 . Thus:
39
1
0 (105)

1 1
If 0 , then the iteration procedure is overdamped, if , then it is critically
2 2
1 1
damped, while if , then it is underdamped.
2
Remark: Note that in the single variable case R , i.e. the first part of (94) has the form of
W (n 1) W (n) 2 (W (n) W * ) , and after subtracting W * from both sides, we get the second
part of (94).
Multi variable case: V ' (n 1) ( I 2) n1V ' (n) . In case of applying a single scalar as ,
convergence requires the relationship:
1
0 (106)
max
Note that here we have N variables. The steepest descent is achieved along that axis, which
corresponds to the highest eigenvalue. If the eigenvalues of matrix R are not known, only its
N 1
diagonal, then we can use max i tr[] tr[ R] , because the eigenvalues are
i 1
nonnegative, therefore
1
0 . (107)
tr[ R]
Remark: If would be known, i.e. we would have global information about the error surface,
1
then instead of a scalar , the matrix of the form 1 would be preferable. What we can
2
do is to use gradient as local information, and follow its direction to reach the minimum of the
error surface.
Iterative model fitting methods: In the following we will summarize some classical
minimization methods, which are widely used in case of quadratic cost functions and models
linear in their parameters. These can be considered also as learning procedures, because they
get and process information about the actual relations. In our case this information source is the
gradient of the error surface, we step forward accordingly. Obviously, we can use other
methods, where e.g. W(n) is selected in a different way, and after it we check the error. If the
error is smaller, then the selected value is the next proposition, otherwise we ignore it. (Monte-
Carlo methods, genetic algorithms.) However, these methods are preferable merely if (1) the
cost function is not quadratic, (2) the model is nonlinear in parameters. In such cases the error
surface is not paraboloid, it might have local minima, and methods based on local information
may stop in one of them.
Iterative model fitting using Newtons method:
This method can be derived from the Wiener-Hopf equation. Here we suppose the a priori
knowledge of R and P. Since this cannot be expected in practice, this method has only
theoretical significance, however, it gives hints to create approximate solutions. Two type of
expressions will be given. The first provides the parameter vector for the next iteration step,
while the second the relation of the parameter error with respect to the initial error.
40
W (n 1) W (n) R 1(n) , (108)

V (n 1) (1 2 ) V (0) .
n1
(109)
It can be seen that for 0.5 convergence in one step is possible.
Iterative model fitting using the steepest descent method:
This is a method of practical importance, which does not require the a priori knowledge of
marices R and P, but it uses the gradient which must be estimated using local information:
J (W (n)) J (W (n))
(n) (n) (110)
W (n) W (n)
This means that we have to measure the changes of (()) corresponding to small changes in
W(n), and compute the gradient:
W (n 1) W (n) (n) (111)
V (n 1) ( I 2) V (0)
' n1 '
(112)
Remarks:
1. The descent along the gradient is more spectacular if we apply such a coordinate system,
where the axes coincide with principle axes of the paraboloid.
2. The result of descending along the gradient obviously does not depend on the applied
coordinate system.
Iterative model fitting using instantaneous derivative (LMS method):
(LMS: Least-Mean-Square). We start from the instantaneous error:
J (W (n)) [ y (n) X T (n)W (n)]T [ y (n) X T (n)W (n)] eT (n)e(n) .
We estimate the gradient in the following way:
(n) J (W (n)) 2 X (n) y(n) 2 X (n) X T (n)W (n) 2 X (n)e(n)

(113)
W (n)
Thus
W (n 1) W (n) 2 X (n)e(n) . (114)

This expression is widely used, especially in case of larger parameter vectors. The convergence
factor should be carefully set, typically to small value, because (113) is rough approximation,
since it is the function of the actual y (n), X (n) values, while the actual gradient would be the
expected value of (113). If we apply small convergence factor, then we will make many small
steps, and therefore we will capable to make acquaintance with many y(n), X (n) values,
which finally result in an averaging effect in (114). The expected value of (113):
()] = 2[()()] + [() ()]() = 2[()()],
[
() can be interpreted as an unbiased instantaneous estimate of ().
hence
Remarks:
1. In the early days of neural networks the LMS method was widely used to train networks
using adaptive linear combinators in front of the nonlinear system component.
2. It is a general experience that using sufficiently small values, then the optimal parameter
vector can be approximated quite properly; in case of higher the remaining parameter
error will be higher. The reason behind this phenomenon is that in the neighborhood of the
minimum, if the correction term is not small enough, the instantaneous derivative will force
41
the parameter value back and force on the internal wall of the paraboloid, and it will be
unable to descend to a lower point. Therefore, it worth reducing near to the optimum.
The expression of the parameter error can be derived from (114) in such a way, that we subtract
from both sides W * , and we suppose that y (n) X T (n)W * . This assumption means that the
model fitting perfectly succeeded.
W (n 1) W * W (n) W * 2X (n)[ X T (n)W * X T (n)W (n)]
[ I 2X (n) X T (n)][W (n) W * ]
Hence:
n
V (n 1) [ ( I 2X (i) X T (i))]V (0) (115)
i 0
Equation (115) shows how contribute the convergence factor and the regression vector X(n)
to the reduction of the parameter error. Obviously the product of matrices should be contractive,
i.e. should reduce the length of the parameter vector, possibly in every step.
Iterative model fitting using the -LMS method:
In (114) it might be useful to norm the regression vector X(n), because otherwise the correction
of the parameter vector is highly dependent on the signal level. The modified versions of (114),
and (115) are:

W (n 1) W (n) T
X (n)e(n) (116)
X (n) X (n)
n

V (n 1) [ ( I T
X (i) X T (i))]V (0) (117)
i 0 X (i) X (i)
Iterative model fitting using LMS-Newton method:
In principle to norm the regression vector X(n) in (114) is possible also by matrix R. If we were
in that particular situation that we are familiar with matrix R and the gradient is estimated by
its instantaneous value, then
W (n 1) W (n) 2R 1 X (n)e(n) (118)
n
V (n 1) [ ( I 2R 1 X (i) X T (i))]V (0) (119)
i 0
This idea has practical importance, if matrix R is estimated iteratively from our observations.
Iterative model fitting using LMS-Newton method, R is estimated iteratively:
W (n 1) W (n) 2R 1 (n 1) X (n)e(n) (120)
where R (n 1) R (n) X (n) X T (n) , 0 1 , 0 1 , 1 , 0.9...0.99 . The

inverse of ( + 1) can be calculated using the so-called matrix inversion lemma:
42

1 R 1 (n) X (n) X T (n) R 1 (n)
R (n 1) R 1 (n)
1
(121)

X ( n) R ( n) X ( n)
T 1
Remarks:
1. The matrix inversion lemma: [ A BC ]1 A1 A1 B[ I CA 1 B]1 CA 1 . Note: if BC is a
dyadic product, as it is in our case, then the inverse of the matrix sum in brackets on the
right-hand side will be a scalar. Here A R (n) , BC X (n) X T (n) .
2. The iteration may be started from R (0) I , where 0 1.
43
5. Model fitting (cont.)

Remarks:
1. Iterative/recursive methods: real-time signal processing is performed typically using
recursive procedures. These methods can be characterized by producing output for every
new input, and this output is reused in the next step.
- Simple averaging: the linear average of the observations is computed to produce the
estimate of the expected value:
1 n1 1 n n 1 1
x (n) y(k ) x (n 1) y (k ) x (n) y(n) x (n) [ y(n) x (n)]
n k 0 n 1 k 0 n 1 n 1 n 1
- Recursive estimation of the autocorrelation matrix:
1 n 1 1 n n 1
R (n) X (k ) X T (k ) R (n 1) X (k ) X T (k ) R(n) X (n) X T (n)
n k 0 n 1k 0 n 1 n 1
R (n)
1
n 1

X (n) X T (n) R (n) .
- Exponential averaging:
x ( n 1) ax (n) by(n) ,
where a and b are positive constants. The frequency-domain behavior can be given with the
help of the z-transform: zX ( z) aX ( z) bY ( z) , from which we can derive the transfer
function of the exponential averager:
X ( z ) b bz 1
H ( z)
Y ( z ) z a 1 az 1
which behaves as a lowpass filter, and for constant input signal after some transients
produces a constant value at its output. Typically, it is normed to have H ( z ) 1 , for z=1.
b
Using this condition: 1 , i.e. a 1 b. Hence
1 a
x (n 1) x (n) b( y (n) x (n)) .
In this case the new observed value is multiplied by a constant in contrary to the linear
averaging. The above computation R (n 1) R (n) X (n) X T (n) has the same structure as
the exponential averaging, where 1 b , and b .
2. Model fitting can be performed basically using one of two approaches:
- first collecting data followed by batch-processing i.e. in an off-line way;
- data processing parallel with the acquisition iteratively or recursively, i.e. in an on-line
way.
3. Concerning the role of model fitting we distinguish two approaches:
- Identification: we try to describe reality with high accuracy;
- Adaptation: we try to follow reality with low delay.
In case of adaptive systems we use basically iterative/recursive methods, while for
identification in principle the two approaches produce the same result.
44
If the cost function is not quadratic its Taylor series expansion:

If for some reason the cost function is not quadratic, then we can attempt the application of
Taylor series expansion in the neighborhood of a give parameter value W(n):
C ( y, y ) C ( y, y (W )) C (W )
1 (122)
C (W (n)) [C (W (n))]T [W W (n)] [W W (n)]T H (W (n))[W W (n)] ...
2
C (W )
where H (W (n)) is the second derivative, which plays the same role as
W W W ( n)
matrix R.
Iterative model fitting based on the Taylor series expansion of the cost function:
a) A possible alternative two use only the first two terms of (122), and suppose that
C (W (n 1)) 0. In this case equation C (W (n)) [C (W (n))] [W (n 1) W (n)] 0 should
T
be solved.
C (W (n))
W (n 1) W (n) C (W (n)) (123)
[C (W (n))]T C (W (n))
This is the so-called Newton-Raphson method. A typical experience is that this method
behaves well far from the optimum, while its behavior near to the optimum depends on to
what extent is fulfilled the supposition C (W * ) 0 .
b) If we try to find the minimum of the expanded C (W ) by computing its gradient, then the
condition C (W (n 1)) 0 C (W (n)) H (W (n))(W (n 1) W (n)) will be received,
which results in the Newton method:
W (n 1) W (n) H 1 (W (n))C (W (n)) (124)
Remarks: The multiplier does not appear here, because it is present in (122) unlike to the
expressions used in the case of quadratic cost functions.
Additive IIR systems
It might be more efficient, however from several respects more problematic, if we fit so-called
Infinite Impulse Response (IIR) models. These can be implemented in several alternative forms,
but in the following we will restrict ourselves only to the so-called direct structure. Formally
we apply further on an adaptive linear combinator, however in computing the actual estimator
earlier estimates are also considered:
y (n) k 0 ak (n) x(n k ) k 1 bk (n) y (n k ) ,
M 1 N 1
(125)
i.e.
W T (n) [a0 (n), a1 (n),..., aM 1 (n);b1 (n), b2 (n),..., bN 1 (n)] , (126)
X T (n) [ x(n), x(n 1),..., x(n M 1); y (n 1), y (n 2),..., y (n N 1)] . (127)
If we apply the methods discussed up till now with (126) and (127), then we perform the so-
called pseudolinear regression (PLR). In this case, we neglect the fact the regression vector
(127) depends on the previous outputs of the fitted model (the adaptive filter), which is an
implicit dependence with the consequence that the error surface is not a paraboloid any more.
For every gradient/based method there is the real danger to have local minima.
45
Equation-Error Formulation:
The implicit dependence mentioned above can be avoided by alternative error formulations. In
this paragraph the nominator ( N (z ) ) and the denominator ( D(z )) of the transfer function (
H (z ) ) of the fitted adaptive filter will be considered as operators allowing the simultaneous
presence of (discrete) time and frequency in the equations:
N ( z ) Y ( z )
H ( z) , (128)
D( z ) X ( z )
and the approximation error, which was up till now minimized in squared sense is
e(n) y (n) y (n) . Instead of this lets introduce ee (n) D( z )e(n) , because using (128) this
can be written in the following form:
ee (n) D( z )e(n) D( z ) y (n) D( z ) y (n) D( z ) y (n) N ( z ) x(n) (129)
which is independent of y (n) . Lets denote
N ( z ) A(n, z ) k 0 ak (n) z k , and D( z ) 1 B(n, z ) , where B(n, z ) k 1 bk (n) z k :
M 1 N 1
ee (n) y (n) B(n, z ) y (n) A(n, z ) x(n) y (n) ye (n) , (130)
where using (126) ye (n) W T (n) X e(n) . Here (compare it with (127)):
X eT (n) [ x(n), x(n 1),..., x(n M 1); y(n 1), y(n 2),..., y(n N 1)] . (131)
For (130) the quadratic error surface will be paraboloid, therefore all methods successful for
adaptive linear combinators can be applied for IIR adaptive filters. The block diagram of the
method can be seen on Figure 26.
() 1 ()
(, )
1 (, )
Copied parameters
(, ) () = () ()
+
()
Figure 26
Remark: In case of noisy observations distortions (parameter bias) may appear, i.e. the expected
value of the parameters may differ from their ideal value. This is because the observation noise
is also filtered and takes part in the model fitting.
Output-Error Formulation:
If we try to avoid parameter bias, then the output error formulation is more advantageous,
however the danger of local minima exists. If we accept this condition, then the different
gradient-based methods can be also considered.
1. Instantaneous gradient (LMS/like) methods: here we try to minimize the output error
e2 (n) eo2 (n) . The form of the gradient estimate has the previously discussed structure:
46
(n) 2e(n) y (n) 2e(n)y (n) ,

(132)
W (n)
y (n) y (n) y (n) y (n) y (n) y (n)
where T y (n) [ , ,..., ; , ,..., ] , and:
a0 (n) a1 (n) aM 1 (n) b1 (n) b2 (n) bN 1 (n)
y (n) y (n i) y (n) y (n i)
x(n k ) i 1 bi (n) y (n k ) i 1 bi (n)
N 1 N 1
, . (133)
ak (n) ak (n) bk (n) bk (n)
From (133) it can be seen what the implicit dependence really means: Due to the infinite
memory of the IIR filters the actual parameter sensitivity is influenced by all previous input
data. And what is more, the computation of (133) is rather hard. Some simplifications are
unavoidable.
If the convergence factor applied is small enough (i.e. small ), then we can apply the
following approximations:
y (n i) y (n i) y (n i) y (n i)
, and . (134)
ak (n) ak (n i) bk (n) bk (n i)
(134) is advantageous, because the gradient approximate can be calculated by filters: the
y (n) y (n)
inputs of the filters are x (n k ) , and y (n k ) , their output , and , and we
ak (n) bk (n)
need as many filters as many tunable parameters we have. The block diagram of the filters
can be seen on Figure 27.
() ()
() 1 ()
1
1 (, ) 1 (, )
( ) ( )
Figure 27
This approximate technique is the so-called Recursive Prediction Error (RPE) method. This
refers that the correction term within the predicted parameter value is computed by a
recursive filter. It is an obvious drawback of this approach is that the number of filters
required equal to the number of parameters. A further simplification can be seen on Figure
28. This is called simplified RPE.
()
() 1 0 () () 1
1 (, ) 1 (, )
1 1
() ()
1 () 1 ()
1 1
() ()
1 () 1 ()
Figure 28
47
2. A further simplification can be if we omit from (133) the feedback terms. With this step we
are back to the pseudolinear regression (PLR) method. Practically also in this case the LMS
or the LMS/Newton methods are applied.
3. As it was already mentioned, in the case of output-error adaptive IIR filters, the error surface
above the subspace of the parameters is not a paraboloid, local minima might occur. Since
gradient methods may lead the algorithm into such local minimum, we might try to transform
this surface into another having a single minimum. This can be achieved with a different
error definition:
Filtered-error (FE) Algorithm:
1 C (n, z )
e(n) e f (n) [1 C(n, z)]e(n) , where C ( n, z ) is designed to force to be Strictly
1 B ( z )
Positive Real (SPR), because in this case global convergence can be assured. B (z ) is
B(n, z ) at the optimal parameter setting W .
Remark: The SPR property indicates that without excitation the internal energy will be
dissipated, and the system will reach its minimum energy state. This corresponds to
convergence of the cost function minimization procedure.
General form of the adaptive IIR filtering algorithms:
W (n 1) W (n) 2R 1 (n 1) X f (n)e f (n) , (135)
where X f (n) is the filtered information (or regression) vector, e f (n) is the filtered error
vector. Names of the algorithms: If X f (n) X e (n), e f (n) ee (n) , then we are speaking about
Recursive Least Square (RLS) algorithms, moreover if R (n 1) I , then the name is Least
Mean Square (LMS) algorithm.
Stability theory based approach: introduction via the LMS method.
W (n 1) W (n) 2 e(n) X (n) , V (n 1) [ I 2X (n) X (n)]V (n) .
T
This latter is an

autonomous system: V (n) 0 , thus W (n) W .
Using Ljapunovs method we are looking for an appropriate energy function:
G (n) V T (n)V (n) . We would like to reduce this energy: G (n 1) G (n 1) G (n) 0 , for
all n. If G (0) is bounded, then G (n) 0 .
G (n 1) V T (n 1)V (n 1) V T (n)V (n)
V T (n)[I 2X (n) X T (n)]T [ I 2X (n) X T (n)]V (n) V T (n)V (n) (136)
4e (n)(1 X (n) X (n)) .
2 T
1
Since 0 , therefore if 0 T for every n, then G 0 involves e 2 (n) 0,
X ( n) X ( n)
and V T (n) X (n) 0.
Remarks:
1
1. The condition 0 T
implicitly guaranties that X (n ) is bounded, and
X ( n) X ( n)
1 /[1 B (n, z )] is stable. (All the poles of the transfer function are located on the unit circle.)
2. e(n) (W (n) W )T X (n) can also be zero, if the parameter error vector and the regression
vector are orthogonal. Obviously, this should be avoided.
48
Measurement Theory: Lecture 7, 29.03.2017
6. Basics of Filtering Theory

I. Optimal non-recursive estimator (scalar Wiener filter)
In the followings, we consider the linear combinator structure again, and obviously, the
obtained result will be the same, however the problem setting is somewhat different, and there
are new elements in the interpretation, as well. We are looking for the best estimator of x using
the so-called linear batch processor:
N 1
x wk yk
k 0
Here y0 , y1 ,..., y N 1 stand for the observed data, wk , k 0,1,..., N 1 are the unknown
weighting factors. This is illustrated by Figure 29 where it is indicated, that element/vector x
is approximated by the linear combination of yk elements within the subspace, which is
spanned by the element vectors. The best estimation is generated by the orthogonal projection
of x on the subspace.
0
Figure 29
This strategy results in the very same solution which is provided by the minimization of the
squared error:

N 1
e x x , E e 2 E ( x x ) 2 E ( x wk yk ) 2 (137)
k 0
E e 2 N 1

2 E ( x wk yk ) y j 0 , E ey j 0 for j .
(138)
w j k 0
This latter is the so-called orthogonality equation, which if we apply an interpretation using
vectors expresses that the error vector e is orthogonal with every y j . By reordering (138):
N 1
Ey y Exy ,
w k k j j j 0,1,2,..., N 1 . (139)

k 0
R yy ( k , j ) Pxy ( j )
where R yy (k , j ) denotes the element of the autocorrelation matrix R yy indexed by ( k , j ) , while

Pxy ( j ) denotes the element of the cross-correlation matrix indexed by j. This notation explicitly
indicates the quantities the correlation of which is represented. The remaining error at optimum
setting:

N 1 N 1
E (e 2 E e( x wk yk ) Eex E( x x ) x E x 2 wk Exyk
min
k 0 min k 0
(140)

N 1
E x wk Pxy (k )
2
k 0
49
All these equations using matrix notation (where W w0 w1 wN 1 ):

T
RyyW Pxy , W R yy1 Pxy , E (e 2 min E x 2 PxyT R yy1 Pxy (141)
Example 1: We have noisy observations from x: yk x nk . The properties of the noise:

0 _ j k
Exn j 0 for j , En j nk 2

. The properties of the signal: Ex 0 , E x 2 x2 .
n j k
The correlation matrices: R yy (k , j ) Eyk , y j E( x nk )( x n j ) x2 n2 kj ,
Pxy ( j ) Exy j x2 . The first equation of (141) in detailed form:
( x2 n2 )w0 x2 w1 x2 wN 1 x2
x2 w0 ( x2 n2 )w1 x2 wN 1 x2 (142)

w0 w1 ( x n )wN 1 x2
2
x
2
x
2 2
After the summation of these N equations:

N 1 N 1
N 2
( n2 N x2 ) wk N x2 , ill. wk 2 x 2 , that can be replaced in every row:
k 0 k 0 n N x
x2 1 N 1
w0 w1 ... wN 1
n2 N x2
, thus x
y k , and the error: (143)
N ( n )2 k 0
x
n2
Ee 2
min

2
x
2
x

N

(144)
N ( n )2 N ( n )2
x x
n 2
Remark: It is interesting to investigate the above result as the function of ( ) .
x
Example 2: We take two samples from a linearly increasing function. Lets estimate the slope.
N 1
We use the estimator x wk yk where y k (k 1) x nk , k=0,1. The correlation matrices:
k 0

Ryy ( j, k ) E(( j 1) x n j )((k 1) x nk ) ( j 1)(k 1) E x 2 En j nk ( j 1)(k 1) S n2 jk

Pxy ( j ) E( xyi Ex(( j 1) x n j ) ( j 1) E x 2 ( j 1)S , where S x2 ( Ex) 2 .
Here we obviously do not suppose that the expected value of x would be zero. By replacing the
above results into the first equation of (141):
( S n2 ) w0 2Sw1 S 1 2 n2
w0 , w1 . Lets have , therefore
2Sw0 (4S n ) w1 2S
2
n2 n2 S
5 5
S S
y 2 y1
w0
1
5
, w1
2
5
, and finally, x 0
5
, E e2 S
5
(145)
50
Recursive estimator from optimal non-recursive estimator:

Introductory example: simple averaging.
Remark: We can assign the discrete time ( n 1,2,... ) to the iteration index.
1 n1 1 n
x (n) y (k ) ,
n k 0
x (n 1) y (k )
n 1 k 0
1 n 1 1 n 1 1

n 1 k 0
y (k )
n 1
y(n)
n 1
x (n)
n 1
y(n) x (n)
n 1
( y(n) x (n)) (146)
Remark: An important advantage of a recursive procedure is that it is not necessary to wait for
all data: the estimator continuously computes the estimator while the quality of the estimator
becomes step-by-step better.
Lets return to the optimal estimator and write instead of N time-index n:
2

n 1
1
x (n) wk (n) y(k ) , wk ( n) , where ( n ) 2 , E e 2 (n) E ( x x (n))2 n ,
k 0 n x n
based on the above an alternative form wk (n)

E e 2 (n) . This can be repeated for n+1:
2
n
n2

n
1
x (n 1) wk (n 1) y(k ) , wk (n 1) , E e 2 (n 1) E ( x x (n 1))2 ,
k 0 n 1 n 1
Ee (n 1) 2
based on the above an alternative form w (n 1) . following the steps of (146)
n2
k
we have:
1 n 1
1 n 1
x (n 1)
n 1
y(k ) n 1 y(n) n 1 x(n) n 1 y(n)
k 0
1
x (n) ( y (n) x (n)) (147)
n 1
It is interesting to see how the mean square error behaves depending on the amount of data.
Based on the previous development:
E{e 2 (n 1)} wk (n 1) n 1 1
(148)
2
E{e (n)} wk (n) n 1 1 1 E{e 2 (n)}
1
n n2
Using this, an alternative form of (147) can be:
E{e 2 (n 1)} E{e 2 (n 1)} E{e 2 (n 1)}
x (n 1)
x ( n) y ( n)
x ( n) ( y(n) x (n)) (149)
E{e 2 (n)} n2 n2
y (0) 2
Example 3: Given 2 , n2 . From the above equations: x (1) , E{e 2 (1)} n ,
3 3
E{e 2 (1)} n2
using (136): E{e 2 (2)} , thus (137) for n=1
E{e 2 (1)} 4
1 2
n
51
n2 n2
3 1 1 1
x (2) 42 x (1) 42 y (1) x (1) y (1) x (1) ( y (1) x (1)) ( y (0) y (1)) ,
n n 4 4 4 4
3
n2 n2
2
4 1
E{e 2 (3)} , x (3) 52 x (2) 52 y (2) x (2) y (2) , etc.
n
5 n n 5 5
4
E{e 2 (n 1)} E{e 2 (n 1)}
Notation: the coefficients of (149): a(n 1) , b(n 1) ,
E{e 2 (n)} n2
thus: x (n 1) a (n 1) x (n) b(n 1) y (n) (150)
a(n 1) 1
Note that , and using (136)
b(n 1) b(n)
1 1 a(n 1)
a(n 1) , thus a(n 1) 1 b(n 1) ,
1 b(n) 1 b(n 1) a(n 1) b(n 1)
a(n 1)
x (n 1) x (n) b(n 1)( y (n) x (n)) (151)

korrekcis _ tag
Equation (151) is the recursive form of the of the optimum non-recursive estimator, which in
every step, due to a new measurement extends with a new dimension the subspace where to
vector x is projected.
II. Optimal recursive estimator (scalar Kalman filter):
The optimal recursive estimator is based on a model, which is more detailed. It is the simplest
state variable model the excitation of which is provided by a noise process. It might have
deterministic excitation, as well, but thanks to the superposition theorem it can be processed
separately.
x(n) ax(n 1) w(n 1) ,
where {w(n)} zero mean white noise process, i.e.:
0 n j x(n) 0
E{w(n)} 0, E{w(n) w( j )} 2 , ha n 0
w n j w(n) 0
Remark: This is the so-called first-order autoregressive process: it depends in first order from
the value available in the previous time instant.

E{x(n)} 0, E{x 2 (n)} Rxx (0) x2 E a 2 x 2 (n 1) w2 (n 1) 2ax(n 1)w(n 1)
w2
a Rxx (0) Rww (0) a , where from Rxx (0)
2 2 2 2
. (152)
x w
1 a2
Rxx (1) Ex(n) x(n 1) Ex(n)(ax(n) w(n)) aRxx (0) , because Ex(n)w(n) 0 .
Rxx (2) Ex(n) x(n 2) Ex(n)(ax(n 1) w(n 1)) a 2 Rxx (0) , in general
52
Rxx ( j ) E{x(n) x(n j )} Rxx ( j ) a Rxx (0) .

j
(153)
The linear model of observation can be seen on Figure 30.
()
()
( 1) ( 1)
1 ()
Figure 30
The observation is noisy: for its description, an additive noise source is used. Our assumptions
concerning the noise are exactly the same as that of w(n). The two noise processes are
independent of each other. The recursive estimator is the linear combination of the new
observation and the previous estimate (see Figure 31):
()
()
() 1 ( 1)
()
Figure 31
x (n) a (n) x (n 1) b(n) y (n) (154)
Remark: There exists a so-called predicator scheme, as well:
x (n 1) (n) x (n) (n) y (n) . (155)
We are looking for the optimum weights in (154) using the illustration in Figure 29. The
approximation error is:

e(n) x(n) x (n), E e 2 (n) E ( x(n) a(n) x (n 1) b(n) y (n))2 .
Conditions of the optimum:

E e 2 (n)
2Ee(n) x (n 1) 0 ,
E e 2 (n)
2Ee(n) y(n) 0 ,
a(n) b(n)
These are the so-called orthogonality equations for the Kalman filter:
Ee(n) x(n 1) 0, Ee(n) y(n) 0 (156)

These equations can be interpreted also using Figure 29: the error vector e(n) is orthogonal to
the vectors x ( n 1) and y(n), and to the subspace generated by them.
53
6. Basics of Filtering Theory (cont.):
Lets compute a (n) :

Using (156) E( x(n) a(n) x(n 1) b(n) y(n))x(n 1) 0 . Write in here:
a(n) x(n 1) a(n) x(n 1) 0, y(n) cx(n) n(n) , i.e.:
E(a(n)(x(n 1) x(n 1)) a(n) x(n 1))x(n 1) ( x(n) cb(n) x(n) b(n)n(n))x(n 1) 0
after some manipulations:
a(n) E e(n 1) x(n 1) x(n 1) x(n 1) E((1 cb(n))x(n) b(n)n(n))x(n 1) (157)
Here Ee(n 1) x(n 1) 0 , since the elements of x (n 1) a (n 1) x (n 2) b(n 1) y (n 1)
are orthogonal to e(n-1). Similarly, the observation noise sample n(n) is uncorrelated with
x ( n 1) . Thus (157) can have the form:
a(n) Ex(n 1) x(n 1) (1 cb(n))Ex(n) x(n 1)
Introducing x(n) ax(n 1) w(n 1) , and Ew(n 1) x(n 1) 0 , we have:
a(n) Ex(n 1) x(n 1) a(1 cb(n))Ex(n 1) x(n 1), i.e. a(n) a(1 cb(n)) , and finally
x (n) ax (n 1) b(n)( y (n) cax (n 1)) (158)

Remark: It can be seen that the model of the observed system becomes part of the observation.
The second component on the right-hand side of (158) is a correction term which is proportional
with the difference of the measured and the estimated y (n) acx (n 1) values. The block
diagram corresponding to the estimator given by (158) can be seen on Figure 32.
()
()
() 1 ( 1)
1

Figure 32
Lets compute b (n ) :

E e 2 (n) Ee(n)( x(n) x (n)) Ee(n) x(n),
because the right-hand side of x (n) a (n) x (n 1) b(n) y (n) is orthogonal to e(n ) . Since
y (n) cx (n) n(n) cx (n) y (n) n(n) .
Using this: cEe(n) x(n) Ee(n)n(n), i.e. E e 2 (n) Ee(n)n(n).

1
c
After some manipulations:
E{e 2 (n)} E( x(n) x (n))n(n) E( x(n) a (n) x (n 1) b(n) y (n))n(n) .
1 1
c c
since x (n), x (n 1) are uncorrelated with n (n ) , therefore:

E e 2 ( n)
b( n)
c
Ey (n)n(n)
b( n) 2
c
n , where from b(n) c
E{e 2 (n)}
n2
(159)
54
Remark: This form is useless, because to be able to compute the mean square of e(n) we need
b(n). We need such a form where only the mean square of e(n-1) or earlier is used. Lets
introduce (158) into mean square error:
E{e 2 (n)} E{[ x(n) x (n)]2 } E{[ x(n) ax (n 1) b(n)( y(n) acx (n 1))]2 } (160)
Write in y (n) cx (n) n(n) and x(n) ax(n 1) w(n 1) :
E{e 2 (n)}
E{[ ax(n 1) w(n 1) ax (n 1) b(n)(acx(n 1) cw(n 1) n(n) acx (n 1))]2 }
E{[ a (1 cb (n))e(n 1) (1 cb (n))w(n) b(n)n(n)]2 }
After some manipulations, and observing that the expected values of the cross-products will be
zero:
E{e2 (n)} a 2 (1 cb(n))2 E{e2 (n 1)} (1 cb(n))2 w2 b2 (n) n2 (161)

b( n) 2
From (159) E{e 2 (n)} n . Writing into (161):
c
b( n)
(1 cb (n)) n2 (1 cb (n))2 [a 2 E{e 2 (n 1)} w2 ] , This is already suitable to compute b(n):
c
b(n)[a 2c 2 E{e2 (n 1)} c 2 w2 n2 ] c[a 2 E{e2 (n 1)} w2 ] ,
c[a 2 E{e 2 (n 1)} w2 ]

b(n) (162)
[a 2 c 2 E{e 2 (n 1)} c 2 w2 n2 ]
1
Remark: If n 0, then b(n) . Thus E{e 2 (n)} 0 .
c
Summary:
Applying the system model: x(n) ax(n 1) w(n 1) , the observation: y (n) cx (n) n(n) .
The optimal recursive filter is:
x (n) ax (n 1) b(n)( y (n) acx (n 1))
b(n) cp (n)[c 2 p(n) n2 ]1 , ahol
p(n) a 2 E{e 2 (n 1)} w2 . (163)
E{e (n)} p(n) cb(n) p(n)
2
p(n 1) a 2 (1 cb(n)) p(n) w2

Optimal recursive predictor:
The estimator (see (155)): x (n 1) (n) x (n) (n) y (n) , with similar manipulations:
x (n 1) ax (n) (n)( y (n) cx (n))
The derivation consists of similar steps like in the case of recursive filter, the difference is that
here we minimize p(n 1) E{e 2 (n 1)} . The detailed development will be provided for the
vector case.
Summary:
55
Applying system model: x(n 1) ax(n) w(n) , the observation: y (n) cx (n) n(n) . The
optimal recursive predictor is:
x (n 1) ax (n) (n)( y (n) cx (n))

(n) acp(n)[c 2 p(n) n2 ]1 . (164)
p(n 1) a(a c (n)) p(n) 2
w
Remark: Figure 33 shows the block diagram of the optimal single-step predictor. Not that this
scheme corresponds to the observer scheme introduced somewhat earlier.
1
( + 1) ()
() () 1 ()
Figure 33
Example: Scalar Kalman Filter: Lets suppose: E{x(n)} 0 , = 1, and x (0) 0 .
x (1) b(1) y (1) . The value of b(1) comes from the orthogonality condition:
E{[ x(1) x (1)] y (1)} 0 , where y(1)=x(1)+n(1) and therefore x (1) b(1)[ x(1) n(1)] . Thus the
orthogonality condition:
E{[(1 b(1))x(1) b(1)n(1)][ x(1) n(1)]} (1 b(1)) x2 b(1) n2 0
x2 1 w2 2
I.e. b(1) 2 . If e.g. n w , and a , then x
2 2 2 2
2 n2 . Thus b(1) . Using
x n 2
2 1 a 2
3
2
(159) E{e 2 (1)} n2 . Having this we can use (162):
3
1
1
a 2 E{e 2 (1)} w2 3 4
b(2) 2 0.57 .
n w a E{e (1)} 2 1 7
2 2 2
3
4 9 9
With (147) and (162): E{e 2 (2)} n2 0.57 n2 , b(3) , E{e 2 (3)} n2 0.562 n2 .
7 16 16
If we continue this iteration, we will reach steady state where E{e (k 1)} E{e 2 (k )} p .
2
Using (159) and (162) with this p we get: p 2 3 p n2 2 n4 0 , which holds if p 0.56 n2 .
It can be seen that the third iteration step gives result relatively close enough to the steady state.
Vector Kalman Filter:
The system model: x(n) Ax (n 1) w(n 1) , and the observation: y (n) Cx (n) n(n) . Both
the system and the observation noise have zero mean and are white. Their correlation matrices:
Q(n) E{w(n) wT (n)} (replaces w2 ), R(n) E{n(n)nT (n)} , (replaces n2 ). The optimum
recursive filter is:
56
x (n) Ax (n 1) K (n)( y (n) CAx (n 1))

K (n) P1 (n)C T [CP1 (n)C T R(n)]1 , where
P1 (n) AP (n 1) AT Q(n 1) (165)
P(n) P1 (n) K (n)CP1 (n) ( I K (n)C ) P1 (n)
P1 (n 1) A( I K (n)C ) P1 (n) AT Q(n)
K(n) denotes the so-called Kalman gain, P(n) is the covariance matrix of the error, and finally
P1 (n) is an auxiliary matrix to simplify computations.
Vector Kalman predictor:
The system model: x(n 1) Ax (n) w(n) , and the observation: y (n) Cx (n) n(n) . Both the
system and the observation noise have zero mean and are white. Their correlation matrices:
Q(n) E{w(n) wT (n)} (replaces w2 ), R(n) E{n(n)nT (n)} , (replaces n2 ). The optimum
recursive predictor is:
x (n 1) Ax (n) G (n)( y (n) Cx (n)) , (166)
and we are looking for G(n), which minimizes the following covariance matrix:
P(n 1) E{[ x(n 1) x (n 1)][ x(n 1) x (n 1)]T } E{e(n 1)eT (n 1)} (167)
Lets compute the derivative of (167) with respect to G(n). The optimum condition is:
P(n 1)
2 E{[ x(n 1) x (n 1)][ y (n) Cx (n)]T } 0 . (168)
G (n)
If we substitute the equations of the system model and the predictor into (168) then:
E{[ Ax (n) w(n) Ax (n) G (n)(Cx (n) n(n) Cx (n))][Cx (n) n(n) Cx (n)]T }
E{[( A G (n)C )e(n) w(n) G (n)n(n)][Ce(n) n(n)]T }
[ A G (n)C ]P(n)C T G (n) R(n) 0
(169)
In (169) we utilized our a priori knowledge concerning:
E{e(n)nT (n)} E{w(n)eT (n)} E{n(n)eT (n)} E{w(n)nT (n)} 0 (170)
The predictor gain from (169:
G (n) AP (n)C T [CP (n)C T R(n)]1 . (171)
With this the covariance matrix of the estimation (169):
P(n 1) [( A G (n)C )e(n) w(n) G (n)n(n)][( A G (n)C )e(n) w(n) G (n)n(n)]T
[ A G (n)C ]P(n)[ A G (n)C ]T Q(n) G (n) R(n)G T (n) . (172)
Remark: From (172) it can be seen that the covariance matrix of the estimator will be changed
as a result of three effects:
1. Multiplication from the left and from the right by [ A G (n)C ] , and its transposed,
respectively: the expected effect is error reduction. Note that [ A G (n)C ] is preferably
of contractive nature, as it was the case with observers, the only difference is that here
the error is propagated in quadratic sense.
57
2. The covariance matrix of the system noise is an additive component in (172), which
increases the covariance of the estimate. The mechanism behind this effect is that the
discrete w(n) values perturb x(n+1), i.e. the predicted value.
3. The covariance matrix of the observation noise also increases the covariance of the
estimate, because the discrete n(n) values via () perturb x ( n 1) , i.e. the predicted
estimate.
Remark: (172) might have a condensed form, because by expanding its first component we
have:
P(n 1) AP (n) AT AP (n)C T G T (n) G (n)CP (n) AT G (n)CP (n)C T G T (n)
(173)
G (n) R(n)G T (n) Q(n).
If we combine the fourth and the fifth component, and use (171), we have:
G(n)[CP (n)C T R(n)]G T (n) AP(n)C T G T (n) , that equals the second component of (173)
with opposite sign. Remain only the first, third and sixth components, thus:
P(n 1) [ A G (n)C ]P(n) AT Q(n) (174)
Summary:
Applying system model: x(n 1) Ax (n) w(n) , the observation: y (n) Cx (n) n(n) . The
Kalman predictor:
x (n 1) Ax (n) G(n)( y (n) Cx (n))

G(n) AP (n)C T [CP (n)C T R(n)]1 (175)
P(n 1) ( A G(n)C ) P(n) A Q(n)
T
Remark: The optimal recursive estimator can be used for model fitting, as well. The fitted
model is again the adaptive linear combinator. By applying the notation used previously, and
supposing Q(n)=0, (175) will have the following form:
W (n 1) W (n) G (n)( y (n) X T (n)W (n)) W (n) G (n)e(n) (176)
G (n) P(n) X (n)[ X T (n) P(n) X (n) R(n)]1 (177)
P(n 1) ( I G (n) X (n))P(n)
T
(178)
where P(n) stands for the covariance matrix of the parameter estimation: P(n) E{V (n)V (n)} T
, and X(n) is the so-called regression vector. It worth studying equations (108) -(121).
Remark:
It is interesting to compare the observer on Figure 2 with the Kalman predictor on Figure 34.
1
( + 1) ()
() () 1 ()
Figure 34
58
1. Note that the first component of (172) is rather similar to (5), the difference is only the
quadratic nature. Obviously here it also valid that to reduce error, the state transition matrix
of the error systems F (n) A G (n)C should be contractive.
2. If the noise processes are stationary, then Q(n)=Q, R(n)=R.
3. Please note how the model of the observed system is built into the observer.
59

Measurement Theory Lecture Notes 1-8

Transféré par

Informations du document

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Measurement Theory Lecture Notes 1-8

Transféré par

Droits d'auteur :

Formats disponibles

Measurement Theory

Lecture Notes 1-8

Space of reality Space of observations Space of decisions and estimations

Observation process Inverse observation process

Its integral is the so-called Probability Distribution Function

3. Decision theory basics

observation decision space

Hypothesis H 0 : the (hostile) object is absent.

Two-hypothesis Bayesian Decision:

using (13) rewrite (12)

R C00 P0 f ( z H 0 )dz C10 P0 f ( z H 0 )dz C01 P1 f ( z H1 )dz C11 P1 f ( z H1 )dz . (14)

the risk increases the risk increases

This is called Bayesian decision rule or likelihood-ratio test.

3. Decision theory basics (cont.)

the ration of the variances can be analyzed.

P(interesting bad ) P(bad )

Reality Observations Estimations

3. Conditional bias: b(a) Ea a a (33)

5. Expected value without condition: E (a ) EEa a (35)

6. Covariance matrix without condition: cova , a Ecova , a a (36)

the minimum of which is called Bayes risk/cost:

Quadratic Absolute Hit-or-Miss

The minimum of (44) can be obtained by:

(a a) f (a z )da (a a) f (a z )da 0, (48)

Remark: a MS a ABS a MAP , because the a posteriori density function is Gaussian.

2. The variance of the estimation error is the a posteriori covariance:

If a n , then var a~ a2 , if a n or N , then var a~ n2 .

4. Estimation Theory Basics (Cont.)

By substituting into (58):

where nn denotes the determinant of matrix nn .

which takes its maximum at:

Based on (65): U T nn1Ua U T nn1 z 0 , from which

aGM [U T nn1U ]1U T nn1 z , (66)

covn k , n j . The channel characteristics is: f ( z a)

Assuming that the estimators have the same variance, we have:

The case of Linear Models with White Gaussian Noise (WGN):

The procedure to be followed:

Step 3: Find the MVU estimator () by factoring:

2. Fourier analysis: Consider the Fourier analysis of the data :

4. Estimation Theory Basics (cont.)

The observation is linear with WGN. Lets estimate the weights : z Ua w ;

x0 x1 ... x N 1 x0 x1 ... x1 P xn2

of which estimate the auto-correlation values of the discrete sequence :

VI. Best Linear Unbiased Estimator, BLUE):

- Assign to , = 0,1, , 1 a linear estimator = 1

Taking the derivative of the log-likelihood function, we have:

Example: Consider DC level in WGN with unknown variance = + .

Taking the derivative of the log-likelihood function, we have:

Linear Least Squares:

J (W (n)) J min [W (n) W * ]T R[W (n) W * ] J min V T (n) RV (n) , (87)

5. Model fitting (cont.)

Towards adaptive processing methods: Based on (85) and (88): W * R 1 P ,

W (n 1) W (n) R 1(n) . (93)

W (n 1) W (n) R 1(n) and V (n 1) (1 2 )V (n) . (94)

Proof of the orthogonality: By definition: QiT RT i QiT , and RQ j j Q j . Multiplying both

Thus (n) 2V (n) 2 v v

Convergence requires r 1 2 1 . Thus:

W (n 1) W (n) R 1(n) , (108)

(n) J (W (n)) 2 X (n) y(n) 2 X (n) X T (n)W (n) 2 X (n)e(n)

W (n 1) W (n) 2 X (n)e(n) . (114)

where R (n 1) R (n) X (n) X T (n) , 0 1 , 0 1 , 1 , 0.9...0.99 . The

5. Model fitting (cont.)

If the cost function is not quadratic its Taylor series expansion:

ee (n) y (n) B(n, z ) y (n) A(n, z ) x(n) y (n) ye (n) , (130)