Class Notes

1/28
Review of Probability
Review of Probability
2/28
Random Variable
! Definition
Numerical characterization of outcome of a random
event
!Examples
1) Number on rolled dice
2) Temperature at specified time of day
3) Stock Market at close
4) Height of wheel going over a rocky road
3/28
Random Variable
!Non-examples
1) Heads or Tails on coin
2) Red or Black ball from urn
!Basic Idea dont know how to completely
determine what value will occur
Can only specify probabilities of RV values occurring.
But we can make
these into RVs
4/28
Two Types of Random Variables
Random Variable
Discrete RV
Die
Stocks
Continuous RV
Temperature
Wheel height
5/28
Given Continuous RV X
What is the probability that X = x
0
?
Oddity : P(X = x
0
) = 0
Otherwise the Prob. Sums to infinity
Need to think of Prob. Density Function (PDF)
p
X
(x)
x
The Probability density function
of RV X
+
o
x
o
x
+
=
= + < <
o
o
x
x
X
dx x p
x X x P
) (
shown area ) (
0 0
PDF for Continuous RV
6/28
Most Commonly Used PDF: Gaussian
m & are parameters of the Gaussian pdf
m = Mean of RV X
= Standard Deviation of RV X (Note: > 0)
2
= Variance of RV X
2 2
2 / ) (
2
1
) (

m x
X
e x p

=
A RV X with the following PDF is called a Gaussian RV
Notation: When X has Gaussian PDF we say X ~ N(m,
2
)
7/28
" Generally: take the noise to be Zero Mean
2 2
2
2
1
) (

x
x
e x p =
Zero-Mean Gaussian PDF
8/28
Small
Large
Small Variability
(Small Uncertainty)
Large Variability
(Large Uncertainty)
p
X
(x)
x
Area within 1 of mean = 0.683
= 68.3%

x = m
p
X
(x)
p
X
(x)
x
x
Effect of Variance on Guassian PDF
9/28
"Central Limit theorem (CLT)
The sum of N independent RVs has a pdf
that tends to be Gaussian as N
"So What! Here is what : Electronic systems generate
internal noise due to random motion of electrons in electronic
components. The noise is the result of summing the random
effects of lots of electrons.
CLT applies
Guassian Noise
Why Is Gaussian Used?
10/28
Describes probabilities of joint events concerning X and Y. For
example, the probability that X lies in interval [a,b] and Y lies in
interval [a,b] is given by:
) , ( y x p
XY
{ }

= < < < <
b
a
d
c
XY
dxdy y x p d Y c b X a ) , ( ) ( and ) ( Pr
This graph shows the Joint PDF
Graph from B. P. Lathis book: Modern Digital & Analog Communication Systems
Joint PDF of RVs X and Y
11/28
When you have two RVsoften ask: What is the PDF of Y if X is
constrained to take on a specific value.
In other words: What is the PDF of Y conditioned on the fact X is
constrained to take on a specific value.
Ex.: Husbands salary X conditioned on wifes salary = $100K?
First find all wives who make EXACTLY $100Khow are their
husbands salaries distributed.
Depends on the joint PDF because there are two RVsbut it
should only depend on the slice of the joint PDF at Y=$100K.
Nowwe have to adjust this to account for the fact that the joint
PDF (even its slice) reflects how likely it is that X=$100K will
occur (e.g., if X=10
5
is unlikely then p
XY
(10
5
,y) will be small); so
if we divide by p
X
(10
5
) we adjust for this.
Conditional PDF of Two RVs
12/28
Conditional PDF (cont.)
Thus, the conditional PDFs are defined as (slice and normalize):

=
otherwise , 0
0 ) ( ,
) (
) , (
) | (
|
x p
x p
y x p
x y p
X
X
XY
X Y

=
otherwise , 0
0 ) ( ,
) (
) , (
) | (
|
y p
y p
y x p
y x p
Y
Y
XY
Y X
This graph shows the Conditional PDF
Graph from B. P. Lathis book: Modern Digital & Analog Communication Systems
y is held
fixed
x is held
fixed
slice and
normalize
y is held fixed
13/28
Independence should be thought of as saying that:
neither RV impacts the other statistically thus, the
values that one will likely take should be irrelevant to the
value that the other has taken.
In other words: conditioning doesnt change the PDF!!!
) (
) (
) , (
) | (
) (
) (
) , (
) | (
|
|
x p
y p
y x p
y x p
y p
x p
y x p
x y p
X
Y
XY
y Y X
Y
X
XY
x X Y
= =
= =
=
=
Independent RVs
14/28
Independent and Dependent Gaussian PDFs
Independent
(zero mean)
Independent
(non-zero mean)
Dependent
y
x
y
x
y
x
Contours of p
XY
(x,y).
If X & Y are independent,
then the contour ellipses
are aligned with either the
x or y axis
Different slices
give
different
normalized curves
Different slices
give
same normalized
curves
15/28
RVs X & Y are independent if:
) ( ) ( ) , ( y p x p y x p
Y X XY
=
) (
) (
) ( ) (
) (
) , (
) | (
|
y p
x p
y p x p
x p
y x p
x y p
Y
X
Y X
X
XY
x X Y
= = =
=
Heres why:
An Independent RV Result
16/28
Characterizing RVs
! PDF tells everything about an RV
but sometimes they are more than we need/know
! So we make due with a few Characteristics
Mean of an RV (Describes the centroid of PDF)
Variance of an RV (Describes the spread of PDF)
Correlation of RVs (Describes tilt of joint PDF)
Mean = Average = Expected Value
Symbolically: E{X}
17/28
Motivation First w/ Data Analysis View
Consider RV X = Score on a test Data: x
1
, x
2
, x
N
Possible values of RV X : V
0
V
1
V
2
... V
100
0 1 2 100
This is called Data Analysis View
But it motivates the Data Modeling View
N
i
= # of scores of value V
i
N =
(Total # of scores)
=
n
i
i
N
1
Test
Average
P(X = V
i
)
=
=
=
+ +
= = =
100
0
100 1 1 0 0 1
...
i
i
i
n
N
i
i
N
N
V
N
V N V N V N
N
x
x
Probability
Statistics
Motivating Idea of Mean of RV
18/28
Theoretical View of Mean
" For Discrete random Variables :
Data Analysis View leads to Probability Theory:
" This Motivates form for Continuous RV:
Probability Density Function
=
=
n
n
i X i
x P x X E
1
) ( } {

= dx x p x X E
X
) ( } {
Probability Function
Notation: X X E = } {
Data Modeling
Shorthand Notation
19/28
Aside: Probability vs. Statistics
Probability Theory
Given a PDF Model
Describe how the
data will likely behave
Statistics
Given a set of Data
Determine how the
data did behave
=
=
n
i
i
x
N
Avg
1
1
Data
There is no DATA here!!!
The PDF models how
the data will likely behave
There is no PDF here!!!
The Statistic measures how
the data did behave

= dx x p x X E
X
) ( } {
Dummy Variable
PDF
Law of Large
Numbers
20/28
Variance: Characterizes how much you expect the
RV to Deviate Around the Mean
There are similar Data vs. Theory Views here
But lets go right to the theory!!
Variance:
=
=
dx x p m x
m X E
X x
x
) ( ) (
} ) {(
2
2 2
Note : If zero mean
=
=
dx x p x
X E
X
) (
} {
2
2 2
Variance of RV
21/28
Motivating Idea of Correlation
Consider a random experiment that observes the
outcomes of two RVs:
Example: 2 RVs X and Y representing height and weight, respectively
y
x
Positively Correlated
Motivate First w/ Data Analysis View
x
y
22/28
Illustrating 3 Main Types of Correlation
Positive Correlation
Best Friends
Negative Correlation
Worst Enemies
Zero Correlation
i.e. uncorrelated
Complete Strangers
GPA
&
Starting Salary
Height
&
$ in Pocket
Student Loans
&
Parents Salary
y y
x x
y y y y
x x x x
Data Analysis View:

=
=
N
i
i i xy
y y x x
N
C
1
) )( (
1
23/28
To capture this, define Covariance :
If the RVs are both Zero-mean :
)} )( {( Y Y X X E
XY
=
} {XY
XY
=
If X = Y:
2 2
Y X XY
= =

= dxdy y x p Y y X x
XY XY
) , ( ) )( (
Prob. Theory View of Correlation
If X & Y are independent, then:
0 =
XY
24/28
If
0 )} )( {( = = Y Y X X E
XY
Then Say that X and Y are uncorrelated

If
0 )} )( {( = = Y Y X X E
XY
Then
Y X XY E = } {
Called Correlation of X & Y
So RVs X and Y are said to be uncorrelated
if
XY
= 0
or equivalently if E{XY} = E{X}E{Y}
25/28
X & Y are
Independent
Implies
X & Y are
Uncorrelated
Uncorrelated
Independence
INDEPENDENCE IS A STRONGER CONDITION !!!!
) ( ) (
) , (
y f x f
y x f
Y X
XY
=
} { } {
} {
Y E X E
XY E
=
Independence vs. Uncorrelated
PDFs Separate Means Separate
26/28
Covariance :
)} )( {( Y Y X X E
XY
=
Correlation :
} {XY E
Correlation Coefficient :
Y X
XY
XY

=
1 1
XY
Same if zero mean

Confusing Covariance and
Correlation Terminology
27/28
Correlation Matrix :
{ } { } { }
{ } { } { }
{ } { } { }
(
(
(
(
(
(
= =
N N N N
N
N
T
X X E X X E X X E
X X E X X E X X E
X X E X X E X X E
E
!
" # " "
!
!
2 1
2 2 2 1 2
1 2 1 1 1
} {xx R
x
T
N
X X X ] [
1 1
! = x
Covariance Matrix :
} ) )( {(
T
E x x x x C
x
=
Covariance and Correlation For
Random Vectors
28/28
} { } { } { Y E X E Y X E + = +
( ) { }
( ) { }
( ) ( ) { }
( ) { } ( ) { } { }
XY
Y X
z z z z
z z z z
z z z
Y X E Y E X E
Y X Y X E
X X X Y X E
Y X Y X E Y X
2
2
2
where
} var{
2 2
2 2
2 2
2
2
+ + =
+ + =
+ + =
= + =
+ = +
+
+ +
= +
ed uncorrelat are & if ,
2
} var{
2 2
2 2
Y X
Y X
Y X
XY
Y X

= dx x p x f X f E
X
) ( ) ( )} ( {
} { } { X aE aX E =
2 2
} var{
X
a aX =
A Few Properties of Expected Value
1/45
Review of
Matrices and Vectors
2/45
Definition of Vector: A collection of complex or real numbers,
generally put in a column
[ ]
T
N
N
v v
v
v
! "
1
1
=
= v
Transpose
+
+
= +
=
N N N N
b a
b a
b
b
a
a
" " "
1 1 1 1
b a b a
Definition of Vector Addition: Add element-by-element
Vectors & Vector Spaces
3/45
Definition of Scalar: A real or complex number.
If the vectors of interest are complex valued then the set of
scalars is taken to be complex numbers; if the vectors of
interest are real valued then the set of scalars is taken to be
real numbers.
Multiplying a Vector by a Scalar :
=
N N
a
a
a
a
" "
1 1
a a
changes the vectors length if || 1
reverses its direction if < 0
4/45
Arithmetic Properties of Vectors: vector addition and
scalar multiplication exhibit the following properties pretty
much like the real numbers do
Let x, y, and z be vectors of the same dimension and let
and be scalars; then the following properties hold:
x x
x y y x
=
+ = +
x x
z x y z y x
) ( ) (
) ( ) (
=
+ + = + +
x x x
y x y x
+ = +
+ = +
) (
) (
zeros all of vector zero the is where ,
1
0 0 0x
x x
=
=
1. Commutativity
2. Associativity
3. Distributivity
4. Scalar Unity &
Scalar Zero
5/45
Definition of a Vector Space: A set V of N-dimensional vectors
(with a corresponding set of scalars) such that the set of vectors
is:
(i) closed under vector addition
(ii) closed under scalar multiplication
In other words:
addition of vectors gives another vector in the set
multiplying a vector by a scalar gives another vector in the set
Note: this means that ANY linear combination of vectors in the
space results in a vector in the space
If v
1
, v
2
, and v
3
are all vectors in a given vector space V, then
=
= + + =
3
1
3 3 2 2 1 1
i
i i
v v v v v
is also in the vector space V.
6/45
Axioms of Vector Space: If V is a set of vectors satisfying the above
definition of a vector space then it satisfies the following axioms:
1. Commutativity (see above)
2. Associativity (see above)
3. Distributivity (see above)
4. Unity and Zero Scalar (see above)
5. Existence of an Additive Identity any vector space V must
have a zero vector
6. Existence of Negative Vector: For every vector v in V its
negative must also be in V
Soa vector space is nothing more than a set of vectors with
an arithmetic structure
7/45
Def. of Subspace: Given a vector space V, a subset of vectors in V
that itself is closed under vector addition and scalar
multiplication (using the same set of scalars) is called a
subspace of V.
Examples:
1. The space R
2
is a subspace of R
3
.
2. Any plane in R
3
that passes through the origin is a subspace
3. Any line passing through the origin in R
2
is a subspace of R
2
4. The set R
2
is NOT a subspace of C
2
because R
2
isnt closed
under complex scalars (a subspace must retain the original
spaces set of scalars)
8/45
Length of a Vector (Vector Norm): For any vector v in C
N
we define its length (or norm) to be
=
=
N
i
i
v
1
2
2
v

=
=
N
i
i
v
1
2 2
2
v
2 2
v v =
2
2
2
1
2
2 1
v v v v + +
N
C < v v
2
0 v v = = iff 0
2
Properties of Vector Norm:
Geometric Structure of Vector Space
9/45
Distance Between Vectors: the distance between two
vectors in a vector space with the two norm is defined by:
2
2 1 2 1
) , ( v v v v = d
2 1 2 1
iff 0 ) , ( v v v v = = d
Note that:
v
2
v
1
v
1
v
2
10/45
Angle Between Vectors & Inner Product:
Motivate the idea in R
2
:
v
A
u
=
0
1
sin
cos
u v
A
A
= + =
=
cos sin 0 cos 1
2
1
A A A v u
i
i
i
Note that:
Clearly we see thatThis gives a measure of the angle
between the vectors.
Now we generalize this idea!
11/45
Inner Product Between Vectors :
Define the inner product between two complex vectors in C
N
by:
*
1
i
N
i
i
v u
=
>= < v u,
Properties of Inner Products:
1. Impact of Scalar Multiplication:
2. Impact of Vector Addition:
3. Linking Inner Product to Norm:
4. Schwarz Inequality:
5. Inner Product and Angle:
(Look back on previous page!)
> < >= <
> < >= <
v u, v u,
v u, v u,
*

> < + > >=< + <
> < + > >=< + <
v w, v u, v w, u
z u, v u, z v u,
> =< v v, v
2
2
2 2
v u v u, > <
) cos(
2 2
=
> <
v u
v u,
12/45
Inner Product, Angle, and Orthogonality :
) cos(
2 2
=
> <
v u
v u,
(i) This lies between 1 and 1;
(ii) It measures directional alikeness of u and v
= +1 when u and v point in the same direction
= 0 when u and v are a right angle
= 1 when u and v point in opposite directions
Two vectors u and v are said to be orthogonal when <u,v> = 0
If in addition, they each have unit length they are orthonormal
13/45
Can we find a set of prototype vectors {v
1
, v
2
, , v
M
} from
which we can build all other vectors in some given vector space V
by using linear combinations of the v
i
?
Same Ingredientsjust different amounts of them!!!
=
=
M
k
k k
1
v v

=
=
M
k
k k
1
v u
We want to be able to do is get any vector just by changing the
amounts To do this requires that the set of prototype
vectors {v
1
, v
2
, , v
M
} satisfy certain conditions.
Wed also like to have the smallest number of members in the
set of prototype vectors.
Building Vectors From Other Vectors
14/45
Span of a Set of Vectors: A set of vectors {v
1
, v
2
, , v
M
} is said to
span the vector space V if it is possible to write each vector v in V as
a linear combination of vectors from the set:
=
=
M
k
k k
1
v v
This property establishes if there are enough vectors in the
proposed prototype set to build all possible vectors in V.
It is clear that:
1. We need at least N vectors to
span C
N
or R
N
but not just any N
vectors.
2. Any set of N mutually orthogonal
vectors spans C
N
or R
N
(a set of
vectors is mutually orthogonal if all
pairs are orthogonal).
Does not
Span R
2
Spans R
2
Examples in R
2
15/45
Linear Independence: A set of vectors {v
1
, v
2
, , v
M
} is said to
be linearly independent if none of the vectors in it can be written as
a linear combination of the others.
If a set of vectors is linearly dependent then there is redundancy
in the setit has more vectors than needed to be a prototype set!
For example, say that we have a set of four vectors {v
1
, v
2
, v
3
, v
4
}
and lets say that we know that we can build v
2
from v
1
and v
3
then every vector we can build from {v

1
, v
2
, v
3
, v
4
} can also be built
from only {v
1
, v
3
, v
4
}.
It is clear that:
1. In C
N
or R
N
we can have no
more than N linear independent
vectors.
2. Any set of mutually
orthogonal vectors is linear
independent (a set of vectors is
mutually orthogonal if all pairs
are orthogonal).
Linearly
Independent
Not Linearly
Independent
Examples in R
2
16/45
Basis of a Vector Space: A basis of a vector space is a set of
linear independent vectors that span the space.
Span says there are enough vectors to build everything
Linear Indep says that there are not more than needed
Orthonormal (ON) Basis: If a basis of a vector space contains
vectors that are orthonormal to each other (all pairs of basis
vectors are orthogonal and each basis vector has unit norm).
Fact: Any set of N linearly independent vectors in C
N
(R
N
) is a
basis of C
N
(R
N
).
Dimension of a Vector Space: The number of vectors in any
basis for a vector space is said to be the dimension of the space.
Thus, C
N
and R
N
each have dimension of N.
17/45
Fact: For a given basis {v
1
, v
2
, , v
N
}, the expansion of a vector v
in V is unique. That is, for each v there is only one, unique set of
coefficients {
1
,
2
, ,
N
} such that

=
=
N
k
k k
1
v v
In other words, this expansion or decomposition is unique.
Thus, for a given basis we can make a 1-to-1 correspondence
between vector v and the coefficients {
1
,
2
, ,
N
}.
We can write the coefficients as a vector, too:
[ ]
T
N
!
1
=
=

N N
v
v
" "
1
1 to 1
1
v
Expansion can be viewed as a mapping (or
transformation) from vector v to vector .
We can view this transform as taking us
from the original vector space into a new
vector space made from the coefficient
vectors of all the original vectors.
Expansion and Transformation
18/45
Fact: For any given vector space there are an infinite number of
possible basis sets.
The coefficients with respect to any of them provides complete
information about a vector
some of them provide more insight into the vector and are therefore
more useful for certain signal processing tasks than others.
Often the key to solving a signal processing problem lies in finding
the correct basis to use for expanding this is equivalent to finding
the right transform. See discussion coming next linking DFT to
these ideas!!!!
19/45
DFT from Basis Viewpoint:
If we have a discrete-time signal x[n] for n = 0, 1, N-1
Define vector:
Define a orthogonal basis from the exponentials used in the IDFT:
[ ]
T
N x x x ] 1 [ ] 1 [ ] 0 [ = ! x
=
1
1
1
0
"
d
=

N N j
N j
e
e
/ ) 1 ( 1 2
/ 1 1 2
1
1
"
d
=

N N j
N j
e
e
/ ) 1 ( 2 2
/ 1 2 2
2
1
"
d
N N N j
N N j
N
e
e
/ ) 1 )( 1 ( 2
/ 1 ) 1 ( 2
1
1
"
d
Then the IDFT equation can be viewed as an expansion of the

signal vector x in terms of this complex sinusoid basis:
=
=
1
0
] [
1
N
k
k
k
k X
N
d x
#$ #% &
k
th
coefficient
T
N
N X
N
X
N
X

=
] 1 [ ] 1 [ ] 0 [
!
coefficient vector
20/45
Whats So Good About an ON Basis?: Given any basis
{v
1
, v
2
, , v
N
} we can write any v in V as
=
=
N
k
k k
1
v v
Given the vector v how do we find the s?
In general hard! But for ON basis easy!!
If {v
1
, v
2
, , v
N
} is an ON basis then
i
i j
i j
N
j
j
i j
N
j
j i
=
=
=

=
=
#$ #% &
] [
1
1
v , v
v , v v v,
i i
v v, =
i
th
coefficient = inner product with i
th
ON basis vector
Usefulness of an ON Basis
21/45
Another Good Thing About an ON Basis: They preserve inner
products and norms(called isometric):
If {v
1
, v
2
, , v
N
} is an ON basis and u and v are vectors
expanded as
Then.
1. < v ,u > = < , > (Preserves Inner Prod.)
2. ||v||
2
= ||||
2
and ||u||
2
= ||||
2
(Preserves Norms)
=
=
N
k
k k
1
v v

=
=
N
k
k k
1
v u
So using an ON basis provides:
Easy computation via inner products
Preservation of geometry (closeness, size, orientation, etc.
22/45
Example: DFT Coefficients as Inner Products:
Recall: N-pt. IDFT is an expansion of the signal vector in terms of
N Orthogonal vectors. Thus
=
=
=
=
1
0
/ 2
1
0
*
] [
] [ ] [
] [
N
n
N kn j
N
n
k
k
e n x
n d n x
k X
d x,
See reading notes for some details about normalization issues in this case
23/45
Matrix: Is an array of (real or complex) numbers organized in
rows and columns.
Here is a 3x4 example:
Well sometimes view a matrix as being built from its columns;
The 3x4 example above could be written as:
[ ]
4 3 2 1
| | | a a a a A =
[ ]
T
k k k k
a a a
3 2 1
= a
Well take two views of a matrix:
1. Storage for a bunch of related numbers (e.g., Cov. Matrix)
2. A transform (or mapping, or operator) acting on a vector
(e.g., DFT, observation matrix, etc. as well see)
=
34 33 32 31
24 23 22 21
14 13 12 11
a a a a
a a a a
a a a a
A
Matrices
24/45
Matrix as Transform: Our main view of matrices will be as
operators that transform one vector into another vector.
Consider the 3x4 example matrix above. We could use that matrix
to transform the 4-dimensional vector v into a 3-dimensional
vector u:
[ ]
4 4 3 3 2 2 1 1
4
3
2
1
4 3 2 1
| | | a a a a a a a a Av u v v v v
v
v
v
v
+ + + =
= =
Clearly u is built from the columns
of matrix A; therefore, it must lie
in the span of the set of vectors that
make up the columns of A.
Note that the columns of A are
3-dimensional vectorsso is u.
25/45
Transforming a Vector Space: If we apply A to all the vectors in
a vector space V we get a collection of vectors that are in a
new space called U.
In the 3x4 example matrix above we transformed a 4-dimensional
vector space V into a 3-dimensional vector space U
A 2x3 real matrix A would transform R
3
into R
2 :
Facts: If the mapping matrix A is square and its columns are
linearly independent then
(i) the space that vectors in V get mapped to (i.e., U) has the
same dimension as V (due to square part)
(ii) this mapping is reversible (i.e., invertible); there is an inverse
matrix A
-1
such that v = A
-1
u (due to square & LI part)
A
26/45
1
x
1
,

A A
1
y
2
x
2
y
1
,

A A
Transform= Matrix Vector: a VERY useful viewpoint for all
sorts of signal processing scenarios. In general we can view many
linear transforms (e.g., DFT, etc.) in terms of some invertible
matrix A operating on a signal vector x to give another vector y:
i i
Ax y =
i i
y A x
1
=
We can think of A and A
-1
as
mapping back and forth
between two vector spaces
27/45
Basis Matrix & Coefficient Vector:
Suppose we have a basis {v
1
, v
2
, , v
N
} for a vector space V.
Then a vector v in space V can be written as:
Another view of this:
=
=
N
k
k k
1
v v
[ ]
=
N
NxN
N
"
# # $ # # % &
!
2
1
matrix
2 1
| | | v v v v
V v =
The Basis Matrix V transforms
the coefficient vector into the
original vector v
Matrix View & Basis View
28/45
Three Views of Basis Matrix & Coefficient Vector:
View #1
Vector v is a linear combination of the columns of basis matrix V.
View #2
Matrix V maps vector into vector v.
View #3
There is a matrix, V
-1
, that maps vector v into vector .
=
=
N
k
k k
1
v v
V v =
v V
1
=
Aside: If a matrix A is square and has linearly independent columns, then A
is invertible and A
-1
exists such that A A
-1
= A
-1
A = I where I is the identity
matrix having 1s on the diagonal and zeros elsewhere.
Now have
a way to go
back-and-
forth
between
vector v
and its
coefficient
vector
Now have
a way to go
back-and-
forth
between
vector v
and its
coefficient
vector
29/45
Basis Matrix for ON Basis: We get a special structure!!!
Result: For an ON basis matrix V V
-1
= V
H
(the superscript H denotes hermitian transpose, which consists
of transposing the matrix and conjugating the elements)
To see this:
I
v v v v v v
v v v v v v
v v v v v v
VV
=
> < > < > <

> < > < > <
> < > < > <
=
1 0 0
0 1 0
0 0 1
, , ,
, , ,
, , ,
2 1
2 2 2 1 2
1 2 1 1 1
!
" ' "
!
!
!
" ' "
!
!
N N N N
N
N
H
Inner products are 0 or 1
because this is an ON basis
30/45
A unitary matrix is a complex matrix A whose inverse is A
-1
= A
H
For the real-valued matrix case we get a special case of unitary
the idea of unitary matrix becomes orthogonal matrix
for which A
-1
= A
T
Two Properties of Unitary Matrices: Let U be a unitary matrix
and let y
1
= Ux
1
and y
2
= Ux
2
1. They preserve norms: ||y
i
|| = ||x
i
||.
2. They preserve inner products: < y
1
, y
2
> = < x
1
, x
2
>
That is the geometry of the old space is preserved by the unitary
matrix as it transforms into the new space.
(These are the same as the preservation properties of ON basis.)
Unitary and Orthogonal Matrices
31/45
DFT from Unitary Matrix Viewpoint:
Consider a discrete-time signal x[n] for n = 0, 1, N-1.
Weve already seen the DFT in a basis viewpoint:
Now we can view the DFT as a transform from the Unitary matrix
viewpoint:
=
=
1
0
] [
1
N
k
k
k
k X
N
d x
#$ #% &
= =

N N N j
N N j
N N j
N j
N N j
N j
N
e
e
e
e
e
e
/ ) 1 )( 1 ( 2
/ 1 ) 1 ( 2
/ ) 1 ( 2 2
/ 1 2 2
/ ) 1 ( 1 2
/ 1 1 2
1 1 0
1 1 1
1
1
1
] | | | [
"
!
!
!
" " "
d d d D
x D x
H
=
~
x D x
~
1
N
=
DFT IDFT
(Acutally D is not unitary but N
-1/2
D is unitary see reading notes)
32/45
Geometry Preservation of Unitary Matrix Mappings
Recallunitary matrices map in such a way that the sizes of
vectors and the orientation between vectors is not changed.
1
x
1
,

A A
1
y
2
x
2
y
1
,

A A
Unitary mappings just
rigidly rotate the space.
33/45
1
x
1
,

A A
1
y
2
x
2
y
1
,

A A
Effect of Non-Unitary Matrix Mappings
34/45
More on Matrices as Transforms
Ax y =
m1 mn n1
Well limit ourselves here to real-valued vectors and matrices
A maps any vector x in R
n
into some vector y in R
m
R
n
R
m
x
y
A
Range(A): Range Space of A =
set of all vectors in R
m
that can be
reached by mapping
vector y = weighted sum of columns of A
may only be able to reach certain ys
Mostly interested in two cases:
1. Tall Matrix m > n
2. Square Matrix m = n
35/45
Range of a Tall Matrix (m > n)
R
n
R
m
x
y A
The range(A) R
m
Proof: Since y is built from the n columns of A there are
not enough to form a basis for R
m
(they dont span R
m
)
Range of a Square Matrix (m = n)
If the columns of A are linearly indep.The range(A) = R
m
because the columns form a basis for R
m
Otherwise.The range(A) R
m
because the columns dont span R
m
36/45
Rank of a Matrix: rank(A) = largest # of linearly independent
columns (or rows) of matrix A
For an mn matrix we have that rank(A) min(m,n)
An mn matrix A has full rank when rank(A) = min(m,n)
Example: This matrix has rank of 3 because the 4
th
column cam be
written as a combination of the first 3 columns
=
0 0 0 0
0 0 0 0
1 1 0 0
2 0 1 0
1 0 0 1
A
37/45
Tall Matrix (m > n) Case
If y does not lie in range(A), then there is No Solution
If y lies in range(A), then there is a solution (but not
necessarily just one unique solution)
yrange(A)
yrange(A)
No Solution
One Solution Many Solutions
A full
rank
A not
full rank
Characterizing Tall Matrix Mappings
We are interested in answering: Given a vector y, what vector x
mapped into it via matrix A?
Ax y =
38/45
Full-Rank Tall Matrix (m > n) Case
R
n
R
m
x
y
A
Range(A)
Ax y =
For a given yrange(A)
there is only one x that maps to it.
This is because the columns of A are linearly independent
and we know from our studies of vector spaces that the
coefficient vector of y is unique x is that coefficient
vector
By looking at y we can determine which x gave rise to it
39/45
NonFull-Rank Tall Matrix (m > n) Case
R
n
R
m
x
1
y
A
Range(A)
Ax y =
For a given yrange(A) there is more than one x that maps
to it
This is because the columns of A are linearly dependent
and that redundancy provides several ways to combine
them to create y
x
2
A
By looking at y we can not determine which x gave rise to it
40/45
Characterizing Square Matrix Mappings
Q: Given any yR
n
can we find an xR
n
that maps to it?
A: Not always!!!
yrange(A)
yrange(A)
No Solution
One Solution
Many Solutions
A full
rank
A not
full rank
Ax y =
Careful!!! This is quite a
different flow diagram here!!!
When a square A is full rank then its range covers the
complete new space then, y must be in range(A) and
because the columns of A are a basis there is a way to
build y
41/45
A Full-Rank Square Matrix is Invertible
A square matrix that has full rank is said to be.
nonsingular, invertible
Then we can find the x that mapped to y using x = A
-1
y
Several ways to check if nn A is invertible:
1. A is invertible if and only if (iff) its columns (or rows) are
linearly independent (i.e., if it is full rank)
2. A is invertible iff det(A) 0
3. A is invertible if (but not only if) it is positive definite (see
later)
4. A is invertible if (but not only if) all its eigenvalues are
nonzero
Pos. Def.
Matrices
Invertible
Matrices
42/45
Eigenvalues and Eigenvectors of Square Matrices
If matrix A is nn, then A maps R
n
R
n
Q: For a given nn matrix A, which vectors get mapped into
being almost themselves???
More preciselyWhich vectors get mapped to a scalar multiple
of themselves???
Even more preciselywhich vectors v satisfy the following:
v Av =
These vectors are special and are called the eigenvectors of A.
The scalar is that e-vectors corresponding eigenvalue.
Input Output
v
Av
43/45
If nn real matrix A is symmetric, then
e-vectors corresponding to distinct e-values are orthonormal
e-values are real valued
can decompose A as
If, further, A is pos. def. (semi-def.), then
e-values are positive (non-negative)
rank(A) = # of non-zero e-values
Pos. Def. Full Rank (and therefore invertible)
Pos. Semi-Def. Not Full Rank (and therefore not invertible)
When A is P. D., then we can write
Eigen-Facts for Symmetric Matrices
T
V V A =
[ ]
{ }
n
T
n
diag , , ,
2 1
2 1
!
=
= =
I VV v v v V
T
V V A
1 1
=
{ }
n
diag

1 1 1
1
, , ,
2 1
=
For P.D. A, A
-1
has
the same e-vectors and
has reciprocal e-values
44/45
Well limit our discussion to real-valued matrices and vectors
Quadratic Forms and Positive-(Semi)Definite Matrices
Quadratic Form = Matrix form for a 2
nd
-order multivariate
polynomial
Example:
=
22 21
12 11
2
1
a a
a a
x
x
A x
fixed variable
2 1
21 12
2
2 22
2
1 11
2
1
2
1
2 1
) (
scalar ) 1 1 ( ) 1 2 ( ) 2 2 ( ) 2 1 ( ) , (
x x a a x a x a x x a
x x Q
i j
j i ij
T
+ + + = =
= =
= =
Ax x
A
scalar
The quadratic form of matrix A is:
Other Matrix Issues
45/45
Values of the elements of matrix A determine the characteristics
of the quadratic form Q
A
(x)
If Q
A
(x) 0 x 0 then say that Q
A
(x) is positive semi-definite
If Q
A
(x) > 0 x 0 then say that Q
A
(x) is positive definite
Otherwise say that Q
A
(x) is non-definite
These terms carry over to the matrix that defines the Quad Form
If Q
A
(x) 0 x 0 then say that A is positive semi-definite
If Q
A
(x) > 0 x 0 then say that A is positive definite
1/15
Ch. 1 Introduction to Estimation
2/15
An Example Estimation Problem: DSB Rx
S( f )
f
o
M( f )
f f f
o
) 2 cos( ) ( ) , ; (
o o o o
t f t m f t s + =
Audio
Amp
BPF
&
Amp
) ( ) ( ) ( t w t s t x + =
X
)

2 cos(
o o
t f +
Est. Algo.
Electronics Adds
Noise w(t)
(usually white)
o o
f
&
f
) (
f M
Oscillator
w/
o o
f
&
Goal: Given
Find Estimates
(that are optimal in some sense)
) ( ) , ; ( ) ( t w f t s t x
o o
+ =
Describe with
Probability Model:
PDF & Correlation
3/15
Discrete-Time Estimation Problem
These days, almost always work with samples of the observed
signal (signal plus noise):
] [ ] , ; [ ] [ n w f n s n x
o o
+ =
Our Thought Model: Each time you observe x[n] it
contains same s[n] but different realization of noise w[n],
so the estimate is different each time.
o o
f
&
are RVs
Our Job: Given finite data set x[0], x[1], x[N-1]
Find estimator functions that map data into estimates:
) ( ]) 1 [ , ], 1 [ ], 0 [ (
) ( ]) 1 [ , ], 1 [ ], 0 [ (
2 2
1 1
x
x
g N x x x g
g N x x x g f
o
o
= =
= =
These are RVs

Need to describe w/
probability model
4/15
PDF of Estimate
Because estimates are RVs we describe them with a PDF
Will depend on:
1. structure of s[n]
2. probability model of w[n]
3. form of est. function g(x)
)
(
o
f p
o
f
o
f
Mean measures centroid
Std. Dev. & Variance
measure spread
Desire:
( ) small }
{
2
2
=
)
`
=
=
o o
f
o o
f E f E
f f E
o
5/15
1.2 Mathematical Estimation Problem
General Mathematical Statement of Estimation Problem:
ForMeasured Data x = [ x[0] x[1] x[N-1] ]
Unknown Parameter = [
1

2

p
]
is Not Random
x is an N-dimensional random data vector
Q: What captures all the statistical information needed for an
estimation problem ?
A: Need the N-dimensional PDF of the data, parameterized by
) ; ( x p
In practice, not given PDF!!!
Choose a suitable model
Captures Essence of Reality
Leads to Tractable Answer
Well use p(x;) to find
) (
x g =
6/15
Ex. Estimating a DC Level in Zero Mean AWGN
] 0 [ ] 0 [ w x + =
Consider a single data point is observed
Gaussian
zero mean
variance
2
~ N(,
2
)
So the needed parameterized PDF is:
p(x[0]; ) which is Gaussian with mean of
Soin this case the parameterization changes the data PDF mean:
1
p(x[0];
1
)
x[0]

2
p(x[0];
2
)
x[0]

3
p(x[0];
3
)
x[0]
7/15
Ex. Modeling Data with Linear Trend
See Fig. 1.6 in Text
Looking at the figure we see what looks like a linear trend
perturbed by some noise
So the engineer proposes signal and noise models:
| | ] [ ] [
] , ; [
n w Bn A n x
B A n s
+ + =
"# "$ %
Signal Model: Linear Trend Noise Model: AWGN
w/ zero mean
AWGN = Additive White Gaussian Noise
White = x[n] and x[m] are uncorrelated for n m { } I w w w w
2
) )( ( =
T
E
8/15
Typical Assumptions for Noise Model
W and G is always easiest to analyze
Usually assumed unless you have reason to believe otherwise
Whiteness is usually first assumption removed
Gaussian is less often removed due to the validity of Central Limit Thm
Zero Mean is a nearly universal assumption
Most practical cases have zero mean
But if not
+ = ] [ ] [ n w n w
zm
Non-Zero Mean of Zero Mean Now group into signal model
Variance of noise doesnt always have to be known to make an
estimate
BUT, must know to assess expected goodness of the estimate
Usually perform goodness analysis as a function of noise variance (or
SNR = Signal-to-Noise Ratio)
Noise variance sets the SNR level of the problem
9/15
Classical vs. Bayesian Estimation Approaches
If we view (parameter to estimate) as Non-Random
Classical Estimation
Provides no way to include a priori information about
If we view (parameter to estimate) as Random
Bayesian Estimation
Allows use of some a priori PDF on
The first part of the course: Classical Methods
Minimum Variance, Maximum Likelihood, Least Squares
Last part of the course: Bayesian Methods
MMSE, MAP, Wiener filter, Kalman Filter
10/15
1.3 Assessing Estimator Performance
Can only do this when the value of is known:
Theoretical Analysis, Simulations, Field Tests, etc.
is a random variable Recall that the estimate ) (
x g =
Thus it has a PDF of its own and that PDF completely displays
the quality of the estimate.
Illustrate with 1-D
parameter case
( p
Often just capture quality through mean and variance of
) (
x g =
Desire:
( ) small }
{
2
2
=
)
`
=
= =

E E
E m
If this is true:
say estimate is
unbiased
11/15
Equivalent View of Assessing Performance
)
e e + = =
Define estimation error:
RV RV Not RV
Completely describe estimator quality with error PDF: p(e)
p(e)
e
Desire:
( ) { } small } {
0 } {
2 2
= =
= =
e E e E
e E m
e
e
If this is true:
say estimate is
unbiased
12/15
Example: DC Level in AWGN
Model: x 1 , , 1 , 0 ], [ ] [ = + = N n n w A n
Gaussian, zero mean, variance
2
White (uncorrelated sample-to-sample)
PDF of an individual data sample:
(
(

=
2
2
2
2
) ] [ (
exp
2
1
]) [ (
A i x
i x p
Uncorrelated Gaussian RVs are Independent
so joint PDF is the product of the individual PDFs:
(
(
(
(
(
(
(
=
2
1
0
2
2 / 2
1
0
2
2
2
2
) ] [ (
exp
) 2 (
1
2
) ] [ (
exp
2
1
) (

N
n
N
N
n
A n x
A n x
p x
( property: prod of exps gives sum inside exp )
13/15
Each data sample has the same mean (A), which is the thing we
are trying to estimate so, we can imagine trying to estimate
A by finding the sample mean of the data:
Statistics

=
=
1
0
] [
1
N
n
n x
N
A
Prob. Theory
Lets analyze the quality of this estimator
Is it unbiased?
A A E
i x E
N
n x
N
E A E
n
A
N
n
=
=
)
`
=

=
=
}
{
]} [ {
1
] [
1
}
{
1
0
# $ %
Yes! Unbiased!
N
A
N
N
N
n x
N
n x
N
A
N
n
N
n
N
n
2
2
2 1
0
2
2
1
0
2
1
0
)
var(
1
]) [ var(
1
] [
1
var )
var(
=
= = =
(
(
=

=
=
Can make var small by increasing N!!!
Due to Indep.
(white & Gauss.
Indep.)
Can we get a small variance?
14/15
Theoretical Analysis vs. Simulations
Ideally wed like to be always be able to theoretically
analyze the problem to find the bias and variance of the
estimator
Theoretical results show how performance depends on the problem
specifications
But sometimes we make use of simulations
to verify that our theoretical analysis is correct
sometimes cant find theoretical results
15/15
Course Goal = Find Optimal Estimators
There are several different definitions or criteria for optimality!
Most Logical: Minimum MSE (Mean-Square-Error)
See Sect. 2.4
To see this result:
( )
) ( }
var{
(
2
2

b
E mse
+ =
)
`
=
( )
( ) ( ) | |
| | { }
) ( }
var{
) ( }
) ( }
{ }
(
2
2
0
2
2
2

b
b E E b E E
E E E
E mse
+ =
+ +
)
`
=
)
`
+ =
)
`
=
=
" "# " "$ %
= }
{ ) ( E b
Bias
Although MSE makes sense, estimates usually rely on
Chapter 2
Minimum Variance
Unbiased Estimators
MVU
Ch. 2: Minimum Variance Unbiased Est.
Basic Idea of MVU: Out of all unbiased estimates,
find the one with the lowest variance
(This avoids the realizability problem of MSE)
2.3 Unbiased Estimators
An estimator is unbiased if
{ } =
E
for all
Example: Estimate DC in White Uniform Noise

[ ] [ ] 1 ..., , 1 , 0 = + = N n n w A n x
Unbiased Estimator:
[ ]
{ } value A of regardless A A E before as same
n x
N
N
n
=
=
:
1
1
0
Biased Estimator:
<
=
<
=
=
=
>
1 0
1 0
, 1
] [ ] [ , 1 :
) (
1
1
0
A if
A if
Bias
A A E then A if
A A E A A
n x n x then A if Note
n x
N
A
N
n
Biased Est.
MVUE = Minimum Variance Unbiased Estimator
)
( )
var( )
(
2
b mse + =
So, MVU could also be called
Minimum MSE Unbiased Est.
(Recall problem with MMSE criteria)
Constrain bias to be zero 0 find the estimator that
minimizes variance
2.4 Minimum Variance Criterion
Note:
= 0 for MVU
2.5 Existence of MVU Estimator
Sometimes there is no MVUE can happen 2 ways:
1. There may be no unbiased estimators
2. None of the above unbiased estimators has a
uniformly minimum variance
Ex. of #2
Assume there are only 3 unbiased estimators for a
problem. Two possible cases:
3 , 2 , 1 ), (
= = i g
i i
x
an MVU an MVU
1
var{
i
var{
i
Even if MVU exists: may not be able to find it!!

2.6 Finding the MVU Estimator
No Known turn the crank Method
Three Approaches to Finding the MVUE
1. Determine Cramer-Rao Lower Bound (CRLB)
and see if some estimator satisfies it (Ch 3 & 4)
(Note: MVU can exist but not achieve the CRLB)
2. Apply Rao-Blackwell-Lechman-Scheffe Theorem
Rare in Practice Well skip Ch. 5
3. Restrict to Linear Unbiased & find MVLU (Ch. 6)
Only gives true MVU if problem is linear
2.7 Vector Parameter
When we wish to estimate multiple parameters we group
them into a vector:
[ ]
T
p
!
2 1
=
[ ]
T
p

2 1
! = Then an estimator is notated as:
{ } =
E
Unbiased requirement becomes:
Minimum Variance requirement becomes:
For each i
{ } estimates unbiased all over min
var =
Chapter 3
Cramer-Rao Lower Bound
Abbreviated: CRLB or sometimes just CRB
CRLB is a lower bound on the variance of any unbiased
estimator:
The CRLB tells us the best we can ever expect to be able to do
(w/ an unbiased estimator)
then , of estimator unbiased an is
If
) ( ) ( ) ( ) (

2
CRLB CRLB
What is the Cramer-Rao Lower Bound
1. Feasibility studies ( e.g. Sensor usefulness, etc.)
Can we meet our specifications?
2. Judgment of proposed estimators
Estimators that dont achieve CRLB are looked
down upon in the technical literature
3. Can sometimes provide form for MVU est.
4. Demonstrates importance of physical and/or signal
parameters to the estimation problem
e.g. Well see that a signals BW determines delay est. accuracy
Radars should use wide BW signals
Some Uses of the CRLB
Q: What determines how well you can estimate ?
Recall: Data vector is x
3.3 Est. Accuracy Consideration
samples from a random
process that depends on an
the PDF describes that
dependence: p(x; )
Clearly if p(x; ) depends strongly/weakly on
we should be able to estimate well/poorly.
See surface plots vs. x & for 2 cases:
1. Strong dependence on
2. Weak dependence on
Should look at p(x; ) as a function of for
fixed value of observed data x
Surface Plot Examples of p(x; )
( )
(
(

=
2
2
2
2
) ] 0 [ (
exp
2
1
]; 0 [
A x
A x p
x[0] = A + w[0]
Ex. 3.1: PDF Dependence for DC Level in Noise
w[0] ~ N(0,
2
)
Then the parameter-dependent PDF of the data point x[0] is:
A
x[0]
A 3
p(x[0]=3; )
Say we observe x[0] = 3
So Slice at x[0] = 3
The LF = the PDF p(x; )
but as a function of parameter w/ the data vector x fixed
Define: Likelihood Function (LF)
We will also often need the Log Likelihood Function (LLF):
LLF = ln{LF} = ln{ p(x; )}
LF Characteristics that Affect Accuracy
Intuitively: sharpness of the LF sets accuracyBut How???
Sharpness is measured using curvature:
( )
value true
data given
2
2
; ln
=
=

x
x p
Curvature PDF concentration Accuracy
But this is for a particular set of datawe want in general:
SoAverage over random vector to give the average curvature:
( )
value true
2
2
; ln
=
x p
E
Expected sharpness
of LF
E{} is w.r.t p(x; )
Theorem 3.1 CRLB for Scalar Parameter
Assume regularity condition is met:

=
)
`
0
) ; ( ln x p
E
E{} is w.r.t p(x; )
Then
( )
value true
2
2
2
; ln
1
=
x p
E
3.4 Cramer-Rao Lower Bound
( ) ( )
dx x p
p p
E ) ; (
; ln ; ln
2
2
2
2
x x
Right-Hand
Side is
CRLB
1. Write log 1ikelihood function as a function of :
ln p(x; )
2. Fix x and take 2
nd
partial of LLF:

2
ln p(x; )/
2
3. If result still depends on x:
Fix and take expected value w.r.t. x
Otherwise skip this step
4. Result may still depend on :
Evaluate at each specific value of desired.
5. Negate and form reciprocal
Steps to Find the CRLB
Need likelihood function:
( )
| | ( )
( )
| | ( )
(
(
(
(

=
(
(
2
2
1
0
2
2
1
0
2
2
2
2
exp
2
1
2
exp
2
1
;
A n x
A n x
A p
N
n
N
N
n
x
Example 3.3 CRLB for DC in AWGN
x[n] = A + w[n], n = 0, 1, , N 1
w[n] ~ N(0,
2
)
& white
Due to
whiteness
Property
of exp
( ) | | ( )
! ! ! " ! ! ! # $
! ! " ! ! # $
? (~~)
1
0
2
2
0 (~~)
2
2
2
1
2 ln ) ; ( ln
=
=
=

(
(
=
A
N
n
A
N
A n x A p
x
Now take ln to get LLF:
| | ( ) ( ) A x
N
A n x A p
A
N
n
= =
=
2
1
0
2
1
) ; ( ln

x
Now take first partial w.r.t. A:
sample
mean
(!)
Now take partial again:
2 2
2
) ; ( ln
N
A p
A
=
x
Doesnt depend
on x so we dont
need to do E{}
Since the result doesnt depend on x or A all we do is negate
and form reciprocal to get CRLB:
( )
N
p
E
CRLB
2
value true
2
2
; ln
1
=
=
x
N
A
2
}
var{

Doesnt depend on A
Increases linearly with
2
Decreases inversely with N
CRLB
For fixed N
2
A
CRLB
For fixed N &
2
CRLB Doubling Data
Halves CRLB!
N
For fixed
2
Continuation of Theorem 3.1 on CRLB
There exists an unbiased estimator that attains the CRLB iff:
| |
) ( ) (
) ; ( ln
x
x
g I
p
for some functions I( ) and g(x)
Furthermore, the estimator that achieves the CRLB is then given
by:
) (
x g =
(!)
Since no unbiased estimator can do betterthis
is the MVU estimate!!
This gives a possible way to find the MVU:
Compute ln p(x; )/ (need to anyway)
Check to see if it can be put in
form like (!)
If sothen g(x) is the MVU esimator
with
CRLB
I
= =
) (
1
}
var{
Revisit Example 3.3 to Find MVU Estimate

( ) A x
N
A p
A
=
2
) ; ( ln
x
For DC Level in AWGN we found in (!) that:
Has form of
I(A)[g(x) A]
| |
=
= = =
1
0
1
) (
N
n
n x
N
x g x
CRLB
N
A
N
A I = = =
2
2
}
var{ ) (

Sofor the DC Level in AWGN:

the sample mean is the MVUE!!
Definition: Efficient Estimator
An estimator that is:
unbiased and
attains the CRLB
is said to be an Efficient Estimator
Notes:
Not all estimators are efficient (see next example: Phase Est.)
Not even all MVU estimators are efficient
Sothere are times when our
1
st
partial test wont work!!!!
Example 3.4: CRLB for Phase Estimation
This is related to the DSB carrier estimation problem we used
for motivation in the notes for Ch. 1
Except herewe have a pure sinusoid and we only wish to
estimate only its phase
Signal Model:
] [ ) 2 cos( ] [
] ; [
n w n f A n x
o
n s
o o
+ + =
! ! ! " ! ! ! # $

AWGN w/ zero
mean &
2
Assumptions:
1. 0 < f
o
< ( f
o
is in cycles/sample)
2. A and f
o
are known (well remove this assumption later)
Signal-to-Noise Ratio:
Signal Power = A
2
/2
Noise Power =
2
2
2
2
A
SNR =
Problem: Find the CRLB for estimating the phase.
We need the PDF:
( )
( )
| | ( )
(
(
(
(
+
=
=
2
2
1
0
2
2
2
) 2 cos(
exp
2
1
;
n f A n x
p
o
N
n
N
x
Exploit
Whiteness
and Exp.
Form
Now taking the log gets rid of the exponential, then taking
partial derivative gives (see book for details):
( )
| |
2
1
0
2
) 2 4 sin(
2
) 2 sin(
; ln
|
.
|
\
|
+ +

n f
A
n f n x
A p
o o
N
n
x
Taking partial derivative again:
( )
| | ( ) ) 2 4 cos( ) 2 cos(
; ln
1
0
2 2
2

+ +

=
n f A n f n x
A p
o o
N
n
x
Still depends on random vector xso need E{}
Taking the expected value:
( )
| | ( )
| | { } ( ) ) 2 4 cos( ) 2 cos(
) 2 4 cos( ) 2 cos(
; ln
1
0
2
1
0
2 2
2

+ + =
)
`
+ + =
=
n f A n f n x E
A
n f A n f n x
A
E
p
E
o o
N
n
o o
N
n
x
E{x[n]} = A cos(2 f
o
n + )
Soplug that in, get a cos
2
term, use trig identity, and get
( )
SNR N
NA
n f
A p
E
N
n
o
N
n
=
(
(
+ =
=
2
2 1
0
1
0
2
2
2
2
2
) 2 4 cos( 1
2
; ln

x
= N << N if
f
o
not near 0 or
n
N-1
Nowinvert to get CRLB:
SNR N
1
}
var{
CRLB Doubling Data
Halves CRLB!
N
For fixed SNR
Non-dB
CRLB Doubling SNR
Halves CRLB!
SNR (non-dB)
For fixed N
Halve CRLB
for every 3B
in SNR
Does an efficient estimator exist for this problem? The CRLB
theorem says there is only if
| |
) ( ) (
) ; ( ln
x
x
g I
p
( )
| |
2
1
0
2
) 2 4 sin(
2
) 2 sin(
; ln
|
.
|
\
|
+ +

n f
A
n f n x
A p
o o
N
n
x
Our earlier result was:
Efficient Estimator does NOT exist!!!
Well see later though, an estimator for which CRLB }
var{
as N or as SNR
Such an estimator is called an asymptotically efficient estimator
(Well see such a phase estimator in Ch. 7 on MLE)
CRB
N
}
var{
1
Alternate Form for CRLB
2
) ; ( ln
1
)
var(
x p
E
See Appendix 3A
for Derivation
Sometimes it is easier to find the CRLB this way.
This also gives a new viewpoint of the CRLB:
From Gardners Paper (IEEE Trans. on Info Theory, July 1979)
Consider the Normalized version of this form of CRLB
Posted
on BB
2
2
2
) ; ( ln
1 )
var(

x p
E
Well derive
this in a way
that will re-
interpret the
CRLB
2
Consider the Incremental Sensitivity of p(x; ) to changes in :
If +, then it causes p(x; ) p(x; + )
How sensitive is p(x; ) to that change??
(
= =
(
) ; (
) ; (
in change %
) ; ( in change %
) ; (
) ; (
) (
~
x
x x
x
x
x
p
p p
p
p
S
p
Now let 0:
=
(
= =

) ; ( ln
) ; (
) ; (
) (
~
lim ) (
0
x
x
x
x x
p
p
p
S S
p p
Recall from Calculus:
x
x f
x f x
x f
) (
) (
1 ) ( ln
| |
)
`
2
2
2
2
2
) (
1
) ; ( ln
1 )
var(
x
x
p
S E
p
E

Interpretation
Norm. CRLB =
Inverse Mean
Square
Sensitivity
3
Definition of Fisher Information
The denominator in CRLB is called the Fisher Information I( )
It is a measure of the expected goodness of the data for the
purpose of making an estimate
=
2
2
) ; ( ln
) (
x p
E I
Has the needed properties for info (as does Shannon Info):
1. I( ) 0 (easy to see using the alternate form of CRLB)
2. I( ) is additive for independent observations
follows from:
| |

=
(
(
=
n n
n x p n x p p ) ]; [ ( ln ) ]; [ ( ln ) ; ( ln x
If each I
n
( ) is the same: I( ) = NI( )
4
3.5 CRLB for Signals in AWGN
When we have the case that our data is signal + AWGN then
we get a simple form for the CRLB:
Signal Model: x[n] = s[n; ] + w[n], n = 0, 1, 2, , N-1
White,
Gaussian,
Zero Mean
Q: What is the CRLB?
First write the likelihood function:
( )
( )
=
1
0
2
2 2 /
2
] ; [ ] [
2
1
exp
2
1
) ; (
N
n
N
n s n x p
x
Differentiate Log LF twice to get:
( )
=

)
1
0
2
2
2
2 2
2
] ; [ ] ; [
] ; [ ] [
1
) ; ( ln
N
n
n s n s
n s n x p
x
Depends on
random x[n]
so must take
E{}
5
{ }
2
1
0
2
1
0
2
2
2
0
] ; [
2 2
2
] ; [
] ; [ ] ; [
] ; [ ] [
1
) ; ( ln
=
=
(
|
|
|
.
|
\
|
=
N
n
N
n
n s
n s
n s n s
n s n x E p E
! ! ! " ! ! ! # $
" # $
x
Then using this we get the CRLB for Signal in AWGN:
=
(
1
0
2
2
] ; [
)
var(
N
n
n s
2
] ; [
(
n s
Note: tells how
sensitive signal is to parameter
If signal is very sensitive to parameter change then CRLB is small
can get very accurate estimate!
6
Ex. 3.5: CRLB of Frequency of Sinusoid
Signal Model:
1 , , 2 , 1 , 0 0 ] [ ) 2 cos( ] [
2
1
= < < + + = N n f n w n f A n x
o o

0 0.1 0.2 0.3 0.4 0.5
0
2
4
6
x 10
-4
f
o
(cycles/sample)
C
R
L
B

(
c
y
c
l
e
s
/
s
a
m
p
l
e
)
2
0 0.1 0.2 0.3 0.4 0.5
0.01
0.015
0.02
0.025
f
o
(cycles/sample)
C
R
L
B
1
/
2

(
c
y
c
l
e
s
/
s
a
m
p
l
e
)
Error in
Book
Bound on
Variance
Bound on
Std. Dev.
| |
=
+
1
0
2
) 2 sin( 2
1
)
var(
N
n
o
n f n SNR
Signal is less
sensitive if f
o
near 0 or
7
3.6 Transformation of Parameters
Say there is a parameter with known CRLB
But imagine that we instead are interested in estimating

some other parameter that is a function of :
= g( )
Q: What is CRLB
CRLB
g
CRLB
2
) (
) var(
|
.
|
\
|
=
Captures the
sensitivity of to
Proved in
Appendix 3B
Large g/ small error in gives larger error in
increases CRLB (i.e., worsens accuracy)
8
Example: Speed of Vehicle From Elapsed Time
Known Distance D
start
Laser
Sensor Sensor
Laser
stop
Measure Elapsed Time T
Possible Accuracy Set by CRLB
T
T
T
T V
CRLB
D
V
CRLB
T
D
CRLB
T
D
T
CRLB
=
|
.
|
\
|
=
|
.
|
\
|
=
2
4
2
2
But really want to measure speed V = d/T
Find the CRLB
V
:
) / (
2
s m CRLB
D
V
T V

Accuracy Bound
Less accurate at High Speeds (quadratic)
More accurate over large distances
9
Effect of Transformation on Efficiency
Suppose you have an efficient estimator of :
But you are really interested in estimating = g( )

Suppose you plan to use )
g =
Q: Is this an efficient estimator of ???
A: Theorem: If g( ) has form g( ) = a + b, then
is efficient.
)
g =
affine transform
Proof:
First: ( ) ( ) ( )

CRLB a a b a
2 2
var
var
var = = + =
= because efficient
Now, what is CRB
? Using transformation result:

( )

CRLB a CRLB
b a
CRLB
a
2
2
2
=
(
+
=
=
! !" ! !# $
( )

CRLB =
var
Efficient!
10
Asymptotic Efficiency Under Transformation
If the mapping = g( ) is not affinethis result does NOT hold
But if the number of data samples used is large, then the
estimator is approximately efficient (Asymptotically Efficient)
( ),
p g =
of pdf
Small N Case
PDF is widely spread
over nonlinear mapping
( ),
p g =
of pdf
Large N Case
PDF is concentrated
onto linearized section
1
3.7 CRLB for Vector Parameter Case
Vector Parameter:
| |
T
p
!
2 1
=
Its Estimate: | |
T
p

2 1
! =
Assume that estimate is unbiased:
{ } =
E
For a scalar parameter we looked at its variance
but for a vector parameter we look at its covariance matrix:
{ } | || |

C

var =
)
`
=
T
E
For example:
for = [x y z]
T
(
(
(
(
(
(
=
)
var( )
cov( )
cov(
)
cov( )
var( )
cov(
)
cov( )
cov( )
var(
z y z x z
z y y x y
z x y x x
C
2
Fisher Information Matrix
For the vector parameter case
Fisher Info becomes the Fisher Info Matrix (FIM) I()
whose mn
th
element is given by:
| |
| |
p n m
p
E
m n
mn
, , 2 , 1 , ,
) ; ( ln
) (
2
=
=

x
I
Evaluate at
true value of
3
The CRLB Matrix
Then, under the same kind of regularity conditions,
the CRLB matrix is the inverse of the FIM:
) (
1
I
= CRLB
So what this means is:
nn nn
n
] [ ] [ ) (
1
I C
Diagonal elements of Inverse FIMbound the parameter variances,

which are the diagonal elements of the parameter covariance matrix
(
(
(
(
(
(
=
)
var( )
cov( )
cov(
)
cov( )
var( )
cov(
)
cov( )
cov( )
var(
z y z x z
z y y x y
z x y x x
C ) (
1
33 32 31
23 22 21
13 12 11
I
=
(
(
(
(
(
(
b b b
b b b
b b b
(!)
4
More General Form of The CRLB Matrix
definite - semi positive is ) (
1
I C
0 I C

) (
1
Mathematical Notation for this is:

(!!)
Note: property #5 about p.d. matrices on p. 573
states that (!!) (!)
5
CRLB Off-Diagonal Elements Insight
Let = [x
e
y
e
]
T
represent the 2-D x-y location of a
transmitter (emitter) to be estimated.
Consider the two cases of scatter plots for the estimated
location:
e
x
e
y
e
x
e
y
e
x
e
x
e
y
e
y
e
y
e
y
e
x
e
x
Each case has the same variances but location accuracy

characteristics are very different. This is the effect of the
off-diagonal elements of the covariance
Should consider effect of off-diagonal CRLB elements!!!
Not In Book
6
CRLB Matrix and Error Ellipsoids
Not In Book
Assume
| |
T
e e
y x

= is 2-D Gaussian w/ zero mean

and a cov matrix

C
Only For Convenience

Then its PDF is given by:
( )
( )
(
=

C
C

2
1
exp
2
1
T
N
p
Quadratic Form!!
(recall: its scalar valued)
So the equi-height contours of this PDF are given by the
values of
such that:
k
T
= A

Some constant
ease for

: Let
1
A C
Note: A is symmetric so a
12
= a
21
because any cov. matrix is symmetric
and the inverse of symmetric is symmetric
7
What does this look like?
k y a y x a x a
e e e e
= + +
2
22 12
2
11

2
An Ellipse!!! (Look it up in your calculus book!!!)

Recall: If a
12
= 0, then the ellipse is aligned w/ the axes &
the a
11
and a
22
control the size of the ellipse along the axes
Note: a
12
= 0
(
(
(
(
=
(
(
(
22
11
22
11
1
1
0
0
1
0
0
a
a
a
a
C C
ed uncorrelat are
&
e e
y x
Note: a
12
0 correlated are
&
e e
y x
(
(
(
(
=
2
e
e e
e e
e
y
x y
y x
x

C
8
e
x
e
x
e
y
ed uncorrelat are
&
e e
y x if
e
y
e
x
e
x
e
y
correlated are
&
e e
y x if
e
y
e
x
2 ~
e
y
2 ~
e
x
2 ~
e
y
2 ~
e
x
2 ~
e
y
2 ~
e
x
2 ~
e
y
2 ~
Not In Book Error Ellipsoids and Correlation
Choosing k Value
For the 2-D case
k = -2 ln(1-P
e
)
where P
e
is the prob.
that the estimate will
lie inside the ellipse
See posted
paper by
Torrieri
9
Ellipsoids and Eigen-Structure
Consider a symmetric matrix A & its quadratic form x
T
Ax
k
T
= Ax x
Ellipsoid: or
k = x Ax ,
Principle Axes of Ellipse are orthogonal to each other
and are orthogonal to the tangent line on the ellipse:
x
1
x
2
Theorem: The principle axes of the ellipsoid x
T
Ax = k are
eigenvectors of matrix A.
Not In Book
10
Proof: From multi-dimensional calculus: gradient of
a scalar-valued function (x
1
,, x
n
) is orthogonal to the surface:
x
1
x
2
T
n
n
x x
x x grad
(
=
=
= =

1
1
) (
) ( ) , , (
x
x
x
x
Different
Notations
See handout posted on Blackboard on Gradients and Derivatives
11

= =
i
k
j i
j
ij
k
i j
j i ij
T
x
x x
a
x
x x a
) (
x A x ) x (

Product rule:
# $ % # $ %
jk
ik
k
j
i
k i
k i
j
k
i
k
j i
x
x
x x
x
x
x
x x
=
= =
0
1
) (
For our quadratic form function we have:
()
()
Using () in () gives:
j
j
kj
j
i
ik j
j
jk
k
x a
x a x a
x

=
+ =
By Symmetry:
a
ik
= a
ki
And from this we get:
Ax Ax x
x
2 ) ( =
T
12
x
1
x
2
Since grad ellipse, this says Ax is ellipse:
x
Ax
k = x Ax ,
When x is a principle axis, then x and Ax are aligned:
x
1
x
2
x
Ax
k = x Ax ,
x Ax =
Eigenvectors are
Principle Axes!!!
< End of Proof >
13
Theorem: The length of the principle axis associated with
eigenvalue
i
is
i
k /
Proof: If x is a principle axis, then Ax = x. Take inner product
of both sides of this with x:
x x x Ax , , =
=
# $ %
k

k k
= =
=
x x x
x
# $ %
2
,
< End of Proof >
Note: This says that if A has a zero eigenvalue, then the error ellipse
will have an infinite length principle axis NOT GOOD!!
So well require that all
i
> 0
must be positive definite
14
Application of Eigen-Results to Error Ellipsoids
The Error Ellipsoid corresponding to the estimator covariance
matrix must satisfy:
k
T
=
C

1
Note that the error

ellipse is formed
using the inverse cov Thus finding the eigenvectors/values of
shows structure of the error ellipse
1
C
Recall: Positive definite matrix A and its inverse A
-1
have the
same eigenvectors
reciprocal eigenvalues
Thus, we could instead find the eigenvalues of
and then the principle axes would have lengths
set by its eigenvalues not inverted
) (
1
I C

=
Inverse FIM!!
15
Illustrate with 2-D case:
k
T
=
C

1
v
1
& v
2
1
&
2
Eigenvectors/values for
(not the inverse!)

C
v
1
1
v
2
1
k
2
k
16
The CRLB/FIM Ellipse
We can re-state this in terms of the FIM
Once we find the FIM we can:
Find the inverse FIM
Find its eigenvectors gives the Principle Axes
Find its eigenvalues Prin. Axis lengths are then
Can make an ellipse from the CRLB Matrix
instead of the Cov. Matrix
This ellipse will be the smallest error ellipse that an unbiased estimator
can achieve!
i
k
1
3.8 Vector Transformations
Just like for the scalar case. = g()
If you know CRLB
you can find CRLB
! ! ! ! ! " ! ! ! ! ! # $
" # $
CRLB
on CRLB
T
on CRLB
g g
=

) (
) (
) (
1
Jacobian Matrix
(see p. 46)
Example: Usually can estimate Range (R) and Bearing () directly
But might really want emitter (x, y)
2
Example of Vector Transform
x
e
y
e
x
y
Emitter
R
Can estimate Range (R) and Bearing () directly

But might really want emitter location (x
e
, y
e
)
= =
sin
cos
) (
R
R
g
y
x R
e
e

Direct
Parameters
Mapped
Parameters
x
e
y
e
x
y
T
g g
CRLB
CRLB

) ( ) (
Jacobian Matrix
cos sin
sin cos
cos sin
cos cos
) (
R
R
R
R
R
R
R
R
g
3
3.9 CRLB for General Gaussian Case
In Sect. 3.5 we saw the CRLB for signal + AWGN
For that case we saw:
The PDFs parameter-dependence
showed up only in the mean of the PDF
Deterministic Signal w/
Scalar Deterministic Parameter
Now generalize to the case where:
Data is still Gaussian, but
Parameter-Dependence not restricted to Mean
Noise not restricted to White Cov not necessarily diagonal
( ) ) ( ), ( ~ C x N
One way to get this case: signal + AGN
Random Gaussian Signal w/
Vector Deterministic Parameter
Non-White Noise
4
For this case the FIM is given by: (See Appendix 3c)
! ! ! ! ! ! ! " ! ! ! ! ! ! ! # $ ! ! ! ! ! " ! ! ! ! ! # $

=

j i j
T
i
ij
tr

) (
) (
) (
) (
2
1 ) (
) (
) (
)] ( [
1 1 1
C
C
C
C

C

I
Variability of Mean
w.r.t. parameters
Variability of Cov
w.r.t. parameters
This shows the impact of signal model assumptions
deterministic signal + AGN
random Gaussian signal + AGN
Est. Cov. uses average
over only noise
Est. Cov. uses average
over signal & noise
5
Gen. Gauss. Ex.: Time-Difference-of-Arrival
Given:
Goal: Estimate
[ ]
T
T T
2 1
x x x =
Rx
1 Tx
Rx
2
x
1
(t) = s(t ) + w
1
(t)
x
2
(t) = s(t) + w
2
(t)
How to model the signal?
Case #1: s(t) is zero-mean, WSS, Gauss. Process
Case #2: s(t) is a deterministic signal
Passive Sonar
Radar/Comm Location
Case #1 Case #1
0 = ) (
No Term #1
C C = ) (
=
] ; 1 [
] ; 0 [
] ; 1 [
] ; 1 [
] ; 0 [
) (
2
2
1
1
1
N s
s
N s
s
s
%
%
No Term #2
=
22 21
12 11
) (
) (
) (
C C
C C
C
) ( ) ( =
+ =
j i
i i i i
ij
ii
s s
w w s s
C C
C C C
6
Comments on General Gaussian CRLB
It is interesting to note that for any given problem you may find
each case used in the literature!!!
For example for the TDOA/FDOA estimation problem:
Case #1 used by M. Wax in IEEE Trans. Info Theory, Sept. 1982
Case #2 used by S. Stein in IEEE Trans. Signal Proc., Aug. 1993
See also differences in the books examples
Well skip Section 3.10 and leave it as a reading assignment
1/19
3.11 CRLB Examples
1. Range Estimation
sonar, radar, robotics, emitter location
2. Sinusoidal Parameter Estimation (Amp., Frequency, Phase)
sonar, radar, communication receivers (recall DSB Example), etc.
3. Bearing Estimation
sonar, radar, emitter location
4. Autoregressive Parameter Estimation
speech processing, econometrics
Well now apply the CRLB theory to several examples of
practical signal processing problems.
Well revisit these examples in Ch. 7 well derive ML
estimators that will get close to achieving the CRLB
2/19
max ,
) ; (
0 ) ( ) ( ) (
o s
t s
o
T T t t w t s t x
o

+ = + =
!" !# $
Transmit Pulse: s(t) nonzero over t[0,T
s
]
Receive Reflection: s(t
o
)
Measure Time Delay:
o
C-T Signal Model
Bandlimited
White Gaussian
t T
s
s(t)
t T
s(t
o
)
BPF
& Amp
x(t)
PSD of w(t)
f
B B
N
o
/2
Ex. 1 Range Estimation Problem
3/19
Sample Every = 1/2B sec
w[n] = w(n)
DT White
Gaussian Noise
Var
2
= BN
o
1 , , 1 , 0 ] [ ) ( ] [ = + = N n n w n s n x
o

f
ACF of w(t)
1/2B
B B
1/B 3/2B
PSD of w(t)
N
o
/2

2
= BN
o
+
+ +

=
1 ] [
1 ] [ ) (
1 0 ] [
] [
N n M n n w
M n n n n w n s
n n n w
n x
o
o o o
o
s[n;
o
]has Mnon-zero samples starting at n
o
Range Estimation D-T Signal Model
4/19
Now apply standard CRLB result for signal + WGN:
=
=
+
=
=
+
=
=
|
|
.
|
\
|
=
|
|
.
|
\
|
=
|
|
.
|
\
|

=
|
|
.
|
\
|
1
0
2
2
1
2
2
1
2
2
1
0
2
2
) (
) (
) ( ] ; [
)
var(
M
n
n t
M n
n n
n t
M n
n n
o
o
N
n
o
o
o
t
t s
t
t s
n s n s
o
o
o
o
o

Plug in and keep

non-zero terms
Exploit Calculus!!!
Use approximation:
o
= n
o
Then do change of variables!!
Range Estimation CRLB
5/19
s
T
o
s
T
o
T
o
E
dt
t
t s
N
E
dt
t
t s
N
dt
t
t s
s s s

|
.
|
\
|
=
|
.
|
\
|
=
|
.
|
\
|
0
2
0
2
0
2
2
) (
2 /
1
) (
2 /
) ( 1
)
var(

Assume sample spacing is smallapprox. sum by integral
=
s
T
s
dt t s E
0
2
) (
( )
( )
dt f S
df f S f
N
E
E
df f S f
N
E
o
s
s
T
o
s
o
s
2
2
2
0
2
2
) (
) ( 2
2 /
1
) ( 2
2 /
1
)
var(
FT Theorem
& Parseval
Parseval
( )

=
dt f S
df f S f
B
rms
2
2
2
) (
) ( 2
Define a BW measure:
B
rms
is RMS BW (Hz)
A type of SNR
Range Estimation CRLB (cont.)
6/19
Using these ideas we arrive at the CRLB on the delay:
) (sec
1
)
var(
2
2
rms
o
B SNR

To get the CRLB on the range use transf. of parms result:
) (m
4 /
)
var(
2
2
2
rms
B SNR
c
R
o
CRLB
R
CRLB
o
R

|
|
.
|
\
|
= with R = c
o
/ 2
CRLB is inversely proportional to:
SNR Measure
RMS BW Measure
CRLB is inversely proportional to:
SNR Measure
RMS BW Measure
So the CRLB tells us
Choose signal with large B
rms
Ensure that SNR is large
Better on Nearby/large targets
Which is better?
Double transmitted energy?
Double RMS bandwidth?
Range Estimation CRLB (cont.)
7/19
1 , , 1 , 0 ] [ ) cos( ] [ = + + = N n n w n A n x
o

Given DT signal samples of a sinusoid in noise.
Estimate its amplitude, frequency, and phase
DT White Gaussian Noise
Zero Mean & Variance of 2
o
is DT frequency in
cycles/sample: 0 <
o
<
Multiple parameters so parameter vector:
T
o
A ] [ =
Recall SNR of sinusoid in noise is:
2
2
2
2
2
2 /

A A
P
P
SNR
n
s
= = =
Ex. 2 Sinusoid Estimation CRLB Problem
8/19
Approach:
Find Fisher Info Matrix
Invert to get CRLB matrix
Look at diagonal elements to get bounds on parm variances
Recall: Result for FIM for general Gaussian case specialized to
signal in AWGN case:
j
N
n
i
T
j i
ij
n s n s

=
|
|
.
|
\
|
|
|
.
|
\
|
=
] ; [ ] ; [ 1
1
)] ( [
1
0
2
2

s s
I

Sinusoid Estimation CRLB Approach
9/19
Taking the partial derivatives and using approximations given in
book (valid when
o
is not near 0 or ) :
( )
( )
2
2 1
0
2 2
2
33
1
0
2
2 1
0
2 2
2
32 23
1
0
2
2
2 1
0
2
2
2 1
0
2 2 2
2
22
1
0
2
1
0
2
31 13
1
0
2
1
0
2
21 12
2
1
0
2
1
0
2
2
11
2
) ( sin
1
)] ( [
2
) ( sin
1
)] ( [ )] ( [
2
) 2 2 cos( 1
2
) ( sin ) (
1
)] ( [
0 ) 2 2 sin(
2
) sin( ) cos(
1
)] ( [ )] ( [
0 ) 2 2 sin(
2
) sin( ) cos(
1
)] ( [ )] ( [
2
) 2 2 cos( 1
2
1
) ( cos
1
)] ( [
NA
n A
n
A
n n A
n
A
n n
A
n n A
n
A
n n A
n n
A
n n An
N
n n
N
n
o
N
n
N
n
o
N
n
N
n
o
N
n
o
N
n
o
N
n
o o
N
n
o
N
n
o o
N
n
o
N
n
o
+ =
+ = =
+ = + =
+
= + +
= =
+
= + +
= =
+ + = + =
=
I
I I
I
I I
I I
I
T
o
A ] [ =
Sinusoid Estimation Fisher Info Elements
10/19
(
(
(
(
(
(
(
(
(
(
=
2
2 1
0
2
2
1
0
2
2 1
0
2
2
2
2
2 2
0
2 2
0
0 0
2
) (

NA
n
A
n
A
n
A
N
N
n
N
n
N
n
I
Fisher Info Matrix then is:
2
2
2
A
SNR =
Recall and closed form results for these sums
T
o
A ] [ =
Sinusoid Estimation Fisher Info Matrix
11/19
Inverting the FIM by hand gives the CRLB matrix and then
extracting the diagonal elements gives the three bounds:
(using co-factor &det
approach helped by 0s)
) rad (
4
) 1 (
) 1 2 ( 2
)
var(
) ) rad/sample ((
) 1 (
12
)
var(
) volts (
2
)
var(
2
2
2
2
2
N SNR N N SNR
N
N N SNR
N
A
o
Amp. Accuracy: Decreases as 1/N, Depends on Noise Variance (not SNR)

Freq. Accuracy: Decreases as 1/N
3
, Decreases as as 1/SNR
Phase Accuracy: Decreases as 1/N, Decreases as as 1/SNR
To convert to Hz
2
multiply by (F
s
/2)
2
Sinusoid Estimation CRLBs
12/19
The CRLB for Freq. Est. referred back to the CT is
) Hz (
) 1 ( ) 2 (
12
)
var(
2
2 2
2

N N SNR
F
f
s
o
Does that mean we do worse if we sample faster than Nyquist?

NO!!!!! For a fixed duration T of signal: N = TF
s
Also keep in mind that F
s
has effect on the noise structure:
Not in Book
f
ACF of w(t)
1/2B
B B
1/B 3/2B
PSD of w(t)
N
o
/2

2
= BN
o
Frequency Estimation CRLBs and Fs
13/19
Uniformly spaced linear array with Msensors:
Sensor Spacing of d meters
Bearing angle to target radians
Figure 3.8
from textbook:
Simple model
Emits or reflects
signal s(t)
) 2 cos( ) ( + = t f A t s
o t
Propagation Time to n
th
Sensor:
1 , , 1 , 0 cos
0
= = M n
c
d
n t t
n

|
|
.
|
\
|
+
|
.
|
\
|
+ =
=

cos 2 cos
) ( ) (
0
c
d
n t t f A
t t s t s
o
n n
Signal at n
th
Sensor:
Ex. 3 Bearing Estimation CRLB Problem
14/19
Now instead of sampling each sensor at lots of time instants
we just grab one snapshot of all Msensors at a single instant t
s
( )
~
cos
~
cos
2
cos
cos 2 cos ) (
0
+ =
|
|
|
|
|
|
|
|
.
|
\
|
+
|
|
|
|
|
.
|
\
|
|
.
|
\
|
=
|
|
.
|
\
|
+
|
.
|
\
|
+ =
n A n d
c
f
A
c
d
n t t f A t s
s
o
s o s n
s
s
! ! ! " ! ! ! # $
! !" ! !# $
Spatial sinusoid w/
spatial frequency
s
Spatial Frequencies:

s
is in rad/meter

s
is in rad/sensor
For sinusoidal transmitted signal Bearing
Est. reduces to Frequency Est.
And we already know its FIM & CRLB!!!
Bearing Estimation Snapshot of Sensor Signals
15/19
( ) ] [
~
cos ] [ ) ( ] [ n w n A n w t s n x
s s n
+ + = + =
Each sample in the snapshot is corrupted by a noise sample
and these Msamples make the data vector x = [x[0] x[1] x[M-1] ]:
Each w[n] is a noise sample that comes from a different sensor so
Model as uncorrelated Gaussian RVs (same as white temporal noise)
Assume each sensor has same noise variance
2
So the parameters to consider are:
T
s
A ]
[ =
which get transformed to:
(
(
(
(
(
(
|
|
.
|
\
|

=
(
(
(
(
(
(
= =
2
arccos
) (
d f
c
A A
o
s
g
Parameter of interest!
Bearing Estimation Data and Parameters
16/19
Using the FIM for the sinusoidal parameter problem together
with the transform. of parms result (see book p. 59 for details):
( )
) rad (
sin
1
1
) 2 (
12
)
var(
2
2
2
2

|
.
|
\
|
L
M
M
M SNR
Bearing Accuracy:
Decreases as 1/SNR Depends on actual bearing
Decreases as 1/M ! Best at = /2 (Broadside)
Decreases as 1/L
r
2
! Impossible at = 0! (Endfire)
L = Array physical length in meters
M = Number of array elements
= c/f
o
Wavelength in meters (per cycle)
Define: L
r
= L/
Array Length in
wavelengths
Low-frequency (i.e., long wavelength) signals need
very large physical lengths to achieve good accuracy
Bearing Estimation CRLB Result
17/19
In speech processing (and other areas) we often model the
signal as an AR random process and need to estimate the AR
parameters. An AR process has a PSD given by
2
1
2
2
] [ 1
) ; (
+
=
p
m
fm j
u
xx
e m a
f P
AR Estimation Problem: Given data x[0], x[1], , x[N-1]

estimate the AR parameter vector
| |
T
u
p a a a
2
] [ ] 2 [ ] 1 [ & =
This is a hard CRLB to find exactly but it has been published.
The difficulty comes from the fact that there is no easy direct
relationship between the parameters and the data.
It is not a signal plus noise problem
Ex. 4 AR Estimation CRLB Problem
18/19
Approach: The asymptotic result we discussed is perfect here:
An AR process is WSS is required for the Asymp. Result
Gaussian is often a reasonable assumption needed for Asymp. Result
The Asymp. Result is in terms of partial derivatives of the PSD and
that is exactly the form in which the parameters are clearly displayed!
| |
| | | |
df
f P f P N
j
xx
i
xx
ij

2
1
2
1
) ; ( ln ) ; ( ln
2
) (

I
2
1
2 2
2
1
2
2
] [ 1 ln ln
] [ 1
ln ) ; ( ln

+ =
+
=
p
m
fm j
u
p
m
fm j
u
xx
e m a
e m a
f P

Recall:
AR Estimation CRLB Asymptotic Approach
19/19
After taking these derivatives you get results that can be
simplified using properties of FT and convolution.
The final result is:
| |
N
p k
N
k a
u
u
kk
xx
u
4
2
1
2
2
)
var(
, , 2 , 1 ]) [
var(
=

R
Both Decrease
as 1/N
To get a little insight look at 1
st
order AR case (p = 1):
]) 1 [ 1 (
1
]) 1 [
var(
2
a
N
a
Complicated
dependence on
AC Matrix!!
Improves as pole
gets closer to unit
circle PSDs
with sharp
peaks are easier
to estimate
a[1]
Re(z)
Im(z)
AR Estimation CRLB Asymptotic Result
1
CRLB Example:
Single-Rx Emitter Location via Doppler
] , [ ) ; (
1 1 1
T t t t f t s +
Received Signal Parameters Depend on Location
Estimate Rx Signal Frequencies: f
1
, f
2
, f
3
, , f
N
Then Use Measured Frequencies to Estimate Location
(X, Y, Z, f
o
)
] , [ ) ; (
2 2 2
T t t t f t s +
] , [ ) ; (
3 3 3
T t t t f t s +
2
Problem Background
Radar to be Located: at Unknown Location (X,Y,Z)
Transmits Radar Signal at Unknown Carrier Frequency f
o
Signal is intercepted by airborne receiver
Known (Navigation Data):
Antenna Positions: (X
p
(t), Y
p
(t), Z
p
(t))
Antenna Velocities: (V
x
(t), V
y
(t), V
z
(t))
Goal: Estimate Parameter Vector x = [X Y Z f
o
]
T
3
Physics of Problem
Receiver
Emitter
v(t)
u(t)
Relative motion between
emitter and receiver
causes a Doppler shift of
the carrier frequency:
( ) ( ) ( )
( ) ( ) ( )
.
) ( ) ( ) (
) ( ) ( ) ( ) ( ) ( ) (
) ( ) ( ) , (
2 2 2
+ +
+ +
=
=
Z t Z Y t Y X t X
Z t Z t V Y t Y t V X t X t V
c
f
f
t t
c
f
f t f
p p p
p z p y p x
o
o
o
o
u v x
Because we estimate the frequency there is an error added:
) ( ) , ( ) , (
~
i i i
t v t f t f + = x x
4
Estimation Problem Statement
Given:
Data Vector:
Navigation Info:
[ ]
T
N
t f t f t f ) , (
~
) , (
~
) , (
~
) (
~
2 1
x x x x f ! =
) ( , ), ( ), (
) ( , ), ( ), (
) ( , ), ( ), (
) ( , ), ( ), (
) ( , ), ( ), (
) ( , ), ( ), (
2 1
2 1
2 1
2 1
2 1
2 1
N z z z
N y y y
N x x x
N p p p
N p p p
N p p p
t V t V t V
t V t V t V
t V t V t V
t Z t Z t Z
t Y t Y t Y
t X t X t X
!
!
!
!
!
!
Estimate:
Parameter Vector: [X Y Z f
o
]
T
Right now only want to consider the CRLB
Vector-Valued function of a Vector
5
The CRLB
Note that this is a signal plus noise scenario:
The signal is the noise-free frequency values
The noise is the error made in measuring frequency
Assume zero-mean Gaussian noise with covariance matrix C:
Can use the General Gaussian Case of the CRLB
Of course validity of this depends on how closely the errors
of the frequency estimator really do follow this
Our data vector is distributed according to:
) ), ( ( ) (
~
C x f ~ x f
Only Mean Shows
Dependence on
parameter x!
Only need the first term in the CRLB equation:
[ ]
=

j
T
i
ij
x x
) ( ) (
1
x f
C
x f
J
I use J for the FIM instead of I to avoid confusion with the identity matrix.
6
Convenient Form for FIM
To put this into an easier form to look at Define a matrix H:
Called The Jacobian of f(x)
[ ]
4 3 2 1
value true
| | | ) , ( h h h h x t
x
H
x
=
=
=
f
where
x x
x
x
x
h
"
#
=
=
) , (
) , (
) , (
2
1
N
j
j
j
j
t f
x
t f
x
t f
x
Then it is east to verify that the FIM becomes:
H C H J
1
=
T
7
CRLB Matrix
The Cramer-Rao bound covariance matrix then is:
[ ]
1
1
1
) (
=
=
H C H
J x C
T
CRB
A closed-form expression for the partial derivatives needed for H can be
computed in terms of an arbitrary set of navigation data see Reading Version
of these notes.
Thus, for a given emitter-platform geometry it is possible to compute the
matrix H and then use it to compute the CRLB covariance matrix in (5), from
which an eigen-analysis can be done to determine the 4-D error ellipsoid.
Cant really plot a 4-D ellipsoid!!!
But it is possible to project this 4-D ellipsoid down into 3-D (or even 2-D) so
that you can see the effect of geometry.
8
Projection of Error Ellipsoids
A zero-mean Gaussian vector of two vectors x & y:
The the PDF is:
[ ]
T
T T
y x =
{ } C
C
1
2
1
exp
) ( det 2
1
) (
2
1

=
T
p
=
y yx
xy x
C C
C C
C
The quadratic form in the exponential defines an ellipse:
k
T
=
C

1
Can choose k to make size of
ellipsoid such that falls
inside the ellipsoid with a
desired probability
Q: If we are given the covariance C
how is x alone is distributed?

A: Extract the sub-matrix C
x
out of C
See also Slice Of Error Ellipsoids

9
Projections of 3-D Ellipsoids onto 2-D Space
2-D Projections Show
Expected Variation
in 2-D Space
x
y
z
Projection Example
T
z y x ] [ =
T
y x ] [ = x
Full Vector: Sub-Vector:
We want to project the 3-D ellipsoid for
down into a 2-D ellipse for x
2-D ellipse still shows
the full range of
variations of x and y
10
Finding Projections
To find the projection of the CRLB ellipse:
1. Invert the FIM to get C
CRB
2. Select the submatrix C
CRB,sub
from C
CRB
3. Invert C
CRB,sub
to get J
proj
4. Compute the ellipse for the quadratic form of J
proj
Mathematically:
T
T
CRB sub CRB
P PJ
P PC C
1
,
=
=
( )
1
1

=
T
proj
P PJ J
P is a matrix formed from the identity matrix:
keep only the rows of the variables projecting onto
For this example, frequency-based emitter location: [X Y Z f
o
]
T
To project this 4-D error ellipsoid onto the X-Y plane:
=
0 0 1 0
0 0 0 1
P
11
Projections Applied to Emitter Location
Shows 2-D ellipses
that result from
projecting 4-D
ellipsoids
12
Slices of Error Ellipsoids
Q: What happens if one parameter were perfectly known.
Capture by setting that parameters error to zero
slice through the error ellipsoid.
Slices Show
Impact of Knowledge
of a Parameter
x
y
z
Impact:
slice = projection when ellipsoid not tilted
slice < projection when ellipsoid is tilted.
Recall: Correlation causes tilt
1
Chapter 4
Linear Models
2
General Linear Model
Recall signal + WGN case: x[n] = s[n;] + w[n]
x = s() + w Here, dependence on is general
N1
known observation
matrix (Np)
p1
known offset(p1)
Now we consider a special case: Linear Observations:
s() = H + b
The General Linear Model:
w b H x + + =
Data
Vector Known &
Full Rank
To Be
Estimated
~N(0,C)
zero-mean,
Gaussian,
C is pos. def.
Note: Gaussian is part
of the Linear Model
Known
3
Need For Full-Rank H Matrix
Note: We must assume H is full rank
Q: Why?
A: If not, the estimation problem is ill-posed
given vector s there are multiple vectors that give s:
If H is not full rank
Then for any s :
1
,
2
such that s = H
1
= H
2
4
Importance of The Linear Model
There are several reasons:
1. Some applications admit this model
2. Nonlinear models can sometimes be linearized
( ) ( ) b x C H H C H =

1
1
1

T T
MVU
as well see!!!
3. Finding Optimal Estimator is Easy
5
MVUE for Linear Model
Theorem: The MVUE for the General Linear Model and its
covariance (i.e. its accuracy performance) are given by:
( ) ( ) b x C H H C H =

1
1
1

T T
MVU
( )
1
1
= H C H C
T
and achieves the CRLB.
Proof: Well do this for the b = 0 case but it can easily be done
for the more general case.
First we have that x~N(H,C) because:
E{x} = E{H + w} = H + {w} = H
cov{x} = E{(x H) (x H)
T
} = E{w w
T
} = C
6
Recalling CRLB Theorem Look at the partial of LLF:
Constant
w.r.t.
Linear
w.r.t.
Quadratic w.r.t.
(Note: H
T
C
-1
H is symmetric)
Now use results in Gradients and Derivatives posted on BB:
The CRLB Theorem says that if we have this form we
have found the MVU and it achieves the CRLB of I
-1
()!!
| |
(
(
! !" ! !# $ !" !# $ !" !# $

H C H H C x x C x
H x C H x

1 1 1
1
2
2
1
) ( ) (
2
1 ) ; x ( ln
T T T T
T
p
(H)
T
C
-1
x = [(H)
T
C
-1
x]
T
= x
T
C
-1
(H)
| | | |
( )
(
(
(
=
= + =

x C H H C H H C H
H C H x C H H C H x C H

I
! ! ! " ! ! ! # $
!" !# $
) (
1
1
1
) (
1
1 1 1 1
2 2
2
1 ) ; x ( ln
g
T T T
T T T T
p
Pull out
H
T
C
-1
H
7
For simplicity assume b = 0
Whitening Filter Viewpoint
Assume C is positive definite (necessary for C
-1
to exist)
Thus, from (A1.2): for pos. def. C NN invertible matrix D, s.t.
C
-1
= D
T
D C = D
-1
(D
T
)
-1
Transform data x using matrix D: w H Dw DH Dx x
~
~
~
+ = + = =
Claim: White!!
{ } { } { }
( ) I D D D D DCD
D Dww Dw Dw w w
=
|
.
|
\
|
= =
= =
T T T
T T T T
E E E
1
1
) )( (
~ ~
D
MVUE for Lin.
Model w/ White
Noise
x
~
x

Whitening
Filter
8
Ex. 4.1: Curve Fitting
Caution: The Linear in Linear Model
does not come from fitting straight lines to data
It is more general than that !!
n
x[n]
Data
Model is Quadratic in Index n
But Model is Linear in Parameters
w H x + =
(
(
(
(
=
3
2
1
(
(
(
(
(
(
=
2
2
2
1
2 2 1
1 1 1
N N
% % %
H
] [ ] [
2
3 2 1
n w n n n x + + + =
Linear in s
9
Ex. 4.2: Fourier Analysis (not most general)
] [
2
sin
2
cos ] [
1 1
n w
N
kn
b
N
kn
a n x
M
k
M
k
k k
+
|
.
|
\
|
+
|
.
|
\
|
=

= =

Data Model:
Parameters to
Estimate
AWGN
Parameters: = |a
1
. . . a
M
b
1
. . . b
M
]
T
(Fourier Coefficients)
(
(
(
(
(
(
(
(
(
(
(
(
(
(
= H
Observation
Matrix:

|
.
|
\
|
N
kn 2
cos
|
.
|
\
|
N
kn 2
sin
k = 1, 2, , M
n = 0, 1, 2, , N
Down each column
10
Now apply MVUE Theorem for Linear Model:
( ) x H H H
T T
MVU
1

=
I
2
N
=
Using standard
orthogonality of
sinusoids (see book)
x H
T
MVU
N
2
=
Each Fourier coefficient
estimate is found by the inner
product of a column of H with
the data vector x
Interesting!!! Fourier Coefficients for signal + AWGN are MVU
estimates of the Fourier Coefficients of the noise-free signal
COMMENT: Modeling and Estimation (are Intertwined)
Sometimes the parameters have some physical significance (e.g. delay
of a radar signal).
But sometimes parameters are part of non-physical assumed model
(e.g. Fourier)
Fourier Coefficients for signal + AGWN are MVU estimates of the
Fourier Coefficients of the noise-free signal
11
H(z)
+
w[n]
u[n]
Ex. 4.3: System Identification
x[n]
Observed
Noisy Output
Known
Input
Unknown
System
Goal: Determine a model for the system
Some Application Areas:
Wireless Communications (identify & equalize multipath)
Geophysical Sensing (oil exploration)
Speakerphone (echo cancellation)
In many applications: assume that the system is FIR (length p)
unknown, but here
well assume known
] [ ] [ ] [ ] [
1
0
n w k n u k h n x
p
k
+ =

=
Measured Estimation
Parameters
Known Input
Assume u[n] =0, n < 0
AWGN
12
Write FIR convolution in matrix form:
w x
H
+
(
(
(
(
(
(
(
(
(
(
(
(
(
(
(
(
(
(
(
(
(
(

=
!" !# $
%
! ! ! ! ! ! ! ! ! ! ! ! " ! ! ! ! ! ! ! ! ! ! ! ! # $
& & & &
% %
%
' %
' ' ' %
% ' ' ' ' %
&
& &
& &
] 1 [
] 1 [
] 0 [
] [ ] 1 [
] 1 [
] 0 [
0
0 0 ] 0 [ ] 1 [ ] 2 [
0 0 ] 0 [ ] 1 [
0 0 0 ] 0 [
) (
p h
h
h
p N u N u
u
u
u u u
u u
u
Nxp
Estimate
This
Measured
Data
Known Input
Signal Matrix
The Theorem for the Linear Model says:
( ) x H H H
T T
MVU
1

=
( )
1
2
= H H C
and achieves the CRLB.

13
Q: What signal u[n] is best to use ?
A: The u[n] that gives the smallest estimated variances!!
Book shows: Choosing u[n] s.t. H
T
H is diagonal will
minimize variance
Choose u[n] to be pseudo-random noise (PRN)
u[n] is to all its shifts u[n m]
Proof uses: ( )
1
2
= H H C
And Cauchy-Schwarz Inequality (same as Schwarz Ineq.)

Note: PRN has approximately flat spectrum
So from a frequency-domain view a PRN signal equally probes at
all frequencies
1
Chapter 6
Best Linear Unbiased Estimate
(BLUE)
2
Motivation for BLUE
Except for Linear Model case, the optimal MVU estimator might:
1. not even exist
2. be difficult or impossible to find
Resort to a sub-optimal estimate
BLUE is one such sub-optimal estimate
Idea for BLUE:
1. Restrict estimate to be linear in data x
2. Restrict estimate to be unbiased
3. Find the best one (i.e. with minimum variance)
Advantage of BLUE:Needs only 1
st
and 2
nd
moments of PDF
Mean & Covariance
Disadvantages of BLUE:
1. Sub-optimal (in general)
2. Sometimes totally inappropriate (see bottom of p. 134)
3
6.3 Definition of BLUE (scalar case)
Observed Data: x = [x[0] x[1] . . . x[N 1] ]
T
PDF: p(x; ) depends on unknown
BLUE constrained to be linear in data:
x a
T
N
n
n BLU
n x a = =

1
0
] [
Choose as to give: 1. unbiased estimator

2. then minimize variance
Linear
Unbiased
Estimators
Nonlinear
Unbiased Estimators
BLUE
MVUE
V
a
r
i
a
n
c
e
Note: This is
not Fig. 6.1
4
6.4 Finding The BLUE (Scalar Case)
=
1
0
] [
N
n
n
n x a
1. Constrain to be Linear:
2. Constrain to be Unbiased:
=
=
=
1
0
]} [ {
}
{
N
n
n
n x E a
E

Using linear constraint
Q: When can we meet both of these constraints?
A: Only for certain observation models (e.g., linear observations)
5
Finding BLUE for Scalar Linear Observations
Consider scalar-parameter linear observation:
x[n] = s[n] + w[n] E{x[n]} = s[n]
Tells how to choose
weights to use in the
BLUE estimator form
1
] [ }
{
1
0
=
= =

s a
T
N
n
n
Need
n s a E
! " #
=
1
0
] [
N
n
n
n x a
Then for the unbiased condition we need:
Now given that these constraints are met
We need to minimize the variance!!
Given that C is the covariance matrix of x we have:
{ } { } Ca a x a
T T
BLU
= = var
var
Like var{aX} =a
2
var{X}
6
Goal: minimize a
T
Ca subject to a
T
s = 1
Constrained optimization
Appendix 6A: Use Lagrangian Multipliers:
Minimize J = a
T
Ca + (a
T
s 1)
s C s
s C s s a
s C a
s a
1
1
1
1
1
2
1
2
2
0 : Set
= = =
= =
T
T T
T
a
J

$ $! $ $" #
s C s
s C
a
1
1
=
T
s C s
1
1
)
var(
=
T
s C s
x C s
x a
1
1
= =
T
T
T
BLUE
Appendix 6A shows that this achieves a global minimum

7
Applicability of BLUE
We just derived the BLUE under the following:
1. Linear observations but with no constraint on the noise PDF
2. No knowledge of the noise PDF other than its mean and cov!!
What does this tell us???
BLUE is applicable to linear observations
Butnoise need not be Gaussian!!!
(as was assumed in Ch. 4 Linear Model)
And all we need are the 1
st
and 2
nd
moments of the PDF!!!
Butwell see in the Example that we
can often linearize a nonlinear model!!!
8
6.5 Vector Parameter Case: Gauss-Markov Thm
Gauss-Markov Theorem:
If data can be modeled as having linear observations in noise:
w H x + =
Known Matrix
Known Mean & Cov
(PDF is otherwise
arbitrary & unknown)
Then the BLUE is: ( ) x C H H C H
1
1
1
=
T T
BLUE
and its covariance is:
( )
1
1
= H C H C
T
Note: If noise is Gaussian then BLUE is MVUE
9
Ex. 4.3: TDOA-Based Emitter Location
Tx @ (x
s
,y
s
)
Rx
3
(x
3
,y
3
)
Rx
2
(x
2
,y
2
)
Rx
1
(x
1
,y
1
)
s(t)
s(t t
1
)
s(t t
2
)
s(t t
3
)
Hyperbola:
12
= t
2
t
1
= constant
Hyperbola:
23
= t
3
t
2
= constant
TDOA = Time-Difference-of-Arrival
Assume that the i
th
Rx can measure its TOA: t
i
Then from the set of TOAs compute TDOAs
Then from the set of TDOAs estimate location (x
s
,y
s
)
We wont worry about
how they do that.
Alsothere are TDOA
systems that never
actually estimate TOAs!
10
TOA Measurement Model
Assume measurements of TOAs at N receivers (only 3 shown above):
t
0
, t
1
, ,t
N-1
There are measurement errors
TOA measurement model:
T
o
= Time the signal emitted
R
i
= Range from Tx to Rx
i
c = Speed of Propagation (for EM: c = 3x10
8
m/s)
t
i
= T
o
+ R
i
/c +
i
i = 0, 1, . . . , N-1
Measurement Noise zero-mean, variance
2
, independent (but PDF unknown)
(variance determined from estimator used to estimate t
i
s)
Now use: R
i
= [ (x
s
x
i
)
2
+ (y
s
- y
i
)
2
]
1/2
i i s i s o s s i
y y x x
c
T y x f t + + + = =
2 2
) ( ) (
1
) , (
Nonlinear
Model
11
Linearization of TOA Model
= [x y]
T
So we linearize the model so we can apply BLUE:
Assume some rough estimate is available (x
n
, y
n
)
x
s
= x
n
+ x
s
y
s
= y
n
+ y
s
know estimate know estimate
Now use truncated Taylor series to linearize R
i
(x
n
, y
n
):
s
n
i n
s
n
i n
n i
y
R
y y
x
R
x x
R R
i
B
i
A
i i
i

$! $" # $! $" #
= =

+
Known
i s
i
s
i
o
n
i i
y
c
B
x
c
A
T
c
R
t t
i
+ + + = =
~
Apply to TOA:
known
known known
Three unknown parameters to estimate: T
o
, y
s
, y
s
12
TOA Model vs. TDOA Model
Two options now:
1. Use TOA to estimate 3 parameters: T
o
, y
s
, y
s
2. Use TDOA to estimate 2 parameters: y
s
, y
s
Generally the fewer parameters the better
Everything else being the same.
But here everything else is not the same:
Options 1 & 2 have different noise models
(Option 1 has independent noise)
(Option 2 has correlated noise)
In practice wed explore both options and see which is best.
13
Conversion to TDOA Model
N1 TDOAs rather
than N TOAs
TDOAs:
1 , , 2 , 1 ,
~ ~
1
= =

N i t t
i i i

$! $" #
$! $" # $! $" #
noise correlated
1
known
1
known
1
=
i i s
i i
s
i i
y
c
B B
x
c
A A

In matrix form: x = H + w
w H A
) ( ) (
) ( ) (
) ( ) (
1
2 1
1 2
0 1
2 1 2 1
1 2 1 2
0 1 0 1
=
(
(
(
(
(
(
=
(
(
(
(
(
(

=
N N N N N N
B B A A
B B A A
B B A A
c

&
&
& & &
&
&
| |
T
N 1 2 1
= ' x
| |
T
s s
y x =
See book for structure
of matrix A
T
AA w C
w
2
} cov{ = =
14
Apply BLUE to TDOA Linearized Model
( )
( )
1
1
2
1
1
|
.
|
\
|
=
=
H AA H
H C H C
w
T T
T
( )
( ) ( ) x AA H H AA H
x C H H C H
w w
1
1
1
1
1
1
|
.
|
\
|
=
=
T T T T
T T
BLUE
Describes how large
the location error is
Dependence on
2
cancels out!!!
Things we can now do:
1. Explore estimation error cov for different Tx/Rx geometries
Plot error ellipses
2. Analytically explore simple geometries to find trends
See next chart (more details in book)
15
Apply TDOA Result to Simple Geometry
R
x1
R
x2
R
x3
d d

R
T
x
(
(
(
(
=
2
2
2 2
) sin 1 (
2 / 3
0
0
cos 2
1
C Then can show:

Diagonal Error Cov Aligned Error Ellipse
And y-error always bigger than x-error
e
x
e
y
16
0 10 20 30 40 50 60 70 80 90
10
-1
10
0
10
1
10
2
10
3
(degrees )
x
/
c

o
r

y
/
c
y
R
x1 R
x2
R
x3
d d

R
T
x
Used Std. Dev. to show units of X & Y
Normalized by cget actual values by
multiplying by your specific c value
For Fixed Range R: Increasing Rx Spacing d Improves Accuracy
For Fixed Spacing d: Decreasing Range R Improves Accuracy
1
Chapter 7
Maximum Likelihood Estimate
(MLE)
2
Motivation for MLE
Problems: 1. MVUE often does not exist or cant be found
<See Ex. 7.1 in the textbook for such a case>
2. BLUE may not be applicable (x H + w)
Solution: If the PDF is known, then MLE can always be used!!!
This makes the MLE one of the most popular practical methods
Advantages: 1. It is a Turn-The-Crank method
2. Optimal for large enough data size
Disadvantages: 1. Not optimal for small data size
2. Can be computationally complex
- may require numerical methods
3
Rationale for MLE
Choose the parameter value that:
makes the data you did observe
the most likely data to have been observed!!!
Consider 2 possible parameter values:
1
&
2
Ask the following: If
i
were really the true value, what is the
probability that I would get the data set I really got ?
Let this probability be P
i
So if P
i
is smallit says you actually got a data set that was
unlikely to occur! Not a good guess for
i
!!!
But p
1
= p(x;
1
) dx
p
2
= p(x;
2
) dx
pick so that is largest
ML
; (
ML
p x
4
Definition of the MLE
is the value of that maximizes the Likelihood
Function p(x;) for the specific measured data x
ML
maximizes the
likelihood function
ML
p(x;)
ML
Note: Because ln(z) is a monotonically increasing function

maximizes the log likelihood function ln{p(x; )}
ML
General Analytical Procedure to Find the MLE

1. Find log-likelihood function: ln p(x;)
2. Differentiate w.r.t and set to 0: ln p(x;)/ = 0
3. Solve for value that satisfies the equation
5
Ex. 7.3: Ex. of MLE When MVUE Non-Existent
x[n] = A + w[n] x[n] ~ N(A,A)
WGN
~N(0,A)
Likelihood Function:
! ! ! ! ! ! " ! ! ! ! ! ! # $
(
(
=
1
0
2
2
) ] [ (
2
1
exp
) 2 (
1
) ; (
N
n
N
A n x
A
A
A p
x
To take ln of this use log properties:
Take /A, set = 0, and change A to
A
A > 0
0
] [
2
1
] [
2
1
1
] [
2
: this Expand
0 )
] [ (
2
1
)
] [ (
2
2
2
2
2
1
0
2
2
1
0
= + + +
= + +

=
=
A
N A
n x A
A
n x
A
A N
A
n x
A A
N
A n x
A
A n x
A A
N
N
n
N
n
Cancel
6
Manipulate to get: 0 ] [
1

1
0
2 2
= +

=
N
n
n x
N
A A
4
1
] [
1
2
1
1
0
2
+ + =

=
N
n
ML
n x
N
A
Solve quadratic equation to get MLE:
Can show this estimator biased (see bottom of p. 160)
But it is asymptotically unbiased
Use the Law of Large Numbers:
Sample Mean True Mean
} ] [ { ] [
1
2
1
0
2
n x E n x
N
N as
N
n

{ } { } { } A n x E n x E E A E
A A
ML
= + + =
+ +
+ =
4
1
] [
2
1
4
1
] [
2
1
2
2 2
!" !# $
CRLB
A N
A
A =
|
.
|
\
|
+
2
1
)
var(
2
So can use this to show:
AsymptoticallyUnbiased &
Efficient
7
7.5 Properties of the MLE (orWhy We Love MLE)
The MLE is asymptotically:
1. unbiased
2. efficient (i.e. achieves CRLB)
3. Gaussian PDF
Also, if a truly efficient estimator exists, then the ML procedure
finds it !
The asymptotic properties are captured in Theorem 7.1:
If p(x; ) satisfies some regularity conditions, then the
MLE is asymptotically distributed according to
)) ( , ( ~
1
I N
a
ML

where I( ) = Fisher Information Matrix
8
Size of N to Achieve Asymptotic
This Theorem only states what happens asymptotically
when N is small there is no guarantee how the MLE behaves
Q: How large must N be to achieve the asymptotic properties?
A: In practice: use Monte Carlo Simulations to answer this
9
Monte Carlo Simulations: see Appendix 7A
Not just for the MLE!!!
A methodology for doing computer simulations to evaluate
performance of any estimation method
Illustrate for deterministic signal s[n; | in AWGN
Monte Carlo Simulation:
Data Collection:
1. Select a particular true parameter value,
true
- you are often interested in doing this for a variety of values of
so you would run one MC simulation for each value of interest
2. Generate signal having true : s[n;
t
] (call it s in matlab)
3. Generate WGN having unit variance
w = randn ( size(s) );
4. Form measured data: x = s + sigma*w;
- choose to get the desired SNR
- usually want to run at many SNR values
do one MC simulation for each SNR value
10
Data Collection (Continued):
5. Compute estimate from data x
6. Repeat steps 3-5 M times
- (call M # of MC runs or just # of runs)
7. Store all M estimates in a vector EST (assumes scalar )
Statistical Evaluation:
1. Compute bias
2. Compute error RMS
3. Compute the error Variance
4. Plot Histogram or Scatter Plot (if desired)
( )
=
=
M
i
true
i
M
b
1
1

( )
=
=
M
i
t
i
M
RMS
1
2
1

= =
|
|
.
|
\
|
|
|
.
|
\
|
=
M
i
M
i
i i
M M
VAR
1
2
1
1

Now explore (via plots) how: Bias, RMS, and VAR vary with:
value, SNR value, N value, Etc.
Is B 0 ?
Is RMS (CRLB)
?
11
Ex. 7.6: Phase Estimation for a Sinusoid
Some Applications:
1. Demodulation of phase coherent modulations
(e.g., DSB, SSB, PSK, QAM, etc.)
2. Phase-Based Bearing Estimation
Signal Model: x[n] = Acos(2f
o
n + ) + w[n], n = 0, 1,, N-1
A and f
o
known, unknown
White
~N(0,
2
)
( )
SNR N
NA

=
1 2
var
2
2
Recall CRLB:
For this problem all methods for finding the MVUE will fail!!
Sotry MLE!!
12
So first we write the likelihood function:
( )
( ) | |
+ =

=
! ! ! ! ! " ! ! ! ! ! # $
1
0
2
2
2
2
2 cos ] [
2
1
exp
2
1
) ; (
N
n
o
N
n f A n x p
x
equivalent to
minimizing this
End up in same
place if we
maximize LLF
GOAL: Find that
maximizes this
So, minimize:
( ) ( ) | |
( )
gives 0 Setting 2 cos ] [
1
0
2
=
+ =

J
n f A n x J
N
n
o
( ) ( ) ( )
! ! ! ! ! ! " ! ! ! ! ! ! # $
0
1
0
1
0
2 cos
2 sin
2 sin ] [
=

+ + = +
N
n
o o
N
n
o
n f n f A n f n x
sin and cos are when summed over full cycles
So MLE Phase Estimate satisfies:
( ) 0
2 sin ] [
1
0
= +
=
N
n
o
n f n x
Interpret via inner product or correlation
13
Nowusing a Trig Identity and then re-arranging gives:
( ) ( )
(
(
=
(
(

n
o
n
o
n f n x n f n x 2 cos ] [ )
sin( 2 sin ] [ )
cos(
( )
( )
(
(
(
n
o
n
o
ML
n f n x
n f n x
2 cos ] [
2 sin ] [
tan
1
Or
Recall: This is the
approximate MLE
Dont need to know
A or
2
but do need
to know f
o
LPF
LPF
y
i
(t)
y
q
(t)
x(t)
cos(2f
o
t)
-sin(2f
o
t)
Recall: I-Q Signal Generation
The sums in the above equation
play the role of the LPFs in the
figure (why?)
Thus, ML phase estimator can be
viewed as: atan of ratio of Q/I
14
Monte Carlo Results for ML Phase Estimation
See figures 7.3 & 7.4 in text book
1
7.6 MLE for Transformed Parameters
Given PDF p(x; ) but want an estimate of = g ( )
What is the MLE for ??
g( )
Two cases:
1. = g( ) is a one-to-one function
)) ( ; ( maximizes
1

g p
ML
x
2. = g( ) is not a one-to-one function
g( )
Need to define modified likelihood function:
{ }
! ! ! " ! ! ! # $
) ; ( max ) ; (
) ( :

x x p p
g
T
=
=
For each , find all s that map to it
Extract largest value of p(x; ) over
this set of s
) ; ( maximizes
x
T ML
p
2
Invariance Property of MLE Another Big
Advantage of MLE!
Theorem 7.2: Invariance Property of MLE
If parameter is mapped according to = g( ) then the
MLE of is given by
where is the MLE for found by maximizing p(x; )
)
g =
Note: when g( ) is not one-to-one the MLE for maximizes

the modified likelihood function
Proof:
Easy to see when g( ) is one-to-one
Otherwise can argue that maximization over inside
definition for modified LF ensures the result.
3
Ex. 7.9: Estimate Power of DC Level in AWGN
x[n] = A + w[n] noise is N(0,
2
) & White
Want to Est. Power: = A
2
A
= A
2
For each value there are 2 PDFs to consider
(
(
+ =
(
(
n
N
T
n
N
T
n x p
n x p
2
2 2 / 2
2
2 2 / 2
) ] [ (
2
1
exp
) 2 (
1
) ; (
) ] [ (
2
1
exp
) 2 (
1
) ; (
2
1
x
x
{ }
| |
2
2
2
0
) ; x ( max arg
) ; ( ), ; x ( max arg
ML
A
ML
A
A p
x p p
=
(
=
(
=
< <
Then:
Demonstration that
Invariance Result
Holds for this
Example
4
Ex. 7.10: Estimate Power of WGN in dB
x[n] = w[n] WGN w/ var =
2
unknown
Recall: P
noise
=
2
=
=
1
0
2
] [
1
N
n
noise
n x
N
P
Can show that the MLE for variance is:
To get the dB version of
the power estimate:
Note: You may recall a
result for estimating
variance that divides by N1
rather than by N that
estimator is unbiased, this
estimate is biased (but
asymptotically unbiased)
! ! ! " ! ! ! # $
(
=
1
0
2
10
] [
1
log 10
N
n
dB
n x
N
P
Using
Invariance Property !
5
7.7: Numerical Determination of MLE
Note: In all previous examples we ended up with a closed-form
expression for the MLE:
) (
x f
ML
=
Ex. 7.11: x[n] = r
n
+ w[n] noise is N(0,
2
) & white
Estimate r
If 1 < r < 0 then this signal
is a decaying oscillation that
might be used to model:
A Ships Hull Ping
A Vibrating String, Etc.
=
=
1
0
1
0 ) ] [ (
0
) ; x ( ln
N
n
n n
nr r n x
p

To find MLE:
No closed-form
solution for the MLE
6
Sowe cant always find a closed-form MLE!
But a main advantage of MLE is:
We can always find it numerically!!!
(Not always computationally efficiently, though)
Brute Force Method
Compute p(x; ) on a fine grid of values
Advantage: Sure to Find maximum
(if grid is fine enough)
Disadvantage: Lots of Computation
(especially w/ a fine grid)
p(x; )
7
Iterative Methods for Numerical MLE
Step #1: Pick some initial estimate
Step #2: Iteratively improve it using
) ,
1
x
i i
f =
+
0
) ; x ( max ) ; x ( lim
p p
i
i
=

such that
Hill Climbing in the Fog
p(x; )
Note: A so-called Greedy

maximization algorithm will
always move up even
though taking an occasional
step downward may be the
better global strategy!
Convergence Issues:
1. May not converge
2. May converge, but to local maximum
- good initial guess is needed !!
- can use rough grid search to initialize
- can use multiple initializations
8
Iterative Method: Newton-Raphson MLE
The MLE is the maximum of the LF so set derivative to 0:
0
) ; ( ln
) (
=
=
!" !# $

g
p x
Newton-Raphson is a numerical method for finding the zero
of a function so it can be applied here Linearize g( )
So MLE is a
zero of g( )
(
) (
) ( ) (
1
! ! ! ! ! " ! ! ! ! ! # $
+
=
=

(
(
+
k
k
for e solv
set
k k
d
dg
g g

(
(
=
=
+
k
d
dg
g
k
k k

1
) (
)
(

Truncated
Taylor
Series
9
=
) ; ( ln
) (
x p
Now using our definition of convenience:
g
So then the Newton-Raphson MLE iteration is:
k
p p
k k

1
2
2
1
) ; ( ln ) ; ( ln

=
(
(
=
x x
Iterate until
convergence
criterion is met:
<
+
|

|
1 k k
Look Familiar???
Looks like I( ), except: I( ) is evaluated at the
true , and has an expected value
You get to
choose!
Generally:
For a given PDF model, compute derivatives analytically
or compute derivatives numerically:
; ( ln )
; ( ln ) ; ( ln
k k
p p p
k
x x x
10
Convergence Issues of Newton-Raphson:
1. May not converge
2. May converge, but to local maximum
- good initial guess is needed !!
- can use rough grid search to initialize
- can use multiple initializations
0
) ; ( ln x p
Some Other Iterative MLE Methods

1. Scoring Method
Replaces second-partial term by I( )
2. Expectation-Maximization (EM) Method
Guarantees convergence to at least a local maximum
Good for complicated multi-parameter cases
11
7.8 MLE for Vector Parameter
Another nice property of MLE is how easily it carries over to the
vector parameter case.
The vector parameter is:
| |
T
p
%
2 1
=
0
) ; x ( ln
=
!" !# $

p
is the vector that satisfies:
ML
(
(
(
(
(
(
(
(
(
(
(
p
f
f
f
f
) (
) (
) (
) ( 2
1
&
Derivative w.r.t.
a vector
12
Ex. 7.12: Estimate DC Level and Variance
x[n] = A + w[n] noise is N(0,
2
) and white
Estimate: DC level A and Noise Variance
2
(
(
(
=
2
( )
| |

)
=
1
0
2
2
2
2
2
] [
2
1
exp
2
1
) , ; (
N
n
N
A n x A p
x
LF is:
0
x set p
=
) ; ( ln
Solve:
(
(
(
(
2
) ] [ (
1
n
ML
x n x
N
x
( ) ( )
( )
=
= + =
= = =
1
0
2
4 2 2
2
1
0
2
0 ] [
2
1
2
) ; ( ln
0 ] [
1 ) ; ( ln
N
n
N
n
A n x
N p
A x
N
A n x
A
p

x
x
Interesting: For this problem
First estimate A just like scalar case
The subtract it off and then estimate variance like scalar case
13
Properties of Vector ML
The asymptotic properties are captured in Theorem 7.3:
If p(x;) satisfies some regularity conditions, then the
MLE is asymptotically distributed according to
)) ( , ( ~
1
I

N
a
ML
where I() = Fisher Information Matrix
So the vector ML is asymptotically:
unbiased
efficient
Invariance Property Holds for Vector Case
If = g ( ), then )
ML ML
g =
14
Ex. 7.12 Revisited
(
(
(
(
(
=
(
(
(
=
4
2
2
2
) 1 ( 2
0
0
}
cov{
) 1 (
}
N
N
N
N
N
A
E
It can be shown that:
) (
2
0
0
}
cov{ }
{
1
4
2
2
I

=
(
(
(
(
=
(
(
(
N
N
A
E For large N then :
which we see satisfies the asymptotic property.
Diagonal covariance matrix shows estimates are uncorrelated:
Error Ellipse is aligned with axes
A
e
2
e
This is why we
could decouple
the estimates
15
MLE for the General Gaussian Case
Let the data be general Gaussian: x ~ N ((), C())
Thus ln p(x;)/ will depend in general on
) ( C ) ( u
and
For each k = 1, 2, . . . , p set:
0
) ; ( ln
=
k
p
x
This gives p simultaneous equations, the k
th
one being:
| | | | | | 0 ) (
) (
) (
2
1
) ( ) (
) ( ) (
) (
2
1
1
1 1
=
(
(
+
|
|
.
|
\
|

x
C
x x C
C
C
k
T
T
k k
tr

Term #1 Term #2 Term #3
Note: for the deterministic signal + noise case: Terms #1 & #3 are zero
This gives general conditions to find the MLE
but cant always solve it!!!
16
MLE for Linear Model Case
The signal model is: x = H + w with the noise w ~ N(0,C)
So terms #1 & #3 are zero and term #2 gives:
For this case we can
solve these equations!
| | 0 H x C
H
H
=
(
(

=
1
) (
!" !# $
T
( ) x C H H C H
1
1
1
=
T T
ML
Solving this gives:
Hey! Same as chapter 4s MVU for linear model
Recall: the Linear Model is
specified to have Gaussian noise
For Linear Model: ML = MVU
) ) ( , ( ~
1 1
H C H
T
ML
N
EXACT
Not Asymptotic!!
17
Numerical Solutions for Vector Case
Obvious generalizations see p. 187
There is one issue to be aware of, though:
The numerical implementation needs ln p(x;)/
For the general Gaussian case this requires:

) (
1
often hard to
analytically:
get C
-1
()
& then
differentiate!
Sowe use (3C.2):
" # $
" # $
" # $
) (
) (
) (
) (
1 1
1
C
C
C
C

k k

Get
Analytically
Get
Numerically
18
7.9 Asymptotic MLE
Useful when data samples x[n] come from a WSS process
Reading Assignment Only
1
7.10 MLE Examples
Well now apply the MLE theory to several examples of
practical signal processing problems.
These are the same examples for which we derived the CRLB
in Ch. 3
1. Range Estimation
sonar, radar, robotics, emitter location
2. Sinusoidal Parameter Estimation (Amp., Frequency, Phase)
sonar, radar, communication receivers (recall DSB Example), etc.
3. Bearing Estimation
sonar, radar, emitter location
4. Autoregressive Parameter Estimation
speech processing, econometrics
See Book
We
Will
Cover
2
Ex. 1 Range Estimation Problem
Transmit Pulse: s(t) nonzero over t[0,T
s
]
Receive Reflection: s(t
o
)
Measure Time Delay:
o
max ,
) ; (
0 ) ( ) ( ) (
o s
t s
o
T T t t w t s t x
o

+ = + =
!" !# $
C-T Signal Model
t T
s
s(t)
t T
s(t
o
)
Bandlimited
White Gaussian
BPF
& Amp
x(t)
PSD of w(t)
f
B B
N
o
/2
3
Range Estimation D-T Signal Model
Sample Every = 1/2B sec
w[n] = w(n)
DT White
Gaussian Noise
Var
2
= BN
o
f
ACF of w(t)
1/2B
B B
1/B 3/2B
PSD of w(t)
N
o
/2

2
= BN
o
1 , , 1 , 0 ] [ ] [ ] [ = + = N n n w n n s n x
o

s[n;n
o
]has Mnon-zero samples starting at n
o
n
o

o
/
+
+ +

=
1 ] [
1 ] [ ] [
1 0 ] [
] [
N n M n n w
M n n n n w n n s
n n n w
n x
o
o o o
o
4
Range Estimation Likelihood Function
White and Gaussian Independent Product of PDFs
3 different PDFs one for each subinterval
2
3 #
1
2
2
2 #
1
2
2
1 #
1
0
2
2
2
1
2
] [
exp
2
]) [ ] [ (
exp
2
] [
exp ) ; (

=
(
(
(
(

(
(
(
(

(
(
(
(
=

+
+ =
+
=
=
C
n x
C
n n s n x
C
n x
C n p
M n
M n n
M n
n n
o
n
n
o
o
o
o
o
o
! ! ! ! & ! ! ! ! ' ( ! ! ! ! ! ! ! & ! ! ! ! ! ! ! ' (
! ! ! & ! ! ! ' (
x
Expand to get an x
2
[n]
termgroup it with
the other x
2
[n] term
(
(
(
(
(
+
(
(
(
(
(
+
=
=
! ! ! ! ! ! ! " ! ! ! ! ! ! ! # $
! ! ! " ! ! ! # $
1
2
0
2 2
1
0
2
]) [ ] [ ] [ 2 (
2
1
exp
2
] [
exp ) ; (
M n
n n
o
N
n
N
o
o
o
n n s n n s n x
n x
C n p

x
must minimize this or maximize
its negative over values of n
o
Does not depend on n
o
5
Range Estimation ML Condition
! ! ! " ! ! ! # $ ! ! ! " ! ! ! # $

+
=
+
=
+
1
2
1
0
] [ ] [ ] [
M n
n n
o
M n
n n
o
o
o
o
n n s n n s n x
Doesnt depend on n
o
!
Summand moves with
the limits as n
o
changes.
Because s[n n
o
] = 0
outside summation
rangeso can extend it!
2 So maximize this:
=

1
0
0
] [ ] [
N
n
n n s n x
So maximize this:
So. MLE Implementation is based on Cross-correlation:
Correlate Received signal x[n] with transmitted signal s[n]
{ } , ] [ ] [ ] [ ] [ max arg
1
0
0
=

= =
N
n
xs xs
M N m
o
m n s n x m C m C n
6
Range Estimation MLE Viewpoint
m
n
o
C
xs
[m]
, ] [ ] [ ] [
1
0
=
=
N
n
xs
m n s n x m C
Doesnt depend on n
o
!
Summand moves with
the limits as n
o
changes.
Warning: When
signals are complex
(e.g., ELPS) take find
peak of |C
xs
[m] |
Think of this as an inner product for each m
Compare data x[n] to all possible delays of signal s[n]
! pick n
o
to make them most alike
7
Ex. 2 Sinusoid Parameter Estimation Problem
Given DT signal samples of a sinusoid in noise.
Estimate its amplitude, frequency, and phase
1 , , 1 , 0 ] [ ) cos( ] [ = + + = N n n w n A n x
o

o
is DT frequency in
cycles/sample: 0 <
o
<
DT White Gaussian Noise
Zero Mean & Variance of
2
Multiple parameters so parameter vector:
T
o
A ] [ =
The likelihood function is:
) , , (
)) cos( ] [ (
2
1
exp ) ; (
1
0
2
2
o
N
n
o
N
A J
n A n x C p
=
(
(
+ =
x
For MLE: Minimize This
8
Sinusoid Parameter Estimation ML Condition
To make things easier
Define an equivalent parameter set:
[
1

2

o
]
T
1
= Acos()
2
= Asin()
Then J'(
1
,
2
,
o
) = J(A,
o
,) = [
1

2
]
T
Define:
c(
o
) = [1 cos(
o
) cos(
o
2) cos(
o
(N-1))]
T
s(
o
) = [0 sin(
o
) sin(
o
2) sin(
o
(N-1))]
T
and
H(
o
) = [c(
o
) s(
o
)] an Nx2 matrix
9
Then: J'(
1
,
2
,
o
) = [x H (
o
) ]
T
[x H (
o
) ]
Looks like the linear model caseexcept for
o
dependence of H (
o
)
Thus, for any fixed
o
value, the optimal estimate is
| | x H H H ) ( ) ( ) (
1
o
T
o o
T
=

Then plug that into J'(
1
,
2
,
o
):
| | | |
| || |
| |
| |
| |
! ! ! ! ! ! ! " ! ! ! ! ! ! ! # $
! ! ! ! ! ! ! ! " ! ! ! ! ! ! ! ! # $
o
o
T
o o
T
o
o
T
o o
T
o
T T
o
T
o o
T
o
T
o o
T T T
o
T
o o
J
=
(
=
=
=
w.r.t. minimize
1
) ( ) ( ) ( ) (
2
1
2 1
) ( ) ( ) ( ) (
) ( ) ( ) ( ) (
) ( ) (
) (
) ( ) ,
(
1
x H H H H x x x
x H H H H I x
H x H x
H x H x
H H H H I

10
Sinusoid Parms. Exact MLE Procedure
| |
)
`
=

x H H H H x ) ( ) ( ) ( ) ( min arg
1
0
o
T
o o
T
o
T
o
o

o
Step 1: Minimize this term over

o
to find
Step 2: Use result of Step 1 to get
| | x H H H )
( )
( )
1
o
T
o o
T
=

Done Numerically
Step 3: Convert Step 2 result by solving
)
sin(
cos(
2
1

A
A
=
=
&
for A
11
Sinusoid Parms. Approx. MLE Procedure
First we look at a specific structure:
| |
(
(
(
(
(
(

(
(
(
x s
x c
s s c s
s c c c
x s
x c
x H H H H x
) (
) (
) ( ) ( ) ( ) (
) ( ) ( ) ( ) (
) (
) (
) ( ) ( ) ( ) (
1
1
o
T
o
T
o o
T
o o
T
o o
T
o o
T
T
o
T
o
T
o
T
o o
T
o
T
! ! ! ! ! ! " ! ! ! ! ! ! # $
Then if
o
is not near 0 or , then approximately
1
2
0
0
2
(
(
(
(
N
N
and Step 1 becomes
{ }
2
0
2
1
0
0
) ( min arg ) exp( ] [
2
min arg
=

X n j n x
N
N
n
o o
o

and Steps 2 & 3 become
DTFT of Data x[n]
)
(
2
o
o
X
X
N
A
=
=
12
The processing is implemented as follows:
Given the data: x[n], n = 0, 1, 2, , N-1
1. Compute the DFT X[m], m = 0, 1, 2, , M-1 of the data
Zero-pad to length M= 4N to ensure dense grid of frequency points
Use the FFT algorithm for computational efficiency
2. Find location of peak
Use quadratic interpolation of |X[m]|
3. Find height at peak
Use quadratic interpolation of |X[m]|
4. Find angle at peak
Use linear interpolation of X[m]
o
|X()|
X()
13
Figure 3.8
from textbook:
) 2 cos( ) ( + = t f A t s
o t
Ex. 3 Bearing Estimation MLE
Emits or reflects
signal s(t)
Simple model
Grab one snapshot of all Msensors at a single instant t
s
:
( ) ] [
~
cos ] [ ) ( ] [ n w n A n w t s n x
s s n
+ + = + =
Same as Sinusoidal Estimation!!
SoCompute DFT and Find Location of Peak!!
If emitted signal is not a sinusoid then you
get a different MLE!!
1
MLE for
TDOA/FDOA Location
Overview
Estimating TDOA/FDOA
Estimating Geo-Location
2
) (t s
t j
e t t s
1
) (
1

Data Link
Data
Link
t j
e t t s
2
) (
2

t j
e t t s
3
) (
3

MULTIPLE-PLATFORM LOCATION
Emitter to be located
3
) (t s
t j
e t t s
1
) (
1

Data Link
Data
Link
23
= t
2
t
3
= constant
TDOA
Time-
Difference-
Of-
Arrival

21
= t
2
t
1
= constant
t j
e t t s
2
) (
2

t j
e t t s
3
) (
3

21
=
2

1
= constant
23
=
2

3
= constant
FDOA
Frequency-
Difference-
Of-
Arrival
TDOA/FDOA LOCATION
4
Estimating TDOA/FDOA
5
SIGNAL MODEL
!Will Process Equivalent Lowpass signal, BW = B Hz
Representing RF signal with RF BW = B Hz
!Sampled at Fs > B complex samples/sec
!Collection Time T sec
!At each receiver:
BPF
ADC
Make
LPE
Signal
Equalize
cos(
1
t)
f
X
RF
(f)
X(f)
f
f
X
f
LPE
(f)
B/2
-B/2
6
Tx Rx
s(t) s
r
(t) = s(t (t))
R(t)
Propagation Time: (t) = R(t)/c
! + + + =
2
) 2 / ( ) ( t a vt R t R
o
Use linear approximation assumes small
change in velocity over observation interval
) / ] / 1 ([ ) / ] [ ( ) ( c R t c v s c vt R t s t s
o o r
= + =
Time
Scaling
Time Delay:
d
For Real BP Signals:
DOPPLER & DELAY MODEL
7
Analytic Signals Model
)] ( [
) ( ) (
~
t t j
c
e t E t s
+
=
Now what? Notice that v << c " (1 v/c) 1
Say v = 300 m/s (670 mph) then v/c = 300/3x10
8
= 10
-6
"(1 v/c)=1.000001
Now assume E(t) & (t) vary slowly enough that
) ( ) ] / 1 ([
) ( ) ] / 1 ([
t t c v
t E t c v E

For the range of v
of interest
DOPPLER & DELAY MODEL (continued)
Analytic Signal of Tx
)} ] / 1 ([ ) ] / 1 ([ {
) ] / 1 ([
) ] / 1 ([
~
) (
~
d d c
t c v t c v j
d
d r
e t c v E
t c v s t s
+
=
=
Analytic Signal of Rx
Called Narrowband Approximation
8
) ( ) / (
)} ( ) / ( {
) (
) ( ) (
~
d c c d c
d d c c c
t j
d
t j t c v j j
t t c v t j
d r
e t E e e e
e t E t s

+
=
=
Constant
Phase
Term
=
c
d
Doppler
Shift
Term
d
=
c
v/c
Carrier
Term
Transmitted Signals
LPE Signal
Time-Shifted by
d
Narrowband Analytic Signal Model
Narrowband Lowpass Equivalent Signal Model
) (
) (
d
t j j
r
t s e e t s
d
=

This is the signal that actually gets processed digitally
DOPPLER & DELAY MODEL (continued)
9
CRLB for TDOA
We already showed that the CRLB for the active sensor case is:
But here we need to estimate the delay between two noisy
signals rather than between a noisy one and a clean one.
The only difference in the result is: replace SNR by an
effective SNR given by
} , min{
1 1 1
1
2 1
2 1 2 1
SNR SNR
SNR SNR SNR SNR
SNR
eff

+ +
=
=
=
1 2 /
2 /
2
1 2 /
2 /
2
2
2
] [
] [
1
N
N k
N
N k
rms
n S
k S k
N
B
2 2
8
1
) (
rms
B SNR N
TDOA C

=
where B
rms
is an effective bandwidth
of the signal computed from the
DFT values S[k].
10
CRLB for TDOA (cont.)
A more familiar form for this is in terms of the C-T version of
the problem:
SNR
BT
B
1

eff rms
TDOA
2 2
df f S
df f S f
B
rms
=
2
2
2
2
) (
) (
seconds
BT = Time-Bandwidth Product ( N, number of samples in DT)
B = Noise Bandwidth of Receiver (Hz)
T = Collection Time (sec)
BT is called Coherent Processing Gain
(Same effect as the DFT Processing Gain on a sinusoid)
For a signal with rectangular spectrum of RF width of B
s
, then the
bound becomes:
SNR
BT
B
eff s
TDOA
2
55 . 0
S. Stein, Algorithms for Ambiguity Function Processing,

IEEE Trans. on ASSP, June 1981
11
CRLB for FDOA
Here we take advantage of the time-frequency duality if the FT:
where T
rms
is an effective duration
of the signal computed from the
signal samples s[k].
=
=
1 2 /
2 /
2
1 2 /
2 /
2
2
2
] [
] [
1
N
N n
N
N n
rms
n s
n s k
N
T
2 2
8
1
) (
rms eff
T SNR N
FDOA C

=
Again we use the same effective SNR:

} , min{
1 1 1
1
2 1
2 1 2 1
SNR SNR
SNR SNR SNR SNR
SNR
eff

+ +
=
12
CRLB for FDOA (cont.)
the problem:
SNR
BT
T
1

eff rms
FDOA
2 2
dt t s
dt t s t
T
rms
=
2
2
2
2
) (
) (
Hz
For a signal with constant envelope of duration T
s
, then the bound
becomes:
SNR
BT
T
eff s
FDOA
2
55 . 0
S. Stein, Algorithms for Ambiguity

Function Processing, IEEE Trans. on
ASSP, June 1981
13
Interpreting CRLBs for TDOA/FDOA
the problem:
SNR
BT
T
1

eff rms
FDOA
2 2
SNR
BT
B
1

eff rms
TDOA
2 2
BT pulls the signal up out of the noise

Large B
rms
improves TDOA accuracy
Large T
rms
improves FDOA accuracy
SNR
1
SNR
2
T = T
s
B = B
s

TDOA

FDOA
3 dB 30 dB 1 ms 1 MHz 17.4 ns 17.4 Hz
3 dB 30 dB 100 ms 10 kHz 1.7 s 0.17 Hz
Two Examples of Accuracy Bounds:
14
MLE for TDOA/FDOA
S. Stein, Differential Delay/Doppler ML
Estimation with Unknown Signals, IEEE
Trans. on SP, August 1993
We already showed that the ML Estimate of delay for the
active sensor case is the Cross-Correlation of the time signals.
By the time-frequency duality the ML estimate for doppler
shift should be Cross-Correlation of the FT, which is
mathematically equivalent to
dt t s t s C
T
+ =
0
2 1
) ( ) ( ) (
dt e t s t s C
T
t j

=
0
2 1
) ( ) ( ) (

dt e t s t s A
T
t j

+ =
0
2 1
) ( ) ( ) , (

The ML estimate of the TDOA/FDOA has been shown to be:
Find Peak of |C()|
Find Peak of |C()|
Find Peak of |A(,)|
15
Ambiguity Function
d

d
Find
Peak
of
|A(,)|
ML Estimator for TDOA/FDOA(cont.)
) (
1 d
t j j
t s e e
d

=
Delay
Doppler
Compare
Signals
For all
Delays &
Dopplers
) (
1
t s
) (
2
t s
LPE Rx
Signals
At Two
Receivers
Called:
Ambiguity Function
Complex Ambiguity Function (CAF)
Cross-Correlation Surface
16
ML Estimator for TDOA/FDOA(cont.)
How well do we expect the Cross-Correlation Processing to
perform?
Well it is the ML estimator so it is not necessarily optimum.
But we know that an ML estimate is asymptotically
Unbiased & Efficient (that means it achieves the CRLB)
Gaussian
)) ( , ( ~
1
I

N
ML
Those are some VERY nice properties that we can make use of
in our location accuracy analysis!!!
17
Consider when =
d
| |

=
T
t j t j
d
dt e e t s A
d
0
2
) ( ) , (
like windowed FT of sinusoid
where window is |s(t)|
2
d
|A(,
d
)|
width 1/T
Consider when =
d
+ =
T
d d
dt t s t s A
0
) ( ) ( ) , (
correlation
|A(
d
,)|
d
width 1/BW
Properties of the CAF
18
TDOA Accuracy depends on:
Effective SNR: SNR
eff
RMS Widths: B
rms
= RMS Bandwidth
TDOA ACCURACY REVISITED
df f S
df f S f
B
rms
=
2
2
2
2
) (
) (
XCorr
Function
~1/B
rms
TDOA
Narrow B
rms
Case
Poor Accuracy
Wide B
rms
Case
Good Accuracy
TDOA
XCorr
Function
Low Effective SNR
Causes Spurious Peaks
On Xcorr Function
Narrow Xcorr Function
Less Susceptible to
Spurious Peaks
19
FDOA Accuracy depends on:
Effective SNR: SNR
eff
RMS Widths: D
rms
= RMS Duration
FDOA ACCURACY REVISITED
df f S
df f S f
B
rms
=
2
2
2
2
) (
) (
XCorr
Function
~1/D
rms
FDOA
Narrow D
rms
Case
Poor Accuracy
Wide D
rms
Case
Good Accuracy
FDOA
XCorr
Function
Low Effective SNR
Causes Spurious Peaks
On Xcorr Function
Narrow Xcorr Function
Less Susceptible to
Spurious Peaks
20
COMPUTING THE AMBIGUITY FUNCTION
Direct computation based on the equation for the ambiguity
function leads to computationally inefficient methods.
In EECE 521 notes we showed how to use decimation to
efficiently compute the ambiguity function
21
Estimating Geo-Location
22
Data Link
Data
Link
TDOA/FDOALOCATION
Data
Link
Centralized Network of P
P-Choose-2 Pairs
# P-Choose-2 TDOA Measurements
# P-Choose-2 FDOA Measurements
Warning: Watch out for Correlation
Effect Due to Signal-Data-In-Common
23
TDOA/FDOALOCATION
Pair-Wise Network of P
P/2 Pairs
# P/2 TDOA Measurements
# P/2 FDOA Measurements
Many ways to select P/2 pairs
Warning: Not all pairings are equally
good!!! The Dashed Pairs are Better
24
TDOA/FDOAMeasurement Model
Given N TDOA/FDOA measurements with corresponding 22 Cov. Matrices
)
( , ),
( ),
(
2 2 1 1 N N

For notational purposes define the 2N measurements r(n) n = 1, 2, , 2N
N n r
N n r
n n
n n
, , 2 , 1 ,
, , 2 , 1 ,
2
1 2
= =
= =
N
C C C , , ,
2 1

T
N
r r r ] [
2 2 1
! = r
Data Vector
Now, those are the TDOA/FDOA estimates so the true values are notated as:
) , ( , ), , ( ), , (
2 2 1 1 N N

N n s
N n s
n n
n n
, , 2 , 1 ,
, , 2 , 1 ,
2
1 2
= =
= =
T
N
s s s ] [
2 2 1
! = s
Signal Vector
Assume pair-wise network, so
TDOA/FDOA pairs are uncorrelated
25
TDOA/FDOAMeasurement Model (cont.)
Each of these measurements r(n) has an error (n) associated with it, so
(
(
(
(
= =
N
N
C 0 0
0 0
0 0 C
C C C C #
1
2 1
} , , , diag{
s r + =
Because these measurements were estimated using an ML estimator (with
sufficiently large number of signal samples) we know that error vector is a
zero-mean Gaussian vector with cov. matrix C given by:
The true TDOA/FDOA values depend on:
Emitter Parms: (x
e
, y
e
, z
e
) and transmit frequency f
e
x
e
= [ x
e
y
e
z
e
f
e
]
T
Receivers Nav Data (positions & velocities): The totality of it called x
r
x x s r + = ) ; (
r e
Deterministic Signal + Gaussian Noise
Signal is nonlinearly related to parms
Assumes that
TDOA/FDOA
pairs are
uncorrelated!!!
To complete the model we need to know how s(x
e
;x
r
) depends on x
e
and x
r.
Thus we need to find TDOA & FDOA as functions of x
e
and x
r
26
TDOA/FDOAMeasurement Model (cont.)
Two Receivers with: (x
1
, y
1
, Vx
1
, Vy
1
) and (x
2
, y
2
, Vx
2
, Vy
2
)
Emitter with: (x
e
, y
e
)
(Let R
i
be the range between Receiver i and the emitter; c is the speed of light.)
The TDOA and FDOA are given by:
( ) ( ) ( ) ( )
|
.
|
\
|
+ + =
= =
2
2
2
2
2
1
2
1
2 1
12 1
1
) , (
e e e e
e e
y y x x y y x x
c
c
R R
y x s
( )
( ) ( )
( ) ( )
( ) ( )
( ) ( )
(
(
(
+
+
+
+
=
= =
2
2
2
2
2 2 2 2
2
1
2
1
1 1 1 1
2 1 12 2
) , , (
e e
e e
e e
e e e
e
e e e
y y x x
Vy y y Vx x x
y y x x
Vy y y Vx x x
c
f
R R
dt
d
c
f
f y x s
Here well simplify to the x-y plane extension is straight-forward.
27
| |
$ $ $ $ $ $ $ % $ $ $ $ $ $ $ & '
$ $ $ $ $ % $ $ $ $ $ & '
parms w.r.t. cov. of y variabilit
1 1
parms w.r.t. mean of y variabilit
1
) (
) (
) (
) ( tr
2
1 ) (
) (
) (
) (
(
(
+
(
=

m n m
T
n
nm

C
C
C
C

C

J
x
x
x
x
x
x
x
CRLB for Geo-Location via TDOA/FDOA
Recall: For the General Gaussian Data case the CRLB depends on a FIM that
has structure like this:
Here we have a deterministic signal plus Gaussian noise so we only have the
1
st
term Using the notation introduced here gives
1
1
) ( ) (
) (

(
(
=
e
e
e
e
T
e CRLB
x
x s
C
x
x s
x C
Called the Jacobian for the 3-D location with
TDOA/FDOA will be a 2N 4 matrix whose columns
are derivatives of s w.r.t. each of the 4 parameters.
($)
H H
T
28
TDOA/FDOA Jacobian:
(
(
(
(
(
(
(
(
e
e N
e
e N
e
e N
e
e N
e
e
e
e
e
e
e
e
e
e
e
e
e
e
e
e
e
e
f
s
z
s
y
s
x
s
f
s
z
s
y
s
x
s
f
s
z
s
y
s
x
s
) ( ) ( ) ( ) (
) ( ) ( ) ( ) (
) ( ) ( ) ( ) (
) (
2 2 2 2
2 2 2 2
1 1 1 1
x x x x
x x x x
x x x x
x
x s
H
( ( ( (
e
x
e
y
e
z
e
f
CRLB for Geo-Loc. via TDOA/FDOA (cont.)

Jacobian can be computed for any desired Rx-Emitter Scenario
Then plug it into ($) to compute the CRLB for that scenario:
| |
1
1
) (

= H C H x C
T
e CRLB
29
The Location CRLB can be used to study various aspects of the emitter location
problem. It can be used to study the effect of
Rx-Emitter Geometry and/or Platform Velocity
TDOA accuracy vs. FDOA accuracy
Number of Platforms
Platform Pairings
Etc., Etc., Etc
Once you have computed the CRLB Covariance C
CRLB
you can use it to
compute and plot error ellipsoids.
CRLB Studies
Faster than doing Monte
Carlo Simulation Runs!!!
Sensor 1
Emitter
V
V
Sensor 2
k
1
TDOA
k
2
FDOA
Assumes Geo-Location Error
is Gaussian
Usually reasonably valid
30
Projections of 3-D Ellipsoids onto 2-D Space
2-D Projections Show
Expected Variation
in 2-D Space
x
y
z
Error Ellipsoids: If our C
CRLB
is 44how do we get 2-D ellipses to plot???
Projections!
CRLB Studies (cont.)
31
Another useful thing that can be computed from the CRLB is the CEP for the
location problem. For the 2-D case:
Circular Error Probable
= radius of a circle that when
centered at the estimates mean,
contains 50% of the estimates
2
2
2
1 2 1
75 . 0 75 . 0 CEP + = +
within 10%
Diagonal elements of
the 2-D Cov. Matrix
Eigen-Values of the
2-D Cov. Matrix
Cross Range
D
o
w
n

R
a
n
g
e
CEP = 10
CEP = 20
CEP = 40
CEP = 80
CEP Contour Plots are
Good Ways to Assess
Location Performance
32
25
Target
Target
Target
Sensor
FDOA Only
TDOA Only
TDOA/FDOA
Both Important FDOA Important TDOA Important
TDOA Only
FDOA
Only
Geometry and TDOA vs. FDOA Trade-Offs
33
Estimator for Geo-Location via TDOA/FDOA
Because we have used the ML estimator to get the TDOA/FDOA
estimates the MLs asymptotic properties tell us that we have
Gaussian TDOA/FDOA measurements
Because the TDOA/FDOA measurement model is nonlinear it is
unlikely that we can find a truly optimal estimate so we again
resort to the ML. For the ML of a Nonlinear Signal in Gaussian we
generally have to proceed numerically.
One way to do Numerical MLE is ML Newton-Raphson (need vector
version):
k
p p
T
k k

x

x

1
2
1
) ; ( ln ) ; ( ln

=
(
(
=
However, the Hessian requires a second derivative
This can add complexity in practice Alternative:
Gauss-Newton Nonlinear Least Squares based on linearizing the model.
Gradient: p1 vector
Hessian: pp matrix
1
Chapter 8
Least-Squares Estimation
2
8.3 The Least-Squares (LS) Approach
All the previous methods weve studied required a
probabilistic model for the data: Needed the PDF p(x;)
For a Signal + Noise problem we needed:
Signal Model & Noise Model
Least-Squares is not statistically based!!!
Do NOT need a PDF Model
Do NEED a Deterministic Signal Model
signal
model

x[n] = s
true
[n;] + w[n]
= s[n;] + e[n]
model &
measurement error
+
[n]
model
error
w[n] noise
(measurement
error)
+ +
Similar to
Fig. 8.1(a)
s[n;]
+
s
true
[n;]
3
Least-Squares Criterion
signal
model
[n]
+
x[n]
]
; [ n s
Choose the
Estimate
to make this
residual small
( )

=
=
= =
1
0
2
1
0
2
] ; [ ] [ ] [ ) (
N
n
N
n
n s n x n J
Minimize the LS Cost
Ex. 8.1: Estimate DC Level
x[n] = A + e[n] = s[n;] + e[n]
x n x
N
A
A
A J
Set
A n x A J
N
n
N
n
= = =
=
1
0
1
0
2
] [
1
0
) (
) ] [ ( ) (
Same thing weve
gotten before!
Note:
If e[n] is WGN,
then LS = MVU
To Minimize
4
Weighted LS Criterion
Sometimes not all data samples are equally good:
x[0], x[1], , x[N-1]
Say you know x[10] was poor in quality compared to other data
Youd want to de-emphasize its importance in the sum of squares:
=
=
1
0
2
]) ; [ ] [ ( ) (
N
n
n
n s n x w J
set this small to de-
emphasize a sample
5
8.4 Linear Least-Squares
A linear least-squares problem is one where the parameter
observation model is linear: s = H x = H + e
p1
N1
p = Order of the model
Np Known Matrix
We must assume that H is full rank otherwise there are multiple
parameter vectors that will map to the same s!!!
Note: Linear LS does NOT mean fitting a line to data although
that is a special case:
!
H
s
(
(
(
(
(
(
(
= + =
B
A
N
Bn A n s
"# "$ %
& &
1 1
2 1
1 1
0 1
] [
6
Finding the LSE for the Linear Model
( )
( ) ( ) H x H x

=
=

=
T
N
n
n s n x J
1
0
2
] ; [ ] [ ) (
For the linear model the LS cost is:
Now, to minimize, first expand:
H H H x x x
H H x H H x x x
T T T T
T T T T T T
J
+ =
+ =
2
) (
Scalar = scalar
T
So
T
H
T
x = (
T
H
T
x)
T
= x
T
H
0
) ( J
Now setting
gives
0 H H x H = +
2 2
T T
Called the
LS Normal Equations
x H H H
T T
=
Because H is full rank we know that H

T
H is invertible:
( ) x H H H
T T
LS
1

= ( ) x H H H H H s
T T
LS LS
1

= =
7
Comparing the Linear LSE to Other Estimates
Model Estimate
e H x + =
( ) x H H H
T T
LS
1

=
No Probability Model Needed
( ) x H H H
T T
BLUE
1

=
w H x + =
PDF Unknown, White
( ) x H H H
T T
ML
1

=
w H x + =
PDF Gaussian, White
( ) x H H H
T T
MVU
1

=
w H x + =
PDF Gaussian, White
If you
assume
Gaussian &
apply
these
BUT you
are
WRONG
you at least
get the
LSE!
8
The LS Cost for Linear LS
For the linear LS problem
what is the resulting LS cost for using
( ) ?
1
x H H H
T T
LS

=
( ) ( ) ( ) ( )
( ) ( )
( ) ( )
( )
x H H H H I H H H H x I x
x H H H H x H H H H x x
x H H H H x x H H H H x H x H x
H H H H I
" " " " " " " " # " " " " " " " " $ %
|
.
|
\
|
=

|
.
|
\
|

|
.
|
\
|
=
|
.
|
\
|

|
.
|
\
|
=
|
.
|
\
|

|
.
|
\
|
= =
T T
T T T T T T
T T T T T T
T T
T
T T
LS
T
LS
J
1
1 1
1 1
1 1
min

Properties of
Transpose
Factor out xs
Easily Verified!
Note: if AA = A then A is called idempotent
( ) x H H H H I x
|
.
|
\
|
=

T T T
J
1
min ( ) x H H H H x x x
T T T T
J
1
min
=
2
min
0 x J
9
Weighted LS for Linear LS
Recall: de-emphasize bad samples importance in the sum of
squares:
=
=
1
0
2
]) ; [ ] [ ( ) (
N
n
n
n s n x w J
( ) ( ) H x W H x =
T
J ) (
For the linear LS case we get:
Diagonal Matrix
Minimizing the weighted LS cost gives:
( ) x W H WH H WH W x
|
.
|
\
|
=

T T T
J
1
min ( ) Wx H WH H
T T
WLS
1

=
Note: Even though there is no true LS-based reasonmany
people use an inverse cov matrix as the weight: W= C
x
-1
This makes WLS look like BLUE!!!!
10
8.5 Geometry of Linear LS
Provides different derivation
Enables new versions of LS
Recall the LS Cost to be minimized: ( ) ( )
2
) ( H x H x H x = =
T
J
s
Order Recursive
Sequential
Thus, LS minimizes the length of the error vector between the
data and the signal estimate:
s x
=
= =
p
i
i i
1
h H s | |
p
h h h H '
2 1
= But For Linear LS we have
Np
s
Range (H) R
N
N > p
R
N
R
p
s lies in subspace of R
N
x can lie anywhere in R
N
11
LS Geometry Example N = 3 p = 2
Notation a bit different from the book
x = s + e
noise takes s out of
Range(H) and into R
N
h
1
h
2
H columns lie in this
plane = subspace
spanned by the columns
of H = S
2
(S
p
in general)
x
2 2 1 1
h h s + =
s
e
s x
=
i
h
12
LS Orthogonality Principle
The LS error vector must be to all columns of H
T T
0 H = 0 H =
T
or
( )
( ) x H H H x H H H
0 H x H 0 H
T T
LS
T T
T T
1

= =
= =
Can use this property to derive the LS estimate:
s
R
N
R
p
H
x
(H
T
H)
-1
H
T
Same answer as before
but no derivatives to worry about!
Range (H) R
N
Acts like an inverse from R
N
back to
R
p
called pseudo-inverse of H
13
LS Projection Viewpoint
From the R
3
example earlier we see that must lie s
right below x
= Projection of x onto Range(H)
s
(Recall: Range(H) = subspace spanned by columns of H)

From our earlier results we have: ( ) x H H H H H s
H
P
" " # " " $ %
= =
T T
LS
1
x
x P s
H
=
s x
=
Projection Matrix onto Range(H)
14
Aside on Projections
If something is on the floor its projection onto the floor = itself!
z z P H z
H
= then , ) Range( if
Now for a given x in the full space P
H
x is already in Range(H)
so P
H
(P
H
x) = P
H
x
Thus for any projection matrix P
H
we have: P
H
P
H
= P
H
H
H
P P =
2 Projection Matrices
are Idempotent
Note also that the projection onto Range(H) is symmetric:
( )
T T
H H H H P
H
1
=
Easily Verified
15
What Happens w/ Orthonormal Columns of H
( ) x H H H
T T
LS
1

=
Recall the general Linear LS solution:
(
(
(
(
(
(
(
(
(
(
=
p p p p
p
p
T
h h h h h h
h h h h h h
h h h h h h
H H
, , ,
, , ,
, , ,
2 1
2 2 2 1 2
1 2 1 1 1
'
& ( & &
'
'
where
If the columns of H are orthonormal then <h
i
,h
j
> =
ij
H
T
H = I
x H
T
LS
=
Easy!! No Inversion
Needed!!
Recall Vector Space
Ideas with ON Basis!!
16
Geometry with Orthonormal Columns of H
Inner Product Between i
th
Column and Data Vector x h
T
i i
=
Re-write this LS solution as:

= =
= = =
p
i
i
T
i
p
i
i i
1 1
) (

"# "$ %
h x h h H s
Then we have:
Projection of x
onto h
i
axis
h
1
h
2
2 2 1 1
) ( ) (
h x h h x h s
T T
+ =
2 2
) ( h x h
T
1 1
) ( h x h
T
x
When the columns of H are
we can first find the projection
onto each 1-D subspace
independently, then add these
independently derived results.
Nice!
1
8.6 Order-Recursive LS
s[n]
n
Want to fit a polynomial to data..,
but which one is the right model?
! Constant ! Quadratic
! Linear ! Cubic, Etc.
Motivate this idea with Curve Fitting
Given data: n = 0, 1, 2, . . ., N-1
s[0], s[1], . . ., s[N-1]
Try each model, look at J
min
which one works best
J
min
(p)
Constant
Line
Quadratic
Cubic
p
3
(# of parameters in model)
1 2 4
2
Choosing the Best Model Order
Q: Should you pick the order p that gives the smallest J
min
??
A: NO!!!!
Fact: J
min
(p) is monotonically non-increasing as order p increases
If you have any N data points
you can perfectly fit a p = N model to them!!!!
s[n]
n
2 points define a line
3 points define a quadratic
4 points define a cubic
N points define a
N
x
N
+a
N-1
x
N-1
++a
1
x+a
0
Warning: Dont Fit the Noise!! Warning: Dont Fit the Noise!!
3
Choosing the Order in Practice
Practice: use simplest model that adequately describes the data
Scheme: Only increase order if cost reduction is significant
" Increase to order p+1 only if J
min
(p) J
min
(p=1) >
" Also, in practice you may have some idea of the expected level of error
thus have some idea of expected J
min
use order p such that J
min
(p) Expected J
min
user-set
threshold
Wasteful to independently compute the LS solution for each order
Drives Need for:
Efficient way to compute LS for many models
Q: If we have computed p-order model, can we use it to
recursively compute (p+1)-order model?
A: YES!! Order-Recursive LS
4
Define General Order-Increasing Models
Define: H
p+1
= [ H
p
h
p+1
] h
1
, h
2
, h
3
, . . .
H
1
H
2
Etc.
H
3
Order-Recursive LS with Orthonormal Columns
If all h
i
are EASY !!
h
1
h
2
2 2 1 2
) (

h x h s s
T
+ =
2 2
) ( h x h
T
1 1 1
) (
h x h s
T
=
x
! !
3 3 2 3
2 2 1 2
1 1 1
) (

3
) (

2
) (
1
h x h s s
h x h s s
h x h s
T
T
T
p
p
p
+ = =
+ = =
= =
5
Order-Recursive Solution for General H
If h
i
are Not Harder, but Possible!
Basic Idea: Given current-order estimate:
map new column of H into an ON version
use it to find new estimate,
then transform to correct for orthogonalization
Quotes here because
this estimate is for the
orthogonalized model
h
1
h
3
h
2
3
~
h
Orthogonalized
version of h
3
S
2
= 2-D space spanned
by h
1
& h
2
= Range(H
2
)
Note: x is not shown here it is
in a higher dimensional space!!
6
Geometrical Development of Order-Recursive LS
The Geometry of Vector Space is indispensable for DSP!
Current-Order = k
H
k
= [h
1
h
2
. . . h
k
] (not necessarily )
See App. 8A
for Algebraic
Development
Yuk!
Geometry is
Easier! " " " " # " " " " $ %
T
k k
T
k k k
H H H H P
1
) (

=
Projector onto S
k
= Range(H
k
)
Recall:
Given next column: h
k+1
Find , which is to S
k
1
~
+ k
h
( )
1 1 1 1
~
+ + + +
= =
k k k k k k
k
h P I h P h h
P
"# "$ %
1 + k k
h P
1 1
~
+
+
=
k k k
h P h
k k
k
k
S s h h
~ ~
1 1

+ +
k
s
h
3
S
k
7
So our approach is now: project x onto
and then add to
1
~
+ k
h
k
s
The projection of x onto is given by

1
~
+ k
h
1
!
2
1
1
1
2
1
1
1 1
1
1
1
1
1
~
~
~
~
~
~
~
~
,
+
+
+
+
+
+
+
+
+
+
(
(
(
= =
= =
k k
scalar
k k
k k
T
k
k
k
T
k k k
k
k
k
k
k
use
h P
h P
h P x
h
h
h x
h P h
h
h
h
h
x s
" " # " " $ %
Divide by
norm to
normalize
1
1 1

+
+ +
+ =
+ =
k k k
k k k
s H
s s s
Now add this to current signal estimate:
8
" "# " "$ %
1 1
1 1
1
2
1
1
1
) (
+ +
+
+ =
(
(
(
+ =
k k
T
k
k
T
k k k
k k
k k
k k
k k
T
k k k
h P h
x P h h P I
H
h P
h P
h P x
H s
Scalar
can move here
and transpose
Write out P
k
scalar define as b for convenience

Now we have:
Write out ||.||
2
and use
that P
k
is idempotent
| |
(
(
(

=
+ =
+
=
+
+
+
+ b
b
b b
k
T
k k
T
k k
k k
k
T
k k
T
k k k k k k
k
1
1
1
1
1
1
) (
) (
1
h H H H
h H
h H H H H h H s
H
"# "$ %
Finally:
Clearly this is
1
+ k
9
(
(
(
(
(
|
|
.
|
\
|
=
+
+
+
+
+
+
1 1
1
1 1
1
1
1
1
) (
k k
T
k
k
T
k
k k
T
k
k
T
k
k
T
k k
T
k k
k
h P h
x P h
h P h
x P h
h H H H
Order-Recursive LS Solution
Drawback: Needs Inversion Each Recursion
See Eq. (8.29) and (8.30) for a way to avoid inversion
Comments:
1. If h
k+1
H
k
simplifies problem as weve seen
(This equation simplifies to our earlier result)
2. Note: P
k
x above is residual of k-order model
= part of x not modeled by k-order model
Update recursion works solely with this
Makes Sense!!!
10
8.7 Sequential LS
In Last Section: In This Section:
Data Stays Fixed Data Length Increases
Model Order Increases Model Order Stays Fixed
You have
received
new data
sample!
Say we have based on {x[0], . . ., x[N-1]}
If we get x[N]can we compute based on and x[N]?
(w/o solving using full data set!)
] 1 [
N
] [
N ] 1 [
N
]) [ ], 1 [
( ] [
N x N f N = We want
Approach Here:
1. Derive for DC-Level case
2. Interpret Results
3. Write Down General Result w/o Proof
11
Sequential LS for DC-Level Case
=
1
0
1
] [
1
N
n
N
n x
N
A
We know this:
Re-Write
] [
1
1
1
] [ ] [
1
1
1
] [
1
1
1
1
1
1
1
0 0
N x
N
A
N
N
N x n x
N
N
N
n x
N
A
N
N
N
n
N
n
N
+
+
+
=
(
(
+
|
|
.
|
\
|
+
=
+
=
+
=
= =

# $ %
and this:
" " " # " " " $ %
# $ % # $ %
error prediction
data new the
of prediction
1
estimate old
1
)
] [ (
1
1

+
+ =
N N N
A N x
N
A A
12
Weighted Sequential LS for DC-Level Case
This is an even better illustration
w[n] has unknown PDF
but has known time-
dependent variance
2
]} [ var{ ] [ ] [
n
n w n w A n x = + =
Assumed model:
=
1
0
2
1
0
2
1
1
] [
N
n n
N
n n
N
n x
A
Standard WLS gives:

With manipulations similar to the above case we get:
" " # " " $ %
"# "$ %
# $ %
error prediction
1
0
2
2
estimate old
1
)
] [ (
1
1

=
=

(
(
(
(
(
+ =
N
k
N
n n
N
N N
A N x A A
N
k
N
is a Gain term that
reflects goodness of
new data
13
Exploring The Gain Term
|
|
.
|
\
|
=
1
0
2
1
1
1
)
var(
N
n n
N
A
We know that
and using it in k
N
&
data new the
of variance
2
1
1
)
var(
)
var(
N N
N
N
A
A
k
+
=
poorness of
current estimate
we get that
poorness of
new data
Note: 0 K[N] 1
Gain depends on Relative Goodness Between:
o Current Estimate
o New Data Point
14
Extreme Cases for The Gain Term
" " " # " " " $ % "# "$ %
error prediction estimate old
]) 1 [
] [ ( ] [ ] 1 [
] [
+ = N A N x N K N A N A
New Data on Based " Correction " Little Make
Use Little Has Data New
0 ] [
]) 1 [
var( If
2

<<
N K
N A
n
Good Estimate
Bad Data
New Data on Based " Correction " Large Make
Useful Very Data New
1 ] [
]) 1 [
var( If
2

>>
N K
N A
n
Bad Estimate
Good Data
15
General Sequential LS Result
See App. 8C for derivation
Diagonal
Covariance
(Sequential LS
requires this)
At time index n-1 we have:
| |
{ } estimate of measure quality
cov
using Estimate LS
} , , , diag{
] 1 [ ] 1 [ ] 0 [
1 1
1 1
2
1
2
1
2
0 1 1 1 1
1
=
= + =
=
n n
n n
n n n n n
T
n
n x x x

x
C w H x
x
'
'
At time index n we get x[n]:
n
T
n
n
n n n
w
h
H
w H x +
(
(
(
= + =
1
Tack on row
at bottom to
show how
maps to x[n]
16
Iterate these Equations:
2
1 1
] [
n n n n
n x h

Given the Following:
1
1
2
1
1 1
) (
)
] [ (

=
+
=
+ =
n
T
n n n
n n
T
n n
n n
n
n
T
n n n n
n x
h k I
h h
h
k
h k
"# "$ %
Prediction of x[n]
using current
parameter estimate
Update the Estimate:
Compute the Gain:
Update the Est. Cov.:
Initialization: (Assume p parameters)
Collect first p data samples x[0], . . ., x[p-1]
Use Batch LS to compute:
Then start sequential processing
1 1
p p

Gain has same kind of
dependence on Relative
Goodness between:
o Current Estimate
o New Data Point
17
Sequential LS Block Diagram
k
n

z
-1
1
+
+
x[n]
2
1
, ,
n n n
h

Compute
Gain
1
] [

n
T
n
n x h
Updated
Estimate
T
n
h
1
n
T
n
h
+
Observations
( )
1 1
] [

+
n
T
n n n
n x k h
n
Predicted
Observation
Previous
Estimate
n
h
1
8.8 Constrained LS
Why Constrain? Because sometimes we know (or believe!)
certain values are not allowed for
For example: In emitter location you may know that the emitters
range cant exceed the radio horizon
You may also know that the emitter is on the left side of the
aircraft (because you got a strong signal from the left-side
antennas and a weak one from the right-side antennas)
LS
Thus, when finding you want to constrain it to satisfy these

conditions
2
Constrained LS Problem Statement
Say that S
c
is the set of allowable values (due to constraints).
Then we seek such that
c CLS
S
2
2
min
H x H x
c
S
CLS
Types of Constraints
1. Linear Equality A = b
2. Nonlinear Equality f () = b
3. Linear Inequality A b
A b
4. Nonlinear Inequality f () b
f() b
H
A
R
D
E
R
Constrained to a line,
plane or hyperplane
Constrained to lie
above/below a
hyperplane
Well Cover #1. See Books on Optimization for Other Cases
3
LS Cost with a Linear Equality Constraint
Using Lagrange Multipliers we need to minimize
( ) ( )

b A H x H x
and
J
T T
c
w.r.t.
) ( ) ( + =
Linear Equality
Constraint
x
2
x
1
contours of
(x H)
T
(x H)
Unconstrained
Minimum
2-D Linear Equality
Constraint
Constrained
Minimum
4
Constrained Optimization: Lagrange Multiplier
x
1
0 ) , ( ) , (
) , ( ) , (
2 1 2 1
2 1 2 1
= +
=
x x h x x f
x x h x x f
Constrained Max occurs when:

( ) | | 0 ) , ( ) , (
2 1 2 1
= + C x x g x x f
(
(
(
=
(
(
(
(
=
b
a
x
x x h
x
x x h
x x h
2
2 1
1
2 1
2 1
) , (
) , (
) , (
Ex. The grad vector has
slope of b/a
orthogonal to constraint line
Ex. ax
1
+ bx
2
c = 0
x
2
= (a/b)x
1
+ c/b
A Linear Constraint
Constraint: g(x
1
,x
2
) = C
g(x
1
,x
2
) C = h(x
1
,x
2
) = 0
x
2
f (x
1
,x
2
) contours
5
LS Solution with a Linear Equality Constraint
) (
of function a as

CLS CLS
c
J
0
Follow the usual steps for Lagrange Multiplier Solution:

1. Set
( ) ( ) A H H x H H H 0 A H H x H
T T T T
c
T T T
uc
1
1
2
1
) (
2 2

= = + +
! !" ! !# $
Unconstrained Estimate
2. Solve for to make satisfy the constraint:
CLS
!" !# $
c
for solve
c

b A
= ) (
( ) ( ) ( ) b A A H H A b A H H A
(
= =
(

uc
T T
c
T T
uc
2
2
1
1
1 1
3. Plug in to get the constrained solution:
) (

c c c
=
( ) ( ) ( )
! ! ! ! ! ! ! ! " ! ! ! ! ! ! ! ! # $
Term" Correction "
1
1 1

b A A H H A A H H
(
=

uc
T T T T
uc c
Amount of
Constraint Deviation
6
Geometry of Constrained Linear LS
The above result can be interpreted geometrically:
x
uc
s
s
c
s
Constraint Line
Constrained Estimate of the Signal is the
Projection of the Unconstrained Estimate
onto the Linear Constraint Subspace
7
8.9 Nonlinear LS
Everything weve done up to now has assumed a linear
observation model but weve already seen that many
applications have nonlinear observation models: s() H
Recall: For linear case closed-form solution
< Not so for nonlinear case!! >
Must use numerical, iterative methods to minimize the LS cost
given by:
J() = [x s()]
T
[x s()]
But first Two Tricks!!!
8
Two Tricks for Nonlinear LS
Sometimes it is possible to:
1. Transform into a Linear Problem
2. Separate out any Linear Parameters
Trick #1: Seek an invertible function
such that
s(()) = H, which can be easily solved for
and then find
=
=
) (
) (
1

g
g
LS
1
LS LS
g

=
Sometimes
Possible to Do
Both Tricks
Together
Trick #2: See if some of the parameters are linear:
Try to decompose
H s
) ( ) ( get to =
(
(
=
Linear in !!!
Nonlinear in
9
Example of Linearization Trick
(
(
= + =

A
n f A n s
o
) 2 cos( ] [
Consider estimation of a sinusoids amplitude and phase (with a
known frequency):
But we can re-write this model as:
) 2 sin( ) sin( ) 2 cos( ) cos( ] [
2 1
n f A n f A n s
o o

!" !# $ !" !# $
=
x H H H
T T 1
) (

= which is linear in = [
1

2
]
T
so:
Then map this estimate back using
(
(
(
(
|
|
.
|
\
|

+
= =

1
2
1
2
2
2
1
1
tan

)

g
Note that for this example this is
merely exploiting polar-to-
rectangular ideas!!!
10
Example of Separation Trick
Consider a signal model of three exponentials:
%
T
n n n
r A A A
r r A r A r A n s
T
] [
1 0 ] [
3 2 1
3
3
2
2 1
!" !# $
=
< < + + =
(
(
(
(
(
(
=
) 1 ( 3 ) 1 ( 2 1
3 2
1 1 1
) (
N N N
r r r
r r r
r
& & &
H
Then we can write:
H s ) ( ) ( r =
x H H H ) ( )] ( ) ( [ ) (
1
r r r r
T T
=
| | | |
| | | | x H H H H x x H H H H x
H x H x
) ( )] ( ) ( )[ ( ) ( )] ( ) ( )[ (
) (
) ( ) (
) ( ) (
1 1
r r r r r r r r
r r r r r J
T T
T
T T
T

=
=
Depends on only one variableso
might conceivably just compute on
a grid and find minimum
Then we need to minimize :
11
Iterative Methods for Solving Nonlinear LS
Goal: Find value that minimizes J() = [x-s()]
T
[x-s()]
without computing it over a p-dimensional grid
Two most common approaches:
1. Newton-Raphson
a. Analytically find J()/
b. Apply Newton-Raphson to find a zero of J()/
(i.e. linearize J()/ about the current estimate)
c. Iteratively Repeat
2. Gauss-Newton
a. Linearize signal model s() about the current estimate
b. Solve resulting linear problem
c. Iteratively Repeat
Both involve:
Linearization (but they each linearize something different!)
Solve linear problem
Iteratively improve result
12
Newton-Raphson Solution to Nonlinear LS
%
0
) (
=
g
J
To find minimum of J(): set
(
(
(
(
(
(
p
J
J
J
) (
) (
) (
1
&

=
=
1
0
2
]) [ ] [ ( ) (
N
i
i s i x J

for Need to find

%
=
=
=
1
0
?
] [
]) [ ] [ ( 2
) (
N
i
h
j
r
Why
ignore
can j
ij
i
i s
i s i x
J
" # $
! !" ! !# $

Taking these partials gives:

13
p j for h r
Vector
Matrix
N
i
ij i
, , 1 0
1
0
!" !# $
= =
0 r H

= =
T
g ) (
Depend nonlinearly on
Now set to zero:
(
(
(
(
(
(
(
=
(
(
(
(
(
(
(
(
(
(
(
(
=
]) 1 [ ] 1 [ (
]) 0 [ ] 0 [ (
] 1 [ ] 1 [ ] 1 [
] 1 [ ] 1 [ ] 1 [
] 0 [ ] 0 [ ] 0 [
2 1
2 1
2 1
N s N x
s x
N s N s N s
s s s
s s s
p
p
p
r H &
'
& ' & &
'
'

(
=
P
T
i
i s i s i s

] [ ] [ ] [
) (
2 1

'
Define the i
th
row of H
:
h
0 h r H

= = =

=
) ( ] [ ) (
1
0
i
N
n
T
n r g
Then the equation to solve is:
14
For Newton-Raphson we linearize g() around our current
estimate and iterate:
Need this
k
k
T
T
k k k
g
g

r H
r H

1
1
) (
) (

=
+
(
(
(
(
(
=
(
(
=
! ! " ! ! # $
! !" ! !# $

H H
h
h

r H
T
n
N
n
n
N
n
n
N
n
n
N
n
n
T
n r
n r
n r
n r
=
=
1
0
1
0
) (
1
0
1
0
] [
) ( ] [
) ( ) ( ] [
] [ ) (
(
(
(
(
(
(
(
(
(
(
(
(
n s
n s
n s
n s n x n r
] [
] [
] [
]) [ ] [ ( ] [
2
1

&
| | p j i
n s
j i
ij
n
, , 2 , 1 ,
] [
) (
2
=

=

G
( )

H H G
r H
T
N
n
n
T
n s n x =
=
1
0
] [ ] [ ) (
Derivative of Product Rule
15
So the Newton-Raphson method becomes:
( ) ( )
k k k
k
k
k
T
N
n
k n
T
k
T
T
k k
n s n x

s x H G H H
r H
r H

1
1
0
1
1
] [ ] [ )
(
(
+ =
(
(
(
(
(
=
=
2
nd
partials of s[n]
w.r.t. parameters
1
st
partials of signal
w.r.t. parameters
Note: if the signal is linear in parameters this collapses to the non-
iterative result we found for the linear case!!!
Newton-Raphson LS Iteration Steps:
1. Start with an initial estimate
2. Iterate the above equation until change is small
16
Gauss-Newton Solution to Nonlinear LS
First we linearize the model around our current estimate by
using a Taylor series and keeping only the linear terms:
)
(
)
k
k
k
k

s
s s
H

(
(
=
=
!" !# $
Then we use this linearized model in the LS cost:
| | | |
{ } | | { } | |
| | | | H H s x H H s x
H s x H s x
s x s x

k k k k k k
k k k k
k
T
k
k
T
k
T
J

)
( )
(
) (
+ + =
+ +
=
y
=
y
=
All Known Things All Known Things
17
| | | | H y H y

k k
T
J

) ( =
This gives a form for the LS cost that looks like a linear problem!!
We know the LS solution to that problem is
| |
| | ( )
| | | | ( )
k k k k k k k k
k k k k k
k k k
T T
k
T T
k
T T
T T
k

I

s x H H H H H H H
H s x H H H
y H H H

1

1

1

1

1
+ =
+ =
=
+
! ! ! " ! ! ! # $
| | ( )
k k k k
T T
k k

s x H H H

1

1

+ =

+
Gauss-Newton LS Iteration:
Gauss-Newton LS Iteration Steps:
1. Start with an initial estimate
2. Iterate the above equation until change is small
18
Newton-Raphson vs. Gauss-Newton
How do these two methods compare?
| | ( )
k k k k
T T
k k

s x H H H

1

1

+ =

+
G-N:
( ) ( )
k k k
k
k
T
N
n
k n
T
k k
n s n x

s x H G H H

1
1
0
1
] [ ] [ )
(
(
+ =

=
+
N-R:
The term of 2
nd
partials is missing
in the Gauss-Newton Equation
Which is better?
Typically I prefer Gauss-Newton:
G
n
matrices are often small enough to be negligible
or the error term is small enough to make the sum term negligible
Inclusion of the sum term can sometimes de-stablize the iteration
See p. 683 of Numerical
Recipes book
1
8.10 Signal Processing Examples of LS
Well briefly look at two examples from the book
Book Examples
1. Digital Filter Design
2. AR Parameter Estimation for the ARMA Model
3. Adaptive Noise Cancellation
4. Phase-Locked Loop (used in phase-coherent
demodulation)
The two examples we will cover highlight the flexibility of the LS
viewpoint!!!
Then (in separate note files) well look in detail at two emitter
location examples not in the book
2
Ex. 8.11 Filter Design by Pronys LS Method
The problem:
You have some desired impulse response h
d
[n]
Find a rational TF with impulse response h[n] h
d
[n]
View: h
d
[n] as the observed data!!!
Rational TF models coefficients as the parameters
General LS Problem
signal
model
[n]
+
x[n]
]
; [ n s
Choose the
Estimate
to make this
residual small
LS Filter Design Problem
[n]
+
h
d
[n]
]
; [ b a n h
b a
) (
) (
) (
z A
z B
z H =
[n]
p1
(q+1)1
3
Pronys Modification to Get Linear Model
The previous formulation results in a model that is nonlinear in the
TF coefficient vectors a, b
Pronys idea was to change the model slightly
[n]
+
h
d
[n]
]
; [ b a n h
b a
) (
) (
) (
z A
z B
z H =
[n]
) (z A
) (z A
p1
(q+1)1
This model is only
approximately
equivalent to the
original!!
Solution (see book for details):
1 ,
1
] [
=
N q
T
q q
T
q
h H H H a
a H h b
0 , 0
+ =
q
H
q
, H
0
, h
q,N-1
, and h
0,q
all contain
elements from h
d
[n] the subscripts
indicate the range of these elements
4
Key Ideas in Prony LS Example
1. Shows power and flexibility of LS approach
- There is no noise here!!! MVU, ML, etc. are not applicable
- But, LS works nicely!
2. Shows a slick trick to convert nonlinear problem to linear one
- Be aware that finding such tricks is an art!!!
3. Results for LS Prony method have links to modeling methods
for Random Processes (i.e. AR, MA, ARMA)
Is this a practical filter design method?
Its not the best: Remez-Based Method is Used Most
5
Ex. 8.13 Adaptive Noise Cancellation
Done a bit
different from
the book
Adaptive
FIR Filter
] [ ] [ ] [ n i n d n x + =
] [
~
n i
] [
n i
+
Desired Interference
] [
n d
Estimate of the desired
signal with
cancelled interference
Statistically correlated with
interference i[n] but mostly
uncorrelated with desired d[n]
Estimate of the interference i[n] adapted
to best cancel the interference
=
=
p
l
n
l k i l h n i
0
] [
~
] [ ] [
Time-Varying Filter!!
Coefficients change at each sample index
6
Noise Cancellation Typical Applications
1. Fetal Heartbeat Monitoring
Adaptive
FIR Filter
] [ ] [ ] [ n i n d n x + =
] [
~
n i
] [
n i
+
On Mothers
Chest
] [
n d
On Mothers
Stomach
Fetal
Heartbeat
Mothers
Heartbeat
via Stomach
Mothers
Heartbeat
via Chest
Adaptive filter has to mimic the TF of the
chest-to-stomach propagation
7
2. Noise Canceling Headphones
Adaptive
FIR Filter
] [
~
n i
] [n i
Ear
Noise
] [
n i
] [
] [ n i n m
Music
Signal
] [n m
!" !# $
cancel
n i n i n m ] [
] [ ] [ +
8
3. Bistatic Radar System
Adaptive
FIR Filter
] [ ] [ ] [ n d n t n x
t
+ =
] [
n t
] [
n d
t
+
t[n]
d
t
[n]
d[n]
d[n]
Tx
d[n]
Delay/Doppler
Radar
Processing
9
LS and Adaptive Noise Cancellation
Goal: Adjust the filter coefficients to cancel the interference
There are many signal processing approaches to this problem
Well look at this from a LS point of view:
Adjust the filter coefficients to minimize

=
n
n d J ] [
Adaptive
FIR Filter
] [ ] [ ] [ n i n d n x + =
]) [
] [ ( ] [ ] [
n i n i n d n d + =
] [
~
n i
] [
n i
+
Because i[n] is uncorrelated with

d[n] minimizing J is essentially the
same as making this term zero
Because the interference likely
changes is character with
time we want to adapt!
Use Sequential LS with
Fading Memory
10
Sequential LS with Forgetting Factor
We want to weight recent measurements more heavily than past
measurements that is we want to forget past values.
So we can use weighted LS and if we choose our weighting
factor as an exponential function then it is easy to implement!
[ ]

=
=
n
k
p
l
n
k n
n
k
k n
l k i l h k x
k i k x n J
0
2
1
0
0
2
] [
~
] [ ] [
] [
] [ ] [
Small quickly
down weights
the past errors
= forgetting factor if 0 < < 1
See book for solution details
See Fig. 8.17 for simulation results
Single Platform Emitter Location
AOA(DF) FOA Interferometery TOA
SBI LBI
Emitter Location is Two Estimation Problems in One:
1) Estimate Signal Parameter(s) that Depend on Emitters Location:
a) Time-of-Arrival (TOA) of Pulses
b) Phase Interferometery: Phase is measured between two different signals
received at nearby antennas
SBI Short Baseline Interferometery (antennas are close enough
together that phase is measured without ambiguity)
LBI Long Baseline Interferometery (antennas are far enough apart that
phase is measured with ambiguity; ambiguity resolved either using
processing or so-called self-resolved)
c) Frequency-of-Arrival (FOA) or Doppler
d) Angle-of-Arrival (AOA)
2) Use Signal Parameters Measured at Several Instants to Estimate
Location
Frequency-Based Location (i.e. Doppler Location)
The Problem
Emitter assumed non-moving and at position (X,Y,Z)
Transmitting a radar signal at unknown carrier frequency is f
o
Signal is intercepted by a receiver on a single aircraft
A/C dynamics are considered to be perfectly known as a function of time
Nav Data: Position X
p
(t), Y
p
(t), Z
p
(t) and Velocity V
x
(t), V
y
(t), V
z
(t)
Relative motion between the Tx and Rx causes Doppler shift
Received carrier frequency differs from transmitted carrier frequency
Thus, the carrier frequency of the received signal will change with time
For a given set of nav data, how the frequency changes dependS on the
transmitters carrier frequency f
o
and the emitters position (X,Y,Z)
Parameter Vector: x = [X Y Z f
o
]
T
f
o
is a nuisance parameter
Received frequency is a function of time as well as parameter vector x
( ) ( ) ( )
( ) ( ) ( )
) 1 (
) ( ) ( ) (
) ( ) ( ) ( ) ( ) ( ) (
) , (
2 2 2
(
(
(
+ +
+ +
=
Z t Z Y t Y X t X
c
f
f t f
p p p
p z p y p x
o
o
x
Make noisy frequency measurements at t
1
, , t
N
:
Problem: Given noisy frequency measurements and the nav data,
estimate x
What PDF model do we use for our data????
) ( ) , ( ) , (
~
i i i
t v t f t f + = x x
In the TDOA/FDOA casewe had an ML estimator for TDOA/FDOA so we
could claim that the measurements were asymptotically Gaussian. Because
we then had a well-specified PDF for the TDOA/FDOA we could hope to
use ML for the location processing.
However, here we have no ML estimator for the instantaneous frequency so
claiming that the inst. freq. estimates are Gaussian is a bit of a stretch.
So we could:
1. Outright ASSUME Gaussian and then use ML approach
2. Resort to LSwhich does not even require a PDF viewpoint!
Both paths get us to the exact same place:
Find the estimate that minimizes
=
=
N
i
e i e i e
t f t f J
1
2
)]
, ( ) , (
~
[ )
( x x x
If we Assume Gaussianwe could choose:
Newton-Raphson MLE approach: leads to double derivatives
of the measurement model f (t
i
,x
e
).
If we Resort to LSwe could choose either:
Newton-Raphson approach, which in this case is identical to
N-R under the Gaussian assumption
Gauss-Newton approach, which needs only first derivatives
of the measurement model f (t
i
,x
e
).
Well resort to LS and use Gauss-Newton
Time
Measured Frequency
Frequency Computed
Using Measured Nav
and Poor Assumed Loc.
Frequency Computed
Using Measured Nav
and Good Assumed Loc.
=
=
N
i
i i
t f t f J
1
2
)]
, ( ) , (
~
[ x x
LS Approach: Find the estimate such that the corresponding
computed frequency measurements are close to the actual
measurements:
Minimize
x
, ( x
i
t f
The Solution
Measurement model in (1) is nonlinear in x !no closed form solution
Newton-Raphson: Linearize the derivative of the cost function
Gauss-Newton: Linearize the measurement model
Thus: ! (A Linear Model)
where
Get LS solution for update and then update current estimate:
| | v x x H x f x f + +
n n
( ) ( v x H x f + )
(
n
~
| |
4 3 2 1
| | | ) , ( h h h h x t
x
H
x x
=
=
=
n
f
( ) )
1
1
1
n
T T
x f R H H R H x =

x x x

1
+ =
+ n n
Under the condition that the frequency measurement errors are Gaussian,
then the CRLB for the problem can be shown to be
( )
1
1
} var{

H R H x
T
Can use this to investigate performance under geometries of interest
even when the measurement errors arent truly Gaussian
The Algorithm
Initialization:
Use the average of the measured frequencies as an initial transmitter
frequency estimate.
To get an initial estimate of the emitters X,Y,Z components there are
several possibilities:
Perform a grid search
Use some information from another sensor (e.g., if other on-board
sensors can give a rough angle use that together with a typical
range)
Pick several typical initial locations (e.g., one in each quadrant
with some typical range)
Let the initial estimate be
]
0 , 0 0 0 0 o
f Z Y X = x
Iteration:
For n = 0, 1, 2,
1. Compute the vector of predicted frequencies at times {t
1
, t
2
, , t
N
} using
the current n
th
estimate and the nav info:
( ) ( ) ( )
( ) ( ) ( )
|
|
|
|
.
|
\
|
+ +
+ +
=
2 2 2
,
,
) (
) (
) (
) ( ) (
) ( ) (
) ( ) (
, (
n j p n j p n j p
n j p j z n j p j y n j p j x n o
n o n j
Z t Z Y t Y X t X
c
f
f t f x
| |
T
n N n n n
t f t f t f )
, (
, (
, (
(
2 1
x x x x f ! =
2. Compute the residual vector by subtracting the predicted frequency vector
from the measured frequency vector:
)
) (
~
)
(
n n
x f x f x f =
3. Compute Jacobian matrix H using the nav info and the current estimate:
| |
4 3 2 1
| | | ) , ( h h h h x t
x
H
x x
=
=
=
n
f
, ) (
) (
) (
) (
) ( ) (

) ( ) (

) ( ) (
2 2 2
j n j n j n j n
n j p j n n j p j n n j p j n
t Z t Y t X t R
Z t Z t Z Y t Y t Y X t X t X
+ + =
= = =
Define:
| |
(
(
+ +
+
=
=
3
) (
) ( ) (
) ( ) (
) ( ) (
) (
) , ( ) (
j
j n j z j n j y j n j x j n
j
j x
o
j
R
t Z t V t Y t V t X t V t X
R
t V
c
f
t f
X
j
n
x x
x h
| |
(
(
+ +
+
=
=
3
) (
) ( ) (
) ( ) (
) ( ) (
) (
) , ( ) (
j
j
j y
o
j
R
t Z t V t Y t V t X t V t Y
R
t V
c
f
t f
X
j
n
x x
x h
| |
(
(
+ +
+
=
=
3
) (
) ( ) (
) ( ) (
) ( ) (
) (
) , ( ) (
j
j
j z
o
j
R
t Z t V t Y t V t X t V t Z
R
t V
c
f
t f
Z
j
n
x x
x h
1 ) , ( ) (
=
=
n
j
o
t f
f
j
x x
x h
4. Compute the estimate update:
( ) )
1
1
1
n
T T
n
x f C H H C H x =

C is the covariance of the

frequency measurements;
usually assumed to be
diagonal with measurement
variances on the diagonal
In practice you would implement this inverse using Singular Value
Decomposition (SVD) due to numerical issues of H being near to
singular (MATLAB will give you a warning when this is a problem)
See pp. 676-677 of the book Numerical Recipes
5. Update the estimate using
n n n
x x x

1
+ =
+
6. Check for convergence of solution: look to see if update is small in some
specified sense.
If Not Converged go to Step 1.
If Converged or Maximum number of iterationsquit loop & Set
x
7. Compute Least-Squares Cost of Converged solution
1
+
=
n
x
( )
=
N
n
n
n n
t f t f
C
1
2
)
, (
) , (
~
)
x x
x
This last step is often done to
allow assessment of how much
confidence you have in the
solution. There are other ways to
assess confidence see
discussion in Ch. 15 of
Numerical Recipes
Note: There is no guarantee that this algorithm will
convergeit might not converge at allit might:
(i) simply wander around aimlessly,
(ii) oscillate back and forth along some path, or
(iii) wander off in complete divergence.
In practical algorithms it is a good idea to put tests
into the code to check for such occurrences
-1 0 1 2 3 4 5
x 10
4
-0.5
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
x 10
4
x (meters)
y

(
m
e
t
e
r
s
)
-400 -200 0 200 400 600
-400
-300
-200
-100
0
100
200
300
400
500
-300 -200 -100 0 100 200 300
-200
-150
-100
-50
0
50
100
150
200
250
300
Simulation Results with 95% CRLB Error Ellipses
Platform Trajectory
1/11
Doppler Tracking
Passive Tracking of an Airborne Radar:
An Example of Least Squares
State Estimation
2/11
Problem Statement
Airborne radar to be located follows a trajectory X(t), Y(t), Z(t) with
velocities V
x
(t), V
y
(t), V
z
(t)
It is transmitting a radar signal whose carrier frequency is f
o
.
Signal is intercepted by a non-moving receiver at known location
X
p
, Y
p
, Z
p
.
Problem: Estimate the trajectory X(t), Y(t), Z(t)
Solution Here:
Measure received frequency at instants t
1
, t
2
, , t
N
Assume a simple model for the aircrafts motion
Estimate model parameters to give estimates of trajectory
unknown
3/11
An Admission
This problem is somewhat of a rigged application
Unlikely it would be done in practice just like this
Because it will lead to poorly observable parameters
The H matrix is likely to be less than full rank
In real practice we would likely need either:
Multiple Doppler sensors or
A single sensor that can measure other things in addition to
Doppler (e.g., bearing).
We present it this way to maximize the similarity to the example of
locating a non-moving radar from a moving platform
Can focus on the main characteristic that arises when the
parameter to be estimated is a varying function (i.e. state
estimation).
4/11
Doppler Shift Model
Relative motion between the emitter and receiverDoppler Shift
Frequency observed at time t is related to the unknown transmitted
frequency of f
o
by:
( ) ( ) ( )
( ) ( ) ( )

+ +
+ +
=
2 2 2
) ( ) ( ) (
) ( ) ( ) ( ) ( ) (
) (
t Z Z t Y Y t X X
t Z Z V t Y Y t V t X X t V
c
f
f t f
p p p
p z p y p x
o
o
We measure this at time instants t
1
, t
2
, , t
N
) ( ) ( ) (
~
i i i
t v t f t f + =
And group them into a measurement vector:
Frequency
Measurement
Noise
[ ]
T
N
t f t f t f ) (
~
) (
~
) (
~ ~
2 1
! = f
But what are we trying to estimate from this data vector???
5/11
Trajectory Model
We cant estimate arbitrary trajectory functions like X(t), Y(t), etc.
Need a trajectory model
to reduce the problem to estimating a few parameters
Here we will choose the simplest Constant-Velocity Model
N N z
N N y
N N x
Z t t V t Z
Y t t V t Y
X t t V t X
+ =
+ =
+ =
) ( ) (
) ( ) (
) ( ) (
Final Positions in
Observation Block
Velocity Values
Now, given measurements of frequencies f(t
1
), f(t
2
), , f(t
N
)
we wish to estimate the 7-parameter vector:
T
o z y x N N N
f V V V Z Y X ] [ = x
6/11
Measurement Model and Estimation Problem
Substituting the Trajectory Model into the Doppler Model
gives our measurement model:
[ ] ( ) [ ] ( ) [ ] ( )
[ ] ( ) [ ] ( ) [ ] ( )

+ + + + +
+ + + + +
=
2 2 2
) ( ) ( ) (
) ( ) ( ) ( ) (
) , (
N N z p N N y p N N x p
N N z p z N N y p y N N x p x
o
o
Z t t V Z Y t t V Y X t t V X
Z t t V Z V Y t t V Y t V X t t V X V
c
f
f t f x
Dependence on
parameter vector
) ( ) , ( ) , (
~
t v t f t f + = x x
[ ]
v x f
x x x x f
+ =
=
) (
) , (
~
) , (
~
) , (
~
) (
~
2 1
T
N
t f t f t f !
Noisy
Frequency
Measurement
Noisy
Measurement
Vector
Noise-Free
Frequency Vector
Noise Vector
7/11
Estimation Problem
Given:
Noisy Data Vector:
Sensor Position:
Estimate:
Parameter Vector:
[ ]
T
N
t f t f t f ) , (
~
) , (
~
) , (
~
) (
~
2 1
x x x x f ! =
p p p
Z Y X , ,
T
o z y x N N N
f V V V Z Y X ] [ = x
This is a nonlinear problem
Although we could use ML to attack this we choose
LS here partly because we arent given an explicit
noise model and partly because LS is easily applied
here!!!
Nuisance
parameter
8/11
Linearize the Nonlinear Model
We have a non-linear measurement model here
so we choose to linearize our model (as before):
[ ] v x x H x f x f + +
n n
( ) (
~
where
n
x
is the current estimate of the parameter vector

)
(
n
x f is the predicted frequency measurements
computed using the Doppler & Trajectory
models with back-propagation (see next)
[ ]
7 6 5 4 3 2 1
| | | | | | ) , ( h h h h h h h x t
x
H
x x
=
=
=
n
f
is the N7 Jacobian matrix evaluated at the
current estimate
9/11
Back-Propagate to Get Predicted Frequencies
Given the current parameter estimate:
T
o z y x N N N n
n f n V n V n V n Z n Y n X )] (
) (
) (
) (
) (
) (
) (
= x
Back-Propagate to get the current trajectory estimate:
) (
) ( ) (
) (
) (
) ( ) (
) (
) (
) ( ) (
) (
n Z t t n V t Z
n Y t t n V t Y
n X t t n V t X
N N z n
N N y n
N N x n
+ =
+ =
+ =
Use Back-Propagated trajectory to get predicted frequencies:
( ) ( ) ( )
( ) ( ) ( )

+ +
+ +
=
2 2 2
) (
) (
) (
) (
) (
) (
) (
) (
) (
) (
) (
, (
t Z Z t Y Y t X X
t Z Z n V t Y Y n V t X X n V
c
n f
n f t f
n p n p n p
n p z n p y n p x
o
o n i
x
10/11
Converting to Linear LS Problem Form
From the linearized model and the back-propagated trajectory
estimate we get:
"
v x H x f
x x
x f x f
+
n
n
n n
( ) (
~
)
(
#$ #% &
Update
Vector
Residual
Vector
This is in standard form of Linear LS so the solution is:
( ) )
1
1
1
n
T T
n
x f R H H R H x =

This LS estimated update is then used to get an

updated parameter estimate:
n n n
x x x

1
+ =
+
R is the
covariance
matrix of the
measurements
11/11
Iterating to the Solution
n=0: Start with some initial estimate
Loop until stopping criterion satistfied
n n+1
Compute Back-Propagated Trajectory
Compute Residual
Compute Jacobian
Compute Update
Check Update for smallness of norm
If Update small enough stop
Otherwise, update estimate and loop
1
Pre-Chapter 10
Results for Two Random Variables
See Reading Notes
posted on BB
2
Let X and Y be two RVs each with there own PDF: p
X
(x) and p
Y
(y)
Their complete probabilistic description is captured in
Joint PDF of X and Y: p
XY
(x,y)
Describes probabilities of joint events concerning X and Y.
{ }

= < < < <
b
a
d
c
XY
dxdy y x p d Y c b X a ) , ( ) ( and ) ( Pr
Marginal PDFs of X and Y: The individual PDFs p
X
(x) and p
Y
(y)
Imagine adding up the joint PDF along one direction of a piece of
paper to give values along one of the margins.

= = dx y x p y p dy y x p x p
XY Y XY X
) , ( ) ( ) , ( ) (
3
Expected Value of Functions of X and Y: You sometimes create a
new RV that is a function of the two of them: Z = g(X,Y).
{ } { }

= = dxdy y x p y x g Y X g E Z E
XY XY
) , ( ) , ( ) , (
Example: Z = X + Y
{ } { } ( )
{ } { } Y E X E
dy y yp dx x xp
dy dx y x p y dx dy y x p x
dxdy y x yp dxdy y x xp
dxdy y x p y x Y X E Z E
Y X
Y X
XY XY
XY XY
XY XY
+ =
+ =
(
(
+
(
(
=
+ =
+ = + =

) ( ) (
) , ( ) , (
) , ( ) , (
) , (
4
Conditional PDFs : If you know the value of one RV how is the
remaining RV now distributed?

=
otherwise , 0
0 ) ( ,
) (
) , (
) | (
|
x p
x p
y x p
x y p
X
X
XY
X Y

=
otherwise , 0
0 ) ( ,
) (
) , (
) | (
|
y p
y p
y x p
y x p
Y
Y
XY
Y X
Sometimes we think of a specific numerical value upon which we
are conditioning p
Y|X
(y|X = 5)
Other times it is an arbitrary value
p
Y|X
(y|X = x) or p
Y|X
(y|x) or p
Y|X
(y|X)
Various Notations
5
Independence: RVs X and Y are said to be independent if
knowledge of the value of one does not change the PDF model for
the other.
) ( ) | (
) ( ) | (
|
|
x p y x p
y p x y p
X Y X
Y X Y
=
=
) ( ) ( ) , ( y p x p y x p
Y X XY
=
This implies (and is implied by)
) (
) (
) ( ) (
) | (
) (
) (
) ( ) (
) | (
|
|
x p
y p
y p x p
y x p
y p
x p
y p x p
x y p
X
Y
Y X
Y X
Y
X
Y X
X Y
= =
= =
6
Decomposing the Joint PDF: Sometimes it is useful to be able to
write the joint PDF in terms of conditional and marginal PDFs.
From our results for conditioning above we get
) ( ) | ( ) , (
|
x p x y p y x p
X X Y XY
=
) ( ) | ( ) , (
|
y p y x p y x p
Y Y X XY
=
From this we can get results for the marginals:
=
=
dx x p x y p y p
dy y p y x p x p
X X Y Y
Y Y X X
) ( ) | ( ) (
) ( ) | ( ) (
|
|
7
Bayes Rule: Sometimes it is useful to be able to write one
conditional PDF in terms of the other conditional PDF.
) (
) ( ) | (
) | (
) (
) ( ) | (
) | (
|
|
|
|
y p
x p x y p
y x p
x p
y p y x p
x y p
Y
X X Y
Y X
X
Y Y X
X Y
=
=
Some alternative versions of Bayes rule can be obtained by
writing the marginal PDFs using some of the above results:

= =
= =
dx x p x y p
x p x y p
dx y x p
x p x y p
y x p
dy y p y x p
y p y x p
dy y x p
y p y x p
x y p
X X Y
X X Y
XY
X X Y
Y X
Y Y X
Y Y X
XY
Y Y X
X Y
) ( ) | (
) ( ) | (
) , (
) ( ) | (
) | (
) ( ) | (
) ( ) | (
) , (
) ( ) | (
) | (
|
| |
|
|
| |
|
8
Conditional Expectations: Once you have a conditional PDF it
works EXACTLY like a PDFthat is because it IS a PDF!
Remember that any expectation involves a function of a random
variable(s) times a PDF and then integrating that product.
So the trick to working with expected values is to make sure you
know three things:
1. What function of which RVs
2. What PDF
3. What variable to integrate over
9
For conditional expectations one idea but several notations!
{ }

= dx y x p y x g Y X g E
Y X Y X
) | ( ) , ( ) , (
| |
{ }

=
=
dx y x p y x g Y X g E
o Y X o y Y X
o
) | ( ) , ( ) , (
| |
{ }

= dx y x p y x g Y Y X g E
Y X
) | ( ) , ( | ) , (
|
{ }

= = dx y x p y x g y Y Y X g E
o Y X o o
) | ( ) , ( | ) , (
|
Uses subscript on E to indicate that you use the cond. PDF.
Does not explicitly state the value at which Y should be fixed so use an arbitrary y
Uses subscript on E to indicate that you use the cond. PDF.
Explicitly states that the value at which Y should be fixed is y
o
Uses conditional bar inside brackets of E to indicate use of the cond. PDF.
Does not explicitly state the value at which Y should be fixed so use an arbitrary y
Uses conditional bar inside brackets of E to indicate use of the cond. PDF.
Explicitly states that the value at which Y should be fixed is y
o
10
Decomposing Joint Expectations: When averaging over the joint
PDF it is sometimes useful to be able to decompose it into nested
averaging in terms of conditional and marginal PDFs.
This uses the results for decomposing joint PDFs.
{ } { }
{ } { } ) , (
) , ( ) , (
|
Y X g E E
Y X g E Y X g E
X Y X
XY
=
=
{ }
dx x p x y p y x g
dxdy y x p y x g Y X g E
X
x
Y X g E
X Y
y
x p x y p
XY
X Y
X X Y
) ( ) | ( ) , (
) , ( ) , ( ) , (
)} , ( {
|
) ( ) | (
|
|

(
=
=
! ! ! ! " ! ! ! ! # $
!" !# $
This is an RV that
inherits the PDF of X!!!
11
Ex. Decomposing Joint Expectations:
Let X = # on Red Die Y = # on Blue Die g(X,Y) = X + Y
{ } { } { } ) , ( ) , (
|
Y X g E E Y X g E
X Y X
=
=
= +
6
1
5 . 9
6
1
) 6 (
y
y
(6+6) (6+5) (6+4) (6+3) (6+2) (6+1)
6
(5+6) (5+5) (5+4) (5+3) (5+2) (5+1)
5
(4+6) (4+5) (4+4) (4+3) (4+2) (4+1)
4
(3+6) (3+5) (3+4) (3+3) (3+2) (3+1)
3
(2+6) (2+5) (2+4) (2+3) (2+2) (2+1)
2
(1+6) (1+5) (1+4) (1+3) (1+2) (1+1)
1
E{Y|X} 6 5 4 3 2 1 X/Y
=
= +
6
1
5 . 4
6
1
) 1 (
y
y
=
= +
6
1
5 . 5
6
1
) 2 (
y
y
=
= +
6
1
5 . 8
6
1
) 5 (
y
y
=
= +
6
1
5 . 7
6
1
) 4 (
y
y
=
= +
6
1
5 . 6
6
1
) 3 (
y
y
These
constitute
an RV with
uniform
probability
of 1/6
=
= = = +
6
1
7
6
1
} | { }} | { { } {
x
x Y E X Y E E Y X E
1
Chapter 10
Bayesian Philosophy
2
10.1 Introduction
Up to now Classical Approach: assumes is deterministic
This has a few ramifications:
Variance of the estimate could depend on
In Monte Carlo simulations:
Mruns done at the same ,
must do M runs at each of interest
averaging done over data
no averaging over values
E{} is
w.r.t. p(x;)
Bayesian Approach: assumes is random with pdf p()
This has a few ramifications:
Variance of the estimate CANT depend on
In Monte Carlo simulations:
each run done at a randomly chosen ,
averaging done over data AND over values
E{} is
w.r.t. p(x,)
joint pdf
3
Why Choose Bayesian?
1. Sometimes we have prior knowledge on some values are
more likely than others
2. Useful when the classical MVU estimator does not exist
because of nonuniformity of minimal variance
) (
2
3. To combat the signal estimation problemestimate signal s

x = s + w If s is deterministic and is the
parameter to estimate, then H = I
Classical Solution:
( ) x x I I I s = =

T T
1
Signal Estimate is
the data itself!!!
The Wiener filter is a Bayesian method to combat this!!
4
10.3 Prior Knowledge and Estimation
Bayesian Data Model:
Parameter is chosen randomly w/ known prior PDF
Then data set is collected
Estimate value chosen for parameter
Every time you collect data, the parameter has a different value,
but some values may be more likely to occur than others
This is how you think about it mathematically and how you run
simulations to test it.
This is what you know ahead
of time about the parameter.
5
Ex. of Bayesian Viewpoint: Emitter Location
Emitters are where they are and dont randomly jump around each
time you collect data. So why the Bayesian model?
(At least) Three Reasons
1. You may know from maps, intelligence data, other sensors,
etc. that certain locations are more likely to have emitters
Emitters likely at airfields, unlikely in the middle of a lake
2. Recall Classical Method: Parm Est. Variance often depends
on parameter
It is often desirable (e.g. marketing) to have a single
number that measures accuracy.
3. Classical Methods try to give an estimator that gives low
variance at each value. However, this could give large
variance where emitters are likely and low variance where
they are unlikely.
6
Bayesian Criteria Depend on Joint PDF
There are several different optimization criteria within the
Bayesian framework. The most widely used is
Minimize the Bayesian MSE:
Bmse { }
=
=
d d , p
E
x x x ) ( )] (
[
)
( )
(
2
2
Take E{} w.r.t.
joint pdf of x and
Can Not Depend on
Joint pdf of x and
To see the difference compare to the Classical MSE:
{ }
=
=
x x x d p
E mse
) ; ( )] (
[
)
( )
(
2
2
pdf of x parameterized by Can Depend on
7
Ex. Bayesian for DC Level
Zero-Mean White Gaussian
Same as before x[n] = A + w[n]
p(A)
-A
o
1/2A
o
A
o
A
But here we use the following model:
that A is random w/ uniform pdf
RVs A and w[n] are independent of each other
Now we want to find the estimator function that maps data x into
the estimate of A that minimizes Bayesian MSE:
| | x x x
x x
d p dA A p A A
dA d A p A A A Bmse

=
=
) ( ) | ( ]
[
) , ( ]
[ )
(
2
2
Now use
p(x,A) = p(A|x)p(x)
Minimize this for each x value
This works because p(x) 0
So fix x, take its partial derivative, set to 0
8
Finding the Partial Derivative gives:

+ =
=
dA A p A dA A Ap
dA A p A A
dA A p
A
A A
dA A p A A
A
) | (
2 ) | ( 2
) | ( ]
[ 2
) | (
[
) | ( ]
2
2
x x
x
x x
=1
Setting this equal to zero and solving gives:
{ } x
x
|
) | (
A E
dA A Ap A
=
=

Conditional mean
of A given data x
Bayesian Minimum MSE Estimate = The Mean of posterior pdf
MMSE
So we need to explore how to compute
this from our data given knowledge of the
Bayesian model for a problem
9
Compare this Bayesian Result to the Classical Result:
for a given observed data vector x look at
MVUE = x
A
A
o
A
o
p(A|x)
p(x;A)
MMSE = E{A|x}
Before taking any data what is the best estimate of A?
Classical: No best guess exists!
Bayesian: Mean of the Prior PDF
observed data updates this a priori estimate into
an a posteriori estimate that balances prior vs. data
10
So for this example weve seen that we need E{A|x}.
How do we compute that!!!?? Well
=
=
dA A Ap
A E A
) | (
} | {
x
x
So we need the posterior pdf of A given the data which
can be found using Bayes Rule:
=
=
dA A p A p
A p A p
p
A p A p
A p
) ( ) | (
) ( ) | (
) (
) ( ) | (
) | (
x
x
x
x
x
Allows us to write one
cond. PDF in terms of
the other way around
Assumed Known
More easily found than p(A|x) very much
the same structure as the parameterized PDF
used in Classical Methods
11
So now we need p(x|A) For x[n] = A + w[n] we know that
( )
(
=
=
=
2
2
2
] [
2
1
exp
2
1
) ] [ (
) | ] [ ( ) | ] [ (
A n x
A n x p
A A n x p A n x p
w
w x
For A known, x[n] is the

known A plus random w[n]
PDF of x
Because w[n] and A are
assumed Independent
Because w[n] is White Gaussian they are independent thus, the
data conditioned on A is independent:
( )
( )
(
(
=
1
0
2
2 2 /
2
] [
2
1
exp
2
1
) | (
N
n
N
A n x A p
x
Same structure as the parameterized PDF used in Classical Methods
But here A is an RV upon which we have conditioned the PDF!!!
12
Now we can use all this to find the MMSE for this problem:
MMSE Estimator
A function that maps
observed data into the
estimate No Closed
Form for this Case!!!
( )
( ) | |
( )
( ) | |
( )
( )
=
(
(

(
(

=
(
(

(
(

=
= = =
o
o
o
o
o
o
o
o
A
A
N
n
A
A
N
n
A
A
o
N
n
N
A
A
o
N
n
N
dA A n x
dA A n x A
A
dA A A n x
dA A A n x A
dA A p A p
dA A p A Ap
dA A Ap A E A
1
0
2
2
1
0
2
2
1
0
2
2 2 /
2
1
0
2
2 2 /
2
] [
2
1
exp
] [
2
1
exp
2 / 1 ] [
2
1
exp
2
1
2 / 1 ] [
2
1
exp
2
1
) ( ) | (
) ( ) | (
) | ( } | {
x
x
x x Using
Bayes
Rule
Use
Prior
PDF
Use Parameter-
Conditioned PDF
Idea
Easy!!
Hard to
Build
13
How the Bayesian approach balances a priori and a posteriori info:
A
A
o
A
o
p(A)
E{A}
No Data
A
A
o
A
o
p(A|x)
E{A|x}
x
Short Data
Record
A
A
o
A
o
p(A|x)
x x } | {A E
Long Data
Record
14
General Insights From Example
1. After collecting data: our knowledge is captured by the
posterior PDF p( |x)
2. Estimator that minimizes the Bmse is E{ |x}the mean of
the posterior PDF
3. Choice of prior is crucial:
Bad Assumption of Prior Bad Bayesian Estimate!
(Especially for short data records)
4. Bayesian MMSE estimator always exists!
But not necessarily in closed form
(Then must use numerical integration)
15
10.4 Choosing a Prior PDF
Choice is crucial:
1. Must be able to justify it physically
2. Anything other than a Gaussian prior will likely result in
no closed-form estimates
We just saw that a uniform prior led to a non-closed form
Well see here an example where a Gaussian prior gives a
closed form
So there seems to be a trade-off between:
Choosing the prior PDF as accurately as possible
Choosing the prior PDF to give computable closed form
16
Ex. 10.1: DC in WGN with Gaussian Prior PDF
We assume our Bayesian model is now: x[n] = A + w[n]
with a prior PDF of
) , ( ~
2
A A
N A
AWGN
So for a given value of the RV A the conditional PDF is
( )
( )
(
(
=
1
0
2
2 2 /
2
] [
2
1
exp
2
1
) | (
N
n
N
A n x A p
x
Then to get the needed conditional PDF we use this and the a
priori PDF for A in Bayes Theorem:
=
dA A p A p
A p A p
A p
) ( ) | (
) ( ) | (
) | (
x
x
x
17
Then after much algebra and gnashing of teeth we get:
See the
Book
( )
(
(
=
2
|
2
|
2
|
2
1
exp
2
1
) | (
x A
x A
x A
A A p
x
which is a Gaussian PDF with
2 2
2
|
2
2
|
2
2
|
|
1
1
A
x A
A
A
x A x A
x A
N
x
N

+
=
|
|
.
|
\
|
+
|
|
.
|
\
|
=
Weighted Combination of a
priori and sample means
Parallel Combination of a
priori and sample variances
So the main point here so far is that by assuming:
Gaussian noise
Gaussian a priori PDF on the parameter
We get a Gaussian a posteriori PDF for Bayesian estimation!!
18
Now recall that the Bayesian MMSE was the conditional
a posteriori mean:
{ } x |
A E A =
Because we now have a Gaussian a posteriori PDF it is easy to
find an expression for this:
{ }
A
A
x A x A
x A
x
N
A E A
|
|
.
|
\
|
+
|
|
.
|
\
|
= = =
2
2
|
2
2
|
|
|
x
After some algebra we get:
( ) 1 0 , 1
2
2
2
2
2
2
< < + =
|
|
|
|
|
.
|
\
|
+
+
|
|
|
|
|
.
|
\
|
+
=

A
A
A A
A
x
N
N
x
N
A
Easily Computable Estimator:
Sample mean computed from data
known from data model

A
and
A
known from prior model
2 2
2
|
1
1
} | var{ }
var{
A
x A
N
A A

+
= = = x
Little or Poor Data:
Much or Good Data:
A A
A N <<
/
2 2
x A N
A
>>
/
2 2

19
Comments on this Example for Gaussian Noise and Gaussian Prior
1. Closed-Form Solution for Estimate!
2. Estimate is Weighted sum of prior mean & data mean
3. Weights balance between prior info quality and data quality
4. As N increases
a. Estimate E{A|x} moves
b. Accuracy var{A|x} moves
x
A

N
A
/
2 2

p(A|x)
A
1
A
x A
2
N
2
> N
1
> 0
A o
A =
N
o
= 0
20
2
|
)
(
x A
A Bmse = Bmse for this Example:
To see this:
( ) ( )
( ) ( )
( ) ( ) ( )
( ) ( ) | | ( )

= =
=
=
=
)
`
=
x x x x
x x x x
x x
x
d p dA A p A E A
dA d p A p A E A
dA d A p A A
A A E A Bmse
x A
A
! ! ! ! " ! ! ! ! # $
2
|
} | var{
2
2
2
2
} | {
} | {
,
General Result: Bmse = posterior variance averaged over PDF of x

In this case
A|x
is not a function of x:
( ) ( )
2
|
2
|
x A x A
d p A Bmse = =

x x
21
The big thing that this example shows:
Gaussian Data & Gaussian Prior gives Closed-Form MMSE Solution
This will hold in general!
1
10.5 Properties of Gaussian PDF
To help us develop some general MMSE theory for the Gaussian
Data/Gaussian Prior case, we need to have some solid results for
joint and conditional Gaussian PDFs.
Well consider the bivariate case but the ideas carry over to the
general N-dimensional case.
2
Bivariate Gaussian Joint PDF for 2 RVs X and Y
|
|
|
|
|
.
|
\
|
(
(
(
(
=

! ! ! ! " ! ! ! ! # $
form quadratic
y
x
T
y
x
y
x
y
x
y x p
1
1/2
2
1
exp
| | 2
1
) , ( C
C
(
(
(
(
Y
X
Y
X
E
(
(
=
(
(
=
(
(
=
2
2
2
2
) var( ) , cov(
) , cov( ) var(
Y Y X
Y X X
Y YX
XY X
Y X Y
Y X X

C
-10 -5 0 5 10
-8
-6
-4
-2
0
2
4
6
8
x
y
-8
-6
-4
-2
0
2
4
6
8
-10
-5
0
5
10
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
x
y
p
(
x
,
y
)
x y
p
(
x
,
y
)
x
y
3
Marginal PDFs of Bivariate Gaussian
What are the marginal (or individual) PDFs?

= dy y x p x p ) , ( ) (

= dx y x p y p ) , ( ) (
We know that we can get them by integrating:
After performing these integrals you get that:
X ~ N(
X
, var{X}) Y ~ N(
Y
, var{Y})
-10 -5 0 5 10
-8
-6
-4
-2
0
2
4
6
8
x
y
x
y
p(x)
p(y)
4
Comment on Jointly Gaussian
We have used the term Jointly Gaussian
Q: EXACTLY what does that mean?
A: That the RVs have a joint PDF that is Gaussian
|
|
|
.
|
\
|
(
(
(
(
=

y
x
T
y
x
y
x
y
x
y x p
1
1/2
2
1
exp
| | 2
1
) , ( C
C
Weve shown that jointly Gaussian RVs also have Gaussian
marginal PDFs
Q: Does having Gaussian Marginals imply Jointly Gaussian?
In other words if X is Gaussian and Y is Gaussian is it
always true that X and Y are jointly Gaussian???
A: No!!!!!
Example for
2 RVs
See Reading Notes on
Counter Example
posted on BB
5
Well construct a counterexample: start with a zero-mean,
uncorrelated 2-D joint Gaussian PDF and modify it so it is no
longer 2-D Gaussian but still has Gaussian marginals.
|
|
.
|
\
|
+
=
2
2
2
2
2
1
exp
2
1
) , (
Y X
Y X
XY
y x
y x p

x
y
x
y
But if we modify it by:
Setting it to 0 in the shaded regions
Doubling its value elsewhere
We get a 2-D PDF that is not
a joint Gaussian but the
marginals are the same
as the original!!!!
6
Conditional PDFs of Bivariate Gaussian
What are the conditional PDFs?
If you know that X has taken value X = x
o
, how is Y distributed?
(
(

=
16 16 25 8 . 0
16 25 8 . 0 25
C
Slope of Line
cov{X,Y}/var{X} =
Y
/
X

= =
dy y x p
y x p
x p
x x p
x y p
) , (
) , (
) (
) | (
) | (
0
0
0
0
0
Slice @ x
o
Normalizer
-15 -10 -5 0 5 10 15
-15
-10
-5
0
5
10
15
x
y
p(y|X=5)
p(y)
Note: Conditioning on correlated RV
shifts mean
reduces variance
7
Theorem 10.1: Conditional PDF of Bivariate Gaussian
Let X and Y be random variables distributed jointly Gaussian
with mean vector [E{X} E{Y}]
T
and covariance matrix
(
(
=
(
(
=
2
2
) var( ) , cov(
) , cov( ) var(
Y YX
XY X
Y X Y
Y X X

C
Then p(y|x) is also Gaussian with mean and variance given by:
( )
( ) } { } {
} { } { } | {
2
X E x Y E
X E x Y E x X Y E
o
X
Y
o
X
XY
o
+ =
+ = =
( )
2 2 2 2 2
2
2
2
1
} | var{
Y Y Y
X
XY
Y o
x X Y

= =
= =
Slope of Line
Amount of Reduction
Reduction Factor
8
Impact on MMSE
We know the MMSE of RV Y after observing the RV X = x
o
:
{ }
o
x X Y E Y = = |
So using the ideas we have just seen:

if the data and the parameter are jointly Gaussian, then
( ) } { } { } | {
2
X E x Y E x X Y E Y
o
X
XY
o MMSE
+ = = =
It is the correlation between the RVs X and Y that allow us to

perform Bayesian estimation.
9
Theorem 10.2: Conditional PDF of Multivariate Gaussian
Let X (k1) and Y (l1) be random vectors distributed jointly
Gaussian with mean vector [E{X}
T
E{Y}
T
]
T
and covariance
matrix
(
(

=
(
(
=
) ( ) (
) ( ) (
l l k l
l k k k
YY YX
XY XX
C C
C C
C
Then p(y|x) is also Gaussian with mean vector and covariance
matrix given by:
( ) } { } { } | {
1
X x C C Y x X Y
XX YX
E E E
o o
+ = =

XY XX YX YY x X | Y
C C C C C
1
=
=
o
( ) } { } { } | {
2
X E x Y E x X Y E
o
X
XY
o
+ = =
2
2
2
} | var{
X
XY
Y o
x X Y
= =
Compare to
Bivariate Results
For the Gaussian case the
cond. covariance does not depend
on the conditioning x-value!!!
10
10.6 Bayesian Linear Model
Now we have all the machinery we need to find the MMSE for
the Bayesian Linear Model
w H x + =
N1
Np
known
p1
~N(
,C
)
N1
~N(0,C
w
)
Clearly, x is Gaussian and is Gaussian
But are they jointly Gaussian???
If yes then we can use Theorem 10.2 to get the MMSE for !!!
Answer = Yes!!
11
Bayesian Linear Model is Jointly Gaussian
and w are each Gaussian and are independent
Thus their joint PDF is a product of Gaussians
which has the form of a jointly Gaussian PDF
Can now use: a linear transform of jointly Gaussian is jointly Gaussian
(
(
(
(
=
(
(
0 I
I H
x
Jointly Gaussian
Thus, Thm. 10.2 applies! Posterior PDF is
! Joint Gaussian
! Completely described by its mean and variance
12
Conditional PDF for Bayesian Linear Model
To apply Theorem 10.2, notationally let X = x and Y = .
First we need E{X} = H E{} + E{w} = H
E{Y} = E{} =
And also
YY
C C =
( )( ) { }
( ) | | ( ) | | { }
( )( ) { } { }
T T T
T
T
E E
E
E E E
ww H H
w H w H
x x x x C
C

XX
+ =
+ + =
=
! ! ! " ! ! ! # $
} { } {
{ }
T T
E ww H HC C
XX
+ =
Cross Terms are Zero
because and w are
independent
13
( )( ) { }
( )( ) { }
( )( ) { }
T T
T
T
E
E
E
H
H w H
x C C

x x YX
=
+ =
= =
Similarly
T
H C C
x
=
Use E{w} = 0
E{
w} = 0
Then Theorem 10.2 gives the conditional PDFs mean and cov
(and we know the conditional mean is the MMSE estimate)
{ }
( ) ( )
w
H x C H HC H C
x
+ + =
=
1
|
T T
MMSE
E
Data Prediction
Error
Update Transformation
Maps unpredictable part
a priori
estimate
Cross Correlation C
x
Relative Quality
( )
w x
HC C H HC H C C C
1
|

+ =
T T
Posterior
Covariance:
a priori
covariance
Reduction Due to Data
Posterior
Mean:
Bayesian
MMSE
Estimator
14
Ex. 10.2: DC in AWGN w/ Gaussian Prior
Data Model: x[n] = A + w[n] A & w[n] are independent
) , ( ~
2
A A
N
) , 0 ( ~
2
N
Write in linear model form:
x = 1A + w with H = 1 = [ 1 1 1]
T
Now General Result gives the MMSE estimate as:
) - ( ) (
) - ( ) ( } | {
1
2
2
2
2
1 2 2 2
A
T
A
T
A
A
A
T
A
T
A A MMSE
A E A

1 x 11 I 1
1 x I 1 1 1 x
+ + =
+ + = =
Can simplify using
The Matrix Inversion Lemma
15
Aside: Matrix Inversion Lemma
( ) ( )
1
1
1 1 1 1 1

+ = + DA C B DA B A A BCD A
nn nm
mm mn
( )
u A u
A uu A
A uu A
1
1 1
1
1
1

+
= +
T
T
T
nn
n1
Special Case (m = 1):
16
Continuing the Example Apply the Matrix Inversion Lemma:
) (
/
1
) (
/
) (
/
) (
2 2 2
2
2 2 2
2
2 2 2
2
1
2
2
2
2
A
A
A
A
A
T
A
T
A
A
A
A
T
T
A
A
A
T
A
T
A
A MMSE
N x N
N
N
N
N
N
A
|
|
.
|
\
|
+
+ =
|
|
.
|
\
|
+
+ =
|
|
.
|
\
|
+
+ =
|
|
.
|
\
|
+ + =

1 x 1 1
1 x
11
I 1
1 x 11 I 1
) (
/
2 2
2
A
A
A
A MMSE
x
N
A

|
|
.
|
\
|
+
+ =
Use Matrix Inv Lemma
Pass through 1
T
& use 1
T
1 = N
Factor Out 1
T
& use 1
T
1 = N
Algebraic
Manipulation
Error Between
Data-Only Est.
& Prior-Only Est.
Gain
Factor
a priori
estimate
When data is bad (
2
/N >>
2
A
),
gain is small, data has little use
When data is good (
2
/N >>
2
A
),
gain is large, data has large use
A MMSE
A
x A
MMSE

17
Using similar manipulations gives:
N
N
N
A
A
A
A
/
1 1
1
) x | var(
2 2
2
2
2
2

+
=
+
|
|
.
|
\
|
=
Like || resistors small one wins!
var (A|x) is the smaller of:
data estimate variance
prior variance
N
A
A
/
1 1
) x | var(
1
2 2

+ =
Or looking at it another way:
additive information!
18
10.7 Nuisance Parameters
One difficulty in classical methods is that nuisance parameters
must explicitly dealt with.
In Bayesian methods they are simply Integrated Away!!!!
Recall Emitter Location: [x y z f
0
]
In Bayesian Approach
From p(x, y, z, f
0
| x) can get p(x, y, z | x):
Nuisance Parameter
=
0 0
) x | , , , ( ) | , , ( df f z y x p z y x p x
Then find conditional mean for the MMSE estimate!
1
Ch. 11
General Bayesian Estimators
2
Introduction
In Chapter 10 we:
introduced the idea of a a priori information on
use prior pdf: p()
defined a new optimality criterion
Bayesian MSE
showed the Bmse is minimized by E {|x}
called:
mean of posterior pdf
conditional mean
In Chapter 11 we will:
define a more general optimality criterion
leads to several different Bayesian approaches
includes Bmse as special case
Why? Provides flexibility in balancing:
model,
performance, and
computations
3
11.3 Risk Functions
Previously we used Bmse as the Bayesian measure to minimize
( )

=
)
`
) , ( . . .
2
x p t r w E Bmse
So, Bmse is Expected value of square of error
Lets write this in a way that will allow us to generalize it.
Define a quadratic Cost Function:
( )
2
2
) ( = = C
Then we have that
{ } ) ( C E Bmse =
C() =
2
Why limit the cost function to just quadratic?
4
General Bayesian Criteria
1. Define a cost function: C()
2. Define Bayes Risk: R = E{C()} w.r.t. p(x, )
{ } )
( )
( = C E R
Depends on choice of estimator
3. Minimize Bayes Risk w.r.t. estimate
The choice of the cost function can be tailored to:

Express importance of avoiding certain kinds of errors
Yield desirable forms for estimates
e.g., easily computed
Etc.
5
Three Common Cost Functions
1. Quadratic: C() =
2
C()
2. Absolute: C() = | |
C()
3. Hit-or-Miss:
<
=

, 1
, 0
) ( C
> 0 and small

C()

6
General Bayesian Estimators
Derive how to choose estimator to minimize the chosen risk:
| | dx x p d |x p C
x p |x p
d dx x, p C
C E
g
) ( ) ( )
(
) ( ) (
) ( )
(
)
( )
(
)
(
} {

=
=
=
=
=

R
must minimize this for each x value
So for a given desired cost function
you have to find the form of the optimal estimator
7
The Optimal Estimates for the Typical Costs
1. Quadratic:
( ) )
(
2
Bmse E =
)
`
= R
x
x
| ( of mean
} | {

p
E
=
=
As we saw in Ch. 10
2. Absolute:
{ }
( = E R ) | ( of median
x p =
3. Hit-or-Miss:
) | ( of mode
x p =
p(|x)
Mode
Median
Mean
If p(|x) is unimodal & symmetric
mean = median = mode
Maximum A Posteriori
or MAP
8
Derivation for Absolute Cost Function

| where region
| where region
) | ( )
( ) | ( )
(
) | ( |
| )
(
=
+ =
=
d p d p
d p g
x x
x
Writing out the function to be minimized gives:
Now set 0
(
=

g
and use Leibnitzs rule for

) (
) (
2
1
) , (
u
u
dv v u h
u
0 ) | ( ) | (
d p d p x x
which is satisfied if (area to the left) = (area to the right)
Median of conditional PDF
9
Derivation for Hit-or-Miss Cost Function

=
+ =
=

) x | ( 1
) x | ( 1 ) x | ( 1
) x | ( )
( )
(
d p
d p d p
d p C g
Writing out the function to be minimized gives:
Almost all the probability
= 1 left out
Maximize this integral
So center the integral around peak of integrand
Mode of conditional PDF
10
11.4 MMSE Estimators
Weve already seen the solution for the scalar parameter case
x
x
| ( of mean
} | {

p
E
=
=
Here well look at:
Extension to the vector parameter case
Analysis of Useful Properties
11
Vector MMSE Estimator
The criterion is minimize the MSE for each component
Vector Parameter: | |
T
p

2 1
=
Vector Estimate:
| |
T
p

2 1
=
is chosen to minimize each of the MSE elements:
=
i i i i i i
d d p E x x ) , ( )
( } )
{(
2 2
= p(x, ) integrated
over all other
j
s

=
=
=
x x
x x
x x
d d p
d d d p
d d p
i
p p i
i i i i
) , (
) , , , (
) , (
1 1

From the scalar case we know the solution is:
} | {
x
i i
E =
12
So putting all these into a vector gives:
| |
| |
| | { } x
x x x
|
} | { } | { } | {

2 1
2 1
2 1
T
p
T
p
T
p
E
E E E

=
=
=
{ } x |
E =
Vector MMSE Estimate
= Vector Conditional Mean
Similarly
| | p i d p Bmse
ii
i
, , 1 ) ( C )
(
|
= =
x x
x
where
| || | { }
T
E E E } | { } | {
| |
x x C
x x
=
13
Ex. 11.1 Bayesian Fourier Analysis
Signal model is: x[n] = acos(2f
o
n) + bsin(2f
o
n) + w[n]
AWGN
w/ zero mean and
2
) , ( ~
2
I 0

N
b
a
(
(
=
and w[n] are independent for each n
This is a common propagation model called Rayleigh Fading
Write in matrix form: x = H + w Bayesian Linear Model
(
(
(
(

= sine cosine H
14
1
2 2 2
1
2 2
1 1
} | {

(
(
+ =
(
(
+ = =

H H
I C
x H H H
I x
x |
T T T
E
Results from Ch. 10 show that
For f
o
chosen such that H has orthogonal columns then
x H x
T
E
(
(
(
(
+
= =
2 2
2
1 1
1
} | {
(
(
=
(
(
=
1
0
1
0
) 2 sin( ] [
2
) 2 cos( ] [
2
N
n
o
N
n
o
n f n x
N
b
n f n x
N
a

2
2
/ 2
1
1
N
+
=
Fourier Coefficients in the Brackets
Recall: Same form as classical result, except there = 1
Note: 1 if
2
>> 2
2
/N
if prior knowledge is poor, this degrades to classical
15
Impact of Poor Prior Knowledge
Conclusion: For poor prior knowledge in Bayesian Linear Model
MMSE Est. MVU Est.
Can see this holds in general: Recall that
| | | |
w w
H x C H H C H C x + + + = =

1
1
1 1
} | {
T T
E
For no prior information:
0 C

1
and 0
| | x C H H C H
w w
1
1
1
T T
MVUE for General Linear Model
16
Useful Properties of MMSE Est.
1. Commutes over affine mappings:
If we have = A + b then b A + =
2. Additive Property for independent data sets

Assume , x
1
, x
2
are jointly Gaussian w/ x
1
and x
2
independent
}] { [ }] { [ } {
2 2
1
1 1
1
2 2 1 1
x x C C x x C C
x x x x
E E E + + =

a priori Estimate
Update due to x
1
Update due to x
2
Proof: Let x = [x
1
T
x
2
T
]
T
. The jointly Gaussian assumption gives:
| |
(
(
(
+ =
+ =
} {
} {
0
0
} {
}] { [ } {
2 2
1 1
1
1
1
2
1
2 1
x x
x x
C
C
C C
x x C C
E
E
E
E E
x
x
x x
x x

Indep. Block Diagonal

Simplify to
get the result
Will be used for
Kalman Filter
3. Jointly Gaussian case leads to a linear estimator:
m Px + =
1
11.5 MAP Estimator
Recall that the hit-or-miss cost function gave the MAP
estimator it maximizes the a posteriori PDF
Q: Given that the MMSE estimator is the most natural one
why would we consider the MAP estimator?
A: If x and are not jointly Gaussian, the form for MMSE estimate
requires integration to find the conditional mean.
MAP avoids this Computational Problem!
Note: MAP doesnt require this integration
Trade natural criterion vs. computational ease
What else do you gain? More flexibility to choose the prior PDF
2
Notation and Form for MAP
) | ( max arg
p
MAP
=
MAP
Notation: maximizes the posterior PDF

arg max extracts the value of
that causes the maximum
Equivalent Form (via Bayes Rule):
)] ( ) | ( [ max arg
p p
MAP
x =
Proof: Use
) (
) ( ) | (
) | (
x
x
x
p
p p
p

=
)] ( ) | ( [ max arg
) (
) ( ) | (
max arg

p p
p
p p
MAP
x
x
x
=
(
=
Does not depend on
3
Vector MAP
< Not as straight-forward as vector extension for MMSE >
The obvious extension leads to problems:
i
Choose to minimize
}}
( { )
(
i i i
C E = R
Exp. over p(x,
i
)
) | ( max arg
x
i i
p
i

=
1-D marginal
conditioned on x
Need to integrate to get it!!
Problem: The whole point of MAP was to
avoid doing the integration needed in MMSE!!!
Is there a way around this?
Can we find an Integration-Free Vector MAP?
p
d d p p " "
2 1
) | ( ) | (

= x x
4
Circular Hit-or-Miss Cost Function
Not in Book
First look at the p-dimensional cost function for this troubling
version of a vector map:
It consists of p individual applications of 1-D Hit-or-Miss
-
-
=
square in not ) , ( , 1
square in ) , ( , 0
) , (
2 1
2 1
2 1

C
The corners of the square let too much in use a circle!
<
=
, 1
, 0
) ( C
This actually seems more natural than the square cost function!!!
5
MAP Estimate using Circular Hit-or-Miss
Back to
Book
So what vector Bayesian estimator comes from using this
circular hit-or-miss cost function?
Can show that it is the following Vector MAP
) | ( max arg
p
MAP
=
Does Not Require
Integration!!!
That is find the maximum of the joint conditional PDF
in all
i
conditioned on x
6
How Do These Vector MAP Versions Compare
In general: They are NOT the Same!!
Example: p = 2
p(
1,
2
| x)
1/6
1/3
1/6
2
1 2 3 4 5
1
2
The vector MAP using Circular Hit-or-Miss is: | |
T
5 . 0 5 . 2
=
To find the vector MAP using the element-wise maximization:
1
p(
1
|x)
1 2 3 4 5
1/6
1/3
2
p(
2
|x)
1 2
1/3
2/3
| |
T
5 . 1 5 . 2
=
7
Bayesian MLE
Recall As we keep getting good data, p(|x) becomes more
concentrated as a function of . Butsince:
)] ( ) | ( [ max arg ) | ( max arg
x x

p p p
MAP
= =
p(x|) should also become more concentrated as a function of .
p(x|)
p()
Note that the prior PDF is nearly

constant where p(x|) is non-zero
This becomes truer as N , and
p(x|) gets more concentrated
) | ( max arg )] ( ) | ( [ max arg x x

p p p
MAP Bayesian MLE
Uses conditional PDF rather
than the parameterized PDF
8
11.6 Performance Characterization
The performance of Bayesian estimators is characterized by looking
at the estimation error:

=
Random (due to
a priori PDF)
Random (due to x)
Performance characterized by errors PDF p()
Well focus on Mean and Variance
If is Gaussian then these tell the whole story
This will be the case for the Bayesian Linear Model
(see Thm. 10.3)
Well also concentrate on the MMSE Estimator
9
Performance of Scalar MMSE Estimator
=
=

d p
E
) | (
} | {
x
x
The estimator is:
Function of x
So the estimation error is: ) , ( } | { x x f E = =
Function of
two RVs
General Result for a function of two RVs: Z = f (X, Y)
dy dx y x p y x f Z E
XY
) , ( ) , ( } {

=
{ } dy dx y x p Z E y x f Z E Z E Z
XY
) , ( }) { ) , ( ( }) { ( } var{
2 2
= =
10
Evaluated as
seen below
So applying the mean result gives:
{ }
{ }
{ }
{ } 0 0
} | { ] [
}] | { [ ] [
}] | { [
} | { } {
|
} | {
| |
|
,
= =
=
=
=
=
x
x x
x
x x x
x x
x
x
x
x
x
E
E E E
E E E E
E E E
E E E
E

See Chart on
Decomposing Joint
Expectations in
Notes on 2 RVs
Pass E
|x
through
the terms
Two Notations for
the same thing
} | { ) | ( } | {
) | ( } | { }] | { [
|
|
on depend
not does
|
x x x
x x x
x
x x

E d p E
d p E E E
= =
=

0 } { = E
i.e., the Mean of the
Estimation Error (over data
& parm) is Zero!!!!
11
And applying the variance result gives:
{ } { } { }
)
(
) , ( )
(
)
( }) { ( } var{
,
2
2 2 2
Bmse
d d p
E E E E
=
=
= = =
x x
x
Use
E{} = 0
So the MMSE estimation error has:
mean = 0
var = Bmse
Sowhen we minimize Bmse
we are minimizing the variance
of the estimate
If is Gaussian then ( ) )
( , 0 ~ Bmse N
12
Ex. 11.6: DC Level in WGN w/ Gaussian Prior
We saw that
2 2
/ 1 /
1
)
(
A
N
A Bmse
+
=
A
A A
A
N
N
x
N
A

|
|
.
|
\
|
+
+
|
|
.
|
\
|
+
=
/
/
/
2 2
2
2 2
2
with
constant constant
SoA
is Gaussian
|
|
.
|
\
|
+
2 2
/ 1 /
1
, 0 ~
A
N
N

Note: As N gets large this PDF collapses around 0.

This estimate is consistent in the Bayesian sense
Bayesian Consistency: For large N
(regardless of the realization of A!)
A A
this is Gaussian because it is a

linear combo of the jointly
Gaussian data samples
If X is Gaussian then
Y = aX + b
is also Gaussian
13
Performance of Vector MMSE Estimator

= Vector estimation error: The mean result is obvious.

Must extend the variance result:
x,
M C
} { } cov{

= = =
T
E
Some New Notation
Bayesian Mean Square
Error Matrix
Look some more at this:
} {
} {
} {
|
|
} { ] ][ [
] ][ [
} | { } | {
} | { } | {
x x
x x
x,
x x
x x M
C E
E E E E
E E E
T
T
=
=
=
0 = } { E
} {
|
x x
M C C E = =
General Vector Results:
See Chart on
Decomposing
Joint
Expectations
= C
|x
In general this is
a function of x
14
The Diagonal Elements of are Bmses of the Estimates

| |
) (
) , ( }] | { [
) , ( }] | { [ } {
2
2
1
i
i i i i
i i
ii
T
Bmse
d d p E
d d p E E
i
p

=
=
=

x
x
x,
x x x
x x x "
To see this:
Why do we call the error covariance the Bayesian MSE Matrix?
Integrate over
all the other
parameters
marginalizing
the PDF
15
Perf. of MMSE Est. for Jointly Gaussian Case
Let the data vector x and the parameter vector be jointly Gaussian.
0 = } { E Nothing new to say about the mean result:
Now look at the Error Covariance (i.e., Bayesian MSq Matrix):
} {
|
x x
M C C E = =
Recall General Result:
Thm 10.2 says that for Jointly Gaussian Vectors we get that
C
|x
does NOT depend on x
x x x
M C
| |
} { C C E = = =
x x x
x
C C C C
M C
1
|
=
= = C
Thm 10.2 also gives the form as:
16
Perf. of MMSE Est. for Bayesian Linear Model
w H x + =
~N(
,C
)
~N(0,C
w
)
Recall the model:
0 = } { E Nothing new to say about the mean result:
Now for the error covariance this is nothing more than a
special case of the jointly Gaussian case we just saw:
x x x
x
C C C C
M C
1
|
=
= = C
Results for Jointly Gaussian Case
T
H C C
x
=
w x
C H HC C + =
T
Evaluations for Bayesian Linear
( )
( )
1
1 1
1
|
+ =
+ = = =
H C H C
HC C H HC H C C M C
w
w x
T
T T
C
Alternate Form
see (10.33)
17
Summary of MMSE Est. Error Results
1. For all cases: Est. Error is zero mean
0 = } { E
2. Error Covariance for three Nested Cases:
} {
|
x x
M C C E = =
| |
ii
i
Bmse

M
) ( =
General Case:
Jointly Gaussian:
x x x x
C C C C M C
1
|

= = = C
Bayesian Linear:
Jointly Gaussian
& Linear Observation
( )
( )
1
1 1
1
|
+ =
+ = = =
H C H C
HC C H HC H C C M C
w
w x
T
T T
C
18
Main Bayesian Approaches
MAP
Hit-or-Miss
Cost Function
MMSE
Squared Cost Function
(In General: Nonlinear Estimate)
{ }
{ }
x x
C M
x
|
: Cov. Err.
: Estimate
E
E
=
=
Jointly Gaussian x and
(Yields Linear Estimate)
{ } { } ( )
x xx x
xx x
C C C C M
x x C C
1
1
: Cov. Err.
: Estimate
=
+ = E E
Bayesian Linear Model
{ } ( ) ( )
( )

HC C H HC H C C M
H x C H HC H C
1
1
: Cov. Err.
: Estimate
+ =
+ + =
w
T T
w
T T
E
) | ( max arg
: Estimate x
p =
Hard to Implement
numerical integration
Easy to Implement
Performance Analysis
is Challenging
Easier to Implement
Determining C
x
can be
hard to find
Easy to Implement
Only need accurate
model: C
, C
w
, H
19
11.7 Example: Bayesian Deconvolution
This example shows the power of Bayesian approaches over
classical methods in signal estimation problems (i.e. estimating the
signal rather than some parameters)
h(t)
s(t)
w(t)
x(t)
Model as a
zero-mean
WSS
Gaussian
Process w/
known
ACF R
s
()
Assumed
Known
Gaussian
Bandlimited
White Noise
w/ Known
Variance
Measured Data =
Samples of x(t)
Goal: Observe x(t) & Estimate s(t)
Note: At Output
s(t) is Smeared & Noisy
So model as D-T System
20
Sampled-Data Formulation
(
(
(
(
(
(
+
(
(
(
(
(
(
(
(
(
(
(
(

=
(
(
(
(
(
(
] 1 [
] 1 [
] 0 [
] 1 [
] 1 [
] 0 [
] [ ] 1 [ ] 1 [
0 0 ] 0 [ ] 1 [
0 0 0 ] 0 [
] 1 [
] 1 [
] 0 [
N w
w
w
n s
s
s
n N h N h N h
h h
h
N x
x
x
s s
# #
" "
# % % # #
"
"
#
Measured
Data Vector x
Known Observation
Matrix H
Signal Vector s
to Estimate
AWGN: w
C
w
=
2
I
We have modeled s(t) as zero-mean WSS process with known ACF
So s[n] is a D-T WSS process with known ACF R
s
[m]
So vector s has a known covariance matrix (Toeplitz & Symmetric) given by:
(
(
(
(
(
(
(
(
=
] 0 [ ] 1 [ ] 2 [ ] 1 [
] 1 [
] 2 [ ] 0 [ ] 1 [ ] 2 [
] 1 [ ] 0 [ ] 1 [
] 1 [ ] 2 [ ] 1 [ ] 0 [
s s s s s
s
s s s s
s s s
s s s s s
R R R n R
R
R R R R
R R R
n R R R R
"
% % % #
%
# %
"
s
C
Model for Prior PDF is
then s ~ N(0,C
s
)
s and w are independent
21
MMSE Solution for Deconvolution
We have the case of the Bayesian Linear Model so:
( ) x I H HC H C s
s s
1
2

+ =
T T
Note that this is a linear estimate
This matrix is called The Weiner Filter
( )
1
2 1
+ = = H H C M C
s s
T
The performance of the filter is characterized by:
22
Sub-Example: No Inverse Filtering, Noise Only
Direct observation of s with H = I x = s + w
s(t)
w(t)
x(t)
Goal: Observe x(t) & De-Noise s(t)
Note: At Output s(t) with Noise
( ) x I C C s
s s
1
2

+ =
( )
1
2 1
+ = = I C M C
s s
Note: Dimensionality Problem # of parms = # of observations
Classical Methods Fail x s =
Bayesian methods can solve it!!

For insight consider single sample case:
) (
] 0 [
] 0 [
1
] 0 [
] 0 [
] 0 [
] 0 [ s
2 2
SNR
R
x x
R
R
s
s
s
=
+
=
+
=
0 ] 0 [
] 0 [ ] 0 [
s
SNR Low
x s
SNR High
Data Driven Prior PDF Driven
23
Sub-Sub-Example: Specific Signal Model
Direct observation of s with H = I x = s + w
But here the signal follows a specific random signal model
] [ ] 1 [ ] [
1
n u n s a n s + =
u[n] is White Gaussian
Driving Process
This is a 1
st
-order auto-regressive model: AR(1)
Such a random signal has an ACF & PSD of
| |
1
2
1
2
) (
1
] [
k
u
s
a
a
k R
(
(
=

2
2
1
2
1
) (
f j
u
s
e a
f P
+
=
See Figures 11.9
& 11.10 in the
Textbook
1
Ch. 12
Linear Bayesian Estimators
2
Introduction
In chapter 11 we saw:
the MMSE estimator takes a simple form when x and are
jointly Gaussian it is linear and used only the 1
st
and 2
nd
order
moments (means and covariances).
Without the Gaussian assumption, the General MMSE estimator
requires integrations to implement undesirable!
So what to do if we cant assume Gaussian but want MMSE?
Keep the MMSE criteria
Butrestrict the form of the estimator to be LINEAR
LMMSE Estimator
Something
similar to
BLUE!
LMMSE Estimator = Wiener Filter
3
Bayesian Approaches
MAP
Hit-or-Miss
Cost Function
{ }
{ }
x x
C M
x
|
: Cov. Err.
: Estimate
E
E
=
=
MMSE
(Nonlinear Estimate)
Other
Cost
Functions
LMMSE
Force Linear Estimate
Known: E{},E{x}, C
Jointly Gaussian x and
{ } { } ( )
x xx x
xx x
C C C C M
x x C C
1
1
: Cov. Err.
: Estimate
=
+ = E E { } { } ( )
x xx x
xx x
C C C C M
x x C C
1
1
: Cov. Err.
=
+ = E E Estimate
Same!
Bayesian Linear Model
{ } ( ) ( )
( )

HC C H HC H C C M
H x C H HC H C
1
1
: Cov. Err.
: Estimate
+ =
+ + =
w
T T
w
T T
E
4
12.3 Linear MMSE Estimator Solution
Scalar Parameter Case:
Estimate: , a random variable realization
Given: data vector x = [x[0] x[1] . . .x[N-1] ]
T
Assume:
Joint PDF p(x, ) is unknown
Butits 1
st
two moments are known
There is some statistical dependence between x and
E.g., Could estimate = salary using x = 10 past years taxes owed
E.g., Cant estimate = salary using x = 10 past years number of Christmas
cards sent
Goal: Make the best possible estimate while using an affine form
for the estimator
=
+ =
1
0
] [
N
n
N n
a n x a
Handles Non-
Zero Mean Case
} )
{( )
(
2

=
x
E Bmse Choose {a
n
} to minimize
5
Derivation of Optimal LMMSE Coefficients
Using the desired affine form of the estimator, the Bmse is
(
(
+ =

=
2
1
0
] [ )
(
N
n
N n
a n x a E Bmse
0
)
(
=
N
a
Bmse
0 } ] [ { 2
1
0
= +

=
N
n
N n
a n x a E
Step #1: Focus on a
N
Passing /a
N
through E{} gives
=
=
1
0
} ] [ { } {
N
n
n N
n x E a E a
Note: a
N
= 0 if E{ } = E{x[n]} = 0
6
Step #2: Plug-In Step #1 Result for a
N
(
(
(
(
=
2
2
1
0
}) { ( }) { (
}) { ( ]}) [ { ] [ ( )
(
!" !# $ ! !" ! !# $
scalar scalar
T
N
n
n
E E E
E n x E n x a E Bmse

x x a
where a = [a
0
a
1
. . . a
N-1
]
T
Only up to N-1
Note: a
T
(x E{x}) = (x E{x})
T
a since it is scalar
7
Thus, expanding out [a
T
(x E{x}) ( E{ })]
2
gives
{ }
{ }

T T
T
T T
T T
c
Etc
Etc E E E
Etc E E E Bmse
+ =
+ =
+ =
+ =
a c c a a C a
a C a
a x x x x a
a x x x x a
x x xx
xx
.
. }) { })( { (
. }) { })( { ( )
(
NN
N1
1N
11
T

E E
x x
x x
c c
x c x c
=
= = } { } {
cross-covariance
vectors

T T
c Bmse + =
x xx
c a a C a 2 )
(
8
Step #3: Minimize w.r.t. a
1
, a
2
, , a
N-1
Only up to N-1
0
)
(
=
a
Bmse
x xx
c C a
1
=
1
=
xx x
C c a

T
0 c a C
x xx
=

2 2
This is where the statistical dependence
between the data and the parameter is
usedvia a cross-covariance vector
Step #4: Combine Results
| | ( ) } { } { } { } {
] [
1
0
x x a x a x a E E E E
a n x a
T T T
N
n
N n
+ = + =
+ =

So t he Opt imal LMMSE Est imat e is:

x C c
xx x
1
( ) } { } {
1
x x C c
xx x
E E

+ =

If Means = 0
Not e: LMMSE Est imat e Only Needs 1
st
and 2
nd
Moment snot PDFs!!
9
Step #5: Find Minimum Bmse
Substitute into Bmse result and simplify:

T T
c
c
c Bmse
+ =
+ =
+ =

x xx x x xx x
x xx x x xx xx xx x
x xx
c C c c C c
c C c c C C C c
c a a C a
1 1
1 1 1
2
2
2 )
(

c Bmse
x xx x
c C c
1
)
(

=
Note: If and x are statistically independent then C
x
= 0
} {
E =
Totally based on
prior info the
data is useless
c Bmse = )
(
10
Ex. 12.1 DC Level in WGN with Uniform Prior
Recall: Uniform prior gave a non-closed form requiring integration
but changing to a Gaussian prior fixed this.
Here we keep the uniform prior and get a simple form:
by using the Linear MMSE
x C c
xx x
1

=
A
A For this problem the LMMSE estimate is:
( )( ) { }
I 11
w 1 w 1 C
xx
2 2
+ =
+ + =
T
A
T
A A E
( ) { }
T
A
T
A A E A E
1
w 1 x c
x
2
} {
=
+ = =
Need
A & w are
uncorrelated
A & w are
uncorrelated
x
N
A
A
A
(
(
+
=
/
2 2
2

11
12.4 Geometrical Interpretations
Abstract Vector Space
Mathematicians first tackled physical vector spaces like R
N
and C
N
, etc.
But then abstracted the bare essence of these structures into
the general idea of a vector space.
Weve seen that we can interpret Linear LS in terms of Physical
vector spaces.
Well now see that we can interpret Linear MMSE in terms of
Abstract vector space ideas.
12
Abstract Vector Space Rules
An abstract vector space consists of a set of mathematical objects
called vectors and another set called scalars that obey:
1. There is a well-defined operation of addition of vectors that
gives a vector in the set, and
Adding is commutative and associative
There is a vector in the set call it 0 for which adding it to any
vector in the set gives back that same vector
For every vector there is another vector s.t. when the 2 are added you get
the 0 vector
2. There is a well-defined operation of multiplying a vector by a
scalar and it gives a vector in the set, and
Multiplying is associative
Multiplying a vector by the scalar 1 gives back the same vector
3. The distributive property holds
Multiplication distributes over vector addition
Multiplication distributes over scalar addition
13
Examples of Abstract Vector Spaces
1. Scalars = Real Numbers
Vectors = N
th
Degree Polynomials w/ Real Coefficients
Vectors = MN Matrices of Real Numbers
Vectors = Functions from [0,1] to R
Vectors = Real-Valued Random Variables with Zero Mean
Colliding Terminology
a scalar RV is a vector!!!
14
There is a well-defined concept of inner product s.t. all the rules of
ordinary inner product still hold
<x,y> = <y, x>
*
<a
1
x
1
+ a
2
x
2
,y> = a
1
<x
1
,y > + a
2
<x
2
,y>
<x,x> 0; <x,x> = 0 iff x = 0
Note: an inner product induces a norm (or length measure):
||x||
2
= <x,x>
So an inner product space has:
1. Two sets of elements: Vectors and Scalars
2. Algebraic Structure (Vector Addition & Scalar Multiplication)
3. Geometric Structure
Direction (Inner Product)
Distance (Norm)
Not needed for Real IP Spaces
Inner Product Spaces
An extension of the idea of Vector Space must also have:
15
Inner Product Space of Random Variables
Vectors: Set of all real RVs w/ zero mean & finite variance (ZMFV)
Scalars: Set of all real numbers
Inner Product: <X,Y> = E{XY}
Claim This is an Inner Product Space
Inner Product is Correlation!
Uncorrelated = Orthogonal
First this is a vector space
Addition Properties: X+Y is another ZMFV RV
1. It is Associative and Commutative: X+(Y+Z) = (X+Y)+Z; X+Y = Y+X
2. The zero RV has variance of 0 (What is an RV with var = 0???)
3. The negative of RV X is X
Multiplication Properties: For any real # a, aX is another ZMFV RV
1. It is Associative: a(bX) = (ab)X
2. 1X = X
Distributive Properties:
1. a(X+Y) = aX + aY
2. (a+b)X = aX + bX
NextThis is an inner product space
<a
1
X
1
+ a
2
X
2
,Y> = E{(a
1
X
1
+ a
2
X
2
)Y}
= a
1
E{X
1
Y}+ a
2
E{X
2
Y}
||X||
2
= <X, X> = E{X
2
} = var{X} 0
16
Use IP Space Ideas for Section 12.3
=
=
1
0
] [
N
n
n
n x a
Apply to the Estimation of a zero-mean scalar RV:
Trying to estimate the realization of RV via a
linear combination of N other RVs x[0], x[1],
x[2], x[N-1]
Zero-Mean
dont need a
N
Nowusing our new vector space view of RVs, this is the same
structural mathematics that we saw for the Linear LS !
( ) ( )

2
2
Bmse E =
)
`
=
N = 2 Case
Minimize:
Each RV is
viewed as a vector
x[0]
x[1]

Connect s t o Geomet r y Connect s t o MSE

Recall Or t hogonalit y Pr inciple!!!
Est imat ion Er r or Dat a Space
0 ] [
} { ) ( = n x E
17
Now apply this Orthogonality Principle
T T
E 0 x = } { ) (

with
x a
T
=
a xx x xx a x 0 x x a } { } { } { } { } { ) (
T T T T T T T T
E E E E E = = =
x xx
c a C =
The Normal Equations
Assuming that C
xx
is invertible
x xx
c C a
1
=
x C c x a
xx x
1

= =

T
Same as bef or e!!!
18
12.5 Vector LMMSE Estimator
Meaning a Physical Vector
| |
T
p
%
2 1
=
Estimate: Realization of
Linear Estimator:
a Ax + =
Goal: Minimize Bmse for each element

View i
th
row in A and i
th
element in a as forming a scalar
LMMSE estimator for
i
Alr eady know t he individual element solut ions!
Wr it e t hem down
Combine int o mat r ix f or m
19
Solutions to Vector LMMSE
}] { [ } {
1
x x C C
xx x
E E + =

The Vector LMMSE estimate is:
Now pN Matrix
Cross-Covariance Matrix
Still NN Matrix
Covariance Matrix
x C C
xx x
1

=
If E{} = 0 & E{x} = 0
Can show similarly that Bmse Matrix is
} )
)(
{(
T
E M
=
x xx x
C C C C M
1

=
pp
prior Cov. Matrix
pN Np
NN
20
Two Properties of LMMSE Estimator
1. Commutes over affine transformations
If b A + = and
is LMMSE Estimate
Then b A + =
is LMMSE Estimate for

2. If =
1
+
2
then
2 1

+ =
21
Bayesian Gauss-Markov Theorem
Like G-M Theorem
for the BLUE
w H x + = Let the data be modeled as
known
p1 random
mean
Cov Mat C
(Not Gaussian)
N1 random
zero mean
Cov Mat C
w
(Not Gaussian)
( ) ] [
1
w
H x C H HC H C + + =

T T
Application of previous results, evaluated for this data model gives:
( )
w
HC C H HC H C C C
1
+ =
T T
MMSE Matrix:
C M =
Same forms as for Bayesian Linear Model (which include Gaussian assumption)
Except here the result is suboptimal unless the optimal estimate is linear
In practice generally dont know if linear estimate is optimal but we use
LMMSE for its simple form!
The challenge is to guess or estimate the needed means & cov matrices
1
12.6 Sequential LMMSE Estimation
Same kind if setting as for Sequential LS
Fixed number of parameters (but here they are modeled as random)
Increasing number of data samples
] [ ] [ ] [ n n n w H x + =
Data Model:
(n+1)1
x[n] = [x[0] x[n]]
T
p1
unknown PDF
known mean & cov
(n+1)1
w[n] = [w[0] w[n]]
T
unknown PDF
known mean & cov
C
w
must be diagonal
with elements
2
n
& w are uncorrelated
(
(

=
] [
] 1 [
] [
n
n
n
T
h
H
H
(n+1)p
known
Goal: Given an estimate
] 1 [
n
based on x[n 1], when new
data sample x[n] arrives, update the estimate to
] [
n
2
Development of Sequential LMMSE Estimate
Our Approach Here: Use vector space ideas to derive solution
for DC Level in White Noise then write down general
solution.
] [ ] [ n w A n x + =
For convenience Assume both A and w[n] have zero mean
Given x[0] we can find the LMMSE estimate
] 0 [ ] 0 [
} ]) [ {(
])} [ ( {
] 0 [
]} 0 [ {
]} 0 [ {
2 2
2
2 2
0
x x
n w A E
n w A A E
x
x E
Ax E
A
A
A
(
(
+
=
(
(
+
+
=
(
(
=

Now we seek to sequentially update this estimate
with the info from x[1]
3
From Vector Space View:
A
x[0]
x[1]
0
A
1
A
First project x[1] onto x[0] to get
] 0 | 1 [
x
Estimate new data
given old data
Prediction!
Notation: the estimate
at 1 based on 0
Use Orthogonality Principle
] 0 | 1 [
] 1 [ ] 1 [
~
x x x =
is to x[0]
This is the new, non-redundant info provided by data x[1]
It is called the innovation
x[0]
x[1]
0
A
] 1 [
~
x
4
Find Estimation Update by Projecting A onto Innovation
] 1 [
~
]} 1 [
~
{
]} 1 [
~
{
] 1 [
~
] 1 [
~
] 1 [
~
,
] 1 [
~
] 1 [
~
] 1 [
~
] 1 [
~
,
2 2
1
x
x E
x A E
x
x
x
A
x
x
x
x
A A
(
(
= = =
Gain: k
1
Recall Property: Two Estimates from data just add:
] 0 [ ] 1 [
~
x x
] 1 [
~

1 0
1 0 1
x k A
A A A
+ =
+ =
| | ] 0 | 1 [
] 1 [

1 0 1
x x k A A + =
x[0]
x[1]
0
A
] 1 [
~
x
1
A
1
A
Predicted
New Data
New
Data
Old
Estimate
Gain
I nnovat ion is Old Dat a
5
The Innovations Sequence
The Innovations Sequence is
Key to the derivation & implementation of Seq. LMMSE
A sequence of orthogonal (i.e., uncorrelated) RVs
Broadly significant in Signal Processing and Controls
] 0 [ x
{ } ], 2 [
~
], 1 [
~
], 0 [
~
x x x
] 0 | 1 [
] 1 [ x x
] 1 | 2 [
] 2 [ x x
Means: Based on
ALL data up to n = 1
(inclusive)
6
General Sequential LMMSE Estimation
Initialization No Data Yet! Use Prior Information
} {
1
E =
Est imat e
C M =
1
MMSE Mat r ix
Update Loop For n = 0, 1, 2,
n n
T
n n
n n
n
h M h
h M
k
1
2
1
+
=
Gain Vect or Calculat ion

| |
1 1
] [

+ =
n
T
n n n n
n x h k
Est imat e Updat e
| |
1
=
n
T
n n n
M h k I M
MMSE Mat r ix Updat e
7
Sequential LMMSE Block Diagram
] [ ] [ ] [ n n n w H x + =
(
(

=
] [
] 1 [
] [
n
n
n
T
h
H
H
Data Model
k
n

2
1
, ,
n n n
h M

Compute
Gain
] [
~
n x
z
-1
1
+
+
x[n]
T
n
h
1
] 1 | [
=
n
T
n
n n x h
+
Innovation
Observation
( )
1 1
] [

+
n
T
n n n
n x h k
n
Delay
Updated
Estimate
Previous
Estimate
n
h
Predicted
Observation
Exact Same St r uct ur e as f or Sequent ial Linear LS!!
8
Comments on Sequential LMMSE Estimation
1. Same structure as for sequential linear LS. BUT they solve
the estimation problem under very different assumptions.
2. No matrix inversion required So computationally Efficient
3. Gain vector k
n
weighs confidence in new data (
2
n
) against all
previous data (M
n-1
)
when previous data is better, gain is small dont use new data much
when new data is better, gain is large new data is heavily used
4. If you know noise statistics
2
n
and observation rows h
n
T
over
the desired range of n:
Can run MMSE Matrix Recursion without data measurements!!!
This provides a Predictive Performance Analysis
9
12.7 Examples Wiener Filtering
During WWII, Norbert Wiener developed the mathematical ideas
that led to the Wiener filter when he was working on ways to
improve anti-aircraft guns.
He posed the problem in C-T form and sought the best linear
filter that would reduce the effect of noise in the observed A/C
trajectory.
He modeled the aircraft motion as a wide-sense stationary
random process and used the MMSE as the criterion for
optimality. The solutions were not simple and there were many
different ways of interpreting and casting the results.
The results were difficult for engineers of the time to understand.
Others (Kolmogorov, Hopf, Levinson, etc.) developed these ideas
for the D-T case and various special cases.
10
Weiner Filter: Model and Problem Statement
Signal Model: x[n] = s[n] + w[n]
Observed: Noisy Signal
Model as WSS, Zero-Mean
C
xx
= R
xx
covariance matrix
correlation
matrix
} {
T
E xx R
xx
=
} }) { })( { {(
T
E E E x x x x C
xx
=
Desired Signal
C
ss
= R
ss
Noise
C
ww
= R
ww
Same if zero-mean
Problem Statement: Process x[n] using a linear filter to provide
a de-noised version of the signal that has minimum MSE
relative to the desired signal
LMMSE Problem!
11
Filtering, Smoothing, Prediction
Terminology for three different ways to cast the Wiener filter problem
Filtering Smoothing Prediction
Given: x[0], x[1], , x[n] Given: x[0], x[1], , x[N-1] Given: x[0], x[1], , x[N-1]
Find: ] 1 [
, ], 1 [
], 0 [
N s s s 0 , ] [
> + l l N x Find: ] [
n s Find:
x C C
xx x
1

=
x[n]
] 0 [
s
2
1
n
] 1 [
s
] 2 [
s
] 3 [
s
3
x[n]
2
1
x[n]
] 5 [
x
2
1
4 5
Note!!
n n
3 3
] 0 [
s ] 1 [
s ] 2 [
s ] 3 [
s
All t hr ee solved using Gener al LMMSE Est .
12
Filt er ing Smoot hing
x C C
xx x
1

=
Pr edict ion
(vector) s = (scalar) ] [n s =
{ }
{ }
| |
) (vector!
~
] 0 [ ] [
] [
] [
T
ss
ss ss
T
T
r n r
n s E
n s E
r
s
x C
x
=
=
=
=
"
{ }
{ }
ww ss
T T
T
xx
E
E
R R
ww ss
w s w s C
+ =
+ =
+ + = ) )( (
| || || | 1 ) 1 ( ) 1 ( ) 1 ( ) 1 ( 1
1
) (
~
] [
+ + + +
+ =
n n n n
ww ss
T
ss
n s x R R r
{ }
{ }
{ }
) (Matrix!
) (
ss
T T
T
T
E
E
E
R
sw ss
w s s
sx C
x
=
+ =
+ =
=
(scalar) ] 1 [ l N x + =
x not s!
{ }
| |
) (vector!
~
] [ ] 1 [
] 1 [
T
xx
xx xx
T
x
l r l N r
l N x E
r
x C
=
+ =
+ =
"
{ }
{ }
ww ss
T T
T
xx
E
E
R R
ww ss
w s w s C
+ =
+ =
+ + = ) )( (
{ }
xx
T
xx
E
R
xx C
=
=
| || || | 1
1
) (
+ =
N N N N N
ww ss ss
x R R R s
| || || | 1 1
1 ~
] 1 [
= +
N N N N
xx
T
xx
l N x x R r
13
Comments on Filtering: FIR Wiener
x a x R R r
a
T
ww ss
T
ss
T
n s = + =

# # # $ # # # % &
1
) (
~
] [
| |
| |
T
n n
T
n n n
a a a
n h h h
0 1
) ( ) ( ) (
...
] [ ... ] 1 [ ] 0 [
=
= h
=
=
n
k
n
k n x k h n s
0
) (
] [ ] [ ] [
Wiener Filter as
Time-Varying FIR
Filter
Causal!
Length Grows!
Wiener - Hopf Filt er ing Equat ions
| |
T
ss ss ss ss
ss ww ss
n r r r
xx
] [ ... ] 1 [ ] 0 [
) (
=
= +
r
r h R R
R
# #$ # #% &
(
(
(
(
=
(
(
(
(
(
(
(
(
(
] [
] 1 [
] 0 [
] [
] 1 [
] 0 [
] 0 [ ] 1 [ ] [
] 1 [ ] 0 [ ] 1 [
] [ ] 1 [ ] 0 [
) (
) (
) (
Toeplitz & Symmetric
n r
r
r
n h
h
h
r n r n r
n r r r
n r r r
ss
ss
ss
n
n
n
xx xx xx
xx xx xx
xx xx xx
'
'
# # # # # # $ # # # # # # % &
"
' ( ' '
"
"
In Principle: Solve WHF Eqs for filter h at each n
In Practice: Use Levinson Recursion to Recursively Solve
14
Comments on Filtering: IIR Wiener
Can Show: as n Wiener filter becomes Time-Invariant
Thus: h
(n)
[k] h[k]
Then the Wiener-Hopf Equations become:
, 1 , 0 ] [ ] [ ] [
0
= =
=
l l r k l r k h
ss
k
xx
and these are solved using so-called Spectral Factorization
Andthe Wiener Filter becomes IIR Time-Invariant:
=
=
0
] [ ] [ ] [
k
k n x k h n s
15
Revisit the FIR Wiener: Fixed Length L
=
=
1
0
] [ ] [ ] [
L
k
k n x k h n s
] 6 [
s
The way the Wiener filter was formulated above, the length of filter
grew so that the current estimate was based on all the past data
Reformulate so that current estimate is based on only L most recent
data: x[3] x[4] x[5] x[6] x[7] x[8] x[9]
] 7 [
s
] 8 [
s
Wiener - Hopf Filt er ing Equat ions f or WSS Pr ocess w/ Fixed FI R
| |
T
ss ss ss ss
ss ww ss
n r r r
xx
] [ ... ] 1 [ ] 0 [
) (
=
= +
r
r h R R
R
# #$ # #% &
(
(
(
(
(
(
=
(
(
(
(
(
(
(
(
(
(
(
(
] [
] 1 [
] 0 [
] [
] 1 [
] 0 [
] 0 [ ] 1 [ ] [
] 1 [ ] 0 [ ] 1 [
] [ ] 1 [ ] 0 [
n r
r
r
n h
h
h
r n r n r
n r r r
n r r r
ss
ss
ss
xx xx xx
xx xx xx
xx xx xx
' '
# # # # # # # $ # # # # # # # % &
"
' ( ' '
"
"
Solve W- H Filt er ing Eqs ONCE f or f ilt er h
16
Comments on Smoothing: FIR Smoother
Wx x R R R s
W
= + =

# # # $ # # # % &
1
) (
ww ss ss
Each row of Wlike a FIR Filter
Time-Varying
Non-Causal!
Block-Based
To interpret this Consider N=1 Case:
] 0 [
1
] 0 [
] 0 [ ] 0 [
] 0 [
] 0 [
SNR Low , 0
SNR High , 1
x
SNR
SNR
x
r r
r
s
ww ss
ss
#$ #% &
+
=
(
+
=
17
Comments on Smoothing: IIR Smoother
Estimate s[n] based on {, x[1], x[0], x[1],}
=
=
k
k n x k h n s ] [ ] [ ] [
Time-Invariant &
Non-Causal IIR Filter
The Wiener-Hopf Equations become:
< < =
=
l l r k l r k h
ss
k
xx
] [ ] [ ] [ ] [ ] [ ] [ n r n r n h
ss xx
=
) ( ) (
) (
) (
) (
) (
f P f P
f P
f P
f P
f H
ww ss
ss
xx
ss
+
=
=
Differs From Filter Case
Sum over all k
Differs From Filter Case
Solve for all l
H( f ) 1 when P
ss
( f ) >> P
ww
( f )
H( f ) 0 when P
ss
( f ) << P
ww
( f )
18
Relationship of Prediction to AR Est. & Yule-Walker
Wiener-Hopf Prediction Equations
| |
T
xx xx xx xx
xx xx
N l r l r l r ] 1 [ ... ] 1 [ ] [ + + =
=
r
r h R
(
(
(
(
+
+
=
(
(
(
(
(
(
(
(
] 1 [
] 1 [
] [
] [
] 1 [
] 0 [
] 0 [ ] 2 [ ] 1 [
] 2 [ ] 0 [ ] 1 [
] 1 [ ] 1 [ ] 0 [
N l r
l r
l r
n h
h
h
r N r N r
N r r r
N r r r
xx
xx
xx
xx xx xx
xx xx xx
xx xx xx
' '
# # # # # # # # $ # # # # # # # # % &
"
' ( ' '
"
"
For l=1 we get EXACTLY the Yule-Walker Eqs used in
Ex. 7.18 to solve for the ML estimates of the AR parameters!!
!FIR Prediction Coefficients are estimated AR parms
Recall: we first estimated the ACF lags r
xx
[k] using the data
Then used the estimates to find estimates of the AR parameters
xx xx
r h R
=
19
Relationship of Prediction to Inverse/Whitening Filter
) ( 1
1
z a
u[k] x[k]
) (z a
AR Model
] [
k x
+
Inverse Filter: 1a(z)

FIR Pred.
Signal Observed
u[k]
White Noise
White Noise
1-Step Prediction
Imagination
&
Modeling
Physical Reality
20
Results for 1-Step Prediction: For AR(3)
0 20 40 60 80 100
-6
-4
-2
0
2
4
Sample Index, k
S
i
g
n
a
l

V
a
l
u
e
Signal
Prediction
Error
At each k we predict x[k] using past 3 samples
Application to Data Compression
Smaller Dynamic Range of Error gives More Efficient Binary Coding
(e.g., DPCM Differential Pulse Code Modulation)
1
Ch. 13
Kalman Filters
2
Introduction
In 1960, Rudolf Kalman developed a way to solve some of the
practical difficulties that arise when trying to apply Weiner filters.
There are D-T and C-T versions of the Kalman Filter we will
only consider the D-T version.
The Kalman filter is widely used in:
Control Systems
Navigation Systems
Tracking Systems
It is less widely used in signal processing applications
KF initially arose in the
field of control systems
in order to make a
system do what you
want, you must know
what it is doing now
3
The Three Keys to Leading to the Kalman Filter
Wiener Filter: LMMSE of a Signal (i.e., a Varying Parameter)
Sequential LMMSE: Sequentially Estimate a Fixed Parameter
State-Space Models: Dynamical Models for Varying Parameters
Kalman Filt er : Sequent ial LMMSE Est imat ion f or a t ime-
var ying par amet er vect or but t he t ime var iat ion is
const r ained t o f ollow a st at e- space dynamical model.
Aside: There are many ways to mathematically model dynamical systems
Differential/Difference Equations
Convolution Integral/Summation
Transfer Function via Laplace/Z transforms
State-Space Model
4
13.3 State-Variable Dynamical Models
System State: the collection of variables needed to know how to determine how
the system will exist at some future time (in the absence of an input)
For an RLC circuit you need to know all of its current capacitor voltages and all
of its current inductor currents
Motivational Example: Constant Velocity Aircraft in 2-D
=
) (
) (
) (
) (
) (
t v
t v
t r
t r
t
y
x
y
x
s
A/C positions (m)
For t he const ant velocit y model we
would const r ain v
x
(t ) & v
y
(t ) t o be
const ant s V
x
& V
y
.
A/C velocities (m/s)
If we know s(t
o
) and there is no input we know how the A/C
behaves for all future times: r
x
(t
o
+ ) = V
x
+ r
x
(t
o
)
r
x
(t
o
+ ) = r
x
(t
o
) + V
x
r
y
(t
o
+ ) = r
y
(t
o
) + V
y
5
D-T State Model for Constant Velocity A/C
Because measurements are often taken at discrete times we often
need D-T models for what are otherwise C-T systems
(This is the same as using a difference equation to approximate a differential equation)
If every increment of n corresponds to a duration of sec and
there is no driving force then we can write a D-T State Model as:
=
1 0 0 0
0 1 0 0
0 1 0
0 0 1
A
] 1 [ ] [ = n n As s
State Transition Matrix
r
x
[n] = r
x
[n-1] + v
x
[n-1]
r
y
[n] = r
y
[n-1] + v
y
[n-1]
v
x
[n] = v
x
[n-1]
v
y
[n] = v
y
[n-1]
We can include the effect of a vector input:
] [ ] 1 [ ] [ n n n Bu As s + =
Input could be deterministic and/or random.
Matrix B combines inputs & distributes them to states.
6
Thm13.1 Vector Gauss-Markov Model
Dont confuse
with the G-M
Thm. of Ch. 6
This theorem characterizes the probability model for a
specific state-space model with Gaussian Inputs
] [ ] 1 [ ] [ n n n Bu As s + =
Linear St at e Model:
n 0
p1 pp
known
pr
known
r1
s[n]: state vector is a vector Gauss-Markov process
A: state transition matrix; assumed |
i
| < 1 for stability
B: input matrix
u[n]: driving noise is vector WGN w/ zero mean
s[-1]: initial state ~ N(
s
,C
s
) and independent of u[n]
eigenvalues
u[n] ~ N(0,Q)
E{u[n] u
T
[m]} = 0, n m
7
Theor em:
s[n] for n 0 is Gaussian with the following characteristics
Mean of state vector is
{ }
s
A s
1
] [
+
=
n
n E
diverges if e-values
have |
i
| 1
Covariance between state vectors at m and n is
{ }
( )

=
+ + +
+ =
=
m
n m k
T k n-m T k
T
n m
T
n E n m E m E n m n m
) (
]}] [ { ] [ ]}][ [ { ] [ [ ] , [ : for
1 1
A BQB A A C A
s s s s C
s
s
] , [ ] , [ : for m n n m n m
T
s s
C C = <
State Process is Not WSS!
Covariance Matrix: C[n] = C
s
[n,n] (this is just notation)
Propagation of Mean & Covariance:
{ } { }
T T
n n
n E n E
BQB A AC C
s A s
+ =
=
] 1 [ ] [
] 1 [ ] [
8
Pr oof : (only for the scalar case: p = 1)
For the scalar case the model is: s[n] = a s[n-1] + b u[n] n 0
differs a bit from (13.1) etc.
Now we can just iterate this model and surmise its general form:
] 0 [ ] 1 [ ] 0 [ bu as s + =
] 1 [ ] 0 [ ] 1 [
] 1 [ ] 0 [ ] 1 [
2
bu abu s a
bu as s
+ + =
+ =
] 2 [ ] 1 [ ] 0 [ ] 1 [
] 2 [ ] 1 [ ] 2 [
2 3
bu abu bu a s a
bu as s
+ + + =
+ =
=
+
+ =
n
k
k n
k n bu a s a n s
0
1
] [ ] 1 [ ] [
!
Now easy to find the mean:
{ } { } { }
s
n
n
k
k n
a
k n u E b a s E a n s E
s
1
0
0
1
] [ ] 1 [ ] [
+
=
= =
+
=
+ =

"# "$ % "# "$ %
as claimed!
z.i. response
exponential
z.s. response
convolution
9
Covariance between s[m] and s[n] is:
{ }
= =
+ =
+ +
=
+
=
+
+ +
+ =
+
+ =
=
m
k
m
l
l
k m n l
k
s
n m
n
l
l
s
n
m
k
k
s
m
T
s
n
s
m
s
ba l n u k m u E b a a a
l n bu a n s a
k m bu a m s a E
a n s a m s E n m C
u
0 0
)] ( [
2 1 1
0
1
0
1
1 1
2
]} [ ] [ {
] [ ] ] [ [
] [ ] ] [ [
] ] [ ][ ] [ [ ] , [
}
{
) (
) (
" " " # " " " $ %

Must use different
dummy variables!!
Cross-terms will
be zero why?
=
+ + +
+ =
m
n m k
k m n
u
k
s
n m
s
ba b a a a n m C
2 2 1 1
] , [
For m n:
For m < n:
] , [ ] , [ m n C n m C
s s
=
10
For mean & cov. propagation: from s[n] = a s[n 1] + b u[n]
"# "$ % " " " " # " " " " $ %
0 theorem in as propagates
]} [ { ]} 1 [ { ]} [ {
=
+ = n u E b n s aE n s E
{ }
{ }
{ } { }b n u E b a n s E n s E a
n s aE n bu n as E
n s E n s E n s
u
n s
"# "$ % " " " " " # " " " " " $ %
2
] [ ]}) 1 [ { ] 1 [ (
]}) 1 [ { ] [ ] 1 [ (
]}) [ { ] [ ( ]} [ var{
2
]} 1 [ var{
2
2
2
=
=
+ =
+ =
=
which propagates as in theorem
< End of Proof >
So we now have:
Random Dynamical Model (A St at e Model)
St at ist ical Char act er izat ion of it
11
Random Model for Constant Velocity A/C
=
] [
] [
0
0
] 1 [
] 1 [
] 1 [
] 1 [
1 0 0 0
0 1 0 0
0 1 0
0 0 1
] [
] [
] [
] [
] [
n u
n u
n v
n v
n r
n r
n v
n v
n r
n r
n
y
x
y
x
y
x
y
x
y
x
s
Deterministic Propagation
of Constant-Velocity
Random Perturbation
of Constant Velocities
=
2
2
0 0 0
0 0 0
0 0 0 0
0 0 0 0
]} [ cov{
u
u
n
u
12
0 2000 4000 6000 8000 10000 12000 14000 16000 18000
0
2000
4000
6000
8000
10000
12000
14000
X position (m)
Y

p
o
s
i
t
i
o
n

(
m
)
m/s 100 ] 1 [ ] 1 [
m 0 ] 1 [ ] 1 [
m/s 5
sec 1
= =
= =
=
=
y x
y x
u
v v
r r
Ex. Set of Constant-Velocity A/C Trajectories

Red Line is Non-Random
Constant Velocity Trajectory
Acceleration of
(5 m/s)/1s = 5m/s
2
13
Observation Model
So we have a random state-variable model for the dynamics of
the signal ( the signal is often some true A/C trajectory)
We need to have some observations (i.e., measurements) of the
signal
In Navigation Systems inertial sensors make noisy measurements
at intervals of time
In Tracking Systems sensing systems make noisy measurements
(e.g., range and angles) at intervals of time
] [ ] [ ] [ ] [ n n n n w s H x + = Linear Obser vat ion Model:
Measured observation
vector at each time
allows multiple
measurements at each time
Observation Matrix
can change w/ time
State Vector Process
being observed
Vector Noise
Process
14
The Estimation Problem
Observe a Sequence of Observation Vectors {x[0], x[1], x[n]}
Compute an Estimate of the State Vector s[n]
] | [
n n s
estimate state at n
using observation up to n
Notation: ] | [
m n s = Estimate of s[n] using {x[0], x[1], x[m]}

Want Recur sive Solut ion:
Given: ] | [
n n s
and a new observation vector x[n + 1]
Find: ] 1 | 1 [
+ + n n s
Thr ee Cases of I nt er est :
Scalar St at e Scalar Obser vat ion
Vect or St at e Scalar Obser vat ion
Vect or St at e Vect or Obser vat ion
1
13.4 Scalar Kalman Filter
Data Model
To derive the Kalman filter we need the data model:
> < + =
> < + =
Equation n Observatio ] [ ] [ ] [
Equation State ] [ ] 1 [ ] [
n w n s n x
n u n as n s
Assumptions
1. u[n] is zero mean Gaussian, White,
2 2
]} [ {
u
n u E =
2. w[n] is zero mean Gaussian, White,
2 2
]} [ {
n
n w E =
3. The initial state is ) , ( ~ ] 1 [
2
s s
N s
4. u[n], w[n], and s[1] are all independent of each other
Can vary
with time
To simplify the derivation: let
s
= 0 (well account for this later)
2
Goal and Two Properties
{ } ] [ , ], 1 [ ], 0 [ | ] [ ] | [
n x x x n s E n n s = Goal: Recur sively compute

| |
T
n x x x n ] [ , ], 1 [ ], 0 [ ] [ = X
Notation:
X[n] is set of all observations
x[n] is a single vector-observation
Two Properties We Need
1. For the jointly Gaussian case, the MMSE estimator of zero mean
based on two uncorrelated data vectors x
1
& x
2
is (see p. 350 of
text)
} | { } | { } , | {
2 1 2 1
x x x x E E E + = =
2. If =
1
+
2
then the MSEE estimator is
} | { } | { } | { } | {
2 1 2 1
x x x x E E E E + = + = =
(a result of the linearity of E{.} operator)
3
Derivation of Scalar Kalman Filter
Innovation: ] 1 | [
] [ ] [
~
= n n x n x n x
Recall from Section 12.6
MMSE estimate of
x[n] given X[n 1]
(prediction!!)
By MMSE Orthogonality Principle
{ } 0 X = ] 1 [ ] [
~
n n x E
data previous the with ed uncorrelat is that ] [ of part is ] [
~
n x n x
Now note: X[n] is equivalent to { } ] [
~
], 1 [ n x n X
Why? Because we can get get X[n] from it as follows:
] [
] [
] 1 [
] [
~
] 1 [
n
n x
n
n x
n
X
X X
=
(
(
(
(

"# "$ %
] 1 | [
1
0
] [ ] [
~
] [
+ =
n n x
n
k
k
k x a n x n x
4
What have we done so far?
Have shown that { } ] [
~
], 1 [ ] [ n x n n X X
Have split current data set into 2 parts:
1. Old data
2. Uncorrelated part of new data (just the new facts)
uncorrelated
{ } { } ] [
~
], 1 [ | ] [ ] [ | ] [ ] | [
n x n n s E n n s E n n s = = X X
Because of this
So what??!! Well can now exploit Property #1!!

{ } { }
" "# " "$ % " " # " " $ %
] [
~
| ] [
] 1 | [
] 1 [ | ] [ ] | [
n x n s E
n n s
n n s E n n s +
=
=
X
Update based on
innovation part
of new data
Now need t o
look mor e
closely at
each of t hese!
prediction of s[n]
based on past data
5
] 1 | [
n n s Look at Prediction Term:

Use the Dynamical Model it is the key to prediction because it tells us
how the state should progress from instant to instant
{ } { } ] 1 [ | ] [ ] 1 [ ] 1 [ | ] [ ] 1 | [
+ = = n n u n as E n n s E n n s X X
Now use Property #2:
{ } { }
" " " # " " " $ % " " " # " " " $ %
0 ]} [ { ] 1 | 1 [
] 1 [ | ] [ ] 1 [ | ] 1 [ ] 1 | [
= = =
+ =
n u E n n s
n n u E n n s E a n n s X X
By Definition
By independence of u[n]
& X[n-1] See bottom
of p. 433 in textbook.
] 1 | 1 [
] 1 | [
= n n s a n n s
The Dynamical Model provides the
update from estimate to prediction!!
6
]} [
~
| ] [ { n x n s E Look at Update Term:
Use the form for the Gaussian MMSE estimate:
] [
~
]} [
~
{
]} [
~
] [ {
]} [
~
| ] [ {
] [
2
n x
n x E
n x n s E
n x n s E
n k
" " # " " $ %
=
(
(
=
] 1 | [
] [ ] [
~
= n n x n x n x
( ) ] 1 | [
] [ ] [ ]} [
~
| ] [ { = n n x n x n k n x n s E
"# "$ % "# "$ %
0
] 1 | [
] 1 | [
=
+ = n n w n n s
by Prop. #2
Prediction Shows Up Again!!!
So
Because w[n] is indep.
of {x[0], , x[n-1]}
Put t hese Result s Toget her :
| | ] 1 | [
] [ ] [ ] 1 | [
] | [
] 1 | 1 [
+ + =
=
n n s n x n k n n s n n s
n n s a
"# "$ %
This is t he
Kalman Filt er
How to get the gain?
7
Look at the Gain Term:
Need two properties
])} 1 | [
] [ ])( 1 | [
] [ {( ])} 1 | [
] [ ]( [ { = n n s n x n n s n s E n n s n x n s E A.
The innovation
] [
~
] 1 | [
] [
n x
n n x n x
=
=
Aside
<x,y> = <x+z,y>
for any z y
Linear combo of past data
thus w/ innovation
0 ])} 1 | [
] [ ]( [ { = n n s n s n w E
B.
proof
w[n] is the measurement noise and by assumption is indep. of the
dynamical driving noise u[n] and s[-1] In other words: w[n] is indep.
of everything dynamical So E{w[n]s[n]} = 0
is based on past data, which include {w[0], , w[n-1]}, and
since the measurement noise has indep. samples we get
] 1 | [
n n s
] [ ] 1 | [
n w n n s
8
So we start with the gain as defined above:
| | { }
| | { }
| || | { }
| | { }
| || | { }
| | { }
| | { } | | { }
| | { } | | { } ] [ ] 1 | [
] [ 2 ] 1 | [
] [
] [ ] 1 | [
] [ ] 1 | [
] [
] [ ] 1 | [
] [
] [ ] 1 | [
] [ ] 1 | [
] [
] [ ] 1 | [
] [
] 1 | [
] [ ] 1 | [
] [
] 1 | [
] [
] 1 | [
] [ ] [
]} [
~
{
]} [
~
] [ {
] [
2
2
2
2
2
2
2
n w n n s n s E n n s n s E
n w n n s n s E n n s n s E
n w n n s n s E
n w n n s n s n n s n s E
n w n n s n s E
n n s n x n n s n s E
n n s n x E
n n s n x n s E
n x E
n x n s E
n k
n
+ +
+
=
+
+
=
+

=

= =
Use Prop. A in num.

Use x[n] = s[n]+ w[n]
in denominator
(!)
(!!)
Use
x[n] = s[n]+ w[n]
in numerator
Expand
= 0 by Prop. B
] 1 | [ =
n n M
Plug in for innovation
MSE when s[n] is estimated
by 1-step prediction
9
This gives a form for the gain:
] 1 | [
] 1 | [
] [
2
+

=
n n M
n n M
n k
n
This balances
the quality of the measured data
against the predicted state
I n t he Kalman f ilt er t he pr edict ion act s like t he
pr ior inf or mat ion about t he st at e at t ime n
bef or e we obser ve t he dat a at t ime n
10
Look at the Prediction MSE Term:
But now we need to know how to find M[n|n 1]!!!
| | { }
| | { }
( ) | | { }
2
2
2
] [ ] 1 | 1 [
] 1 [
] 1 | 1 [
] [ ] 1 [
1 | [
] [ ] 1 | [
n u n n s n s a E
n n s a n u n as E
n n s n s E n n M
+ =
+ =
=
2 2
] 1 | 1 [ ] 1 | [
u
n n M a n n M + =
Why are the cross-terms zero? Two parts:
1. s[n 1] depends on {u[0] u[n 1], s[-1]}, which are indep. of u[n]
2. depends on {s[0]+w[0] s[n 1]+w[n 1]}, which are
indep. of u[n]
] 1 | 1 [
n n s
Use dynamical
model & exploit
form for
prediction
Cross-terms = 0
Est. Error at previous time
11
Look at a Recursion for MSE Term: M[n|n]
| | { } ( ) | | { }
2 2
] 1 | [
] [ ] [ ] 1 | [
] [ ] | [
] [ ] | [ = = n n s n x n k n n s n s E n n s n s E n n By def.: M
Term A Term B
Now well get three terms:
E{A
2
}, E{AB}, E{B
2
}
{ } ] 1 | [
2
= n n M A E
{ } | || | { }
] 1 | [ ] [ 2
] 1 | [
] [ ] 1 | [
] [ ] [ 2 2
=
=
n n M n k
n n s n x n n s n s E n k AB E
{ } | | { }
| |
| | ] 1 | [ ] [ ] [ of Num. ] [
] [ of Den. ] [
] 1 | [
] [ ] [
2
2 2 2
= =
=
=
n n M n k n k n k
n k n k
n n s n x E n k B E
from (!!) is num. k[n]
from (!) is den. k[n]
by definition
] 1 | [
] 1 | [
] [
2
+

=
n n M
n n M
n k
n
Recall:
12
So this gives
] 1 | [ ] [ ] 1 | [ ] [ 2 ] 1 | [ ] | [ + = n n M n k n n M n k n n M n n M
( ) ] 1 | [ ] [ 1 ] | [ = n n M n k n n M
Put t ing all of t hese r esult s t oget her gives
some ver y simple equat ions t o it er at e
Called t he Kalman Filt er
We just derived the form for Scalar State & Scalar Observation.
On the next three charts we give the Kalman Filter equations for:
Scalar State & Scalar Observation
Vector State & Scalar Observation
Vector State & Vector Observation
13
Kalman Filter: Scalar State & Scalar Observation
u[n] WGN; WSS;
) , 0 ( ~
2
u
N
] [ ] 1 [ ] [ n u n as n s + = State Model:
Varies
with n
] [ ] [ ] [ n w n s n + = x Observation Model:
w[n] WGN; ~ ) , 0 (
2
n
N
2 2
} ]) 1 | 1 [
]} 1 [ {( ] 1 | 1 [
]} 1 [ { ] 1 | 1 [
s
s
s s E M
s E s
= =
= =
Must Know:
s
,
2
s
, a,
2
u
,
2
n
Must Know:
s
,
2
s
, a,
2
u
,
2
n
Initialization:
Prediction:
] 1 | 1 [
] 1 | [
= n n s a n n s
2 2
] 1 | 1 [ ] 1 | [
u
n n M a n n M + =
Pred. MSE:
] 1 | [
] 1 | [
] [
2
+

=
n n M
n n M
n K
n
Kalman Gain:
( ) ] 1 | [
] [ ] [ ] 1 | [
] | [
+ = n n s n x n K n n s n n s
Update:
( ) ] 1 | [ ] [ 1 ] | [ = n n M n K n n M
Est. MSE:
14
Kalman Filter: Vector State & Scalar Observation
1 ) ( ~ ; ; 1; ] [ ] 1 [ ] [ + = r N r p p p p n n n Q 0, u B A s Bu As s
State Model:
1 ] [ ]; [ ] [ ] [ ] [ + = p n n w n n n x
T T
h s h
) , 0 ( ~
2
n
N
w[n] WGN;
Observation Model:
{ }
s
T
s
E
E
C s s s s M
s s
= =
= =
]) 1 | 1 [
]} 1 [ ])( 1 | 1 [
]} 1 [ ( ] 1 | 1 [
]} 1 [ { ] 1 | 1 [
Must Know:
s
, C
s
, A, B, h, Q,
2
n
Must Know:
s
, C
s
, A, B, h, Q,
2
n
Initialization:
] 1 | 1 [
] 1 | [
= n n n n s A s
Prediction:
T T
n n n n BQB A AM M + = ] 1 | 1 [ ] 1 | [
Pred. MSE (pp):
" " " # " " " $ %
1 1
2
] [ ] 1 | [ ] [
] [ ] 1 | [
] [
+

=
n n n n
n n n
n
T
n
h M h
h M
K
Kalman Gain (p1):

" " " " # " " " " $ %
" " # " " $ %
s innovation n x
n n x
T
n n n n x n n n n n
: ] [
~
] 1 | [
]) 1 | [
] [ ] [ ( ] [ ] 1 | [
] | [
+ = s h K s s
Update:
( ) ] 1 | [ ] [ ] [ ] | [ = n n n n n n
T
M h K I M Est. MSE (pp): :
15
Kalman Filter: Vector State & Vector Observation
1 ) N( ~ ; ; 1; ] [ ] 1 [ ] [ + = r r p p p p n n n Q 0, u B A s Bu As s State Model:
1 ) ] [ N( ~ ] [ ; ] [ ; 1 ]; [ ] [ ] [ ] [ + = M n n p M n M n n n n C 0, w H x w s H x Observation:
{ }
s
T
s
E
E
C s s s s M
s s
= =
= =
]) 1 | 1 [
]} 1 [ ])( 1 | 1 [
]} 1 [ ( ] 1 | 1 [
]} 1 [ { ] 1 | 1 [
Must Know:
s
, C
s
, A, B, H, Q, C[n]}
Must Know:
s
, C
s
, A, B, H, Q, C[n]}
Initialization:
] 1 | 1 [
] 1 | [
= n n n n s A s
Prediction:
T T
n n n n BQB A AM M + = ] 1 | 1 [ ] 1 | [
Pred. MSE (pp):
1
] [ ] 1 | [ ] [ ] [ ] [ ] 1 | [ ] [
|
|
.
|
\
|
+ =
" " " # " " " $ %
M M
T T
n n n n n n n n n H M H C H M K
Kalman Gain (pM):
" " " " # " " " " $ %
" " # " " $ %
s innovation n
n n
n n n n n n n n n
: ] [
~
] 1 | [
]) 1 | [
] [ ] [ ( ] [ ] 1 | [
] | [
x
x
s H x K s s
+ =
Update:
Est. MSE (pp): :
( ) ] 1 | [ ] [ ] [ ] | [ = n n n n n n M H K I M
16
Kalman Filter Block Diagram
K[n]
Az
-1
+
+
] [
n u B
] 1 | [
n n s
x[n]
H[n]
] 1 | [
n n x
+
] [
~
n x
] | [
n n s
Estimated
State
Estimated
Driving Noise
Innovations
Observations
Embedded
Observation
Model
Embedded
Dynamical
Model
Predicted
Observation
Predicted
State
Looks a lot like Sequent ial LS/ MMSE except it
has t he Embedded Dynamical Model!!!
17
Overview of MMSE Estimation
Jointly
Gaussian
LMMSE
Bayesian
Linear
Model
LMMSE
Linear
Model
} | {
x E =
Optimal
Seq. Filter
(No Dynamics)
Optimal
Kalman Filter
(w/ Dynamics)
Linear
Seq. Filter
(No Dynamics)
Linear
Kalman Filter
(w/ Dynamics)
( ) } { } {
1
x x C C
xx x
E E + =

( ) ( )

H x C H HC H C + + =
1
w
T T
| |
1 1
] [

+ =
n
T
n n n n
n x h k
]) 1 | 1 [
] [ ] [ ]( [ ] 1 | [
] | [
+ = n n n n n n n n n s A H x K s s
Force Linear
Any PDF,
Known 2
nd
Moments
Assume
Gaussian
Gen. MMSE
1
Important Properties of the KF
1. Kalman filter is an extension of the sequential MMSE
estimator
Sequential MMSE is for a fixed parameter
Kalman is for time-varying parameter, but must have a known
dynamical model
Block diagrams are nearly identical except for the Az
-1
feedback box in
the Kalman filter just a z
-1
box in seq. MMSE the A is the
dynamical models state-transition matrix
2. Inversion is only needed for the vector observation case
3. Kalman filter is a time-varying filter
Due to two time-varying blocks: gain K[n] & Observation Matrix H[n]
Note: K[n] changes constantly to adjust the balance between info from
the data (the innovation) vs. info from the model (the prediction)
4. Kalman filter computes (and uses!) its own performance
measure M[n|n] (which is the MMSE matrix)
Used to help balance between innovation and prediction
2
5. There is a natural up-down progression in the error
The Prediction Stage increases the error
The Update Stage decreases the error M[n|n 1] > M[n|n]
This is OK prediction is just a natural, intermediate step in the Optimal
processing
5 6 7 8 n
M[5|4]
M[6|5]
M[7|6]
M[8|7]
M[5|5]
M[6|6]
M[7|7]
6. Prediction is an integral part of the KF
And it is based entirely on the Dynamical Model!!!
7. After a long time (as n ) the KF reaches steady-state
operationand the KF becomes a Linear Time-Invariant filter
M[n|n] and M[n|n 1] both become constant
but still have M[n|n 1] > M[n|n]
Thus, the gain k[n] becomes constant, too.
3
8. The KF creates an uncorrelated sequence the innovations.
Can view the innovations as an equivalent input sequence
Or if we view the innovations as the output, then the steady-state KF is
a LTI whitening filter (need state-state to get constant-power innovations)
9. The KF is optimal for the Gaussian Case (minimizes MSE)
If not Gaussian the KF is still the optimal Linear MMSE estimator!!!
10. M[n|n 1], M[n|n], and K[n] can be computed ahead of time
(off-line)
As long as the expected measurement variance
2
n
is known
This allows off-line data-independent assessment of KF performance
4
13.5 Kalman Filters vs. Wiener Filters
They are hard to directly compare They have different models
Wiener assumes WSS signal + Noise
Kalman assumes Dynamical Model w/ Observation Model
So to compare we need to put them in the same context:
If we let:
1. Consider only after much time has elapsed (as n )
Gives IIR Wiener case
Gives steady-state Kalman & Dynamic model becomes AR
2. For Kalman Filter, let
2
n
be constant
Observation noise becomes WSS
ThenKalman = Wiener !!! See book for more details
5
13.7 Extended Kalman Filter
The dynamical and observation models we assumed when
developing the Kalman filter were Linear models:
] [ ] 1 [ ] [ n n n Bu As s + =
Dynamics:
(A matrix is a
linear operator)
] [ ] [ ] [ ] [ n n n n w s H x + =
Observations:
However , many (most ?) applicat ions have a
Nonlinear St at e Equat ion
and/ or
Nonlinear Obser vat ion Equat ion
Solving for the Optimal Kalman filter for the nonlinear model
case is generally intractable!!!
The Ext ended Kalman Filt er is a sub-opt imal appr oach t hat
linear izes t he model(s) and t hen applies t he st andar d KF
6
EKF Motivation: A/C Tracking with Radar
Case # 1: Dynamics ar e Linear but Obser vat ions ar e Nonlinear
Recall the constant-velocity model for an aircraft:
=
] [
] [
] [
] [
] [
n v
n v
n r
n r
n
y
x
y
x
s
A/C positions (m)
A/C velocities (m/s)
=
1 0 0 0
0 1 0 0
0 1 0
0 0 1
A
] [ ] 1 [ ] [ n n n Bu As s + =
=
1 0
0 1
0 0
0 0
B
Dynamics Model
Define the state in
rectangular coordinates:
For r ect angular coor dinat es
t he st at e equat ion is linear
7
But the choice of rectangular coordinates makes the radars
observations nonlinearly related to the state:
A r adar can obser ve r ange and bear ing (i.e., angle t o t ar get )
(and radial and angular velocities, which we will ignore here)
So the observations equations relating the observation to the
state are given by:
+
=

] [
] [
] [
] [
tan
] [ ] [
] [
1
2 2
n w
n w
n r
n r
n r n r
n
R
x
y
y x
x
Obser vat ion Model
For r ect angular st at e
coor dinat es t he obser vat ion
equat ion is Non-Linear
Target
Radar
r
x
r
y
R
8
Case # 2: Obser vat ions ar e Linear but Dynamics ar e Nonlinear
If we choose the state to be in polar form then the observations will
be linear functions of the state so maybe then we wont have a
problem???
WRONG!!!
=
] [
] [
] [
] [
] [
n
n S
n
n R
n
s
A/C Range & Bearing
A/C Speed & Heading
Target
Radar
r
x
r
y
S
Obser vat ion Model
=
] [
] [
] [
] [
] [
] [
0 0 1 0
0 0 0 1
] [
n w
n w
n
n S
n
n R
n
R
x The obser vat ion is linear :

9
But The Dynamics Model is now Non- Linear :
( ) ( )
( ) ( ) [ ]
( ) ( )
( ) ( ) [ ]

+ +
+ + +
+ +
+ + +
=
] [ ]) 1 [ cos( ] 1 [ / ] [ ]) 1 [ sin( ] 1 [ tan

] [ ]) 1 [ sin( ] 1 [ ] [ ]) 1 [ cos( ] 1 [
]) 1 [ cos( ] 1 [ ]) 1 [ cos( ] 1 [ / ]) 1 [ sin( ] 1 [ ]) 1 [ sin( ] 1 [ tan
]) 1 [ sin( ] 1 [ ]) 1 [ sin( ] 1 [ ]) 1 [ cos( ] 1 [ ]) 1 [ cos( ] 1 [
] [
] [
] [
] [
] [
1
2 2
1
2 2
n u n n S n u n n S
n u n n S n u n n S
n n S n n R n n S n n R
n n S n n R n n S n n R
n
n S
n
n R
n
x y
y x

s
I n each of t hese cases
We cant apply t he st andar d KF because it
r elies on t he assumpt ion of linear st at e and
obser vat ion models!!!
10
Nonlinear Models
We state here the case where both the state and observation
equations are nonlinear
] [ ]) 1 [ ( ] [ n n n Bu s a s + =
] [ ]) [ ( ] [ n n n
n
w s h x + =
where a(.) and h
n
(.) are both nonlinear functions mapping
a vector to a vector
11
What To Do When Facing a Non-Linear Model?
1. Go back and re-derive the MMSE estimator for the the nonlinear
case to develop the your-last-name-here filter??
Nonlinearities dont preserve Gaussian so it will be hard to derive
There has been some recent progress in this area: particle filters
2. Give up and try to convince your companys executives and the
FAA (Federal Aviation Administration) that tracking airplanes is not that
important??
Probably not a good career move!!!
3. Argue that you should use an extremely dense grid of radars
networked together??
Would be extremely expensive although with todays efforts in sensor
networks this may not be so far-fetched!!!
4. Linearize each nonlinear model using a 1
st
order Taylor series?
Yes!!!
Of course, it wont be optimal but it might give the required
performance!
12
Linearization of Models
( ) ] 1 | 1 [
] 1 [
] 1 [
]) 1 | 1 [
( ]) 1 [ (
] 1 [
] 1 | 1 [
] 1 [

+
=
=
n n n
n
n n n
n
n n n
s s
s
a
s a s a
A
s s
! ! ! ! " ! ! ! ! # $
State:
( ) ] 1 | [
] [
] [
]) 1 | [
( ]) 1 [ (
] [
] 1 | [
] [

=
=
n n n
n
n n n
n
n n n
n
n n
s s
s
h
s h s h
H
s s
! ! ! " ! ! ! # $
Observation:
13
Using the Linearized Models
[ ] ] 1 | 1 [
] 1 [ ]) 1 | 1 [
( ] [ ] 1 [ ] 1 [ ] [ + + = n n n n n n n n n s A s a Bu s A s
Just like what we did in the
linear case except now have
a time-varying A matrix
New additive term But it is known
at each step. So in terms of
development we can imagine that we
just subtract off this known part
Result: This part has no real impact!
[ ] ] 1 | [
] [ ]) 1 | [
( ] [ ] [ ] [ ] [ + + = n n n n n n n n n
n
s H s h w s H x
1. Result ing EKF it er at ion is vir t ually t he same except
t her e is a linear izat ions st ep
2. We no longer can do dat a-f r ee, of f -line per f or mance
it er at ion
H[ n] and A[ n-1] ar e comput ed on each it er at ion using t he
dat a-dependent est imat e and pr edict ion
14
Extended Kalman Filter (Vector-Vector)
s s
C M s = = ] 1 | 1 [ ] 1 | 1 [
Initialization:
( ) ] 1 | 1 [
] 1 | [
= n n n n s a s
Prediction:
Linearizations :
) 1 | 1 ( ) 1 (
) 1 (
] 1 [
=

=
n n n
n
n
s s
s
a
A
) 1 | (
) (
) (
] [
=
=
n n n
n
n
n
s s
s
h
H
T T
n n n n n n BQB A M A M + = ] 1 [ ] 1 | 1 [ ] 1 [ ] 1 | [
Pred. MSE:
( )
1
] [ ] 1 | [ ] [ ] [ ] [ ] 1 | [ ] [

+ = n n n n n n n n n
T T
H M H C H M K
Kalman Gain:
]) 1 | [
] [ ] [ ]( [ ] 1 | [
] | [
+ = n n n n n n n n n s H x K s s
Update:
( ) ] 1 | [ ] [ ] [ ] | [ = n n n n n n M H K I M Est. MSE:
1
13.8 Signal Processing Examples
Ex. 13.3 Time-Varying Channel Estimation
Tx
Rx
Direct Path
Multi Path
v(t)
y(t)
=
T
t
d t v h t y
0
) ( ) ( ) ( T is the maximum delay
Model using a time-varying D-T FIR system
Channel changes with time if:
Relative motion between Rx, Tx
Reflectors move/change with time
=
=
p
k
n
k n v k h n y
0
] [ ] [ ] [
Coefficients change at each n
to model time-varying channel
2
In communication systems, multipath channels degrade performance
(Inter-symbol interference (ISI), flat fading, frequency-selective
fading, etc.)
Need To: First estimate the channel coefficients
Second Build an Inverse Filter or Equalizer
2 Broad Scenarios:
1. Signal v(t) being sent is known (Training Data)
2. Signal v(t) being sent is not known (Blind Channel Est.)
One method for scenario #1 is to use a Kalman Filter:
State to be estimated is h[n] = [h
n
[0] h
n
[p]]
T
(Note: h here is no longer used to notate the observation model here)
3
Need State Equation:
Assume FIR tap coefficients change slowly
] [ ] 1 [ ] [ n n n u Ah h + =
Assumed Known
That is a weakness!!
Assume FIR taps are uncorrelated with each other
<uncorrelated scattering>
A, Q , C
h
, are Diagonal
cov{h[-1]} = M[-1|-1]
cov{u[n]}
4
Have measurement model from convolution view:
] [ ] [ ] [ ] [
0
n w k n v k h n x
p
k
n
+ =

=
zero-mean,
WGN,
2
Known
training signal
Need Observation Equation:
] [ ] [ n w n x
n
T
+ = h v
Observation Matrix
is made up of the
samples of the known
transmitted signal
State Vector
is the filter
coefficients
5
Simple Specific Example: p = 2 (1 Direct Path, 1 Multipath)
=
0001 . 0 0
0 0001 . 0
999 . 0 0
0 99 . 0
Q A
] [ ] 1 [ ] [ n n n u Ah h + =
Q = cov{u[n]}
Typical Realization of Channel Coefficients
Book doesnt state how
the initial coefficients
were chosen for this
realization
Note: h
n
[0] decays faster
and that the random
perturbation is small
6
Known Transmitted Signal
Noise-Free Received Signal
<It is a bit odd that the received
signal is larger than the
transmitted signal>
Noisy Received Signal
The variance of the noise in the
measurement model is
2
= 0.1
7
Estimation Results Using Standard Kalman Filter
Initialization:
1 . 0 100 ] 1 | 1 [ ] 0 0 [ ] 1 | 1 [
2
= = = I M h
T
Chosen to reflect that little
prior knowledge is known
In theory we said that we
initialize to the a priori
mean but in practice it is
common to just pick some
arbitrary initial value and
set the initial covariance
quite high this forces the
filter to start out trusting
the data a lot!
Transient due to wrong IC
Eventually Tracks Well!!
h
n
[0]
h
n
[1]
8
Kalman Filter Gains
Decay down relies more on model
Gain is zero when signal is noise only
Kalman Filter MMSE
Filter Performance
improves with time
9
Example: Radar Target Tracking
State Model: Constant-Velocity A/C Model
!" !# $ !" !# $
! ! ! " ! ! ! # $
!" !# $
] [ ] 1 [ ] [
] [
] [
0
0
] 1 [
] 1 [
] 1 [
] 1 [
1 0 0 0
0 1 0 0
0 1 0
0 0 1
] [
] [
] [
] [
n
y
x
n
y
x
y
x
n
y
x
y
x
n u
n u
n v
n v
n r
n r
n v
n v
n r
n r
u s
A
s
= =
2
2
0 0 0
0 0 0
0 0 0 0
0 0 0 0
} cov{
u
u
u Q
+
=

] [
] [
] [
] [
tan
] [ ] [
] [
1
2 2
n w
n w
n r
n r
n r n r
n
R
x
y
y x
x
Observation Model: Noisy Range/Bearing Radar Measurements
For this simple example. assume:
= =
2
2
0
0
} cov{
R
w C
Velocity perturbations due to wind, slight speed corrections, etc. Velocity perturbations due to wind, slight speed corrections, etc.
in radians
10
Extended Kalman Filter Issues
1. Linearization of the observation model (see book for details)
Calculate by hand, program into the EKF to be evaluated each iteration
2. Covariance of State Driving Noise
Assume wind gusts, etc. are as likely to occur in any direction w/ same
magnitude !model as indep. w/ common variance
Need the following:
= =
2
2
0 0 0
0 0 0
0 0 0 0
0 0 0 0
} cov{
u
u
u Q
u
= what??? Note: u
x
[n]/ = acceleration from n-1 to n
So choose
u
in m/s so that
u
/ gives a
reasonable range of accelerations for the
type of target expected to track
11
3. Covariance of Measurement Noise
The DSP engineers working on the radar usually specify this or build
routines into the radar to provide time updated assessments of
range/bearing accuracy
Usually assume to be white and zero-mean
Can use CRLBs for Range & Bearing
" Note: The CRLBs depend on SNR so the Range & Bearing measurement
accuracy should get worse when the target is farther away
Often assume Range Error to be Uncorrelated with Bearing Error
" Souse C[n] = diag{
R
2
[n],
2
[n]}
But best to derive joint CRLB to see if they are correlated
12
4. Initialization Issues
TypicallyConvert first range/bearing into initial r
x
& r
y
values
If radar provides no velocity info (i.e. does not measure Doppler) can
assume zero velocities
Pick a large initial MSE to force KF to be unbiased
" If we follow the above two ideas, then we might pick the MSE for r
x
& r
y
based on
statistical analysis of conversion of range/bearing accuracy into r
x
& r
y
accuracies
Sometimes one radar gets a hand-off from some other radar or sensor
" The other radar/sensor would likely hand-off its last track valuesso use
those as ICs for the initializing the new radar
" The other radar/sensor would likely hand-off a MSE measure of the quality its
last trackso use that as M[-1|-1]
13
State Model Example Trajectories: Constant-Velocity A/C Model
-20 -15 -10 -5 0 5 10 15
-10
-5
0
5
10
15
20
25
X position (m)
Y

p
o
s
i
t
i
o
n

(
m
)
Radar
Red Line is Non-Random
Constant Velocity Trajectory
m/s 2 . 0 ] 1 [ 2 . 0 ] 1 [
m 5 ] 1 [ 10 ] 1 [
) /s m 001 . 0 ( m/s 0.0316
sec 1
2 2 2
= =
= =
= =
=
y x
y x
u u
v v
r r

14
0 10 20 30 40 50 60 70 80 90 100
0
5
10
15
20
Sample Index n
R
a
n
g
e

R

(
m
e
t
e
r
s
)
0 10 20 30 40 50 60 70 80 90 100
-50
0
50
100
B
e
a
r
i
n
g

(
d
e
g
r
e
e
s
)
Sample Index n
Observation Model Example Measurements
Red Lines are Noise-Free Measurements
) rad 01 . 0 ( deg 5.7 rad 0.1
) m 1 . 0 ( m 0.3162
2 2
2 2
= = =
= =
R
R
R

In reality, these would get

worse when the target is
far away due to a weaker
returned signal
15
If we tried to directly convert the noisy range and bearing
measurements into a track this is what wed get.
Not a very accurate track!!!! ! Need a Kalman Filter!!!
But Nonlinear Observation Model so use Extended KF!
Radar
Note how the track gets worse when
far from the radar (angle accuracy
converts into position accuracy in a way
that depends on range)
Measurements Directly Give a Poor Track
16
Extended Kalman Filter Gives Better Track
Note: The EKF was run with the correct values for Q and C
(i.e., the Q and C used to simulate the trajectory and measurements
was used to implement the Kalman Filter)
Initialization: s[-1|-1] = [5 5 0 0]
T
M[-1|-1] = 100I
Picked Arbitrarily
Radar
Initialization
Set large to assert that
little is known a priori
After about 20 samples
the EKF attains track
even with poor ICs and
the linearization.
Track gets worse near
end where measurements
are worse
MSE show obtain track
and show that things get
worse at the end
17
MSE Plots Show Performance
First a transient where
things get worse
Next the EKF seems to
obtain track
Finally the accuracy degrades
due to range magnification of
bearing errors

Class Notes

Transféré par

Informations du document

Description originale:

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Class Notes

Transféré par

Droits d'auteur :

Formats disponibles

1/28

Note : If zero mean

Then Say that X and Y are uncorrelated

Same if zero mean

then every vector we can build from {v

Then the IDFT equation can be viewed as an expansion of the

> < > < > <

These are RVs

Example: Estimate DC in White Uniform Noise

Even if MVU exists: may not be able to find it!!

Revisit Example 3.3 to Find MVU Estimate

Sofor the DC Level in AWGN:

But imagine that we instead are interested in estimating

But you are really interested in estimating = g( )

? Using transformation result:

Diagonal elements of Inverse FIMbound the parameter variances,

Mathematical Notation for this is:

Each case has the same variances but location accuracy

= is 2-D Gaussian w/ zero mean

Only For Convenience

An Ellipse!!! (Look it up in your calculus book!!!)

Note that the error

you can find CRLB

Can estimate Range (R) and Bearing () directly

Plug in and keep

Assume sample spacing is smallapprox. sum by integral

Amp. Accuracy: Decreases as 1/N, Depends on Noise Variance (not SNR)

Does that mean we do worse if we sample faster than Nyquist?

AR Estimation Problem: Given data x[0], x[1], , x[N-1]

how is x alone is distributed?

See also Slice Of Error Ellipsoids

! !" ! !# $ !" !# $ !" !# $

and achieves the CRLB.

And Cauchy-Schwarz Inequality (same as Schwarz Ineq.)

Choose as to give: 1. unbiased estimator

Appendix 6A shows that this achieves a global minimum

C Then can show:

Note: Because ln(z) is a monotonically increasing function

General Analytical Procedure to Find the MLE

Note: when g( ) is not one-to-one the MLE for maximizes

Note: A so-called Greedy

Some Other Iterative MLE Methods

Step 1: Minimize this term over

S. Stein, Algorithms for Ambiguity Function Processing,

Again we use the same effective SNR:

S. Stein, Algorithms for Ambiguity

BT pulls the signal up out of the noise

CRLB for Geo-Loc. via TDOA/FDOA (cont.)

Because H is full rank we know that H

(Recall: Range(H) = subspace spanned by columns of H)

Re-write this LS solution as:

The projection of x onto is given by

scalar define as b for convenience

Standard WLS gives:

Thus, when finding you want to constrain it to satisfy these

Constrained Max occurs when:

Follow the usual steps for Lagrange Multiplier Solution:

for Need to find

Taking these partials gives:

Because i[n] is uncorrelated with

C is the covariance of the

is the current estimate of the parameter vector

This LS estimated update is then used to get an