A Concise Guide To Statistics

Review of Probability
1
Random Variable
 Definition
Numerical characterization of outcome of a random
event
Examples
1) Number on rolled dice
2) Temperature at specified time of day
3) Stock Market at close
4) Height of wheel going over a rocky road
2
Random Variable
But we can make
 Non-examples these into RV’s
1) ‘Heads’ or ‘Tails’ on coin
2) Red or Black ball from urn
 Basic Idea – don’t know how to completely

determine what value will occur
– Can only specify probabilities of RV values occurring.
3
Two Types of Random Variables
Random Variable
Discrete RV Continuous RV
• Die • Temperature
• Stocks • Wheel height
4
PDF for Continuous RV
Given Continuous RV X…
What is the probability that X = x0 ?
Oddity : P(X = x0) = 0
Otherwise the Prob. “Sums” to infinity
Need to think of Prob. Density Function (PDF)
pX(x) The Probability density function

of RV X
xo xo + ∆ x
P ( x0 < X < x0 + ∆ ) = area shown
xo + ∆
= ∫x
o
p X ( x )dx
5
Most Commonly Used PDF: Gaussian
A RV X with the following PDF is called a Gaussian RV
1 −( x − m ) 2 / 2σ 2
p X ( x) = e
σ 2π
m & σ are parameters of the Gaussian pdf

m = Mean of RV X
σ = Standard Deviation of RV X (Note: σ > 0)
σ2 = Variance of RV X
Notation: When X has Gaussian PDF we say X ~ N(m,σ 2)

6
Zero-Mean Gaussian PDF
 Generally: take the noise to be Zero Mean
1 x 2 2σ 2
p x ( x) = e
σ 2π
7
Effect of Variance on Gaussian PDF
pX(x)
σ σ
Area within ±1 σ of mean = 0.683
= 68.3%
x=m
x
pX(x)
Small σ
Small Variability
(Small Uncertainty)
x
pX(x)
Large σ Large Variability
(Large Uncertainty)
x 8
Why Is Gaussian Used?
Central Limit theorem (CLT)
The sum of N independent RVs has a pdf
that tends to be Gaussian as N → ∞
So What! Here is what : Electronic systems generate

internal noise due to random motion of electrons in electronic
components. The noise is the result of summing the random
effects of lots of electrons.
CLT applies Guassian Noise
9
Joint PDF of RVs X and Y p XY ( x, y )
Describes probabilities of joint events concerning X and Y. For
example, the probability that X lies in interval [a,b] and Y lies in
interval [a,b] is given by:
bd
Pr{( a < X < b) and ( c < Y < d )} = ∫ ∫ p XY ( x, y )dxdy
ac
This graph shows a Joint PDF

Graph from B. P. Lathi’s book: Modern Digital & Analog Communication Systems 10
Conditional PDF of Two RVs
When you have two RVs… often ask: What is the PDF of Y if X is
constrained to take on a specific value.
In other words: What is the PDF of Y conditioned on the fact X is
constrained to take on a specific value.
Ex.: Husband’s salary X conditioned on wife’s salary = $100K?
First find all wives who make EXACTLY $100K… how are their
husband’s salaries distributed.
Depends on the joint PDF because there are two RVs… but it
should only depend on the slice of the joint PDF at Y=$100K.
Now… we have to adjust this to account for the fact that the joint
PDF (even its slice) reflects how likely it is that Y=$100K will
occur (e.g., if Y=105 is unlikely then pXY(x,105) will be small); so…
if we divide by pY(105) we adjust for this.
11
Conditional PDF (cont.)
Thus, the conditional PDFs are defined as (“slice and normalize”):
 p XY ( x, y )  p XY ( x, y )
 , p X ( x) ≠ 0  , pY ( y ) ≠ 0
pY | X ( y | x ) =  p X ( x ) p X |Y ( x | y ) =  pY ( y )
0, 0,
 otherwise  otherwise
x is held y is held
fixed fixed
“slice and
normalize”
y is held fixed
This graph shows a Conditional PDF

Graph from B. P. Lathi’s book: Modern Digital & Analog Communication Systems 12
Independent RV’s
Independence should be thought of as saying that:
Neither RV impacts the other statistically – thus, the

values that one will likely take should be irrelevant to the
value that the other has taken.
In other words: conditioning doesn’t change the PDF!!!

p XY ( x, y )
pY | X = x ( y | x ) = = pY ( y )
p X ( x)
p XY ( x, y )
p X |Y = y ( x | y ) = = p X ( x)
pY ( y )
13
Independent and Dependent Gaussian PDFs
y
Contours of pXY(x,y).
Independent If X & Y are independent,

x
(zero mean) then the contour ellipses
are aligned with either the
x or y axis
Independent y
Different slices
(non-zero mean) give
x same normalized
curves
Different slices
y
Dependent give
different normalized
x curves
14
An “Independent RV” Result
RV’s X & Y are independent if:
p XY ( x, y ) = p X ( x ) pY ( y )
Here’s why:
p XY ( x, y ) p X ( x ) pY ( y )
pY | X = x ( y | x ) = = = pY ( y )
p X ( x) p X ( x)
15
Characterizing RVs
 PDF tells everything about an RV
– but sometimes they are “more than we need/know”
 So… we make due with a few Characteristics
– Mean of an RV (Describes the centroid of PDF)
– Variance of an RV (Describes the spread of PDF)
– Correlation of RVs (Describes “tilt” of joint PDF)
Mean = Average = Expected Value
Symbolically: E{X}
16
Motivating Idea of Mean of RV
Motivation First w/ “Data Analysis View”
Consider RV X = Score on a test Data: x1, x2,… xN
Possible values of RV X : V0 V1 V2... V100

0 1 2 … 100
∑
N
x N 0V0 + N1V1 + ... N nV100 100 N i
= ∑Vi
Test i =1 i
Average = x = N
=
N N
i =0
Ni = # of scores of value Vi
n
≈ P(X = Vi)
N =∑ i =1
N i (Total # of scores)
Statistics
This is called Data Analysis View
But it motivates the Data Modeling View Probability 17
Theoretical View of Mean
Data Analysis View leads to Probability Theory:
Data Modeling
 For Discrete random Variables :
n
E{ X } = ∑ xi PX ( xi )
n =1
Probability Function
 This Motivates form for Continuous RV:
∞
E{ X } = ∫ x p X ( x ) dx
−∞
Probability Density Function
Notation: E{ X } = X Shorthand Notation 18

Aside: Probability vs. Statistics
Probability Theory Statistics
» Given a PDF Model » Given a set of Data
» Describe how the » Determine how the
data will likely behave data did behave
∞ N
≈
1
E{ X } = ∫ x p X ( x )dx Avg =
N
∑x
i =1
i
−∞ “Law of Large
PDF Numbers” Data
Dummy Variable
There is no DATA here!!! There is no PDF here!!!

The PDF models how The Statistic measures how
the data will likely behave the data did behave
19
Variance of RV
There are similar Data vs. Theory Views here…
But let’s go right to the theory!!
Variance: Characterizes how much you expect the

RV to Deviate Around the Mean
Variance: σ 2 = E{( X − m x ) 2 }
= ∫ ( x − m x ) 2 p X ( x )dx
Note : If zero mean…

σ 2 = E{ X 2 }
= ∫ x 2 p X ( x )dx
20
Motivating Idea of Correlation
Motivate First w/ Data Analysis View
Consider a random experiment that observes the
outcomes of two RVs:
Example: 2 RVs X and Y representing height and weight, respectively
y
Positively Correlated
x x
21
Illustrating 3 Main Types of Correlation
N
∑ ( xi − x )( yi − y )
1
Data Analysis View: C xy =
N i =1
y−y y−y y−y
x−x x−x x−x
Positive Correlation Zero Correlation Negative Correlation

“Best Friends” i.e. uncorrelated “Worst Enemies”
“Complete Strangers”
GPA Height Student Loans

& & &
Starting Salary $ in Pocket Parents’ Salary
22
Prob. Theory View of Correlation
To capture this, define Covariance :
σ XY = E{( X − X )(Y − Y )}
σ XY = ∫ ∫ ( x − X )( y − Y ) p XY ( x, y )dxdy
If the RVs are both Zero-mean : σ XY = Ε{XY }
If X = Y: σ XY = σ X2 = σ Y2
If X & Y are independent, then: σ XY = 0

23
If σ XY = E{( X − X )(Y − Y )} = 0
Then… Say that X and Y are “uncorrelated”
If σ XY = E{( X − X )(Y − Y )} = 0
Then E{ XY } = X Y
Called “Correlation of X & Y ”
So… RVs X and Y are said to be uncorrelated

if σXY = 0
or equivalently… if E{XY} = E{X}E{Y}
24
Independence vs. Uncorrelated
X & Y are Implies X & Y are

Independent Uncorrelated
f XY ( x, y ) E{ XY }
= f X ( x ) fY ( y ) = E{ X }E{Y }
PDFs Separate Means Separate

Uncorrelated
Independence
INDEPENDENCE IS A STRONGER CONDITION !!!!

25
Confusing Covariance and
Correlation Terminology
Covariance : σ XY = E{( X − X )(Y − Y )}
Same if zero mean

Correlation : E{XY }
σ XY
Correlation Coefficient : ρ XY =
σ XσY
−1 ≤ ρ XY ≤ 1
26
Covariance and Correlation For
Random Vectors…
x = [ X 1 X 1  X N ]T
Correlation Matrix :
 E {X 1 X 1 } E {X 1 X 2 }  E {X 1 X N }
 
 E {X 2 X 1 } E {X 2 X 2 }  E {X 2 X N }
R x = E{xx } = 
T 
     
 
 E {X N X 1 } E {X N X 2 }  E {X N X N }

Covariance Matrix :
C x = E{(x − x )( x − x )T }
27
A Few Properties of Expected Value
E{ X + Y } = E{ X } + E{Y } E{aX } = aE{ X } E{ f ( X )} = ∫ f ( x ) p X ( x )dx
σ 2 + σ 2 + 2σ
 X Y XY
var{ X + Y } =  var{aX } = a 2σ X2
σ 2 + σ 2 , if X & Y are uncorrelated
 X Y
{(
var{ X + Y } = E X + Y − X − Y )} 2 σ X2
var{a + X } =
= E {( X + Y ) } where X = X − X
z z
2
z
= E {( X ) + (Y ) +2 X Y }
z
2
z
2
z z
= E {( X ) }+ E {(Y ) }+ 2 E { X Y }
z
2
z
2
z z
= σ X2 + σ Y2 + 2σ XY
28
Joint PDF for Gaussian
Let x = [X1 X2 … XN]T be a vector of random variables. These random variables
are said to be jointly Gaussian if they have the following PDF
1  1 
p(x ) = exp − ( x − μ x )T C −x1 ( x − μ x ) 
 2 
N
( 2π ) 2 det(C x )
where µx is the mean vector and Cx is the covariance matrix:

μ x = E{x} C x = E{(x − μ x )( x − μ x )T }
For the case of two jointly Gaussian RVs X1 and X2 with

E{Xi} = µi var{Xi} = σi2 E{(X1 – µ1) (X2 – µ2)} = σ12 ρ = σ12/ (σ1 σ2)
Then…
1  1  ( x1 − µ1 ) 2 ( x1 − µ1 )( x2 − µ2 ) ( x2 − µ2 ) 2  
p( x1 , x2 ) = exp − 2 
− 2ρ + 
2πσ 1 σ 2 1− ρ 2
 2 (1 − ρ )  σ 2
1 σ σ
1 2 σ 22  
It is easy to verify that X1 and X2 are uncorrelated (and independent!) if ρ = 0

29
Linear Transform of Jointly Gaussian RVs
Let x = [X1 X2 … XN]T be a vector of jointly Gaussian random variables with
mean vector µx and covariance matrix Cx…
Then the linear transform y = Ax + b is also jointly Gaussian with
μ y = E{y} = Aμ x + b
C y = E{( y − μ y )( y − μ y )T } = AC x A T
A special case of this is the sum of jointly Gaussian RVs… which can be
handled using A = [1 1 1 … 1]
30
Moments of Gaussian RVs
Let X be zero mean Gaussian with variance σ2
Then the moments E{Xk} are as follows:
1 ⋅ 3 ( k − 1)σ k , k even

E{ X k } = 
0, k odd
Let X1 X2 X3 X4 be any four jointly Gaussian random variables with zero mean
Then…
E{X1X2X3X4} = E{X1X2}E{X3X4} + E{X1X3}E{X2X4} + E{X1X4}E{X2X3}
Note that this can be applied to find E{X2Y2} if X and Y are jointly Gaussian
31
Chi-Squared Distribution
Let X1 X2 … XN be a set of zero-mean independent jointly Gaussian random
variables each with unit variance.
Then the RV Y = X12 + X22 + … + XN2 is called a chi-squared (χ2) RV of N

degrees of freedom and has PDF given by
 1 ( N /2) −1 − y /2
 N /2 y e , y≥0
p( y ) =  2 Γ( N / 2)
0, y<0

For this RV we have that:
E{Y} = N and var{Y} = 2N
32
Review of
Matrices and Vectors
1/45
Vectors & Vector Spaces
Definition of Vector: A collection of complex or real numbers,
generally put in a column  v 
1 Transpose
 
 
v =  "  = [v1 ! v N ]T
 
 
v N 
 
Definition of Vector Addition: Add element-by-element
 a1   b1   a1 + b1 
a =  "  b =  "  a + b =  " 
a N  bN  a N + bN 
2/45
Definition of Scalar: A real or complex number.
If the vectors of interest are complex valued then the set of

scalars is taken to be complex numbers; if the vectors of
interest are real valued then the set of scalars is taken to be
real numbers.
Multiplying a Vector by a Scalar :
 a1   αa1 
a =  "  αa =  " 
a N  αa N 
…changes the vector’s length if |α| ≠ 1

… “reverses” its direction if α < 0
3/45
Arithmetic Properties of Vectors: vector addition and
scalar multiplication exhibit the following properties pretty
much like the real numbers do
Let x, y, and z be vectors of the same dimension and let α
and β be scalars; then the following properties hold:
x+y=y+x
1. Commutativity
αx = xα
(x + y ) + z = y + (x + z )
2. Associativity α(βx) = (αβ)x
α ( x + y ) = αx + α y
3. Distributivity
(α + β)x = αx + βx
1x = x
4. Scalar Unity &
Scalar Zero 0x = 0, where 0 is the zero vector of all zeros
4/45
Definition of a Vector Space: A set V of N-dimensional vectors
(with a corresponding set of scalars) such that the set of vectors
is:
(i) “closed” under vector addition
(ii) “closed” under scalar multiplication
In other words:
• addition of vectors – gives another vector in the set
• multiplying a vector by a scalar – gives another vector in the set
Note: this means that ANY “linear combination” of vectors in the

space results in a vector in the space…
If v1, v2, and v3 are all vectors in a given vector space V, then
3
v = α1v1 + α 2 v 2 + α 3 v 3 = ∑α i v i
i =1
is also in the vector space V.
5/45
Axioms of Vector Space: If V is a set of vectors satisfying the above
definition of a vector space then it satisfies the following axioms:
1. Commutativity (see above)
2. Associativity (see above)
3. Distributivity (see above)
4. Unity and Zero Scalar (see above)
5. Existence of an Additive Identity – any vector space V must
have a zero vector
6. Existence of Negative Vector: For every vector v in V its
negative must also be in V
So… a vector space is nothing more than a set of vectors with

an “arithmetic structure”
6/45
Def. of Subspace: Given a vector space V, a subset of vectors in V
that itself is closed under vector addition and scalar
multiplication (using the same set of scalars) is called a
subspace of V.
Examples:
1. The space R2 is a subspace of R3.
2. Any plane in R3 that passes through the origin is a subspace
3. Any line passing through the origin in R2 is a subspace of R2
4. The set R2 is NOT a subspace of C2 because R2 isn’t closed
under complex scalars (a subspace must retain the original
space’s set of scalars)
7/45
Geometric Structure of Vector Space
Length of a Vector (Vector Norm): For any vector v in CN
we define its length (or “norm”) to be
N N
∑v 2
= ∑ vi
2
2
v 2
= i v
i =1 2
i =1
Properties of Vector Norm:

αv 2
=α v 2
α v1 + β v 2 2
≤ α v1 2
+ β v2 2
v 2
< ∞ ∀v ∈ C N
v 2
= 0 iff v = 0
8/45
Distance Between Vectors: the distance between two
vectors in a vector space with the two norm is defined by:
d ( v1 , v 2 ) = v1 − v 2 2
Note that: d ( v 1 , v 2 ) = 0 iff v 1 = v 2
v1 v1 – v2
v2
9/45
Angle Between Vectors & Inner Product:
v
Motivate the idea in R :
2
 A cos θ 1
A v =  A sin θ  u= 
θ   0 
u
2
Note that: ∑u v
i =1
i i = 1 ⋅ A cos θ + 0 ⋅ A sin θ = A cos θ
Clearly we see that… This gives a measure of the angle

between the vectors.
Now we generalize this idea!
10/45
Inner Product Between Vectors :
Define the inner product between two complex vectors in CN by:
N
< u, v >= ∑ u i vi*
i =1
Properties of Inner Products: < αu, v >= α < u, v >

1. Impact of Scalar Multiplication:
< u, βv >= β * < u, v >
< u, v + z >=< u, v > + < u, z >

2. Impact of Vector Addition:
< u + w, v >=< u, v > + < w, v >
2
3. Linking Inner Product to Norm: v 2
=< v, v >
4. Schwarz Inequality: < u, v > ≤ u 2

v 2
5. Inner Product and Angle: < u, v >

= cos(θ )
(Look back on previous page!) u2 v2
11/45
Inner Product, Angle, and Orthogonality :
< u, v >
= cos(θ )
u2 v2
(i) This lies between –1 and 1;
(ii) It measures directional alikeness of u and v
= +1 when u and v point in the same direction
= 0 when u and v are a “right angle”
= –1 when u and v point in opposite directions
Two vectors u and v are said to be orthogonal when <u,v> = 0
If in addition, they each have unit length they are orthonormal
12/45
Building Vectors From Other Vectors
Can we find a set of “prototype” vectors {v1, v2, …, vM} from
which we can build all other vectors in some given vector space V
by using linear combinations of the vi?
M M
v = ∑α k v k u = ∑ βk vk
k =1 k =1
Same “Ingredients”… just different amounts of them!!!
We want to be able to do is get any vector just by changing the

amounts… To do this requires that the set of “prototype”
vectors {v1, v2, …, vM} satisfy certain conditions.
We’d also like to have the smallest number of members in the
set of “prototype” vectors.
13/45
Span of a Set of Vectors: A set of vectors {v1, v2, …, vM} is said to
span the vector space V if it is possible to write each vector v in V as
a linear combination of vectors from the set:
M
v = ∑αk vk
k =1
This property establishes if there are enough vectors in the

proposed prototype set to build all possible vectors in V.
It is clear that:
1. We need at least N vectors to
Examples in R2 span CN or RN but not just any N
vectors.
2. Any set of N mutually orthogonal
vectors spans CN or RN (a set of
vectors is mutually orthogonal if all
Does not Spans R2
pairs are orthogonal).
Span R2
14/45
Linear Independence: A set of vectors {v1, v2, …, vM} is said to
be linearly independent if none of the vectors in it can be written as
a linear combination of the others.
If a set of vectors is linearly dependent then there is “redundancy”
in the set…it has more vectors than needed to be a “prototype” set!
For example, say that we have a set of four vectors {v1, v2, v3, v4}
and lets say that we know that we can build v2 from v1 and v3…
then every vector we can build from {v1, v2, v3, v4} can also be built
from only {v1, v3, v4}.
It is clear that:
1. In CN or RN we can have no
Examples in R2 more than N linear independent
vectors.
2. Any set of mutually
orthogonal vectors is linear
independent (a set of vectors is
Linearly Not Linearly
mutually orthogonal if all pairs
Independent Independent
are orthogonal).
15/45
Basis of a Vector Space: A basis of a vector space is a set of
linear independent vectors that span the space.
• “Span” says there are enough vectors to build everything
• “Linear Indep” says that there are not more than needed
Orthonormal (ON) Basis: If a basis of a vector space contains
vectors that are orthonormal to each other (all pairs of basis
vectors are orthogonal and each basis vector has unit norm).
Fact: Any set of N linearly independent vectors in CN (RN) is a
basis of CN (RN).
Dimension of a Vector Space: The number of vectors in any
basis for a vector space is said to be the dimension of the space.
Thus, CN and RN each have dimension of N.
16/45
Expansion and Transformation
Fact: For a given basis {v1, v2, …, vN}, the expansion of a vector v
in V is unique. That is, for each v there is only one, unique set of
N
coefficients {α1, α2, … , αN} such that v = α v
∑k =1
k k
In other words, this “expansion” or “decomposition” is unique.

Thus, for a given basis we can make a 1-to-1 correspondence
between vector v and the coefficients {α1, α2, … , αN}.
We can write the coefficients as a vector, too: α = [α1 ! α N ]T
 v1   α1  Expansion can be viewed as a mapping (or

    transformation) from vector v to vector α.
   
v= "  ←
→ α= "  We can view this transform as taking us
  1− to−1   from the original vector space into a new
    vector space made from the coefficient
v N  α N  vectors of all the original vectors.
   
17/45
Fact: For any given vector space there are an infinite number of
possible basis sets.
The coefficients with respect to any of them provides complete
information about a vector…
some of them provide more insight into the vector and are therefore
more useful for certain signal processing tasks than others.
Often the key to solving a signal processing problem lies in finding

the correct basis to use for expanding… this is equivalent to finding
the right transform. See discussion coming next linking DFT to
these ideas!!!!
18/45
DFT from Basis Viewpoint:
If we have a discrete-time signal x[n] for n = 0, 1, … N-1
x = [x[0] x[1] ! x[ N − 1]]
T
Define vector:
Define a orthogonal basis from the exponentials used in the IDFT:
1  1   1   1 
1  e j 2 π1⋅1 / N   e j 2 π 2⋅1 / N   e j 2 π ( N −1)⋅1 / N 
d0 =  
"

d1 = 
 "


 j 2 π1( N −1) / N 
d2 = 
 "


 j 2 π 2( N −1) / N 
… d N −1 =
 "


 j 2 π( N −1)( N −1) / N 
1 e  e  e 
Then the IDFT equation can be viewed as an expansion of the

signal vector x in terms of this complex sinusoid basis:
N −1 T
1  X [ 0] X [1] X [ N − 1] 
x = ∑ X [k ] d k α= ! 
N
k =0 &
#%#
$  N N N 
αk
kth coefficient coefficient vector 19/45

Usefulness of an ON Basis
What’s So Good About an ON Basis?: Given any basis
{v1, v2, …, vN} we can write any v in V as
N
v = ∑αk vk
k =1
Given the vector v how do we find the α’s?

• In general – hard! But for ON basis – easy!!
N 
If {v1, v2, …, vN} is an ON basis then v, v i = ∑ α j v j  , v i
 j =1 
N
= ∑α
j =1
j
&
v j , vi
#%# $
δ[ j − i ]
α i = v, v i = αi
ith coefficient = inner product with ith ON basis vector

20/45
Another Good Thing About an ON Basis: They preserve inner
products and norms… (called “isometric”):
If {v1, v2, …, vN} is an ON basis and u and v are vectors
expanded as
N N
Then…. v = ∑αk vk u = ∑ βk vk
k =1 k =1
1. < v ,u > = < α , β > (Preserves Inner Prod.)

2. ||v||2 = ||α||2 and ||u||2 = ||β||2 (Preserves Norms)
So… using an ON basis provides:

• Easy computation via inner products
• Preservation of geometry (closeness, size, orientation, etc.
21/45
Example: DFT Coefficients as Inner Products:
Recall: N-pt. IDFT is an expansion of the signal vector in terms of
N Orthogonal vectors. Thus
X [k ] = x, d k
N −1
= ∑ x[ n ]d *
k [n ]
n =0
N −1
= ∑ x[n]e − j 2πkn / N
n =0
See “reading notes” for some details about normalization issues in this case
22/45
Matrices
Matrix: Is an array of (real or complex) numbers organized in
rows and columns.  a11 a12 a13 a14 
 
Here is a 3x4 example:  
A = a 21 a 22 a 23 a 24 
 
 
a31 a32 a33 a34 

We’ll sometimes view a matrix as being built from its columns;

The 3x4 example above could be written as:
A = [a1 | a 2 | a 3 | a 4 ] a k = [a1k a2k a3k ]T
We’ll take two views of a matrix:

1. “Storage” for a bunch of related numbers (e.g., Cov. Matrix)
2. A transform (or mapping, or operator) acting on a vector
(e.g., DFT, observation matrix, etc…. as we’ll see)
23/45
Matrix as Transform: Our main view of matrices will be as
“operators” that transform one vector into another vector.
Consider the 3x4 example matrix above. We could use that matrix
to transform the 4-dimensional vector v into a 3-dimensional
vector u:
 v1 
 
 
v 2 
 
u = Av = [a1 | a 2 | a 3 | a 4 ]   = v1a1 + v2 a 2 + v3a 3 + v4 a 4
 v3 
 
 
v  Clearly u is built from the columns
 4
of matrix A; therefore, it must lie
in the span of the set of vectors that
make up the columns of A.
Note that the columns of A are
3-dimensional vectors… so is u. 24/45
Transforming a Vector Space: If we apply A to all the vectors in
a vector space V we get a collection of vectors that are in a
new space called U.
In the 3x4 example matrix above we transformed a 4-dimensional
vector space V into a 3-dimensional vector space U
A 2x3 real matrix A would transform R3 into R2 :
Facts: If the mapping matrix A is square and its columns are

linearly independent then
(i) the space that vectors in V get mapped to (i.e., U) has the
same dimension as V (due to “square” part)
(ii) this mapping is reversible (i.e., invertible); there is an inverse
matrix A-1 such that v = A-1u (due to “square” & “LI” part)
25/45
Transform = Matrix × Vector: a VERY useful viewpoint for all
sorts of signal processing scenarios. In general we can view many
linear transforms (e.g., DFT, etc.) in terms of some invertible
matrix A operating on a signal vector x to give another vector y:
y i = Axi x i = A −1y i
A, A −1
y2
x1 y1
x2
We can think of A and A-1 as

mapping back and forth
A, A −1 between two vector spaces
26/45
Matrix View & Basis View
Basis Matrix & Coefficient Vector:
Suppose we have a basis {v1, v2, …, vN} for a vector space V.
Then a vector v in space V can be written as: N
v = ∑αk vk
k =1
Another view of this:

 α1 
 
 
α2 
 
v = [v1 | v 2 | ! | v N ]  
&##%##$
NxN matrix  " 
v = Vα
 
 
α 
 N
The “Basis Matrix” V transforms
the coefficient vector into the
original vector v
27/45
Three Views of Basis Matrix & Coefficient Vector:
View #1
Vector v is a linear combination of the columns of basis matrix V.
N
v = ∑αk vk
k =1
View #2
Matrix V maps vector α into vector v.
Now have
a way to go
v = Vα back-and-
View #3 forth
between
There is a matrix, V-1, that maps vector v into vector α. vector v
α = V −1v and its
coefficient
vector α
Aside: If a matrix A is square and has linearly independent columns, then A

is “invertible” and A-1 exists such that A A-1 = A-1A = I where I is the identity
matrix having 1’s on the diagonal and zeros elsewhere.
28/45
Basis Matrix for ON Basis: We get a special structure!!!
Result: For an ON basis matrix V… V-1 = VH
(the superscript H denotes “hermitian transpose”, which consists
of transposing the matrix and conjugating the elements)
To see this:
 < v1 , v1 > < v1 , v 2 > ! < v1 , v N > 
< v , v > < v , v > ! < v 2 , v N > 
VV H = 2 1 2 2
 " ' " 

 
< v N , v 1 > < v N , v 2 > ! < v N , v N >
1 0 ! 0
0 1 ! 0  Inner products are 0 or 1
= =I because this is an ON basis
" ' "
 
 0 0 ! 1 
29/45
Unitary and Orthogonal Matrices
A unitary matrix is a complex matrix A whose inverse is A-1 = AH
For the real-valued matrix case… we get a special case of “unitary”
the idea of “unitary matrix” becomes “orthogonal matrix”
for which A-1 = AT
Two Properties of Unitary Matrices: Let U be a unitary matrix
and let y1 = Ux1 and y2 = Ux2
1. They preserve norms: ||yi|| = ||xi||.
2. They preserve inner products: < y1, y2 > = < x1, x2 >
That is the “geometry” of the old space is preserved by the unitary
matrix as it transforms into the new space.
(These are the same as the preservation properties of ON basis.)
30/45
DFT from Unitary Matrix Viewpoint:
Consider a discrete-time signal x[n] for n = 0, 1, … N-1.
N −1
We’ve already seen the DFT in a basis viewpoint: 1
x= ∑ N X [k ] d k
k =0 &
#%#
$
αk
Now we can view the DFT as a transform from the Unitary matrix
viewpoint:
1 1 1 ! 1 
 1 e j 2 π1⋅1 / N e j 2 π 2⋅1 / N ! e j 2 π ( N −1)⋅1 / N 
D = [d 0 | d1 | … | d N −1 ] = 
" " " " 
 j 2 π1( N −1) / N j 2 π ( N −1)( N −1) / N 
 1 e e j 2 π 2 ( N −1) / N ! e 
DFT IDFT
~
x = DH x 1 ~
x = Dx
N
(Acutally D is not unitary but N-1/2D is unitary… see reading notes) 31/45
Geometry Preservation of Unitary Matrix Mappings
Recall… unitary matrices map in such a way that the sizes of
vectors and the orientation between vectors is not changed.
A, A −1
y2
x1
x2
y1
A, A −1
Unitary mappings just
“rigidly rotate” the space.
32/45
Effect of Non-Unitary Matrix Mappings
A, A −1
y2
x1
x2
y1
A, A −1
33/45
More on Matrices as Transforms
We’ll limit ourselves here to real-valued vectors and matrices
y = Ax A maps any vector x in Rn

into some vector y in Rm
m×1 m×n n×1
Mostly interested in two cases:
1. “Tall Matrix” m>n
2. “Square Matrix” m = n
Rn Rm
A y
x
Range(A): “Range Space of A” =

vector y = weighted sum of columns of A set of all vectors in Rm that can be
⇒ may only be able to reach certain y’s reached by mapping
34/45
Range of a “Tall Matrix” (m > n) The range(A) ⊂ Rm
Rn Rm
x A y
“Proof”: Since y is “built” from the n columns of A there are

not enough to form a basis for Rm (they don’t span Rm)
Range of a “Square Matrix” (m = n)
If the columns of A are linearly indep….The range(A) = Rm

…because the columns form a basis for Rm
Otherwise….The range(A) ⊂ Rm
…because the columns don’t span Rm
35/45
Rank of a Matrix: rank(A) = largest # of linearly independent
columns (or rows) of matrix A
For an m×n matrix we have that rank(A) ≤ min(m,n)
An m×n matrix A has “full rank” when rank(A) = min(m,n)
Example: This matrix has rank of 3 because the 4th column cam be
written as a combination of the first 3 columns
1 0 0 1
 
 
0 1 0 2
 
 
A = 0 0 1 1
 
 
0 0 0 0
 
 
0 0 0 0

36/45
Characterizing “Tall Matrix” Mappings
We are interested in answering: Given a vector y, what vector x
mapped into it via matrix A?
“Tall Matrix” (m > n) Case
If y does not lie in range(A), then there is No Solution
If y lies in range(A), then there is a solution (but not
necessarily just one unique solution)
y = Ax
y∉range(A) y∈range(A)
No Solution A full A not

rank full rank
One Solution Many Solutions 37/45

Full-Rank “Tall Matrix” (m > n) Case y = Ax
A
Rn Rm
y
x
Range(A)
For a given y∈range(A)…

there is only one x that maps to it.
This is because the columns of A are linearly independent

and we know from our studies of vector spaces that the
coefficient vector of y is unique… x is that coefficient
vector
By looking at y we can determine which x gave rise to it

38/45
NonFull-Rank “Tall Matrix” (m > n) Case y = Ax
A
Range(A)
Rn Rm
y
x1
x2 A
For a given y∈range(A) there is more than one x that maps

to it
This is because the columns of A are linearly dependent

and that redundancy provides several ways to combine
them to create y
By looking at y we can not determine which x gave rise to it
39/45
Characterizing “Square Matrix” Mappings
Q: Given any y∈Rn can we find an x∈Rn that maps to it?

A: Not always!!!
y = Ax Careful!!! This is quite a
different flow diagram here!!!
A full A not
rank full rank
One Solution
y∉range(A) y∈range(A)
No Solution Many Solutions
When a square A is full rank then its range covers the

complete new space… then, y must be in range(A) and
because the columns of A are a basis there is a way to
build y 40/45
A Full-Rank Square Matrix is Invertible
A square matrix that has full rank is said to be….
“nonsingular”, “invertible”
Then we can find the x that mapped to y using x = A-1y
Several ways to check if n×n A is invertible:

1. A is invertible if and only if (iff) its columns (or rows) are
linearly independent (i.e., if it is full rank)
2. A is invertible iff det(A) ≠ 0
3. A is invertible if (but not only if) it is “positive definite” (see
later)
4. A is invertible if (but not only if) all its eigenvalues are
nonzero
Pos. Def. Invertible

Matrices Matrices
41/45
Eigenvalues and Eigenvectors of Square Matrices
If matrix A is n×n, then A maps Rn → Rn
Q: For a given n×n matrix A, which vectors get mapped into
being almost themselves???
More precisely… Which vectors get mapped to a scalar multiple
of themselves???
Even more precisely… which vectors v satisfy the following:
Av = λv
Input Output
These vectors are “special” and are called the eigenvectors of A.
The scalar λ is that e-vector’s corresponding eigenvalue.
v Av
42/45
“Eigen-Facts for Symmetric Matrices”
• If n×n real matrix A is symmetric, then
– e-vectors corresponding to distinct e-values are orthonormal
– e-values are real valued
– can decompose A as A = VΛ V T
V = [v1 v2 ! vn ] VV T = I
Λ = diag{λ1 , λ2 ,…, λn }
• If, further, A is pos. def. (semi-def.), then
– e-values are positive (non-negative)
– rank(A) = # of non-zero e-values
• Pos. Def. ⇒ Full Rank (and therefore invertible)
• Pos. Semi-Def. ⇒ Not Full Rank (and therefore not invertible)
– When A is P. D., then we can write
A −1 = VΛ −1 V T
{ }
For P.D. A, A-1 has
the same e-vectors and Λ −1 = diag 1 λ , 1 λ ,…, 1 λ
1 2 n
has reciprocal e-values 43/45
Other Matrix Issues
We’ll limit our discussion to real-valued matrices and vectors
Quadratic Forms and Positive-(Semi)Definite Matrices
Quadratic Form = Matrix form for a 2nd-order multivariate
polynomial
 x1   a11 a12 
Example: x =   A =  

 x2  a 21 a 22 
variable fixed
The quadratic form of matrix A is:
QA ( x1 , x2 ) = x T Ax (1 × 2) ⋅ ( 2 × 2) ⋅ ( 2 × 1) = (1 × 1) scalar
2 2
scalar = ∑∑ ij i j 11 1 22 2 + (a12 + a21 ) x1 x2
a x x = a x 2
+ a x 2
i =1 j =1 44/45
• Values of the elements of matrix A determine the characteristics
of the quadratic form QA(x)
– If QA(x) ≥ 0 ∀x ≠ 0… then say that QA(x) is “positive semi-definite”
– If QA(x) > 0 ∀x ≠ 0… then say that QA(x) is “positive definite”
– Otherwise say that QA(x) is “non-definite”
• These terms carry over to the matrix that defines the Quad Form
– If QA(x) ≥ 0 ∀x ≠ 0… then say that A is “positive semi-definite”
– If QA(x) > 0 ∀x ≠ 0… then say that A is “positive definite”
45/45
Ch. 1 Introduction to Estimation
1/15
An Example Estimation Problem: DSB Rx
S( f ) M( f )
–f o fo f f
s(t; f o ,φo ) = m(t ) cos(2πf o t + φo )
BPF x ( t ) = s ( t ) + w( t ) Audio
X
& Amp
Amp
cos(2πfô t + φô )
Mˆ ( f )
Electronics Adds
Noise w(t) fô & φô Oscillator
(usually “white”)
Est. Algo. w/ fô & φô
f
Goal: Given x (t ) = s (t; f o ,φo ) + w(t )

Describe with
Probability Model:
PDF & Correlation
Find Estimates
(that are optimal in some sense) 2/15
Discrete-Time Estimation Problem
These days, almost always work with samples of the observed
signal (signal plus noise): x[n ] = s[n; f ,φ ] + w[n ]
o o
Our “Thought” Model: Each time you “observe” x[n] it

contains same s[n] but different “realization” of noise w[n],
so the estimate is different each time. fô & φô are RVs
Our Job: Given finite data set x[0], x[1], … x[N-1]

Find estimator functions that map data into estimates:
fô = g1 ( x[0], x[1], …, x[ N − 1]) = g1 ( x )

These are RVs…
Need to describe w/
φô = g 2 ( x[0], x[1], …, x[ N − 1]) = g 2 ( x ) probability model
3/15
PDF of Estimate
Because estimates are RVs we describe them with a PDF…
Will depend on:
1. structure of s[n]
2. probability model of w[n]
3. form of est. function g(x)
p ( fô )
Mean measures centroid
Std. Dev. & Variance

measure spread
fo fô
Desire: E{ fô } = f o
( )
σ 2fˆ = E  fô − E{ fô }  = small
o
2
4/15
1.2 Mathematical Estimation Problem
General Mathematical Statement of Estimation Problem:
For… Measured Data x = [ x[0] x[1] … x[N-1] ]
Unknown Parameter θ = [θ1 θ2 … θp ]
θ is Not Random
x is an N-dimensional random data vector
Q: What captures all the statistical information needed for an
estimation problem ?
A: Need the N-dimensional PDF of the data, parameterized by θ
In practice, not given PDF!!!

Choose a suitable model p ( x; θ )
• Captures Essence of Reality
• Leads to Tractable Answer We’ll use p(x;θ) to find θˆ = g ( x ) 5/15
Ex. Estimating a DC Level in Zero Mean AWGN
Consider a single data point is observed x[0] = θ + w[0]
Gaussian
~ N(θ, σ2) zero mean
variance σ2
So… the needed parameterized PDF is:
p(x[0];θ ) which is Gaussian with mean of θ
So… in this case the parameterization changes the data PDF mean:
p(x[0];θ1) p(x[0];θ2) p(x[0];θ3)
θ1 x[0] θ2 x[0] θ3 x[0]
6/15
Ex. Modeling Data with Linear Trend
See Fig. 1.6 in Text
Looking at the figure we see what looks like a linear trend

perturbed by some noise…
So the engineer proposes signal and noise models:
x[n ] = [%
"+
A #] + w[n ]
$Bn
"
s[ n; A, B ]
Signal Model: Linear Trend Noise Model: AWGN

w/ zero mean
AWGN = “Additive White Gaussian Noise”

“White” = x[n] and x[m] are uncorrelated for n ≠ m { }
E ( w − w )( w − w )T = σ 2 I
7/15
Typical Assumptions for Noise Model
• W and G is always easiest to analyze
– Usually assumed unless you have reason to believe otherwise
– Whiteness is usually first assumption removed
– Gaussian is less often removed due to the validity of Central Limit Thm
• Zero Mean is a nearly universal assumption
– Most practical cases have zero mean
– But if not… w[n ] = w zm [n ] + µ
Non-Zero Mean of µ Zero Mean Now group into signal model
• Variance of noise doesn’t always have to be known to make an

estimate
– BUT, must know to assess expected “goodness” of the estimate
– Usually perform “goodness” analysis as a function of noise variance (or
SNR = Signal-to-Noise Ratio)
– Noise variance sets the SNR level of the problem
8/15
Classical vs. Bayesian Estimation Approaches
If we view θ (parameter to estimate) as Non-Random
→ Classical Estimation
Provides no way to include a priori information about θ
If we view θ (parameter to estimate) as Random

→ Bayesian Estimation
Allows use of some a priori PDF on θ
The first part of the course: Classical Methods

• Minimum Variance, Maximum Likelihood, Least Squares
Last part of the course: Bayesian Methods

• MMSE, MAP, Wiener filter, Kalman Filter
9/15
1.3 Assessing Estimator Performance
Can only do this when the value of θ is known:
• Theoretical Analysis, Simulations, Field Tests, etc.
Recall that the estimate θˆ = g ( x ) is a random variable

Thus it has a PDF of its own… and that PDF completely displays
the quality of the estimate. p(θˆ)
Illustrate with 1-D

parameter case
θ θˆ
Often just capture quality through mean and variance of θˆ = g ( x )
Desire: mθˆ = E{θˆ} = θ If this is true:

say estimate is
( )
“unbiased
 ˆ 2
σ θ2ˆ = E  θ − E{θ }  = small
ˆ
  10/15
Equivalent View of Assessing Performance
Define estimation error: e = θˆ − θ (θˆ = θ + e )
RV RV Not RV
Completely describe estimator quality with error PDF: p(e)

p(e)
Desire: me = E{e} = 0 If this is true:

say estimate is
{ }
σ e2 = E (e − E{e})2 = small
“unbiased
11/15
Example: DC Level in AWGN
Model: x[n ] = A + w[n ], n = 0, 1, … , N − 1
Gaussian, zero mean, variance σ2
White (uncorrelated sample-to-sample)
PDF of an individual data sample:
1  ( x[i ] − A) 2 
p ( x[i ]) = exp  − 
2πσ 2  2σ 2

Uncorrelated Gaussian RVs are Independent…

so joint PDF is the product of the individual PDFs:
 N −1 
N −1
 1  ( x[n ] − A) 2  
 ∑ ( x[n] − A) 
2
 
p(x) = ∏ 
1 n =0
exp  −   = exp  − 

n =0  2πσ
2 
 2 σ 2
  ( 2πσ )

2 N /2
 2σ 2 
 
( property: prod of exp’s gives sum inside exp )

12/15
Prob. Theory
Each data sample has the same mean (A), which is the thing we
are trying to estimate… so, we can imagine trying to estimate
A by finding the sample mean of the data: 1 N −1
Statistics Aˆ =
N
∑ x[ n ]
n =0
Let’s analyze the quality of this estimator…

• Is it unbiased?
1 N −1
 1
E{ Aˆ } = E ∑
N
n =0
x[ n ] = ∑%
 N n $
E{x[i ]}
#
=A
⇒ E{ Aˆ } = A Yes! Unbiased!
Due to Indep.
(white & Gauss.
• Can we get a small variance? ⇒ Indep.)
1 N −1  N −1 N −1
Nσ 2
∑
1
∑ var( x[n]) = N 2 ∑σ
1
var( A) = var 
ˆ x [ n ] = 2
2
=
 N n =0  N n =0 n =0 N2
σ2
⇒ var( Aˆ ) = Can make var small by increasing N!!!
N 13/15
Theoretical Analysis vs. Simulations
• Ideally we’d like to be always be able to theoretically
analyze the problem to find the bias and variance of the
estimator
– Theoretical results show how performance depends on the problem
specifications
• But sometimes we make use of simulations
– to verify that our theoretical analysis is correct
– sometimes can’t find theoretical results
14/15
Course Goal = Find “Optimal” Estimators
• There are several different definitions or criteria for optimality!
• Most Logical: Minimum MSE (Mean-Square-Error)
( )
– See Sect. 2.4
 ˆ 2
– To see this result: mse(θ ) = E  θ − θ 
ˆ
 
(
 ˆ
)
2
mse(θ ) = E  θ − θ 
ˆ
 
= var{θˆ} + b 2 (θ )

[( ) (
= E  θˆ − E{θˆ} + E{θˆ} − θ

)] 
2
Bias
[
 ˆ 2
] { }
= E  θ − E{θ }  + b(θ ) E θˆ − E{θˆ} + b 2 (θ )

ˆ
 %"$
" "# "
b(θ ) = E{θˆ} − θ
=0
= var{θˆ} + b 2 (θ )
Although MSE makes sense, estimates usually rely on θ

15/15
Chapter 2
Minimum Variance
Unbiased Estimators
Ch. 2: Minimum Variance Unbiased Est.
MVU
Basic Idea of MVU: Out of all unbiased estimates,
find the one with the lowest variance
(This avoids the realizability problem of MSE)
2.3 Unbiased Estimators

An estimator is unbiased if
{}
E θ = θ
ˆ for all θ
Example: Estimate DC in White Uniform Noise
x [n ] = A + w [n ] n = 0 ,1, ..., N − 1
Unbiased Estimator:
∧ N −1
∑ x[n ]
1
Α=
N n =0
∧
same as before : E {A} = A regardless of A value
Biased Estimator:
∨ N −1
∑
1
A x(n)
N n >0
Note : if A ≥ 1, then x[n ] = x[n ]
∨ ∧ ∨ 
⇒ A= A ⇒ E  A = A
 
∨ 
if A < 1 , then E  A ≠ A
 
= 0 if A ≥ 1
⇒ Bias  ⇒ Biased Est.
≠ 0 if A < 1
2.4 Minimum Variance Criterion
(Recall problem with MMSE criteria)
Constrain bias to be zero 0 find the estimator that

minimizes variance
Note: mse (θ ) = var(θ ) + b (θ )

ˆ ˆ 2 ˆ
= 0 for MVU
So, MVU could also be called

“Minimum MSE Unbiased Est.”
MVUE = Minimum Variance Unbiased Estimator

2.5 Existence of MVU Estimator
Sometimes there is no MVUE… can happen 2 ways:
1. There may be no unbiased estimators
2. None of the above unbiased estimators has a
uniformly minimum variance
Ex. of #2 θî = g i ( x ), i = 1, 2, 3
Assume there are only 3 unbiased estimators for a
problem. Two possible cases:
∃ an MVU ∃ an MVU
var{θî } var{θî }
θˆ1 θˆ1
θˆ2 θˆ2
θˆ3 θˆ3
θ θ
2.6 Finding the MVU Estimator
Even if MVU exists: may not be able to find it!!
No Known “turn the crank” Method
Three Approaches to Finding the MVUE

1. Determine Cramer-Rao Lower Bound (CRLB)
… and see if some estimator satisfies it (Ch 3 & 4)
(Note: MVU can exist but not achieve the CRLB)
2. Apply Rao-Blackwell-Lechman-Scheffe Theorem

Rare in Practice… We’ll skip Ch. 5
3. Restrict to Linear Unbiased & find MVLU (Ch. 6)

Only gives true MVU if problem is linear
2.7 Vector Parameter
When we wish to estimate multiple parameters we group
θ = [θ1 θ p ]T
them into a vector:
θ2 !
[
Then an estimator is notated as: θˆ = θˆ1 θˆ
2 ! θˆ p ]
T
Unbiased requirement becomes:

{}
E θˆ = θ
Minimum Variance requirement becomes:

For each i…
{}
var θˆ = θ min over all unbiased estimates
Chapter 3
Cramer-Rao Lower Bound
What is the Cramer-Rao Lower Bound
Abbreviated: CRLB or sometimes just CRB
CRLB is a lower bound on the variance of any unbiased

estimator:
If θˆ is an unbiased estimator of θ , then
σ θ2ˆ (θ ) ≥ CRLB ˆ (θ ) ⇒ σ ˆ (θ ) ≥ CRLB ˆ (θ )

θ θ θ
The CRLB tells us the best we can ever expect to be able to do

(w/ an unbiased estimator)
Some Uses of the CRLB
1. Feasibility studies ( e.g. Sensor usefulness, etc.)
• Can we meet our specifications?
2. Judgment of proposed estimators

• Estimators that don’t achieve CRLB are looked
down upon in the technical literature
3. Can sometimes provide form for MVU est.
4. Demonstrates importance of physical and/or signal

parameters to the estimation problem
e.g. We’ll see that a signal’s BW determines delay est. accuracy
⇒ Radars should use wide BW signals
3.3 Est. Accuracy Consideration
Q: What determines how well you can estimate θ ?
samples from a random
Recall: Data vector is x process that depends on an θ
⇒ the PDF describes that

dependence: p(x;θ )
Clearly if p(x;θ ) depends strongly/weakly on θ

…we should be able to estimate θ well/poorly.
See surface plots vs. x & θ for 2 cases:

1. Strong dependence on θ
2. Weak dependence on θ
⇒ Should look at p(x;θ ) as a function of θ for

fixed value of observed data x
Surface Plot Examples of p(x;θ )
Ex. 3.1: PDF Dependence for DC Level in Noise
x[0] = A + w[0] w[0] ~ N(0,σ2)
Then the parameter-dependent PDF of the data point x[0] is:
1  ( x[0] − A) 2 
p (x[0]; A) = exp  − 2 
2πσ 2  2σ 
Say we observe x[0] = 3…

So “Slice” at x[0] = 3
p(x[0]=3;θ )
3 A
A x[0]
Define: Likelihood Function (LF)
The LF = the PDF p(x;θ )
…but as a function of parameter θ w/ the data vector x fixed
We will also often need the Log Likelihood Function (LLF):
LLF = ln{LF} = ln{ p(x;θ )}

LF Characteristics that Affect Accuracy
Intuitively: “sharpness” of the LF sets accuracy… But How???
Sharpness is measured using curvature: − ∂ 2 ln p (x ; θ )
2
∂θ x = given data
θ = true value
Curvature ↑ ⇒ PDF concentration ↑ ⇒ Accuracy ↑
But this is for a particular set of data… we want “in general”:

So…Average over random vector to give the average curvature:
 ∂ 2 ln p (x ; θ )  “Expected sharpness
− E  of LF”
 ∂θ 2 
θ = true value
E{•} is w.r.t p(x;θ )
3.4 Cramer-Rao Lower Bound
Theorem 3.1 CRLB for Scalar Parameter
 ∂ ln p( x;θ ) 
Assume “regularity” condition is met: E   = 0 ∀θ
 ∂θ 
Then σ 2 ≥ 1
θˆ
Right-Hand
 ∂ 2 ln p (x;θ )  Side is
− E 2 
 ∂θ  CRLB
θ = true value
E{•} is w.r.t p(x;θ )
 ∂ 2 ln p (x;θ )  ∂ 2 ln p (x;θ )
E 2 =∫ 2
p( x;θ )dx
 ∂θ  ∂θ
Steps to Find the CRLB
1. Write log 1ikelihood function as a function of θ:
• ln p(x;θ )
2. Fix x and take 2nd partial of LLF:

• ∂2ln p(x;θ )/∂θ 2
3. If result still depends on x:

• Fix θ and take expected value w.r.t. x
• Otherwise skip this step
4. Result may still depend on θ:

• Evaluate at each specific value of θ desired.
5. Negate and form reciprocal

Example 3.3 CRLB for DC in AWGN
x[n] = A + w[n], n = 0, 1, … , N – 1
w[n] ~ N(0,σ2)
& white
Need likelihood function:

N −1
1  − (x [n ] − A )2  Due to
p (x ; A ) = ∏ exp   whiteness
n =0 2πσ 2  2σ 2 
 N −1 
 − ∑ (x [n ] − A )
2
1  Property
= exp  n = 0  of exp
(2πσ )
N
2 2  2σ 2 
 
 
Now take ln to get LLF:
(
)
N N −1
1
ln p ( x; A) = − ln  2πσ 2 2  − 2 ∑
( x [n ] − A )2
  2σ n =0
$!!#!!" $!!!#!!!"
∂ ∂
(~~) =0 (~~) =?
∂A ∂A
sample
Now take first partial w.r.t. A: mean
N −1
∂ 1 N
∂A
ln p ( x; A) =
σ2
∑ (x[n] − A) = σ 2 (x − A) (!)
n =0
Now take partial again: Doesn’t depend

on x so we don’t
need to do E{•}
∂2 N
2
ln p ( x; A) = −
∂A σ2
Since the result doesn’t depend on x or A all we do is negate
and form reciprocal to get CRLB:
1 σ2
CRLB = = σ2
 ∂ 2 ln p (x;θ )  N var{ Aˆ } ≥
− E 2  N
 ∂θ 
θ = true value
CRLB
• Doesn’t depend on A
For fixed N & σ 2
• Increases linearly with σ 2
• Decreases inversely with N
CRLB CRLB Doubling Data

Halves CRLB!
For fixed N For fixed σ 2
σ2 N
Continuation of Theorem 3.1 on CRLB
There exists an unbiased estimator that attains the CRLB iff:
∂ ln p ( x;θ )
= I (θ )[g ( x ) − θ ] (!)
∂θ
for some functions I(θ ) and g(x)
Furthermore, the estimator that achieves the CRLB is then given
by:
Since no unbiased estimator can do better… this
θˆ = g ( x ) is the MVU estimate!!
1 This gives a possible way to find the MVU:

ˆ
θ} =
var{with = CRLB • Compute ∂ln p(x;θ )/∂θ (need to anyway)
I (θ ) • Check to see if it can be put in
form like (!)
• If so… then g(x) is the MVU esimator
Revisit Example 3.3 to Find MVU Estimate
For DC Level in AWGN we found in (!) that:
Has form of
∂ N
ln p ( x; A) = 2 (x − A) I(A)[g(x) – A]
∂A σ
N −1
σ2 1
I ( A) =
N
2
⇒ var{ Aˆ } = = CRLB θˆ = g ( x ) = x =
N
∑ x[n]
σ N n =0
So… for the DC Level in AWGN:

the sample mean is the MVUE!!
Definition: Efficient Estimator
An estimator that is:
• unbiased and
• attains the CRLB
is said to be an “Efficient Estimator”
Notes:
• Not all estimators are efficient (see next example: Phase Est.)
• Not even all MVU estimators are efficient
So… there are times when our

“1st partial test” won’t work!!!!
Example 3.4: CRLB for Phase Estimation
This is related to the DSB carrier estimation problem we used
for motivation in the notes for Ch. 1
Except here… we have a pure sinusoid and we only wish to
estimate only its phase
AWGN w/ zero
Signal Model: x[n ] = A cos(2πf o n + φo ) + w[n ]
$!! !#!!! " mean & σ 2
s[ n;φo ]
Signal-to-Noise Ratio:
Signal Power = A2/2 A2
SNR =
Noise Power = σ 2 2σ 2
Assumptions:
1. 0 < fo < ½ ( fo is in cycles/sample)
2. A and fo are known (we’ll remove this assumption later)
Problem: Find the CRLB for estimating the phase.
Exploit
We need the PDF: Whiteness
and Exp.
 N −1  Form
 − ∑ (x [n ] − A cos( 2π f o n + φ ) )
2
1 
p (x ; φ ) = exp  n = 0 
(2πσ )
N
2 2  2σ 2 
 
 
Now taking the log gets rid of the exponential, then taking
partial derivative gives (see book for details):
∂ ln p (x ; φ ) − A N −1
2
A 
= 2 ∑  x [n ]sin( 2π f o n + φ ) − sin( 4π f o n + 2φ ) 
∂φ σ n =0  2 
Taking partial derivative again:

∂ 2 ln p (x ; φ ) − A N −1
2
= 2
∑ (x [n ]cos( 2π f o n + φ ) − A cos( 4π f o n + 2φ ) )
∂φ σ n =0
Still depends on random vector x… so need E{}

Taking the expected value:
 ∂ 2 ln p (x ; φ )   A N −1 
− E 2  = E 2 ∑ (x [n ]cos( 2π f o n + φ ) − A cos( 4π f o n + 2φ ) )
 ∂φ  σ n =0 
A N −1
= 2
∑ (E {x [n ]}cos( 2π f o n + φ ) − A cos( 4π f o n + 2φ ) )
σ n =0
E{x[n]} = A cos(2π fon + φ )
So… plug that in, get a cos2 term, use trig identity, and get
 ∂ 2 ln p (x ; φ )  A2  N −1 N −1  NA 2
− E 2 =  ∑1− ∑ cos( 4π f o n + 2φ )  ≈ 2σ 2 = N × SNR
 ∂φ  2σ 2  n = 0 n −0 
=N << N if
fo not near 0 or ½
n
N-1
1
var{φˆ} ≥
Non-dB
Now… invert to get CRLB:
N × SNR
CRLB Doubling Data

Halves CRLB!
For fixed SNR
CRLB Doubling SNR

Halves CRLB!
For fixed N Halve CRLB
for every 3B
in SNR
SNR (non-dB)
Does an efficient estimator exist for this problem? The CRLB
theorem says there is only if ∂ ln p( x;θ )
= I (θ )[g ( x ) − θ ]
∂θ
Our earlier result was:
∂ ln p (x ; φ ) − A N −1
2
A 
= 2 ∑  x [n ]sin( 2π f o n + φ ) − sin( 4π f o n + 2φ ) 
∂φ σ n =0  2 
Efficient Estimator does NOT exist!!!

We’ll see later though, an estimator for which var{φˆ} → CRLB
as N → ∞ or as SNR → ∞
var{φˆ}
CRB
N
Such an estimator is called an “asymptotically efficient” estimator

(We’ll see such a phase estimator in Ch. 7 on MLE)
Alternate Form for CRLB
1
var(θˆ) ≥
  ∂ ln p ( x;θ )  2  See Appendix 3A
E    for Derivation
  ∂θ  
Sometimes it is easier to find the CRLB this way.
This also gives a new viewpoint of the CRLB: Posted

on BB
From Gardner’s Paper (IEEE Trans. on Info Theory, July 1979)
Consider the Normalized version of this form of CRLB

We’ll “derive”
var(θˆ) 1 this in a way
≥
θ2   ∂ ln p ( x;θ )  2  that will re-
2
θ E    interpret the
  ∂θ   CRLB
1
Consider the “Incremental Sensitivity” of p(x;θ ) to changes in θ :
If θ → θ +∆θ, then it causes p(x;θ ) → p(x;θ +∆θ )
How sensitive is p(x;θ ) to that change??

 ∆p ( x;θ ) 
 
~p ∆  p ( x;θ )  % change in p ( x;θ )  ∆p ( x;θ )   θ 
Sθ ( x ) = = =  
 ∆θ  % change in θ  ∆θ   p ( x;θ ) 
θ 
 
~  ∂p ( x;θ )   θ  ∂ ln p ( x;θ )
Now let ∆θ → 0: Sθp ( x ) = lim Sθp ( x ) =    = θ
∆θ → 0  ∂θ   p( x;θ )  ∂θ
Recall from Calculus: ∂ ln f ( x ) = 1 ∂f ( x )

∂x f ( x ) ∂x
Interpretation
var(θˆ) 1 1 Norm. CRLB =
≥ =
θ2   ∂ ln p ( x;θ )  2 
θ 2 E   
2
[
 p
]
2
θ E  Sθ ( x ) 
 
Inverse Mean
Square
  ∂θ   Sensitivity 2
Definition of Fisher Information
The denominator in CRLB is called the Fisher Information I(θ )
It is a measure of the “expected goodness” of the data for the

purpose of making an estimate
 ∂ 2 ln p ( x;θ ) 
I (θ ) = − E  
 ∂θ 2 
Has the needed properties for “info” (as does “Shannon Info”):
1. I(θ ) ≥ 0 (easy to see using the alternate form of CRLB)
2. I(θ ) is additive for independent observations
follows from: ln p(x;θ ) = ln ∏ p( x[n];θ ) = ∑ ln[ p( x[n];θ )]
 n  n
If each In (θ ) is the same: I(θ ) = N×I(θ ) 3

3.5 CRLB for Signals in AWGN
When we have the case that our data is “signal + AWGN” then
we get a simple form for the CRLB:
Signal Model: x[n] = s[n;θ ] + w[n], n = 0, 1, 2, … , N-1
White,
Gaussian,
Q: What is the CRLB? Zero Mean
First write the likelihood function:

1  − 1 N −1  2
p ( x;θ ) = exp  2 ∑ (x[n] − s[n;θ ]) 
(2πσ ) 2 N /2  2σ n =0 
Differentiate Log LF twice to get: Depends on

random x[n]
N −1 2 so must take
∂2 1  ∂ 2 s[n;θ ]  ∂s[n;θ ]  
ln p( x;θ ) = ∑ ( x [ n ] − s[ n ;θ ]) −  ∂θ   E{}
∂θ 2 σ2 
n =0  ∂ θ 2
  
4
 

N −1   2 2
 ∂ 2  1   ∂ s[n;θ ]  ∂s[n;θ ]  
E  2 ln p ( x;θ )  = 2 ∑  E {x[n ]}− s[n;θ ] −  ∂θ  
 ∂θ  σ n =0  $#"
s[ n;θ ]
 ∂θ 2
  
$! ! !#!!! " 
 
 =0 
N −1 2
 ∂s[n;θ ] 
− ∑  ∂θ 
 
n =0
=
σ2
Then using this we get the CRLB for Signal in AWGN:
σ2  ∂s[n;θ ] 
2
var(θˆ) ≥ 2
Note:  ∂θ  tells how
N −1  
 ∂s[n;θ ] 
∑  ∂θ 
 
sensitive signal is to parameter
n =0
If signal is very sensitive to parameter change… then CRLB is small

… can get very accurate estimate!
5
Ex. 3.5: CRLB of Frequency of Sinusoid
Signal Model: x[n ] = A cos(2πf o n + φ ) + w[n ] 0 < f o < 12 n = 0, 1, 2, …, N − 1
1
var(θˆ) ≥ N −1
Error in SNR × ∑ [2πn sin( 2π f o n + φ ) ]2
Book n =0
x 10
-4 Signal is less
6 sensitive if fo
CRLB (cycle s /sa mple ) 2
near 0 or ½
4
Bound on
2
Variance
0
0 0.1 0.2 0.3 0.4 0.5
fo (cycles/sample)
0.025
CRLB 1 /2 (cycle s /s a mple)
Bound on 0.02
Std. Dev.
0.015
0.01
0 0.1 0.2 0.3 0.4 0.5
f (cycles/sample) 6
o
3.6 Transformation of Parameters
Say there is a parameter θ with known CRLBθ
But imagine that we instead are interested in estimating
some other parameter α that is a function of θ :
α = g(θ )
Q: What is CRLBα ?
2 Proved in
 ∂g (θ ) 
var(α ) ≥ CRLBα =   CRLBθ Appendix 3B
 ∂θ 
Captures the
sensitivity of α to θ
Large ∂g/∂ θ → small error in θ gives larger error in α

→ increases CRLB (i.e., worsens accuracy)
7
Example: Speed of Vehicle From Elapsed Time
Laser Laser
Known Distance D
start stop
Sensor Sensor
Measure Elapsed Time T
Possible Accuracy Set by CRLBT
But… really want to measure speed V = d/T
Find the CRLBV:
2 Accuracy Bound
 ∂  D 
CRLBV =    × CRLBT V2
 ∂T  T  σV ≥ CRLBT (m / s)
D
2
 D
=  −  × CRLBT • Less accurate at High Speeds (quadratic)
 T • More accurate over large distances
V4
= 2
× CRLBT
D 8
Effect of Transformation on Efficiency
Suppose you have an efficient estimator of θ : θˆ
But… you are really interested in estimating α = g(θ )
Suppose you plan to use αˆ = g (θˆ)

Q: Is this an efficient estimator of α ???
A: Theorem: If g(θ ) has form g(θ ) = aθ + b, then αˆ = g (θˆ)
is efficient. “affine” transform
“=” because “efficient”
Proof:
( ) ()
First: var (αˆ ) = var aθˆ + b = a 2 var θˆ = a 2CRLBθ
Now, what is CRBα ? Using transformation result:
 ∂ (aθ + b ) 
2
var (αˆ ) = CRLBα
2
CRLBα =   CRLBθ = a CRLBθ
 ! ∂θ 
$ !#!! "
=a 2 Efficient! 9
Asymptotic Efficiency Under Transformation
If the mapping α = g(θ ) is not affine… this result does NOT hold
But… if the number of data samples used is large, then the

estimator is approximately efficient (“Asymptotically Efficient”)
αˆ = g (θˆ), p (θˆ) αˆ = g (θˆ), p (θˆ)
pdf of θˆ pdf of θˆ
θˆ θˆ
Small N Case Large N Case
PDF is widely spread PDF is concentrated
over nonlinear mapping onto linearized section 10
3.7 CRLB for Vector Parameter Case
Vector Parameter: θ = [θ1 θ2 ! θp ]T
[
Its Estimate: θˆ = θˆ1 θˆ2 ! θˆ p ]
T
Assume that estimate is unbiased: E θˆ = θ {}

For a scalar parameter we looked at its variance…
but for a vector parameter we look at its covariance matrix:
{}
ˆ  ˆ [ ][ ]
ˆ T
var θ = E  θ − θ θ − θ  = C θˆ
 
For example:  var( xˆ ) cov( xˆ , yˆ ) cov( xˆ , zˆ ) 
 
 
for θ = [x y z]T C θˆ = cov( yˆ , xˆ ) var( yˆ ) cov( yˆ , zˆ )
 
 
 cov( zˆ, xˆ ) cov( zˆ, yˆ ) var( zˆ ) 

1
Fisher Information Matrix
For the vector parameter case…
Fisher Info becomes the Fisher Info Matrix (FIM) I(θ)

whose mnth element is given by:
Evaluate at
 ∂ 2 ln[ p ( x; θ)]  true value of θ
[I(θ)]mn = −E  , m, n = 1, 2, … , p
 ∂θ n ∂θ m 
2
The CRLB Matrix
Then, under the same kind of regularity conditions,
the CRLB matrix is the inverse of the FIM:
CRLB = I −1 (θ)
So what this means is: σ θ2ˆ = [C θˆ ]nn ≥ [I −1 (θ)]nn (!)

n
Diagonal elements of Inverse FIM bound the parameter variances,

which are the diagonal elements of the parameter covariance matrix
 var( xˆ ) cov( xˆ , yˆ ) cov( xˆ , zˆ )   b11 b12 b13 

   
   
C θˆ = cov( yˆ , xˆ ) var( yˆ ) cov( yˆ , zˆ ) b21 b22 b23  = I −1 (θ)
   
   
 cov( zˆ, xˆ ) cov( zˆ, yˆ ) var( zˆ )  b31 b32 b33 
 
3
More General Form of The CRLB Matrix
C θˆ − I −1 (θ) is positive semi - definite
Mathematical Notation for this is:

C θˆ − I −1 (θ) ≥ 0 (! !)
Note: property #5 about p.d. matrices on p. 573

states that (! !) ⇒ (!)
4
CRLB Off-Diagonal Elements Insight Not In Book
Let θ = [xe ye]T represent the 2-D x-y location of a

transmitter (emitter) to be estimated.
Consider the two cases of “scatter plots” for the estimated
location:
ŷ e ŷ e
σ ŷe σ ŷe
ye ye
σ x̂e σ x̂e
xe x̂e xe x̂e
Each case has the same variances… but location accuracy
characteristics are very different. ⇒ This is the effect of the
off-diagonal elements of the covariance
Should consider effect of off-diagonal CRLB elements!!! 5
CRLB Matrix and Error Ellipsoids Not In Book
Assume θˆ = [xˆ e yˆ e ]T is 2-D Gaussian w/ zero mean

and a cov matrix Cθˆ Only For Convenience
Then its PDF is given by:

Let :
()
p θˆ =
1  1 
exp  − θˆ T C θ−ˆ 1 θˆ  C θ−ˆ 1 = A
(2π )N C θˆ  2 
Quadratic Form!! for ease

(recall: it’s scalar valued)
So the “equi-height contours” of this PDF are given by the

values of θ̂ such that:
θˆ T A θˆ = k Some constant
Note: A is symmetric so a12 = a21 …because any cov. matrix is symmetric

and the inverse of symmetric is symmetric 6
What does this look like? a 11 xˆ e2 + 2 a 12 xˆ e yˆ e + a 22 yˆ e2 = k
An Ellipse!!! (Look it up in your calculus book!!!)
Recall: If a12 = 0, then the ellipse is aligned w/ the axes &

the a11 and a22 control the size of the ellipse along the axes
 1 
a11 0   0 
Note: a12 = 0 ⇒ Cθ−ˆ 1 =   ⇒ C =  a11 
 θˆ  1 
 0 a 22   0 
 a 22 
⇒ xˆ e & yˆ e are uncorrelated
Note: a12 ≠ 0 ⇒ xˆ e & yˆ e are correlated

 σ 2ˆ σ xˆ 
 xe ê 
ey
Cˆ =  
θ
σ σ 2yˆ e 
 yˆ e xê  7
Error Ellipsoids and Correlation Not In Book
if xˆ e & yˆ e are uncorrelated

ŷ e Choosing k Value
ŷ e For the 2-D case…
~ 2σ ŷ ~ 2σ ŷ k = -2 ln(1-Pe)
e e
x̂e x̂e
where Pe is the prob.
~ 2σ x̂ that the estimate will
e
~ 2σ x̂ lie inside the ellipse
e
if xˆ e & yˆ e are correlated

ŷ e
ŷ e
See posted
~ 2σ ŷ ~ 2σ ŷ paper by
e e Torrieri
x̂e x̂e
~ 2σ x̂ ~ 2σ x̂
e 8
e
Ellipsoids and Eigen-Structure Not In Book
Consider a symmetric matrix A & its quadratic form xTAx

⇒ Ellipsoid: x T Ax = k or Ax , x = k
Principle Axes of Ellipse are orthogonal to each other…

and are orthogonal to the tangent line on the ellipse:
x2
x1
Theorem: The principle axes of the ellipsoid xTAx = k are

eigenvectors of matrix A.
9
Proof: From multi-dimensional calculus: gradient of
a scalar-valued function φ(x1,…, xn) is orthogonal to the surface:
x2
x1
Different
Notations
∂φ ( x )
grad φ ( x1 ,…, xn ) = ∇ xφ ( x ) = =
∂x
T
 ∂φ ∂φ 
= ! 
 ∂x1 ∂xn 
See handout posted on Blackboard on Gradients and Derivatives
10
For our quadratic form function we have:
∂φ ∂ ( xi x j )
φ ( x ) = x A x = ∑∑ aij xi x j
T
⇒ = ∑∑ aij (♣)
i j
∂x k i j
∂x k
∂ ( xi x j ) ∂xi ∂x j
Product rule: = x j + xi (♣♣)
∂xk ∂xk ∂x
%$ # %$#k
1 i = k δ jk
= δ ik = 
0 i ≠ k
Using (♣♣) in (♣) gives: ∂φ = ∑ a jk x j + ∑ aik x j

∂xk j i
= 2∑ akj x j
By Symmetry:
aik = aki
j
And from this we get:
∇ x ( x T Ax) = 2Ax
11
Since grad ⊥ ellipse, this says Ax is ⊥ ellipse:
x2
Ax
x
x1
Ax , x = k
When x is a principle axis, then x and Ax are aligned:

x2
x Ax
x1 Ax = λ x
Eigenvectors are
Ax , x = k Principle Axes!!!
< End of Proof > 12

Theorem: The length of the principle axis associated with
eigenvalue λi is k / λ i
Proof: If x is a principle axis, then Ax = λx. Take inner product

of both sides of this with x:
k k
Ax , x = λ x , x x, x = ⇒ x =
%$ # %$# λ λ
=k 2
= x
< End of Proof >
Note: This says that if A has a zero eigenvalue, then the error ellipse
will have an infinite length principle axis ⇒ NOT GOOD!!
So… we’ll require that all λi > 0

⇒ Cθˆ must be positive definite
13
Application of Eigen-Results to Error Ellipsoids
The Error Ellipsoid corresponding to the estimator covariance
matrix Cθˆ must satisfy: ˆ T −1 ˆ
θ C θˆ θ = k
Note that the error
ellipse is formed
Thus finding the eigenvectors/values of Cθ−ˆ 1 using the inverse cov
shows structure of the error ellipse
Recall: Positive definite matrix A and its inverse A-1 have the
• same eigenvectors
• reciprocal eigenvalues
Thus, we could instead find the eigenvalues of C ˆ = I −1 (θ)

θ
and then the principle axes would have lengths
set by its eigenvalues not inverted
Inverse FIM!!
14
θˆ2
Illustrate with 2-D case: θˆ T C θ−ˆ 1θˆ = k
v1
v1 & v2 v2
λ1 & λ2
θˆ1
Eigenvectors/values for C ˆ k λ1
θ kλ2
(not the inverse!)
15
The CRLB/FIM Ellipse
Can make an ellipse from the CRLB Matrix…
instead of the Cov. Matrix
This ellipse will be the smallest error ellipse that an unbiased estimator
can achieve!
We can re-state this in terms of the FIM…
Once we find the FIM we can:

• Find the inverse FIM
• Find its eigenvectors… gives the Principle Axes
• Find its eigenvalues… Prin. Axis lengths are then k λi
16
3.8 Vector Transformations
Just like for the scalar case…. α = g(θ)
If you know CRLBθ you can find CRLBα
T
 ∂g (θ)  −1  ∂g (θ) 
CRLB α =  I ( θ )
 $#"  ∂θ 
 ∂θ CRLB on θ
 
$!!!! !#!!!!! "
CRLB on α
Jacobian Matrix
(see p. 46)
Example: Usually can estimate Range (R) and Bearing (ϕ) directly
But might really want emitter (x, y)
1
Example of Vector Transform y
ye Emitter
Can estimate Range (R) and Bearing (φ) directly R
But might really want emitter location (xe, ye) φ
xe x
R  xe   R cosφ 
Direct θ =   α =   = g (θ) =   Mapped
Parameters  Parameters
φ   y e   R sin φ 
y
Jacobian Matrix
ye
 ∂R cosφ ∂R cosφ 
 ∂φ 
∂g (θ )  ∂R 
=
∂θ  ∂R sin φ ∂R cosφ  x
  xe
 ∂R ∂φ 
T
 cosφ − R sin φ   ∂g (θ)   ∂g (θ) 
CRLB α =  CRLB
=  
  ∂θ 
 θ
 ∂θ 

 sin φ R cosφ  2
3.9 CRLB for General Gaussian Case
In Sect. 3.5 we saw the CRLB for “signal + AWGN”
For that case we saw: Deterministic Signal w/
The PDF’s parameter-dependence Scalar Deterministic Parameter
showed up only in the mean of the PDF
Now generalize to the case where: x ~ N (µ(θ), C(θ) )

• Data is still Gaussian, but
• Parameter-Dependence not restricted to Mean
• Noise not restricted to White… Cov not necessarily diagonal
One way to get this case: “signal + AGN”
Random Gaussian Signal w/ Non-White Noise

Vector Deterministic Parameter
3
For this case the FIM is given by: (See Appendix 3c)
T
 ∂µ(θ)  −1
 ∂µ(θ)  1  −1 ∂ C(θ) −1 ∂ C(θ) 
[I(θ)] ij =   C ( θ)   + tr C (θ) C (θ) 
 ∂θ i   ∂θ j  2  ∂ θ ∂ θ j 
!#!!!!! $!!!!!!
i
$!!!! " !#!!!!!!! "
Variability of Mean Variability of Cov

w.r.t. parameters w.r.t. parameters
This shows the impact of signal model assumptions

• deterministic signal + AGN Est. Cov. uses average
• random Gaussian signal + AGN over only noise
Est. Cov. uses average

over signal & noise
4
Gen. Gauss. Ex.: Time-Difference-of-Arrival
Tx Rx1 Given:
Rx2
x1(t) = s(t – ∆τ) + w1(t)
x= [
x1T x2]
T T
x2(t) = s(t) + w2(t)

Goal: Estimate ∆τ
How to model the signal?
• Case #1: s(t) is zero-mean, WSS, Gauss. Process Passive Sonar
• Case #2: s(t) is a deterministic signal Radar/Comm Location
Case #1 Case #1
µ ( ∆τ ) = 0 No Term #1 C( ∆τ ) = C No Term #2
 s1[0; ∆τ ] 
 C11 C12 ( ∆τ )  
C( ∆τ ) =    s1[1; ∆τ ] 
 
  % 
C 21 ( ∆τ ) C 22   
µ( ∆τ ) = s1[ N − 1; ∆τ ] 

 
 s [0; ∆τ ] 
Cii = C si si + C w i w i  2 
 % 
 
Cij ( ∆τ ) = C si s j ( ∆τ )  s [ N − 1; ∆τ ]
 2  5
Comments on General Gaussian CRLB
It is interesting to note that for any given problem you may find
each case used in the literature!!!
For example for the TDOA/FDOA estimation problem:

• Case #1 used by M. Wax in IEEE Trans. Info Theory, Sept. 1982
• Case #2 used by S. Stein in IEEE Trans. Signal Proc., Aug. 1993
See also differences in the book’s examples
We’ll skip Section 3.10 and leave it as a reading assignment
6
3.11 CRLB Examples
We’ll now apply the CRLB theory to several examples of
practical signal processing problems.
We’ll revisit these examples in Ch. 7… we’ll derive ML
estimators that will get close to achieving the CRLB
1. Range Estimation
– sonar, radar, robotics, emitter location
2. Sinusoidal Parameter Estimation (Amp., Frequency, Phase)
– sonar, radar, communication receivers (recall DSB Example), etc.
3. Bearing Estimation
– sonar, radar, emitter location
4. Autoregressive Parameter Estimation
– speech processing, econometrics
1
Ex. 1 Range Estimation Problem
Transmit Pulse: s(t) nonzero over t∈[0,Ts]
Receive Reflection: s(t – τo)
Measure Time Delay: τo
C-T Signal Model

x ( t ) = s ( t − τ o ) + w( t ) 0 ≤ t ≤ T = Ts + τ o,max
 
s ( t ;τ o )
Bandlimited s(t)
White Gaussian
PSD of w(t)
Ts t
No/2
BPF x(t) s(t – τo)
& Amp
–B B f
T t 2
Range Estimation D-T Signal Model
PSD of w(t) ACF of w(t)
No/2 σ2 = BNo
Sample Every ∆ = 1/2B sec

–B B f τ w[n] = w(n∆)
1/2B 1/B 3/2B
DT White
x[n ] = s ( n∆ − τ o ) + w[n ] n = 0,1,  , N − 1 Gaussian Noise
Var σ2 = BNo
s[n;τo]… has M non-zero samples starting at no
no = τo /∆
 w[n ] 0 ≤ n ≤ no − 1

x[n ] =  s ( n∆ − τ o ) + w[n ] no ≤ n ≤ n o + M − 1

 w[n ] no + M ≤ n ≤ N − 1
3
Range Estimation CRLB
Now apply standard CRLB result for signal + WGN:
Plug in… and keep
non-zero terms
σ2 σ2
var(τô ) ≥ =
N −1 2 no + M −1 2
 ∂s[n;τ o ]   ∂s ( n∆ − τ o ) 
∑  ∂τ o  ∑ 
∂τ o

n = 0  n = no  
σ2 σ2
= 2
= 2
no + M −1   M −1  
∂ ∂s (t )
∑  s ( t )
 ∂t t = n∆ −τ 
 ∑  ∂t 

n = no  o 
n =0  t = n∆ 
Exploit Calculus!!! Use approximation: τo = ∆ no

Then do change of variables!!
4
Range Estimation CRLB (cont.)
Assume sample spacing is small… approx. sum by integral…
σ2 No / 2 1
var(τô ) ≥ 2
= 2
= 2
1 Ts  ∂s (t )  Ts  ∂s ( t )  Ts  ∂s ( t ) 
∆ ∫0

 ∂t 
 dt ∫0   dt
 ∂t  Es ∫0   dt
 ∂t 
No / 2 Es
Ts
E s = ∫ s 2 (t )dt
1 0
var(τô ) ≥ FT Theorem
( )
Ts
∫0
2 & Parseval
Es 2πf 2
S ( f ) df
No / 2 Es Define a BW measure:
∞
=
1
∫−∞ (2πf )2
S ( f )
2
df
∞ Parseval Brms =
∫−∞ (2πf )
2 2
S ( f ) df ∞
∫−∞
Es 2
S ( f ) dt
∞
∫−∞ S ( f )
No / 2 2
dt
Brms is “RMS BW” (Hz)
A type of “SNR” 5
Using these ideas we arrive at the CRLB on the delay:
var (τô ) ≥
1
SNRE × Brms
2 (sec )
2
This “SNR” is not our usual ratio of powers… so let’s convert to

our usual form: 1 T 2 E
( )
Ts ∫0
Es = ∫ s 2 ( t ) dt =
T
=
s s
s
Ps s t dt
0 Ts
No
=
Pn × (2B )
2
Thus…
Ps
= =
SNR
Es Ts
SNRE = 2 BTs SNR
Pn N o × 2 B
( )
2
var (τô ) ≥
1
2 BTs SNR × Brms
2 (sec ) 2
6
To get the CRLB on the range… use “transf. of parms” result:
( )
2
c 4
( )
2
 ∂R  var Rˆ ≥ 2
CRLBRˆ =   CRLBτˆ with R = cτo / 2 m
 ∂τ o 
o 2 BTs SNR × Brms
2
CRLB is inversely proportional to:

So the CRLB tells us…
• SNR Measure
• Choose signal with large Brms
• RMS BW Measure • Ensure that SNR is large
• Better on Nearby/large targets
• Which is better?
– Double transmitted energy/power?
– Double RMS bandwidth?
7
Ex. 2 Sinusoid Estimation CRLB Problem
Given DT signal samples of a sinusoid in noise….
Estimate its amplitude, frequency, and phase
x[n ] = A cos(Ω o n + φ ) + w[n ] n = 0, 1,  , N − 1
Ωo is DT frequency in DT White Gaussian Noise

rad/sample: 0 < Ωo < π Zero Mean & Variance of σ2
Multiple parameters… so parameter vector: θ = [ A Ω o φ ]T
Recall… SNR of sinusoid in noise is:
Ps A2 / 2 A2
SNR = = =
Pn σ 2
2σ 2
8
Sinusoid Estimation CRLB Approach
Approach:
• Find Fisher Info Matrix
• Invert to get CRLB matrix
• Look at diagonal elements to get bounds on parm variances
Recall: Result for FIM for general Gaussian case specialized to

signal in AWGN case:
T
1  ∂s θ  ∂s θ 

[I(θ)]ij = 2  
σ  ∂θ i  ∂θ j 

N −1
∂s[n; θ] ∂s[n; θ]
∑ ∂θ ∂θ
1
=
σ2 n =0 i j
9
Sinusoid Estimation Fisher Info Elements
Taking the partial derivatives and using approximations given in
book (valid when Ωo is not near 0 or π) : θ = [ A Ω o φ ]T
N −1 N −1
∑ cos (Ωo n + φ ) = ∑ (1 + cos(2Ωo n + 2φ )) ≈
1 1 N
[I(θ)]11 = 2
σ2 n =0 2σ 2 n =0 2σ 2
− 1 N −1 −A N −1
[I(θ)]12 = [I(θ)]21 = 2 ∑ An cos(Ω o n + φ ) sin(Ω o n + φ ) = ∑ n sin(2Ωo n + 2φ ) ≈ 0
σ n =0 2σ 2 n =0
− 1 N −1 −A N −1
[I(θ)]13 = [I(θ)]31 = 2 ∑ A cos(Ω o n + φ ) sin(Ω o n + φ ) = ∑ sin(2Ωo n + 2φ ) ≈ 0
σ n =0 2σ 2 n =0
N −1 N −1 N −1
A2 A2
∑A ∑ n (1 − cos(2Ωo n + 2φ )) ≈ 2σ 2 ∑ n 2
1
[I(θ)]22 = 2
( n ) sin (Ω o n + φ ) =
2 2 2
σ2 n =0 2σ 2 n =0 n =0
N −1 N −1
A2
∑ A n sin ∑n
1
[I(θ)]23 = [I(θ)]32 = 2 2
(Ω o n + φ ) ≈
σ 2
n =0 2σ 2
n =0
N −1
NA2
∑A
1
[I(θ)]33 = 2
sin (Ω o n + φ ) ≈
2
σ 2
n =0 2σ 2 10
Sinusoid Estimation Fisher Info Matrix
θ = [ A Ω o φ ]T
Fisher Info Matrix then is:
 N 
 2 0 0 
 2σ 
 
 A2 N −1
A2 N −1 
I( θ) ≈  0
 2σ 2
∑n2 2 ∑
n
2σ n =0 
n =0
 
 N −1 
A2 NA2 
 0
 2σ 2 ∑n 2σ 2 
 n =0
A2
Recall… SNR = and closed form results for these sums
2σ 2
11
Sinusoid Estimation CRLBs (using co-factor & det
approach… helped by 0’s)
Inverting the FIM by hand gives the CRLB matrix… and then
extracting the diagonal elements gives the three bounds:
2σ 2
var( Aˆ ) ≥ ( volts2 )
N
To convert to Hz2
12 multiply by (Fs /2π)2
var(Ω
ˆ )≥
o (( rad/sample) 2 )
SNR × N ( N 2 − 1)
2( 2 N − 1) 4
var(φˆ) ≥ ≈ ( rad 2 )
SNR × N ( N + 1) SNR × N
• Amp. Accuracy: Decreases as 1/N, Depends on Noise Variance (not SNR)

• Freq. Accuracy: Decreases as 1/N3, Decreases as as 1/SNR
• Phase Accuracy: Decreases as 1/N, Decreases as as 1/SNR
12
Frequency Estimation CRLBs and Fs Not in Book
The CRLB for Freq. Est. referred back to the CT is

12 Fs2
var( fô ) ≥ ( Hz 2 )
( 2π ) 2 SNR × N ( N 2 − 1)
Does that mean we do worse if we sample faster than Nyquist?

NO!!!!! For a fixed duration T of signal: N = TFs
Also keep in mind that Fs has effect on the noise structure:

No/2 σ2 = BNo
–B B f τ
1/2B 1/B 3/2B
13
Ex. 3 Bearing Estimation CRLB Problem
Figure 3.8 Emits or reflects
from textbook: signal s(t)
s (t ) = At cos(2πf o t + φ )
Simple model
Uniformly spaced linear array with M sensors:

• Sensor Spacing of d meters
• Bearing angle to target β radians
d
Propagation Time to nth Sensor: tn = t0 − n cos β n = 0, 1,  , M − 1
c
sn (t ) = αs (t − tn )
Signal at nth Sensor:   d  
= A cos 2πf o  t − t0 + n cos β  + φ 
  c   14
Bearing Estimation Snapshot of Sensor Signals
Now instead of sampling each sensor at lots of time instants…
we just grab one “snapshot” of all M sensors at a single instant ts
  d  
sn (t s ) = A cos 2πf o  t s − t0 + n cos β  + φ 
  c  
 
 
  

  2πf 
~

= A cos  
 
o  
+ 

~
cos β d n φ = A cos Ω s n + φ ( )
c
   
  
 
 ωs 
    
 Ωs  Spatial sinusoid w/
spatial frequency Ωs
Spatial Frequencies: For sinusoidal transmitted signal… Bearing

• ωs is in rad/meter Est. reduces to Frequency Est.
• Ωs is in rad/sensor And… we already know its FIM & CRLB!!!
15
Bearing Estimation Data and Parameters
Each sample in the snapshot is corrupted by a noise sample…
and these M samples make the data vector x = [x[0] x[1] … x[M-1] ]:
(
x[n ] = sn (t s ) + w[n ] = A cos Ω s n + φ + w[n ]
~
)
Each w[n] is a noise sample that comes from a different sensor so…
Model as uncorrelated Gaussian RVs (same as white temporal noise)
Assume each sensor has same noise variance σ2
~
So… the parameters to consider are: θ = [ A Ωs φ ]T
 A 
 A  
 Ω 
α = g (θ) =  β  = arccos
c s 
which get transformed to:
~   2πf o d 
φ   ~ 
 φ 
Parameter of interest!
16
Bearing Estimation CRLB Result
Using the FIM for the sinusoidal parameter problem… together
with the transform. of parms result (see book p. 59 for details):
12
var( βˆ ) ≥ 2
( rad 2 )
M +1 L 
( 2π ) 2 SNR × M   sin (β )
2
M −1 λ 
Define: Lr = L/λ
L = Array physical length in meters Array Length “in
M = Number of array elements wavelengths”
λ = c/fo Wavelength in meters (per cycle)
• Bearing Accuracy:
– Decreases as 1/SNR – Depends on actual bearing β
– Decreases as 1/M  Best at β = π/2 (“Broadside”)
– Decreases as 1/Lr2  Impossible at β = 0! (“Endfire”)
Low-frequency (i.e., long wavelength) signals need

very large physical lengths to achieve good accuracy 17
Reading
Ex. 4 AR Estimation CRLB Problem Assignment
In speech processing (and other areas) we often model the
signal as an AR random process and need to estimate the AR
parameters. An AR process has a PSD given by
σ u2
Pxx ( f ; θ) = 2
p
1+ ∑ a[m]e − j 2πfm
m =1
AR Estimation Problem: Given data x[0], x[1], … , x[N-1]

estimate the AR parameter vector
[
θ = a[1] a[2]  a[ p ] σ u2 ] T
This is a hard CRLB to find exactly… but it has been published.

The difficulty comes from the fact that there is no easy direct
relationship between the parameters and the data.
It is not a signal plus noise problem
18
AR Estimation CRLB Asymptotic Approach
Approach: The asymptotic result we discussed is perfect here:
• An AR process is WSS… is required for the Asymp. Result
• Gaussian is often a reasonable assumption… needed for Asymp. Result
• The Asymp. Result is in terms of partial derivatives of the PSD… and

that is exactly the form in which the parameters are clearly displayed!
∂[ln Pxx ( f ; θ)] ∂[ln Pxx ( f ; θ)]

Recall: [I(θ)]ij
1
N
≈ ∫−
2
df
2 1
2 ∂θ i ∂θ j
2
σ u2 p
ln Pxx ( f ; θ) = ln
p
2
= ln σ u2 − ln 1 + ∑ a[m]e − j 2πfm
m =1
1+ ∑ a[m]e − j 2πfm
m =1
19
AR Estimation CRLB Asymptotic Result
After taking these derivatives… you get results that can be
simplified using properties of FT and convolution. Complicated
dependence on
The final result is: var(aˆ[k ]) ≥
N
[R ]
σ u2 −1
xx kk k = 1, 2,  , p AC Matrix!!
2σ u4
var(σˆ u2 ) ≥ Both Decrease
N as 1/N
To get a little insight… look at 1st order AR case (p = 1):

1
var(aˆ[1]) ≥ (1 − a 2 [1])
N
Im(z)
Improves as pole
gets closer to
–a[1] unit circle…
Re(z)
PSDs with
sharp peaks are
easier to 20
CRLB Example:
Single-Rx Emitter Location via Doppler
s (t; f1 ) t ∈ [t1 , t1 + T ]
s ( t ; f 2 ) t ∈ [t 2 , t 2 + T ]
s ( t ; f 3 ) t ∈ [t 3 , t 3 + T ]
• Received Signal Parameters Depend on Location

– Estimate Rx Signal Frequencies: f1, f2, f3, …, fN
– Then Use Measured Frequencies to Estimate Location
(X, Y, Z, fo)
1
Problem Background
Radar to be Located: at Unknown Location (X,Y,Z)
Transmits Radar Signal at Unknown Carrier Frequency fo
Signal is intercepted by airborne receiver with:
Known (Navigation Data):

Antenna Positions: (Xp(t), Yp(t), Zp(t))
Antenna Velocities: (Vx(t), Vy(t), Vz(t))
Goal: Estimate Parameter Vector x = [X Y Z fo]T
2
Physics of Problem
Emitter
Relative motion between u(t)
emitter and receiver Receiver
causes a Doppler shift of
the carrier frequency: v(t)
u(t) is unit vector
along line-of-sight
fo
f (t , x ) = f o − v ( t ) • u( t )
c
= fo −

( ) ( ) (
f o V x (t ) X p (t ) − X + V y (t ) Y p (t ) − Y + Vz (t ) Z p (t ) − Z ) .
c  ( ) ( ) ( )
X p (t ) − X 2 + Y p (t ) − Y 2 + Z p (t ) − Z 2

 
Because we estimate the frequency there is an error added:
~
f ( ti , x ) = f ( ti , x ) + v ( ti )
3
Estimation Problem Statement
Vector-Valued function of a Vector
Given:
Data Vector:
~
[
~ ~ ~
f (x) = f (t1 , x) f (t 2 , x)  f (t N , x)]T
Navigation Info: X p (t1 ), X p (t2 ),  , X p (t N )

Y p (t1 ), Y p (t2 ),  , Y p (t N )
Z p (t1 ), Z p (t2 ),  , Z p (t N )
V x (t1 ), V x (t2 ),  ,V x (t N )
V y (t1 ), V y (t2 ),  ,V y (t N )
Vz (t1 ), Vz (t2 ),  ,Vz (t N )
Estimate:
Parameter Vector: [X Y Z fo]T
Right now only want to consider the CRLB

4
The CRLB
Note that this is a “signal” plus noise scenario:
• The “signal” is the noise-free frequency values
• The “noise” is the error made in measuring frequency
Assume zero-mean Gaussian noise with covariance matrix C:

• Can use the “General Gaussian Case” of the CRLB
• Of course validity of this depends on how closely the errors
of the frequency estimator really do follow this
~
Our data vector is distributed according to: f ( x ) ~ Ν (f ( x ), C)
Only need the first term in the CRLB equation:

Only Mean Shows
T
   ∂f ( x ) 
[J ] ij =  ∂f (x) 
Dependence on
C −1   parameter x!
 ∂x i   ∂x j 
I use J for the FIM instead of I to avoid confusion with the identity matrix. 5
Convenient Form for FIM Called “The Jacobian” of f(x)
To put this into an easier form to look at… Define a matrix H:
∂
H= f (x) = [h1 | h 2 | h 3 | h 4 ]
∂x x = true value
where  ∂ 
 ∂x f (t1 , x ) 
 j 
 ∂ f (t2 , x ) 
h j =  ∂x j
 
  
 ∂ f (t N , x )
 ∂x j 
  x = true value
Then it is east to verify that the FIM becomes:
J = HT C −1H
6
CRLB Matrix
The Cramer-Rao bound covariance matrix then is:
CCRB ( x ) = J −1
[
= H C HT −1
]
−1
A closed-form expression for the partial derivatives needed for H can be

computed in terms of an arbitrary set of navigation data – see “Reading Version”
of these notes (on BlackBoard).
Given: Emitter Location & Platform Trajectory and Measurement Cov C
• Compute Matrix H
• Compute the CRLB covariance matrix CCRB(x)
• Compute eigen-analysis of CCRB(x)
• Determine the 4-D error ellipsoid.
Can’t really plot a 4-D ellipsoid!!!
But… it is possible to project this 4-D ellipsoid down into a 2-D ellipse so that
you can see the effect of geometry. 7
Projection of Error Ellipsoids See also “Slice Of Error Ellipsoids”
A zero-mean Gaussian vector of two vectors x & y: θ = x [ T

y ]
T T
The the PDF is:
p (θ) =
1
( 2π ) N / 2 det(Cθ )
{
exp − 12 θT Cθ−1θ}  Cx
Cθ = 
C xy 
C y 
C yx
The quadratic form in the exponential defines an ellipse:

−1
θ T
Cθ θ = k Can choose k to make size of
ellipsoid such that θ falls inside
the ellipsoid with a desired
probability
Q: If we are given the covariance Cθ how is x alone is distributed?
A: Extract the sub-matrix Cx out of Cθ

8
Projection Example
Full Vector: θ = [ x y z ]T Sub-Vector: x = [ x y ]T
We want to project the 3-D ellipsoid for θ

…down into a 2-D ellipse for x
2-D ellipse still shows

the full range of
variations of x and y
9
Finding Projections
To find the projection of the CRLB ellipse:
1. Invert the FIM to get CCRB
2. Select the submatrix CCRB,sub from CCRB
3. Invert CCRB,sub to get Jproj
4. Compute the ellipse for the quadratic form of Jproj
( )
Mathematically: CCRB,sub = PCCRB PT
−1 T −1
−1 T
J proj = PJ P
= PJ P
P is a matrix formed from the identity matrix:
keep only the rows of the variables projecting onto
For this example, frequency-based emitter location: [X Y Z fo]T

To project this 4-D error ellipsoid onto the X-Y plane:
1 0 0 0 
P= 
 0 1 0 0  10
Projections Applied to Emitter Location
Shows 2-D ellipses that

result from projecting
4-D ellipsoids
11
Slices of Error Ellipsoids
Q: What happens if one parameter were perfectly known.
Capture by setting that parameter’s error to zero
⇒ slice through the error ellipsoid.
Impact:
• slice = projection when ellipsoid not tilted
• slice < projection when ellipsoid is tilted.
Recall: Correlation causes tilt
12
Chapter 4
Linear Models
1
General Linear Model
Recall signal + WGN case: x[n] = s[n;θ] + w[n]
x = s(θ) + w Here, dependence on θ is general
Now we consider a special case: Linear “Observations”:
s(θ) = Hθ + b
p×1 known “offset”(p×1)
N×1 known “observation
matrix” (N×p)
The General Linear Model:

x = Hθ + b + w
Data
Vector Known & Known ~N(0,C)
Full Rank zero-mean, Note: “Gaussian” is part
To Be Gaussian, of the “Linear Model”
Estimated C is pos. def. 2
Need For Full-Rank H Matrix
Note: We must assume H is full rank
Q: Why?
A: If not, the estimation problem is “ill-posed”

…given vector s there are multiple θ vectors that give s:
If H is not full rank…

Then for any s : ∃ θ1, θ2 such that s = Hθ1 = Hθ2
3
Importance of The Linear Model
There are several reasons:
1. Some applications admit this model

2. Nonlinear models can sometimes be linearized
3. Finding Optimal Estimator is Easy
ˆθ (T −1 −1 T −1
MVU = H C H )
H C (x − b ) … as we’ll see!!!
4
MVUE for Linear Model
Theorem: The MVUE for the General Linear Model and its
covariance (i.e. its accuracy performance) are given by:
ˆθ (T −1 −1 T −1
MVU = H C H )
H C (x − b )
(
Cθˆ = H C H
T −1
)
−1
and achieves the CRLB.
Proof: We’ll do this for the b = 0 case but it can easily be done
for the more general case.
First we have that x~N(Hθ,C) because:

E{x} = E{Hθ + w} = Hθ + Ε{w} = Hθ
cov{x} = E{(x – Hθ) (x – Hθ)T} = E{w wT} = C

5
Recalling CRLB Theorem… Look at the partial of LLF:
∂ ln p( x; θ)
∂θ
=−
1 ∂
2 ∂θ
[( x − Hθ)T C −1 ( x − Hθ) ] (Hθ)TC-1x = [(Hθ)TC-1x]T
= xTC-1 (Hθ)
1 ∂  T −1 − −

=− x$
! C!
# "x − 2$ T
x!C #
1
!Hθ
" + θ$!
T
H
!
T
#
1
C!!Hθ
" 
2 ∂θ  
 
Constant Linear Quadratic w.r.t. θ
w.r.t. θ w.r.t. θ (Note: HTC-1H is symmetric)
Now use results in “Gradients and Derivatives” posted on BB:

∂ ln p ( x; θ)
∂θ
[
1
] [
= − − 2HT C −1x + 2HT C −1Hθ = HT C −1x − HT C −1Hθ
2
]
 
=H
$!#!C H (
T −1  T −1 −1 T −1
H C H
"  $!!!#!!!" ) H C x − θ 

Pull out
HTC-1H
I(θ)  g ( θ ) =θˆ 
The “CRLB Theorem” says that if we have this form we

have found the MVU and it achieves the CRLB of I-1(θ)!! 6
For simplicity… assume b = 0
Whitening Filter Viewpoint
Assume C is positive definite (necessary for C-1 to exist)
Thus, from (A1.2): for pos. def. C ∃ N×N invertible matrix D, s.t.
C-1 = DTD C = D-1(DT)-1

~ ~ ~
Transform data x using matrix D: x = Dx = DHθ + Dw = Hθ + w
{ } {
~w
Ew } {
~ T = E ( Dw )( Dw )T = E Dww T DT } Claim: White!!
= D D (D )
T −1  T
 −1
= DCD T
D = I
 
x ~
x MVUE for Lin. θ̂
D Model w/ White
Whitening Noise
Filter 7
Ex. 4.1: Curve Fitting
Caution: The “Linear” in “Linear Model”
does not come from fitting straight lines to data
It is more general than that !!
x[n] Data
Model is Quadratic in Index n…

But Model is Linear in Parameters
n
x[n ] = θ1 + θ 2 n + θ 3n 2 + w[n ] x = Hθ + w
Linear in θ’s 1 1 12 
 
θ1  1 2 2 2
  H= 
θ = θ 2  %
  % % 
 
θ 3  1
 N N 2 
8
Ex. 4.2: Fourier Analysis (not most general)
 2πkn   2πkn 
M M
Data Model: x[n ] = ∑ a k cos   ∑ k
+ b sin   + w [n ]
k =1  N  k =1  N 
Parameters to AWGN
Estimate
Parameters: θ = [a1 . . . aM b1 . . . bM]T (Fourier Coefficients)

 
 
 
 
 
Observation  
Matrix:

H=

… … 

 n = 0, 1, 2, …, N
 
  Down each column
 
 
 
 
 
 2πkn   2πkn 
cos   sin   k = 1, 2, …, M
 N   N  9
Now apply MVUE Theorem for Linear Model:
θˆMVU (
= H HT
)
−1
HT x θˆMVU =
N T
2
H x
N
= I
2
Each Fourier coefficient
Using standard
estimate is found by the inner
orthogonality of
product of a column of H with
sinusoids (see book)
the data vector x
Interesting!!! Fourier Coefficients for signal + AWGN are MVU

estimates of the Fourier Coefficients of the noise-free signal
COMMENT: Modeling and Estimation (are Intertwined)

• Sometimes the parameters have some physical significance (e.g. delay
of a radar signal).
• But sometimes parameters are part of non-physical assumed model
(e.g. Fourier)
• Fourier Coefficients for signal + AGWN are MVU estimates of the
Fourier Coefficients of the noise-free signal 10
Ex. 4.3: System Identification
u[n] x[n]
H(z) +
Known Unknown Observed
Input System w[n] Noisy Output
Goal: Determine a model for the system

Some Application Areas:
• Wireless Communications (identify & equalize multipath)
• Geophysical Sensing (oil exploration)
• Speakerphone (echo cancellation)
In many applications: assume that the system is FIR (length p)
p −1
x[ n ] = ∑ h[k ] u[n − k ] + w [n ] unknown, but here
we’ll assume known
k =0
AWGN
Measured Estimation Known Input
Parameters Assume u[n] =0, n < 0 11
Write FIR convolution in matrix form:
 u[0] 0 0 & & 0 
 
 u[1] u[0] 0 & & 0 
 
 u[2] u[1] u[0] 0 & 0 
   h[0] 
 % ' ' ' ' %   
   h[1] 
x=  % ' ' ' 0    +w
   % 
 % ' u[0]   
Measured   h[ p − 1]
Data  % u[1]  $!#! "
 θ
 
 % % 
 
Known Input $ u[ N − 1] & & & & u[ N − p ] Estimate
!!!!!!!!!!!!#!!!!!!!!!!!! "
Signal Matrix H ( Nxp ) This
The Theorem for the Linear Model says:
ˆθ (
MVU = H H
T −1 T
H x )
Cθˆ = σ H H2
( T
)
−1
and achieves the CRLB.
12
Q: What signal u[n] is best to use ?
A: The u[n] that gives the smallest estimated variances!!
Book shows: Choosing u[n] s.t. HTH is diagonal will

minimize variance
⇒ Choose u[n] to be pseudo-random noise (PRN)
u[n] is ⊥ to all its shifts u[n – m]
( )
−1
Proof uses: Cθˆ = σ 2 HT H
And Cauchy-Schwarz Inequality (same as Schwarz Ineq.)
Note: PRN has approximately flat spectrum
So from a frequency-domain view a PRN signal equally probes at

all frequencies
13
Chapter 6
Best Linear Unbiased Estimate
(BLUE)
1
Motivation for BLUE
Except for Linear Model case, the optimal MVU estimator might:
1. not even exist
2. be difficult or impossible to find
⇒ Resort to a sub-optimal estimate
BLUE is one such sub-optimal estimate
Idea for BLUE:
1. Restrict estimate to be linear in data x
2. Restrict estimate to be unbiased
3. Find the best one (i.e. with minimum variance)
Advantage of BLUE:Needs only 1st and 2nd moments of PDF

Disadvantages of BLUE:
1. Sub-optimal (in general) Mean & Covariance
2. Sometimes totally inappropriate (see bottom of p. 134)
2
6.3 Definition of BLUE (scalar case)
Observed Data: x = [x[0] x[1] . . . x[N – 1] ]T
PDF: p(x;θ ) depends on unknown θ

N −1
BLUE constrained to be linear in data: θˆBLU = ∑n
a x[ n ] = a T
x
n −0
Choose a’s to give: 1. unbiased estimator
2. then minimize variance
Linear
Unbiased
Variance
Estimators Note: This is

BLUE not Fig. 6.1
Nonlinear
Unbiased Estimators
MVUE
3
6.4 Finding The BLUE (Scalar Case)
N −1
1. Constrain to be Linear: θˆ = ∑ a n x[ n]
n −0
2. Constrain to be Unbiased: E {θˆ} = θ

Using linear constraint
⇓
N −1
∑a
n=0
n E { x [ n ]} = θ
Q: When can we meet both of these constraints?
A: Only for certain observation models (e.g., linear observations)
4
Finding BLUE for Scalar Linear Observations
Consider scalar-parameter linear observation:
x[n] = θs[n] + w[n] ⇒ E{x[n]} = θs[n]
N −1
Then for the unbiased condition we need: E{θˆ } = θ ∑ a#n"s[!
n] = θ
n −0 ⇓
Tells how to choose
weights to use in the
Need aT s = 1
BLUE estimator form
N −1
θˆ = ∑a
n −0
n x[ n]
Now… given that these constraints are met…

We need to minimize the variance!!
Given that C is the covariance matrix of x we have:
{ } { }
Like var{aX} =a2 var{X}
var θˆBLU = var aT x = aT Ca
5
Goal: minimize aTCa subject to aTs = 1
⇒ Constrained optimization
Appendix 6A: Use Lagrangian Multipliers:
Minimize J = aTCa + λ(aTs – 1)
∂J λ
Set : = 0 ⇒ a = − C −1s
∂a #$"$ 2 $!
$ C −1s
a s =1
T
a=
sT C −1s
λ λ 1
⇒ aT s = − sT C −1s = 1 ⇒ − =
2 2 sT C −1s
sT C −1x var(θˆ) =
1
θˆ =a x=
T
BLUE
sT C −1s sT C −1s
Appendix 6A shows that this achieves a global minimum

6
Applicability of BLUE
We just derived the BLUE under the following:
1. Linear observations but with no constraint on the noise PDF
2. No knowledge of the noise PDF other than its mean and cov!!
What does this tell us???

BLUE is applicable to linear observations
But… noise need not be Gaussian!!!
(as was assumed in Ch. 4 Linear Model)
And all we need are the 1st and 2nd moments of the PDF!!!
But… we’ll see in the Example that we

can often linearize a nonlinear model!!!
7
6.5 Vector Parameter Case: Gauss-Markov Thm
Gauss-Markov Theorem:
If data can be modeled as having linear observations in noise:
x = Hθ + w
Known Matrix Known Mean & Cov
(PDF is otherwise
arbitrary & unknown)
(
T −1 −1 T −1
Then the BLUE is: θ BLUE = H C H H C x
ˆ )
(
and its covariance is: C ˆ = HT C −1H
θ
)
−1
Note: If noise is Gaussian then BLUE is MVUE

8
Ex. 4.3: TDOA-Based Emitter Location
s(t)
Tx @ (xs,ys)
s(t – t1)
s(t – t2) s(t – t3) Rx3
Rx1 Rx2 (x3,y3)
(x1,y1) (x2,y2)
Hyperbola:
Hyperbola:
τ23 = t3 – t2 = constant
τ12 = t2 – t1 = constant
TDOA = Time-Difference-of-Arrival
We won’t worry about

Assume that the ith Rx can measure its TOA: ti “how” they do that.
Also… there are TDOA
Then… from the set of TOAs… compute TDOAs systems that never
actually estimate TOAs!
Then… from the set of TDOAs… estimate location (xs,ys)
9
TOA Measurement Model
Assume measurements of TOAs at N receivers (only 3 shown above):
There are measurement errors
t0, t1, … ,tN-1
TOA measurement model:

To = Time the signal emitted
Ri = Range from Tx to Rxi
c = Speed of Propagation (for EM: c = 3x108 m/s)
ti = To + Ri/c + εi i = 0, 1, . . . , N-1
Measurement Noise ⇒ zero-mean, variance σ2, independent (but PDF unknown)

(variance determined from estimator used to estimate ti’s)
Now use: Ri = [ (xs – xi)2 + (ys - yi)2 ]1/2

1 Nonlinear
ti = f ( x s , y s ) = To + ( x s − xi ) 2 + ( y s − y i ) 2 + ε i Model
c
10
Linearization of TOA Model
So… we linearize the model so we can apply BLUE:
Assume some rough estimate is available (xn, yn)
xs = xn + δxs ys = yn + δys
⇒ θ = [δx δy]T
know estimate know estimate
Now use truncated Taylor series to linearize Ri (xn, yn):

x n − xi y n − yi
Ri ≈ Rn + δ s
x + δy s
i Rn Rn
#
$"$ i
! #
$"$ i
!
∆
= Ai = Bi
∆
Known
~ Rn Ai Bi
Apply to TOA: ti = ti − i
= To + δx s + δy s + ε i
c c c
known known known
Three unknown parameters to estimate: To, δys, δys 11

TOA Model vs. TDOA Model
Two options now:
1. Use TOA to estimate 3 parameters: To, δys, δys
2. Use TDOA to estimate 2 parameters: δys, δys
Generally the fewer parameters the better…
Everything else being the same.
But… here “everything else” is not the same:
Options 1 & 2 have different noise models
(Option 1 has independent noise)
(Option 2 has correlated noise)
In practice… we’d explore both options and see which is best.
12
Conversion to TDOA Model N–1 TDOAs rather
than N TOAs
TDOAs: τ i = ~ti − ~ti −1 , i = 1, 2, …, N − 1
Ai − Ai −1 Bi − Bi −1
= δx s + δy s + ε i − ε i −1
c$ c$ # $"$ !
#$" ! #$" ! correlated noise
known known
In matrix form: x = Hθ + w
x = [τ 1 τ2 ' τ N −1 ]T θ = [δx s δy s ]T
 ( A1 − A0 ) & ( B1 − B0 )   ε1 − ε 0 
   
1  ( A2 − A1 ) & ( B2 − B1 )   ε 2 − ε1 
H=   w=  = Aε
c & & &   & 
   
( AN −1 − AN − 2 ) & ( BN −1 − BN − 2 ) ε N −1 − ε N − 2 
  
See book for structure

C w = cov{w} = σ AA 2 T
of matrix A
13
Apply BLUE to TDOA Linearized Model
ˆθ (
BLUE = H C w H )
T −1 −1 T −1
H Cw x (
C θˆ = H T −1 −1
Cw H )
( ) ( ) ( )
−1 −1
 T −1  −1 2 T −1 
=  H T AA H  H T AA T x = σ  H AA T
H
   
Dependence on σ2 Describes how large
cancels out!!! the location error is
Things we can now do:

1. Explore estimation error cov for different Tx/Rx geometries
• Plot error ellipses
2. Analytically explore simple geometries to find trends
• See next chart (more details in book)
14
Apply TDOA Result to Simple Geometry
Tx
R
Rx1 α Rx2 α Rx3
d d
 1 
 0 
2 2 2 cos 2
α 
Then can show: Cθˆ = σ c
 3/ 2 
 0 2
 (1 − sin α ) 
ey
Diagonal Error Cov ⇒ Aligned Error Ellipse
And… y-error always bigger than x-error ex

15
3
10
σx
σy
σ y/cσ
2
10
or
1
10
σ x/cσ
0
10
-1
10
0 10 20 30 40 50 60 70 80 90
α (degre es )
Tx
• Used Std. Dev. to show units of X & Y
• Normalized by cσ… get actual values by R
multiplying by your specific cσ value
Rx1 α Rx2 α Rx3
d d
• For Fixed Range R: Increasing Rx Spacing d Improves Accuracy
• For Fixed Spacing d: Decreasing Range R Improves Accuracy 16

Chapter 7
Maximum Likelihood Estimate
(MLE)
1
Motivation for MLE
Problems: 1. MVUE often does not exist or can’t be found
<See Ex. 7.1 in the textbook for such a case>
2. BLUE may not be applicable (x ≠ Hθ + w)
Solution: If the PDF is known, then MLE can always be used!!!
This makes the MLE one of the most popular practical methods
• Advantages: 1. It is a “Turn-The-Crank” method

2. “Optimal” for large enough data size
• Disadvantages: 1. Not optimal for small data size
2. Can be computationally complex
- may require numerical methods
2
Rationale for MLE
Choose the parameter value that:
makes the data you did observe…
the most likely data to have been observed!!!
Consider 2 possible parameter values: θ1 & θ2
Ask the following: If θi were really the true value, what is the
probability that I would get the data set I really got ?
Let this probability be Pi
So if Pi is small… it says you actually got a data set that was

unlikely to occur! Not a good guess for θi!!!
But p1 = p(x;θ1) dx
⇒ pick θˆML so that p( x;θˆML ) is largest
p2 = p(x;θ2) dx
3
Definition of the MLE
θˆML is the value of θ that maximizes the “Likelihood
Function” p(x;θ) for the specific measured data x
p(x;θ) θˆML maximizes the
likelihood function
θˆML θ
Note: Because ln(z) is a monotonically increasing function…
θˆML maximizes the log likelihood function ln{p(x; θ)}
General Analytical Procedure to Find the MLE

1. Find log-likelihood function: ln p(x;θ)
2. Differentiate w.r.t θ and set to 0: ∂ln p(x;θ)/∂θ = 0
3. Solve for θ value that satisfies the equation
4
Ex. 7.3: Ex. of MLE When MVUE Non-Existent
x[n] = A + w[n] ⇒ x[n] ~ N(A,A)
A>0 WGN
~N(0,A)
 1 N −1 
∑
1
Likelihood Function: p( x; A) = N
exp − ( x[n ] − A) 
2
 2 A n =0 
( 2πA) 2
$!!!!!!#!!!!!!"
To take ln of this… use log properties:
Take ∂/∂A, set = 0, and change A to Â

N 1 N −1 N −1
+ ∑ ( x[n ] − Aˆ ) + ∑ ( x[n] − Aˆ ) 2 = 0
1
−
2 Aˆ Aˆ n =0 2 Aˆ 2 n =0
Expand this :
N Aˆ 2 N
+ ∑ x[n ] − NA + ∑ 2 A∑ x[n ] +
1 1 ˆ 1 1
− x [n ] −
2 ˆ =0
ˆ
2A Aˆ Aˆ ˆ
2A ˆ
2A 2
2A ˆ 2
Cancel 5
N −1
∑
1
Manipulate to get: Aˆ 2
+ Aˆ − x 2
[n ] = 0
N n=0
Solve quadratic equation to get MLE:

N −1
∑
1 1 1
Aˆ ML = − + x 2
[n ] +
2 N n=0
4
Can show this estimator biased (see bottom of p. 160)

But it is asymptotically unbiased…
Use the “Law of Large Numbers”:
Sample Mean → True Mean 1 N −1 2
∑
N n =0
x [ n ]  
as N →∞
→ E{ x 2
[ n] }
So can use this to show:
{
E Aˆ ML }  1
→ E −
 2
+ { }
E x 2 [n ] +
1
4

 = −

1
2
+ E{x 2
$!# ! [ n
"
}] +
1
4
= A
=A +A
2
A2
var( Aˆ ) → = CRLB
 1 Asymptotically…Unbiased &
N A+ 
 2 Efficient
6
7.5 Properties of the MLE (or… “Why We Love MLE”)
The MLE is asymptotically:

1. unbiased
2. efficient (i.e. achieves CRLB)
3. Gaussian PDF
Also, if a truly efficient estimator exists, then the ML procedure
finds it !
The asymptotic properties are captured in Theorem 7.1:
If p(x;θ ) satisfies some “regularity” conditions, then the
MLE is asymptotically distributed according to
a
θˆML ~ N (θ , I −1 (θ ))
where I(θ ) = Fisher Information Matrix 7
Size of N to Achieve Asymptotic
This Theorem only states what happens asymptotically…

when N is small there is no guarantee how the MLE behaves
Q: How large must N be to achieve the asymptotic properties?
A: In practice: use “Monte Carlo Simulations” to answer this
8
Monte Carlo Simulations: see Appendix 7A
A methodology for doing computer simulations to evaluate
performance of any estimation method Not just for the MLE!!!
Illustrate for deterministic signal s[n; θ ] in AWGN
Monte Carlo Simulation:
Data Collection:
1. Select a particular true parameter value, θtrue
- you are often interested in doing this for a variety of values of θ
so you would run one MC simulation for each θ value of interest
2. Generate signal having true θ: s[n;θt] (call it s in matlab)
3. Generate WGN having unit variance
w = randn ( size(s) );
4. Form measured data: x = s + sigma*w;
- choose σ to get the desired SNR
- usually want to run at many SNR values
→ do one MC simulation for each SNR value
9
Data Collection (Continued):
5. Compute estimate from data x
6. Repeat steps 3-5 M times
- (call M “# of MC runs” or just “# of runs”)
7. Store all M estimates in a vector EST (assumes scalar θ)
∑ (θˆ )
M
Statistical Evaluation: 1
b= − θ true
M i =1
i
1. Compute bias
∑ (θˆ − θ )
M
2. Compute error RMS 1 2
RMS = t
M i =1
i
3. Compute the error Variance

2
4. Plot Histogram or Scatter Plot (if desired)   M M 
 θˆ −  1 ˆ 
∑ ∑ i 
1
VAR = θ
M i =1  i  M i =1  

Now explore (via plots) how: Bias, RMS, and VAR vary with:
θ value, SNR value, N value, Etc.
Is B ≈ 0 ?
Is RMS ≈ (CRLB)½ ?
10
Ex. 7.6: Phase Estimation for a Sinusoid
Some Applications:
1. Demodulation of phase coherent modulations
(e.g., DSB, SSB, PSK, QAM, etc.)
2. Phase-Based Bearing Estimation
Signal Model:x[n] = Acos(2πfon + φ) + w[n], n = 0, 1,…, N-1

A and fo known, φ unknown White
~N(0,σ2)
Recall CRLB: ()
var φˆ ≥
2σ 2
NA 2
=
1
N ⋅ SNR
For this problem… all methods for finding the MVUE will fail!!
⇒ So… try MLE!!
11
So first we write the likelihood function:
 
 
 1 N −1 2
p ( x;φ ) =
1
exp  − 2 ∑
[x[n ] − A cos(2πf o n + φ )] 
(2πσ )
N
2 2  2σ $ n = 0!!!!!#!!!!! "
 
End up in same
GOAL: Find φ that … equivalent to place if we
maximizes this minimizing this maximize LLF
N −1
∂J (φ )
J (φ ) = ∑ [x[n ] − A cos(2πf o n + φ )]2
∆
So, minimize: Setting = 0 gives
n =0
∂φ
∑ x[n ] sin (2πf o n + φˆ) = A ∑ sin (2πf o n + φˆ)cos(2πf o n + φ )

N −1 N −1
n =0 n = 0!!!!!
$ !#!!!!!!
"
≈0
sin and cos are ⊥ when summed over full cycles
( )
N −1
∑
So… MLE Phase Estimate satisfies:
x [ n ] sin 2πf o n + φˆ =0
Interpret via inner product or correlation n =0
12
Now…using a Trig Identity and then re-arranging gives:
   
cos(φ ) ∑ x[n ] sin (2πf o n ) = − sin(φ ) ∑ x[n ] cos(2πf o n )
ˆ ˆ
 n   n 
 ∑ x[n ] sin (2πf o n )  Recall: This is the

Or… −1  n 
φˆML = − tan  approximate MLE

∑ x [ n ] cos (2πf n
o )
 n 
Don’t need to know
Recall: I-Q Signal Generation A or σ2 but do need
to know fo
LPF
yi(t)
cos(2πfot)
x(t) The “sums” in the above equation
-sin(2πfot) play the role of the LPF’s in the
figure (why?)
LPF
Thus, ML phase estimator can be
yq(t) viewed as: atan of ratio of Q/I
13
Monte Carlo Results for ML Phase Estimation
See figures 7.3 & 7.4 in text book
14
7.6 MLE for Transformed Parameters
Given PDF p(x;θ ) but want an estimate of α = g (θ )
What is the MLE for α ??
Two cases: g(θ )

1. α = g(θ ) is a one-to-one function θ
αˆ ML maximizes p(x; g −1 (α ))
g(θ )
2. α = g(θ ) is not a one-to-one function
θ
Need to define modified likelihood function:
pT ( x;α ) = max p ( x;θ )
{$
θ : α = g (θ )}
!! !#!!! "
• For each α, find all θ’s that map to it
αˆ ML maximizes pT (x;α ) • Extract largest value of p(x; θ ) over
this set of θ’s 1
Invariance Property of MLE Another Big
Advantage of MLE!
Theorem 7.2: Invariance Property of MLE
If parameter θ is mapped according to α = g(θ ) then the
MLE of α is given by
αˆ = g (θˆ)
where θˆ is the MLE for θ found by maximizing p(x;θ )
Note: when g(θ ) is not one-to-one the MLE for α maximizes
the modified likelihood function
“Proof”:
Easy to see when g(θ ) is one-to-one
Otherwise… can “argue” that maximization over θ inside

definition for modified LF ensures the result.
2
Ex. 7.9: Estimate Power of DC Level in AWGN
x[n] = A + w[n] noise is N(0,σ2) & White
α = A2
Want to Est. Power: α = A2 ⇒
A
⇒ For each α value there are 2 PDF’s to consider
 
∑ ( x[n ] − α ) 
1 1
p T1 ( x ; α ) = exp  − 2
( 2πσ 2 ) N / 2  2σ
2
n 
 
∑ ( x[n ] + α ) 
1 1
p T2 ( x ; α ) = exp  − 2
( 2πσ 2 ) N / 2  2σ
2
n 
Then: 
αˆ ML =  arg max { p(x; }
α ), p ( x ;− α )  2 Demonstration that
 α ≥0 
Invariance Result
=  arg max p ( x ; A )  2 Holds for this
 − ∞ < A< ∞  Example
[ ]
= Aˆ ML 2
3
Ex. 7.10: Estimate Power of WGN in dB
x[n] = w[n] WGN w/ var = σ2 unknown
Recall: Pnoise = σ2
N −1
1
Can show that the MLE for variance is: Pˆnoise =
N
∑x
n=0
2
[n]
To get the dB version of Note: You may recall a

the power estimate: result for estimating
variance that divides by N–1
 1 N −1

PdB = 10 log10  ∑ x [ n]
ˆ 2 rather than by N … that
estimator is unbiased, this
$!!! N#n!=0
!!" estimate is biased (but
Using asymptotically unbiased)
Invariance Property !
4
7.7: Numerical Determination of MLE
Note: In all previous examples we ended up with a closed-form
expression for the MLE: θˆ = f (x)
ML
Ex. 7.11: x[n] = rn + w[n] noise is N(0,σ2) & white

Estimate r If –1 < r < 0 then this signal
is a decaying oscillation that
might be used to model:
To find MLE: • A Ship’s “Hull Ping”
• A Vibrating String, Etc.
∂ ln p ( x;θ )
=0
∂θ
N −1
∑ ( x[ n ] − r n −1 No closed-form
⇒ n
)nr =0 solution for the MLE
n =0
5
So…we can’t always find a closed-form MLE!
But a main advantage of MLE is:
We can always find it numerically!!!
(Not always computationally efficiently, though)
Brute Force Method p(x;θ )

Compute p(x;θ ) on a fine grid of θ values
Advantage: Sure to Find maximum

θ
(if grid is fine enough)
Disadvantage: Lots of Computation
(especially w/ a fine grid)
6
Iterative Methods for Numerical MLE
Step #1: Pick some “initial estimate” θˆ0
Step #2: Iteratively improve it using
θî +1 = f (θî , x ) such that lim p( x ;θ i ) = max p( x;θ )
i →∞ θ
“Hill Climbing in the Fog”
p(x;θ ) Note: A so-called “Greedy”
maximization algorithm will
always move up even
though taking an occasional
θ step downward may be the
θˆ0 θˆ1 θˆ2
better global strategy!
Convergence Issues:
1. May not converge
2. May converge, but to local maximum
- good initial guess is needed !!
- can use rough grid search to initialize
- can use multiple initializations 7
Iterative Method: Newton-Raphson MLE
The MLE is the maximum of the LF… so set derivative to 0:
∂ ln p ( x;θ ) So… MLE is a
=0
$!∂# θ!" zero of g(θ )
∆
= g (θ )
Newton-Raphson is a numerical method for finding the zero
of a function… so it can be applied here… Linearize g(θ )
 dg (θ )  Truncated
g (θ ) ≈ g (θ k ) +  (θ − θˆk ) Taylor
 dθ θ =θˆk 
$!!!!!#!!!!!" Series
set = 0
solve for θˆk +1
g (θˆk )
θˆk +1 = θ k −
ˆ
θˆ2 θˆ1 θˆ0 θ  dg (θ ) 
 
 dθ θ =θˆk  8
∂ ln p( x;θ )
Now… using our “definition of convenience”: g (θ ) =
∂θ
So then the Newton-Raphson MLE iteration is:
 2 
−1  Iterate until
 ∂ ln p ( x;θ ) ∂ ln p ( x;θ ) 
θˆk +1 = θˆk −    convergence
 ∂θ 2
 ∂θ  criterion is met:
θ =θˆk
| θˆk +1 − θˆk |< ε
Look Familiar???
Looks like I(θ ), except: I(θ ) is evaluated at the You get to
true θ, and has an expected value choose!
Generally:
For a given PDF model, compute derivatives analytically…
or… compute derivatives numerically:
∂ ln p( x;θ ) ln p(x;θˆk + ∆θ ) − ln p( x;θˆk )
≈
∂θ θˆ ∆θ
k
9
Convergence Issues of Newton-Raphson:
1. May not converge
2. May converge, but to local maximum
- good initial guess is needed !!
- can use rough grid search to initialize
- can use multiple initializations
∂ ln p( x;θ )
∂θ
θˆ1
θˆ3 θˆ0 θˆ2 θ
Some Other Iterative MLE Methods

1. Scoring Method
• Replaces second-partial term by I(θ )
2. Expectation-Maximization (EM) Method
• Guarantees convergence to at least a local maximum
• Good for complicated multi-parameter cases 10
7.8 MLE for Vector Parameter
Another nice property of MLE is how easily it carries over to the
vector parameter case.
The vector parameter is: θ = [θ1 θ2 % θp ]T

∂ ln p ( x; θ)
θ̂ ML is the vector that satisfies: =0
$!# ∂θ!"
 ∂f (θ) 
 ∂θ  Derivative w.r.t.
 1 
a vector
 ∂f (θ) 
 
∂f ( θ )  ∂ θ 2 
=
∂θ  
 & 
 
 ∂f (θ) 
 
 ∂θ p  11
Ex. 7.12: Estimate DC Level and Variance
x[n] = A + w[n] noise is N(0,σ2) and white
 A
Estimate: DC level A and Noise Variance σ2 ⇒ θ =  
σ 2 
 N −1 
2
∑ [x[n ] − A] 
1 1
LF is: p ( x; A, σ ) =2
exp  −
(2πσ )  2σ 2
N
2 2 n =0 
∂ ln p ( x; θ) set
Solve: = 0
∂θ
N −1
∂ ln p ( x; θ)
∑ ( x[ n ] − A) = ( x − A) = 0
1 N
= 2
∂A σ n =0 σ 2  x 
 
N −1 θ ML =  1
ˆ 
∂ ln p ( x; θ)
∂σ 2
=−
N
2σ 2
+
1
2σ 4
∑ (x[n ] − A)2 = 0 
 N
∑( x[n] − x ) 
2
n =0 n 
Interesting: For this problem…

First estimate A just like scalar case
The subtract it off and then estimate variance like scalar case 12
Properties of Vector ML
The asymptotic properties are captured in Theorem 7.3:
If p(x;θ) satisfies some “regularity” conditions, then the
MLE is asymptotically distributed according to
θˆ ML ~ N ( θ, I −1 ( θ ))
a
where I(θ) = Fisher Information Matrix

So the vector ML is asymptotically:
• unbiased
• efficient
Invariance Property Holds for Vector Case

If α = g (θ ), then αˆ ML = g (θˆ ML )
13
Ex. 7.12 Revisited
σ 2 
 A   0 
 
cov{θˆ } =  
N
It can be shown that: E{θ} = 
ˆ
( N − 1) 2   2( N − 1) 4 
 σ   0 σ 
 N 
 N 2

σ 2 
 A  0 
E{θˆ } ≈   = θ cov{θˆ } ≈   = I −1 (θ)
For large N then : N
 2 4
σ 2   0 σ 
 N 
which we see satisfies the asymptotic property.
Diagonal covariance matrix shows estimates are uncorrelated:

eσ 2 Error Ellipse is aligned with axes
This is why we
eA could “decouple”
the estimates
14
MLE for the General Gaussian Case
Let the data be general Gaussian: x ~ N (µ(θ), C(θ))
∂u(θ ) ∂C(θ )
Thus ∂ ln p(x;θ)/ ∂θ will depend in general on and
∂θ ∂θ
∂ ln p( x; θ)
For each k = 1, 2, . . . , p set: =0
∂θ k
This gives p simultaneous equations, the kth one being:
T −1
1  −1 ∂ C ( θ )   ∂µ ( θ )  T  ∂C ( θ ) 
 C ( θ ) [x − µ ( θ ) ] − [x − µ ( θ ) ]   [x − µ ( θ ) ] = 0
− 1
− tr  C ( θ ) + 1
2  ∂θ k   ∂θ k  2  ∂ θ k 
Term #1 Term #2 Term #3
Note: for the “deterministic signal + noise” case: Terms #1 & #3 are zero
This gives general conditions to find the MLE…

but can’t always solve it!!! 15
MLE for Linear Model Case For this case we can
solve these equations!
The signal model is: x = Hθ + w with the noise w ~ N(0,C)

So terms #1 & #3 are zero and term #2 gives:  ∂ (Hθ)T 
 C
−1
[x − Hθ] = 0
 ∂θ 
$!#!"
=H
(
T −1 −1 T −1
Solving this gives: θ ML = H C H H C x
ˆ )
Hey! Same as chapter 4’s MVU for linear model
Recall: the Linear Model is

For Linear Model: ML = MVU
specified to have Gaussian noise
θˆ ML ~ N ( θ, ( H T C −1H ) −1 )
EXACT…
Not Asymptotic!!
16
Numerical Solutions for Vector Case
Obvious generalizations… see p. 187
There is one issue to be aware of, though:

The numerical implementation needs ∂ln p(x;θ)/∂θ
For the general Gaussian case this requires: ∂C −1 (θ)

∂θ
So… we use (3C.2):
…often hard to
−1
∂C (θ) ∂C(θ) −1
= − C −1 (θ) C ( θ) analytically:
∂θ k $#" ∂θ $#"
$#k" get C-1(θ)
& then
differentiate!
Get
Analytically
Get
Numerically
17
7.9 Asymptotic MLE
Useful when data samples x[n] come from a WSS process
Reading Assignment Only
18
7.10 MLE Examples
We’ll now apply the MLE theory to several examples of
practical signal processing problems.
These are the same examples for which we derived the CRLB
in Ch. 3
1. Range Estimation
– sonar, radar, robotics, emitter location
2. Sinusoidal Parameter Estimation (Amp., Frequency, Phase)
– sonar, radar, communication receivers (recall DSB Example), etc.
3. Bearing Estimation
We – sonar, radar, emitter location
Will
Cover 4. Autoregressive Parameter Estimation
– speech processing, econometrics
See Book
1
Ex. 1 Range Estimation Problem
Transmit Pulse: s(t) nonzero over t∈[0,Ts]
Receive Reflection: s(t – τo)
Measure Time Delay: τo
C-T Signal Model

x ( t ) = s ( t − τ o ) + w( t ) 0 ≤ t ≤ T = Ts + τ o,max
 
s ( t ;τ o )
Bandlimited s(t)
White Gaussian
PSD of w(t)
Ts t
No/2
BPF x(t) s(t – τo)
& Amp
–B B f
T t 2
Range Estimation D-T Signal Model
No/2 σ2 = BNo
Sample Every ∆ = 1/2B sec

–B B f τ w[n] = w(n∆)
1/2B 1/B 3/2B
DT White
x[n ] = s[n − no ] + w[n ] n = 0,1,  , N − 1 Gaussian Noise
Var σ2 = BNo
s[n;no]… has M non-zero samples starting at no
no ≈ τo /∆
 w[n ] 0 ≤ n ≤ no − 1

x[n ] =  s[n − no ] + w[n ] no ≤ n ≤ no + M − 1

 w[n ] no + M ≤ n ≤ N − 1
3
Range Estimation Likelihood Function
White and Gaussian ⇒ Independent ⇒ Product of PDFs
3 different PDFs – one for each subinterval
 #1   # 2    # 3 
no −1  x 2 [n ]   no + M −1  ( x[n ] − s[n − no ]) 2    no + M −1  x 2 [n ]  

p ( x; no ) =  ∏ C exp  − 2 
 •  ∏ C exp  −   •  ∏ C exp  − 
 2σ    n = no 2σ   n = no + M  2σ  
2 2
 n =0 
1 Expand to get an x2[n]

C= term… group it with the
2πσ 2
other x2[n] term
 N −1 2   
 ∑ x [n ]   no + M −1 
 
p ( x; no ) = C N exp  − n =0 2  • exp  − ∑ 
1
( −2 x [ n ]s[ n − n 0 ] + s 2
[ n − n o ]) 
 2σ   2σ 2
=
n n
o  
   
 
must minimize this or maximize its
Does not depend on no negative over values of no 4
Range Estimation ML Condition
no + M −1 no + M −1
So maximize this: 2 ∑ x[n]s[n − n0 ] + ∑ s 2 [ n − no ]
n = no = no
  n  

Because s[n – no] = 0 Doesn’t depend on no!

outside summation range… …Summand moves with
so can extend it! the limits as no changes.
N −1
So maximize this: ∑ x[n]s[n − n0 ]
n =0
So…. MLE Implementation is based on Cross-correlation:

“Correlate” Received signal x[n] with transmitted signal s[n]
N −1
nô = arg max {C xs [m ]} C xs [m ] = ∑ x[n]s[n − m],
0≤ m ≤ N − M n =0
5
Range Estimation MLE Viewpoint
Cxs[m]
Warning: When
signals are complex
m (e.g., ELPS) take find
no peak of |Cxs[m] |
N −1
C xs [m ] = ∑ x[n]s[n − m],
n =0
• Think of this as an inner product for each m

• Compare data x[n] to all possible delays of signal s[n]
 pick no to make them most alike 6
Ex. 2 Sinusoid Parameter Estimation Problem
Given DT signal samples of a sinusoid in noise….
Estimate its amplitude, frequency, and phase
x[n ] = A cos(Ω o n + φ ) + w[n ] n = 0, 1,  , N − 1
Ωo is DT frequency in DT White Gaussian Noise

cycles/sample: 0 < Ωo < π Zero Mean & Variance of σ2
Multiple parameters… so parameter vector: θ = [ A Ω o φ ]T

The likelihood function is:
 N −1 
∑ ( x[n] − A cos(Ωo n + φ )) 
1
p ( x; θ) = C exp  −
N 2
 2σ
2
n =0 
∆
= J ( A, Ω o ,φ )
For MLE: Minimize This
7
Sinusoid Parameter Estimation ML Condition
To make things easier…
Define an equivalent parameter set:
[α1 α2 Ωo ]T α1 = Acos(φ) α2 = –Asin(φ)
Then… J'(α1 ,α2,Ωo) = J(A,Ωo,φ) α = [α1 α2]T
Define:
c(Ωo) = [1 cos(Ωo) cos(Ωo2) … cos(Ωo(N-1))]T
s(Ωo) = [0 sin(Ωo) sin(Ωo2) … sin(Ωo(N-1))]T
and…
H(Ωo) = [c(Ωo) s(Ωo)] an Nx2 matrix
8
Then: J'(α1 ,α2,Ωo) = [x – H (Ωo) α]T [x – H (Ωo) α]
Looks like the linear model case… except for Ωo dependence of H (Ωo)
Thus, for any fixed Ωo value, the optimal α estimate is
[
αˆ = H (Ω o )H(Ω o )
T
] −1
H T (Ω o )x
Then plug that into J'(α1 ,α2,Ωo):
J ′(αˆ1 ,αˆ 2 , Ω o ) = [x − H(Ω o )αˆ ]T [x − H(Ω o )αˆ ]
[ ]
= x T − αˆ T H T (Ω o ) [x − H(Ω o )αˆ ]
[ ]
2
 −1 T 
= x I − H(Ω o ) H (Ω o )H(Ω o ) H (Ω o ) x
T T
 
  

[
= I − H ( Ω o ) HT ( Ω o ) H ( Ω o ) ]
−1
HT ( Ω o )
[
= x T x − x T H(Ω o ) H T (Ω o )H(Ω o ) H T (Ω o )x

]−1
minimize w.r.t. Ωo
maximize 9
Sinusoid Parms. Exact MLE Procedure
Step 1: Maximize “this term” over Ωo to find Ω̂o
=Ω
ˆ
o arg max x T
H ( Ω
0<Ωo <π
)
o  H {
T
( Ω o ) H ( Ω )
o 
−1
H T
(Ωo )x }
Step 2: Use result of Step 1 to get Could Do Numerically
[ ˆ T ˆ −1 T
αˆ = H (Ω o )H(Ω o ) H (Ωˆ )x
o ]
Step 3: Convert Step 2 result by solving
αˆ1 = Aˆ cos(φˆ)
for Aˆ & φˆ
αˆ 2 = − Aˆ sin(φˆ)
10
Sinusoid Parms. Approx. MLE Procedure
First we look at a specific structure:
T −1
−1  c T (Ωo )x   c T ( Ω o ) c ( Ω o ) c T ( Ω o ) s( Ω o )   c T ( Ω o ) x 
x H(Ωo )  H (Ωo )H(Ωo )  H (Ωo )x =
T T T
 T   T   T 
 s ( Ω ) x   s ( Ω ) c ( Ω ) s T
( Ω ) s ( Ω o  
) s ( Ω ) x 
o
 o o
 o
 o
−1
N 
Then… if Ωo is not near 0 or π, then approximately 2 0
≈ 
0 N
 2 
and Step 1 becomes
 2 
{ }
N −1 2
∑ ( )
− jΩo n 2
=Ωo arg max 
ˆ x[ =
n ]e  arg max X Ω
0<Ωo <π
 N n =0  0<Ω<π
and Steps 2 & 3 become DTFT of Data x[n]

2
Aˆ = X (Ω
ˆ )
o
N
φˆ = ∠X (Ω
ˆ )
o
11
The processing is implemented as follows:
Given the data: x[n], n = 0, 1, 2, … , N-1
1. Compute the DFT X[m], m = 0, 1, 2, … , M-1 of the data
• Zero-pad to length M = 4N to ensure dense grid of frequency points
• Use the FFT algorithm for computational efficiency
2. Find location of peak |X(Ω)|

• Use quadratic interpolation of |X[m]|
3. Find height at peak

• Use quadratic interpolation of |X[m]| Ω
Ω̂ o
4. Find angle at peak ∠X(Ω)
• Use linear interpolation of ∠X[m]
Ω 12
Ω̂ o
Ex. 3 Bearing Estimation MLE
Figure 3.8 Emits or reflects
from textbook: signal s(t)
s (t ) = At cos(2πf o t + φ )
Simple model
Grab one “snapshot” of all M sensors at a single instant ts:
( ~
)
x[n ] = sn (t s ) + w[n ] = A cos Ω s n + φ + w[n ]
Same as Sinusoidal Estimation!!

So… Compute DFT and Find Location of Peak!!
If emitted signal is not a sinusoid… then you get

a different MLE!! 13
Case Study
TDOA/FDOA Location
• Overview
• Stage 1: Estimating TDOA/FDOA
• Stage 2: Estimating Geo-Location
1/42
TDOA/FDOA LOCATION
ν21 = ω2 – ω1
s (t )
= constant
ν23 = ω2 – ω3
= constant
s (t − t3 )e jω3t
FDOA
Frequency-
Difference-
Of-
s (t − t1 )e jω1t Data
Arrival Link
s (t − t2 )e jω 2 t
Data Link
TDOA
Time- τ23 = t2 – t3
Difference- = constant
Of- τ21 = t2 – t1
Arrival = constant
Next 2/42
Classical TDOA/FDOA Emitter Location:
 Stage 1: Estimate TDOA/FDOA

 Stage 2: Estimate the emitter’s location from the info from stage 1.
s (t ) s (t )
TDOA13
TDOA12 r1 (t ) FDOA13
FDOA12 r1 (t )
r1 (t ) TDOA12 TDOA13
FDOA12 FDOA13
CAF12 (τ , ω ) CAF13 (τ , ω )
Next 3/42
Stage 1: Estimating TDOA/FDOA
Next 4/42
SIGNAL MODEL
 Will Process Equivalent Lowpass signal, BW = B Hz
– Representing RF signal with RF BW = B Hz
 Sampled at Fs > B complex samples/sec XRF(f)
 Collection Time T sec
f
 At each receiver:
X(f)
Make f
BPF ADC LPE Equalize
Signal
XfLPE(f)
cos(ω1t) -B/2 B/2 f
Next 5/42
DOPPLER & DELAY MODEL
s(t) sr(t) = s(t – τ(t))
Tx Rx
R(t)
Propagation Time: τ(t) = R(t)/c
R(t ) = Ro + vt + (a / 2)t 2 + 
Use linear approximation – assumes small
change in velocity over observation interval
For Real BP Signals:

sr (t ) = s (t − [ Ro + vt ] / c) = s ([1 − v / c]t − Ro / c)
Time Time Delay: τd
Scaling Next 6/42
DOPPLER & DELAY MODEL (continued)
Analytic Signals Model Analytic Signal of Tx
~
s (t ) = E (t )e j [ ω c t + φ ( t )]
Analytic Signal of Rx
~s (t ) = ~s ([1 − v / c]t − τ )
r d
j {ω c ([1− v / c ]t − τ d ) + φ ([1− v / c ]t − τ d )}
= E ([1 − v / c]t − τd )e
Now what? Notice that v << c  (1 – v/c) ≈ 1
Say v = –300 m/s (–670 mph) then v/c = –300/3x108 = –10-6  (1 – v/c)=1.000001
Now assume E(t) & φ(t) vary slowly enough that
E ([1 − v / c]t ) ≈ E (t ) For the range of v
of interest
φ([1 − v / c]t ) ≈ φ(t )
Called Narrowband Approximation Next 7/42
DOPPLER & DELAY MODEL (continued)
Narrowband Analytic Signal Model
~s (t ) = E (t − τ )e j{ωc t − ωc ( v / c )t − ωc τ d + φ(t − τ d )}
r d
− jω c τ d − jω c ( v / c ) t jω c t jφ ( t − τ d )
=e e e E (t − τd )e
Constant Doppler Carrier Transmitted Signal’s
Phase Shift Term LPE Signal
Term Term Time-Shifted by τd
α= –ωcτd ωd= ωcv/c
Narrowband Equivalent Lowpass Signal (ELPS) Model

jα − jω d t
sˆr (t ) = e e sˆ(t − τd )
This is the signal that actually gets processed digitally
Next 8/42
S. Stein, “Algorithms for Ambiguity Function
Stein’s CRLB for TDOA Processing,” IEEE Trans. on ASSP, June 1981
Most well-known form is for the C-T version of the problem:

∫
2
1 “seconds” f 2 S ( f ) df
σ TDOA ≥ 2
Brms =
2π 2 B rms BT × SNR eff
∫
2
S ( f ) df
∫ s(t ) dt
“Hz” 2 2
1 t
σ FDOA ≥ 2
Trms =
2π 2 T rms BT × SNR eff
∫
2
s (t ) dt
BT = Time-Bandwidth Product (≈ N, number of samples in DT)

B = Noise Bandwidth of Receiver (Hz)
T = Collection Time (sec)
Problem with Stein’s CRLBs M. Fowler X. Hu, “Signal Models for TDOA/FDOA
Estimation,” IEEE T. AES, Oct. 2008.
Stein’s paper does not derive these CRLB results… rather they are just
stated.
There is no mention of what signal model is assumed….
And, it turns out that matters very much!!!
Next 9/42
TDOA/FDOA CRLB History Lesson
1970s 1980s 1990s 2000s
Sonar-Driven Research Radar/Comm-Driven Research

Hann, Tretter, Knapp, Carter, Stein, Chestnut, Berger, Blahut,
Schultheis, Weinstein, Etc. Torrieri, Fowler, Yeredor, Etc.
Question: How much of the Sonar TDOA/FDOA estimation

work can be carried over to the Radar/Comm arena??
Answer: Not as much as many Radar/Comm

researchers/practitioners think!
Next 10/42
M. Fowler X. Hu, “Signal Models
Signals: Sonar vs. RF for TDOA/FDOA Estimation,”
IEEE T. AES, Oct. 2008.
Two sampled passively-received complex-valued baseband signals:
jφ jν 1nT
r1[n ] = e s( nT − τ 1 )e + w1[n ]
=
r2 [n ] s( nT ) + w2 [n ] Noise Model
• Zero-mean WSS processes
 r1  • Gaussian
r=  • Independent of each other
r2 
This much is the same for each case…
At least when the narrowband approximation can be used…
which we assume here so we can focus on the impact of
differences in the statistical model.
Next 11/42
Signal Models: Sonar vs. RF for TDOA/FDOA Estimation,”
• Passive Sonar • Passive Radar/Comm

– Signal = Sound from Boat – Signal = Pulse Train
– “Erratic” signal behavior – Structured signal behavior
– Model as Random Process – Model as Deterministic
• Zero-mean WSS • Specific pulse shape
• Gaussian • Pulse width & spacing
• Independent of Noise
– Expected values taken over – Expected values taken over
signal + noise ensemble only noise ensemble
• Estimation accuracy is • Estimation accuracy is
average over all possible average over all possible
noises and signals noises for one specific signal
Next 12/42
PDFs: Sonar vs. RF for TDOA/FDOA Estimation,”
• For both cases the received data vector… is Gaussian.

• But how TDOA/FDOA is embedded is very different.
This is the key… it impacts significant differences in:

• Fisher Info Matrix (FIM) / Cramer-Rao Bound (CRB)
• ML Estimator Structure
Passive Sonar PDF: TDOA/FDOA in

Covariance
=pac (r; θ)
1
det (π Cr (θ) )
H
{
exp −r ( Cr (θ) ) r
−1
}
Passive Radar/Comm PDF TDOA/FDOA in
Mean
exp {−(r − sθ ) H Cr−1 (r − sθ )}
1
pem (r; θ)
=
det (π Cr )
Next 13/42
FIM/CRB: Sonar vs. RF for TDOA/FDOA Estimation,”
• For complex Gaussian case the FIM elements are:

  ∂μ  H  ∂μθ    −1 ∂ C θ −1 ∂ C θ 
[ J gg ]ij 2 Re   θ

−1
C ( θ)    + tr  Cθ Cθ 
  ∂θi  ∂θ 
 j    ∂ θ ∂ θ j 
 i
• Leads to VERY different forms for the two cases:
Passive Sonar FIM: Passive Radar/Comm FIM:

 −1 ∂ C θ −1 ∂ C θ    ∂s  H  ∂sθ  
[ J sonar ]ij = tr  Cθ Cθ  [ J radar ]ij = 2 Re   θ  C−1  
 ∂ θi ∂θj    ∂θi 
   ∂θ j  
First developed by Bangs.. falls out of general case

Easy to numerically assess…
Difficult to assess… usually use “Whittles Theorem”
Depends on specific signal structure
Depends on Noise PSD as well as Signal PSD
Next 14/42
A. Yeredor & E. Angel, “Joint TDOA
Correct CRLB for RF Signals and FDOA Estimation: A Conditional
Bound and Its Use for Optimality
Weighted Localization”, IEEE T. SP
April 2011.
Fowler & Hu  Yeredor & Angel to consider “specific signal” case
“…bounds derived under an assumption of a stochastic source signal are

associated with the “average” performance, averaged not only over noise
realizations, but also over different source signal realizations, all drawn
from the same statistical model.”
“It might be of greater interest to obtain a “signal-specific” bound,

namely: for a given realization of the source signal, to predict the
attainable performance when averaged only over different realizations of
the noise. Such a bound can relate more accurately to the specific
structure of the specific signal.”
Next 15/42
A. Yeredor & E. Angel
Correct CRLB for RF Signals
=
r1[n ] s( nT ) + w1[n ] 
 N N
r2 [n ]= ae jφ s( nT − τ ) e jν nT + w2 [n ]  − ≤ n ≤ −1
   2 2
 sτ [ n ] 
Signal Model Noise Model
• Deterministic • Zero-mean WSS processes
• Complex Baseband • White (can generalize to colored noise)
• s[n] itself is UN-Known • Gaussian
- Must Estimate! • Independent of each other
• Complex Baseband
Define:
T
  N  N   N 
s   s  −  s  − + 1  s  − 1 
  2  2  2 
T
  N  N   N 
sτ   sτ − s −
 2  τ  2 + 1  sτ  − 1 
 2  Next 16/42
Now using property of DFT: sτ = F Dτ Fs
H (Pad zeros to account
for DFT circular nature)
 N 
1  2π   −2 
F is (unitary) DFT matrix: =
F exp  − j ⋅ nnT 
N  N   
N
 − + 1
  2π  n 2 
𝐃𝐃𝜏𝜏 is “delay” matrix: =Dτ diag exp  − j ⋅ n ⋅τ     
  N   
 N 
−
Dν diag {exp ( − j ⋅ n ⋅ν )}  2 1 
𝐃𝐃𝜈𝜈 is “doppler” matrix: =
r1 = s + v1
Then get:
=r2 ae jφ Dν  F H Dτ F  s + v 2
Unit: “samples”
Unit: “rad/sample”
( −π , π ) Models Doppler Models Delay
Next 17/42
r1 = s + v1
r2 ae jφ Dν  F H Dτ F  s + v 2 Recall: Must treat s as Unknown!
  
 Qτ ,ν
Parameters to Estimate: θ = [Re {s} Im {s} a

φ τ ν]
γ
T
Data Vector (Gaussian): r = r 1
T
r 
T
2
 s  σ 12I 0 
μθ  E {r} =  jφ =  =
Cθ cov{r}  Λ  2 
 ae D 
ν  F H
D F 
τ  s σ
  0 2 I
No 𝛉𝛉 dependence!
General Gaussian FIM elements:
  ∂μ  H  ∂μθ    −1 ∂ C θ −1 ∂ C θ 
[ J θ ]ij 2 Re   θ  Cθ−1    + tr  Cθ Cθ  This term
  ∂θi  
 ∂θ j    ∂ θi ∂θj 
 is zero!
Easy Inversion! Next 18/42
r1 = s + v1   ∂μθ  H −1  ∂μθ  
J θ = 2 Re   Λ  
r2 ae jφ Dν  F H Dτ F  s + v 2   ∂θ  
 ∂θ  
   
 Qτ ,ν θ = [ Re {s} Im {s} a φ τ ν ]

γ
 (1 + η a 2 ) I 0 η a Re {B} 
2  
=Jθ
σ1 
2
 0 (1 + ηa ) I
2
η a Im {B} 

η a Re {B } −η a Im {B } η Re {G G}
H H H
 
σ 12 ∂ae jφ Qτ ,ν s B  e − jφ QτH,ν G
η 2 G
σ2
∂γ γ = [a φ τ ν ]
Now could get the CRLB matrix for full parameter vector:
CRLBθ = J θ−1
But we really only want w.r.t. 𝛄𝛄
Next 19/42
 J −Re{
1
?? 
Define: CRLBθ =  s},Im{s}
 CRLB γ = J −γ 1
 ?? J −γ 1 
Then.. use a “math trick” called “Schur Complement” we get

2
Jγ = 2 2 Re 
 G H
G 
a σ1 + σ 2
2
Evaluating the elements in 𝐆𝐆𝐻𝐻 𝐆𝐆 leads to noting this form:
Ja 0  Amplitude parameters virtually

Jγ = 
0 Jφ ,τ ,ν  always decouple like this!
So… really only need this!
Next 20/42
r1 = s + v1
r2 ae jφ Dν  F H Dτ F  s + v 2 The final result for the
  
 Qτ ,ν FIM of interest is:
 sH s −s H s′ s H Ns =s Q= s D  F H
Dτ F  s
τ ,ν ν 
 
Jφ ,τ ,ν  −s s′
= H
s′ s′
H
− Re {s′ Qτ ,ν Ns}
H H
  2π H
′ =
 s H Ns − Re {s′H QτH,ν Ns}
s F NFs
 H N 2 s
s  N
 
1  2π 
=
F exp  − j ⋅ nnT 
 N   N 
 −2 
N
  So… use all these
  2π  N
 − + 1 boxes to compute this
=Dτ diag exp  − j ⋅ n ⋅τ  
  N  n 2 
   J then invert it to get
 
=Dν diag {exp ( − j ⋅ n ⋅ν )}  N  the CRLB!
 2 − 1 
N = diag {n}
Next 21/42
We can interpret some of these FIM terms:
 sH s −s H s′ s H Ns 
 
Jφ ,τ ,ν  −s s′
= H
s′ s′
H
− Re {s′ Qτ ,ν Ns}
H H
 
 s H Ns − Re {s′H QτH,ν Ns}  H N 2 s
s 
 
 2π 
2
s s = ∑ s [n ] s′H s′ =  ( ) N 2 ( Fs ) s H N 2 s ≈ s H N 2s
H 2 H
 Fs
n  N 
Energy Like RMS BW Term Like RMS Duration Term
∫ t s(t ) dt
2 2
∫
2 2
f S ( f ) df
2
Brms =
2
Trms =
∫
2
∫
2 s (t ) dt
S ( f ) df
Next 22/42
CRBs: “specific”
CRBFH: Fowler-Hu
CRBs: Wax for WSS Gaussian
Next 23/42
MLE: Sonar vs. Radar/Comm for TDOA/FDOA Estimation,”
• For general Gaussian case set ∂ ln { pgg (r; θ)} / ∂θi =

0
 ∂C  ∂C  ∂μ 
− tr  Cθ−1 θ  + [r − μθ ] Cθ−1 θ Cθ−1 [r − μθ ] + 2 Re [r − μθ ] Cθ−1 θ  =
H H
0
 ∂θ i  ∂θ i  ∂θ i 
Covariance Sensitivity Mean Sensitivity

– Derived by Weinstein, Wax – Derived by Stein
– Showed trace term = 0
∂Cθ −1
−1  ∂s 
H
r C θ Cθ r = 0 2 Re [r − sθ ] C−1 θ  =
H
0
∂θi  ∂θi 
⇒ θˆ ML ,ac = arg max {−r H Cθ−1r} =
⇒ θˆ ML ,em arg max {2 Re {r H C−1sθ } − sθH C−1sθ }
θ θ
Seem Very Different… but not as much as you’d think Next

24/42
MLE: Sonar vs. Radar/Comm for TDOA/FDOA Estimation,”
• Both lead to cross-correlation with pre-filtering
H( f )
Cross Find Peak
Correlate
H( f )

– Pre-Filters depend on interplay – Pre-Filters depend only on Noise
between Noise PSD & Signal PSD PSD, not on signal structure
– Becomes Std Cross-Correlator – Becomes Std Cross-Correlator
when Noise and Signal are white when Noise is white… regardless
of signal
Next 25/42
S. Stein, “Differential Delay/Doppler
ML Estimator for TDOA/FDOA ML Estimation with Unknown
Signals,” IEEE Trans. on SP, 1993.
Two received CT signals in ELPS form (complex)

observed over (0,T):
x(t) itself is unknown!
Complex Delay Doppler Delay

Parameters: Amplitude
The signal x(t) has bandwidth of B Hz

The time-BW product is assumed large: BT >>1
Next 26/42
S. Stein, “Differential Delay/Doppler ML Estimation with Unknown Signals,” IEEE Trans. on SP, 1993.
BT >> 1 yileds a common trick: analysis in freq domain is easier.

CTFT View:
Now convert this into a DFT view for the DT problem

Recall: X unknown!
DFT View:
𝚫𝚫𝒇𝒇 is DFT spacing

Doppler
Assume the Doppler shift

is an integer multiple (F) Benefit of converting to Frequency Domain:
of the DFT spacing {Nim}, m = 0, 1, …,N-1 are independent RVs!
(Even when noise is correlated in time!!!)
Covariances: Ci  diag {Pi 0 , Pi1 , , PiN −1} , i = 1, 2

Next 27/42
Because of the independence (due to the DFT trick) it is easy to

write the PDF of the two observed signals’ vectors
Minimize!
where:
Re-write L1: No Parm or Signal in here!
X only in here!
where: Other Parms

only in here!
Next
28/42
Remember we need to estimate the signal DFT X too!

It only shows up in second term in L1:
This is minimized (to 0) by choosing the signal estimate to be
“Undo” Doppler and delay to align!

When we plug into (8) the ML estimates for delay,
Doppler, amplitude we get a signal estimate
So…
• First term of L1 is not needed
• Second term of L1 led to signal estimate
• Third term of L1 … look at now! Next
29/42
Third term of L1 … look at now!
where:
It follows that we need to maximize |K|… equivalent to maxmizing
For white noise the denominator is constant and this becomes
Next
30/42
S. Stein, “Differential Delay/Doppler
ML Estimator for TDOA/FDOA ML Estimation with Unknown
Signals,” IEEE Trans. on SP, 1993.
s1 (t ) “Compare”
LPE Rx Find
Signals Signals Peak
At Two Delay Doppler For all of
Receivers Delays &
τ ω |A(ω,τ)|
s2 (t ) Dopplers
jα jω d t Ambiguity Function
=e e s1 (t − τ d )
T
A(ω ,τ ) = ∫ s1 (t ) s2 (t + τ ) e − jωt dt
0
ω
τd
ωd τ
Next 31/42
COMPUTING THE AMBIGUITY FUNCTION
Direct computation based on the equation for the ambiguity
function leads to computationally inefficient methods.
In Prof. Fowler’s EECE 521 notes it is shown how to use

decimation to efficiently compute the ambiguity function
Next 32/42
Stage 2: Estimating Geo-Location
Next 33/42
TDOA/FDOA LOCATION
Centralized Network of P
• “P-Choose-2” Pairs
 “P-Choose-2” TDOA Measurements
 “P-Choose-2” FDOA Measurements
• Warning: Watch out for Correlation
Effect Due to Signal-Data-In-Common
Data Data
Link Link
Data Link
Next 34/42
TDOA/FDOA LOCATION
Pair-Wise Network of P
• P/2 Pairs
 P/2 TDOA Measurements
 P/2 FDOA Measurements
• Many ways to select P/2 pairs
• Warning: Not all pairings are equally
good!!! The Dashed Pairs are Better
Next 35/42
TDOA/FDOA Measurement Model
Given N TDOA/FDOA measurements with corresponding 2×2 Cov. Matrices
(τˆ1 ,νˆ1 ), (τˆ2 ,νˆ2 ),  , (τˆN ,νˆ N ) Assume pair-wise network, so…
C1 , C 2 ,  , C N TDOA/FDOA pairs are uncorrelated
For notational purposes… define the 2N measurements r(n) n = 1, 2, …, 2N

r2 n −1 = τˆn , n = 1, 2,  , N
r = [ r1 r2  r2 N ]T
r2 n = νˆn , n = 1, 2,  , N
Data Vector
Now, those are the TDOA/FDOA estimates… so the true values are notated as:
(τ 1 ,ν 1 ), (τ 2 ,ν 2 ),  , (τ N ,ν N )
s2 n −1 = τ n , n = 1, 2,  , N
s = [ s1 s2  s2 N ]T
s2 n = ν n , n = 1, 2,  , N
“Signal” Vector
Next 36/42
TDOA/FDOA Measurement Model (cont.)
Each of these measurements r(n) has an error ε(n) associated with it, so…
r =s+ε
Because these measurements were estimated using an ML estimator (with
sufficiently large number of signal samples) we know that error vector ε is a
zero-mean Gaussian vector with cov. matrix C given by:
C1 0 0 
  Assumes that
C = diag{C1 , C 2 ,  , C N } =  0  0  TDOA/FDOA
  pairs are
 0 0 C N  uncorrelated!!!
The true TDOA/FDOA values depend on:
Emitter Parms: (xe, ye, ze) and transmit frequency fe xe = [ xe ye ze fe ]T
Receivers’ Nav Data (positions & velocities): The totality of it called xr
r = s( x e ; x r ) + ε • Deterministic “Signal” + Gaussian Noise

• “Signal” is nonlinearly related to parms
To complete the model… we need to know how s(xe;xr) depends on xe and xr.
Thus we need to find TDOA & FDOA as functions of xe and xr
Next 37/42
TDOA/FDOA Measurement Model (cont.)
Here we’ll simplify to the x-y plane… extension is straight-forward.
Two Receivers with: (x1, y1, Vx1, Vy1) and (x2, y2, Vx2, Vy2)
Emitter with: (xe, ye)
(Let Ri be the range between Receiver i and the emitter; c is the speed of light.)
The TDOA and FDOA are given by:
R1 − R2
s1 ( xe , y e ) = τ 12 =
c
1
=  (x1 − xe )2 + ( y1 − ye )2 − (x2 − xe )2 + ( y2 − ye )2 
c 
s2 ( xe , y e , f e ) = ν 12 =
fe d
(R1 − R2 )
c dt
 
f e  (x1 − xe )Vx1 + ( y1 − y e )Vy1 (x2 − xe )Vx2 + ( y 2 − y e )Vy 2 
= −
 2 
c
 ( x − x )2
+ ( y − y )2
( x − x )2
+ ( y − y ) 
Next
1 e 1 e 2 e 2 e
38/42
CRLB for Geo-Location via TDOA/FDOA
Recall: For the General Gaussian Data case the CRLB depends on a FIM that
has structure like this:
T
 ∂μ x (θ)  −1  ∂μ x (θ)  1  −1 ∂C x (θ) −1 ∂C x (θ) 
[J (θ)]nm =  C x (θ)   + tr C x (θ) C x (θ) 
∂
  θ n  ∂θ
 m  ∂θ ∂θ m 

2
  
n
 
variability of mean w.r.t. parms variability of cov. w.r.t. parms
Here we have a deterministic “signal” plus Gaussian noise so we only have the
1st term… Using the notation introduced here gives…
−1
 ∂sT ( x e ) −1 ∂s( x e ) 
CCRLB ( x e ) =  C  ()
∂
 x e ∂ x e 
HT H
Called the “Jacobian” … for the 3-D location with

TDOA/FDOA will be a 2N × 4 matrix whose columns
are derivatives of s w.r.t. each of the 4 parameters.
Next 39/42
CRLB for Geo-Loc. via TDOA/FDOA (cont.)
TDOA/FDOA Jacobian:
 ∂s1 (x e ) ∂s1 (x e ) ∂s1 (x e ) ∂s1 (x e ) 
 ∂x ∂y e ∂z e ∂f e 
 e 
 ∂s2 (x e ) ∂s2 (x e ) ∂s2 ( x e ) ∂s2 ( x e ) 
∆ ∂s( x e )  
H= = ∂xe ∂y e ∂z e ∂f e
∂x e      
 
 ∂s2 N ( x e ) ∂s2 N ( x e ) ∂s2 N ( x e ) ∂s2 N ( x e ) 
 ∂x ∂y e ∂z e ∂f e 
 e 
∂ ∂ ∂ ∂
∂xe ∂y e ∂z e ∂f e
Jacobian can be computed for any desired Rx-Emitter Scenario

Then… plug it into () to compute the CRLB for that scenario:
CCRLB ( x e ) = H C H [ T −1
] −1
Next 40/42
CRLB for Geo-Loc. via TDOA/FDOA (cont.)
Geometry and TDOA vs. FDOA Trade-Offs
Next 41/42
Estimator for Geo-Location via TDOA/FDOA
Because we have used the ML estimator to get the TDOA/FDOA
estimates the ML’s asymptotic properties tell us that we have
Gaussian TDOA/FDOA measurements
Because the TDOA/FDOA measurement model is nonlinear it is
unlikely that we can find a truly optimal estimate… so we again
resort to the ML. For the ML of a Nonlinear Signal in Gaussian we
generally have to proceed numerically.
One way to do Numerical MLE is ML Newton-Raphson (need vector
version):
 2 
−1 
 ∂ ln p ( x ; θ ) ∂ ln p ( x ; θ ) 
θˆ k +1 = θˆ k −    
  ∂θ∂θ
T
 ∂θ 
θ =θˆ k
Hessian: p×p matrix Gradient: p×1 vector

However, the “Hessian” requires a second derivative…
This can add complexity in practice… Alternative:
Gauss-Newton Nonlinear Least Squares based on linearizing the model.
42/42
Chapter 8
Least-Squares Estimation
1
8.3 The Least-Squares (LS) Approach
All the previous methods we’ve studied… required a
probabilistic model for the data: Needed the PDF p(x;θ)
For a Signal + Noise problem we needed:

Signal Model & Noise Model
Least-Squares is not statistically based!!!

⇒ Do NOT need a PDF Model
⇒ Do NEED a Deterministic Signal Model
signal
s[n;θ] strue[n;θ] x[n] = strue[n;θ] + w[n]
model + ∑ + ∑
+ + = s[n;θ] + e[n]
Similar to δ[n] w[n] noise
Fig. 8.1(a) (measurement model &
model
error) measurement error
error
2
Least-Squares Criterion
x[n] ε[n]
∑ Minimize the LS Cost
+–
N −1 N −1
s[n; θˆ ] J (θ) = ∑ ε [n ] =
2
∑ ( x [ n ] − s[ n ; θ ])2
signal n =0 n =0
θ̂ model
Choose the … to make this

Estimate… “residual” small
Ex. 8.1: Estimate DC Level x[n] = A + e[n] = s[n;θ] + e[n]

N −1
J ( A) = ∑ ( x[n] − A) 2 Same thing we’ve
n =0 gotten before!
∂J ( A) 1 N −1
Set
∂A
=0 ⇒ Aˆ =
N
∑ x[n] = x
n =0
Note:
If e[n] is WGN,
To Minimize…
then LS = MVU 3
Weighted LS Criterion
Sometimes not all data samples are equally good:
x[0], x[1], … , x[N-1]
Say you know x[10] was poor in quality compared to other data…
You’d want to de-emphasize its importance in the sum of squares:
N −1
J (θ ) = ∑ n
w ( x [ n ] − s[ n ;θ ]) 2
n =0
set this small to de-

emphasize a sample
4
8.4 Linear Least-Squares
A linear least-squares problem is one where the parameter
observation model is linear: s = Hθ x = Hθ + e
N×1
p×1 p = Order of the model
N×p Known Matrix
We must assume that H is full rank… otherwise there are multiple

parameter vectors that will map to the same s!!!
Note: Linear LS does NOT mean “fitting a line to data”… although
that is a special case: 1 0 
1 1 
   A
s[n ] = A + Bn ⇒ s = 1 2  
 B
 !
& & 
θ
1 N − 1
%"$"#
H 5
Finding the LSE for the Linear Model
N −1
For the linear model the LS cost is: J (θ) = ∑ ( x [ n ] − s[ n ; θ ])2
n =0
= (x − Hθ )T (x − Hθ )
Now, to minimize, first expand:
J (θ) = x T x − x T Hθ − θT HT x + θT HT Hθ
T T T T Scalar = scalarT So…
= x x − 2x Hθ + θ H Hθ
θTHTx = (θTHTx)T = xTHθ
∂J (θ) T T
Now setting = 0 gives − 2H x + 2H Hθˆ = 0
∂θ
Called the
“LS Normal Equations”
HT Hθˆ = HT x
Because H is full rank we know that HTH is invertible:
( )
ˆθ = HT H −1 HT x
LS (
−1
ˆs LS = Hθˆ LS = H HT H HT x ) 6
Comparing the Linear LSE to Other Estimates
Model Estimate
x = Hθ + e
No Probability Model Needed
(
ˆθ = HT H −1 HT x
LS )
x = Hθ + w
PDF Unknown, White
ˆθ (
T
BLUE = H H
−1 T
H x )
If you
assume
x = Hθ + w
( )
Gaussian &
T −1 apply
θˆ
ML = H H HT x these…
PDF Gaussian, White
BUT you
are
WRONG…
x = Hθ + w
PDF Gaussian, White
ˆθ ( T
MVU = H H
−1 T
H x ) you at least
get the
LSE! 7
The LS Cost for Linear LS
For the linear LS problem…
−1
what is the resulting LS cost for using θˆ LS = HT H HT x ? ( )
( )( ) ( ) ( )
T
T  T −1 T   T −1 T 
J min = x − Hθˆ LS x − Hθ LS =  x − H H H H x   x − H H H H x 
ˆ
   
Properties of
Transpose
 T

T
(
T −1 T 

) T −1 T 
=  x − x H H H H  x − H H H H x 

( )
Factor out x’s
 ( −1
)  −1 (
= x T  I − x T H H T H H T  I − H H T H H T  x
%""""""""$"  """""""
)
#
Easily Verified!

(
= I − H HT H )
−1
HT 

Note: if AA = A then A is called idempotent
T
J min = x  I − H H H

( T
)
−1 
H T x

J min = x T x − x T H HT H ( )
−1
HT x
2
0 ≤ J min ≤ x 8
Weighted LS for Linear LS
Recall: de-emphasize bad samples’ importance in the sum of
squares: N −1
J (θ) = ∑ wn ( x[n] − s[n; θ])
2
n =0
For the linear LS case we get: J (θ ) = (x − Hθ )T W (x − Hθ )

Diagonal Matrix
Minimizing the weighted LS cost gives:
ˆθ (
T
WLS = H WH
−1 T
)
H Wx
T

( T
J min = x  W − WH H WH )
−1 
H T W x

Note: Even though there is no true LS-based reason… many

people use an inverse cov matrix as the weight: W = Cx-1
This makes WLS look like BLUE!!!!

9
8.5 Geometry of Linear LS
• Provides different derivation – Order Recursive
• Enables new versions of LS – Sequential ŝ
Recall the LS Cost to be minimized: J (θ) = (x − Hθ )T (x − Hθ ) = x − Hθ 2
Thus, LS minimizes the length of the error vector between the

data and the signal estimate: ε = x − sˆ
[ ]
p
But… For Linear LS we have s = Hθ = ∑θ i h i H = h1 h2 ' hp
i =1
N×p
Rp RN N>p
Range (H) ⊂ RN
θ s
s lies in subspace of RN
x can lie anywhere in RN 10
LS Geometry Example N = 3 p = 2
Notation a bit different from the book
x=s+e
“noise” takes s out of
Range(H) and into RN
ε = x − sˆ ε ⊥ hi
e
h2
H columns lie in this s sˆ = θ1h1 + θ 2 h 2
plane = “subspace” h1
spanned by the columns
of H = S2
(Sp in general)
11
LS Orthogonality Principle
The LS error vector must be ⊥ to all columns of H
ε T H = 0T or HT ε = 0
Can use this property to derive the LS estimate:
H T ε = 0 ⇒ H T (x − Hθ ) = 0
T T T −1 T
⇒ H Hθ = H x ⇒ θ LS = H H H x
ˆ ( )
Rp RN Same answer as before…
H
but no derivatives to worry about!
θ
s
θ̂ Range (H) ⊂ RN
x
(HTH)-1HT Acts like an inverse from RN back to
Rp… called pseudo-inverse of H 12
LS Projection Viewpoint
From the R3 example earlier… we see that ŝ must lie
“right below” x
ŝ = “Projection” of x onto Range(H)

(Recall: Range(H) = subspace spanned by columns of H)
ˆ  T
(−1 T 
From our earlier results we have: sˆ = Hθ LS = H H H H  x
%""$""#
)
∆
= PH
x
ε = x − sˆ “Projection Matrix onto Range(H)”
sˆ = PH x
13
Aside on Projections
If something is “on the floor”… its projection onto the floor = itself!
if z ∈ Range( H), then PH z = z
Now… for a given x in the full space… PHx is already in Range(H)

… so PH(PHx) = PHx
Thus… for any projection matrix PH we have: PH PH = PH
PH2 = PH Projection Matrices

are Idempotent
Note also that the projection onto Range(H) is symmetric:
(
PH = H H H T
)
−1
HT Easily Verified
14
What Happens w/ Orthonormal Columns of H
T −1 T
Recall the general Linear LS solution: θ LS = H H H x
ˆ ( )
 h1 , h1 h1 , h 2 ' h1 , h p 
 
 
 h ,h h2 , h2 ' h2 , h p 
T
 2 1 
where H H= 
 & & ( & 
 
 
 
 h p , h1 h p , h2 ' h p ,h p 

If the columns of H are orthonormal then <hi,hj> = δij ⇒ HTH = I

Easy!! No Inversion
ˆθ = HT x Needed!!
LS Recall Vector Space
Ideas with ON Basis!!
15
Geometry with Orthonormal Columns of H
Inner Product Between ith
Re-write this LS solution as: θî = hTi x Column and Data Vector
p p
Then we have: sˆ = Hθˆ = ∑θî h i = ∑ ( hTi x ) h i
% "$" #
i =1 i =1
Projection of x
x onto hi axis
h2 ( h T2 x ) h 2
When the columns of H are ⊥
we can first find the projection
h1 onto each 1-D subspace
independently, then add these
( h1T x ) h1 independently derived results.
sˆ = ( h1T x ) h1 + ( h T2 x ) h 2 Nice!
16
8.6 Order-Recursive LS s[n]
Motivate this idea with Curve Fitting
Given data: n = 0, 1, 2, . . ., N-1
s[0], s[1], . . ., s[N-1]
Want to fit a polynomial to data..,

n
but which one is the right model?
! Constant ! Quadratic
! Linear ! Cubic, Etc.
Try each model, look at Jmin … which one works “best”

Constant
Jmin(p)
Line Quadratic
Cubic
1 2 3 4
p
(# of parameters in model)
1
Choosing the Best Model Order
Q: Should you pick the order p that gives the smallest Jmin??
A: NO!!!!
Fact: Jmin(p) is monotonically non-increasing as order p increases
If you have any N data points…

you can perfectly fit a p = N model to them!!!!
s[n] 2 points define a… line

3 points define a… quadratic
4 points define a… cubic
…
N points define… aNxN+aN-1xN-1+…+a1x+a0
n
Warning: Don’t “Fit the Noise”!!

2
Choosing the Order in Practice
Practice: use simplest model that adequately describes the data
Scheme: Only increase order if cost reduction is “significant”
" Increase to order p+1 only if Jmin(p) – Jmin(p=1) > ε user-set

threshold
" Also, in practice you may have some idea of the expected level of error
⇒ thus have some idea of expected Jmin
⇒ use order p such that Jmin(p) ≈ Expected Jmin
Wasteful to independently compute the LS solution for each order

Drives Need for:
Efficient way to compute LS for many models
Q: If we have computed p-order model, can we use it to

recursively compute (p+1)-order model?
A: YES!! ⇒ Order-Recursive LS 3
Define General Order-Increasing Models
Define: Hp+1 = [ Hp hp+1 ] ⇒ h1, h2, h3, . . .
H1
H2
Etc.
H3
Order-Recursive LS with Orthonormal Columns
If all hi are ⊥ ⇒ EASY !!
x
p =1 sˆ1 = ( h1T x ) h1
h2 ( h T2 x ) h 2
p=2 sˆ 2 = sˆ1 + ( h T2 x ) h 2
h1
p=3 sˆ 3 = sˆ 2 + ( h T3 x ) h 3
sˆ1 = ( h1T x ) h1 sˆ 2 = sˆ1 + ( h T2 x ) h 2
! ! 4
Order-Recursive Solution for General H
If hi are Not ⊥ ⇒ Harder, but Possible!
Basic Idea: Given current-order estimate: Quotes here because
• map new column of H into an ON version this estimate is for the
• use it to find new “estimate,” orthogonalized model
• then transform to correct for orthogonalization
~ h3
h3
Orthogonalized
version of h3 S2 = 2-D space spanned
by h1 & h2
= Range(H2)
h2
h1
Note: x is not shown here… it is

in a higher dimensional space!!
5
Geometrical Development of Order-Recursive LS
The Geometry of Vector Space is indispensable for DSP!
See App. 8A
Current-Order = k for Algebraic
Development
⇒ Hk = [h1 h2 . . . hk] (not necessarily ⊥)
Yuk!
Recall: Pk = H k ( H k H k ) −1 H k
T T Geometry is
%""""$""""
# Easier!
Projector onto Sk = Range(Hk)

~
Given next column: hk+1 Find h k +1 , which is ⊥ to Sk
h3
h k +1 = h k +1 − Pk h k +1 = (I − Pk )h k +1
~
%"$" #
Pk⊥
~
h k +1 = Pk⊥ h k +1
Sk
~ ~ ŝ k Pk h k +1
h k +1 ⊥ S k ⇒ h k +1 ⊥ sˆ k
6
~
So our approach is now: project x onto h k +1
and then add to ŝ k
Divide by
~
The projection of x onto h k +1 is given by norm to
normalize
~ ~
h h k +1 ~
∆sˆ k +1 = x, ~ k +1 ~ use h k +1 = Pk⊥ h k +1
h k +1 h k +1
 
T~ T ⊥
x h k +1 ~  x Pk h k +1  ⊥
= h k +1 =  P h
2  k k +1
~ 2
h k +1  Pk⊥ h k +1 
%""$""#
scalar!
Now add this to current signal estimate: sˆ k +1 = sˆ k + ∆sˆ k +1
= H k θˆ k + ∆sˆ k +1 7
 
T ⊥
Now we have:  x Pk h k +1  ⊥
sˆ k +1 = H k θˆ k +  2  Pk h k +1
⊥
 Pk h k +1 
  Scalar…
can move here
and transpose
( I − Pk ) h k +1 hTk +1 Pk⊥ x
= H k θˆ k +
hTk +1 Pk⊥ h k +1
%""$"" #
Write out Pk⊥ Write out ||.||2 and use
that Pk⊥ is idempotent
scalar… define as b for convenience
Finally: sˆ k = H k θˆ k + h k +1b − H k ( HTk H k ) −1 HTk h k +1b
θˆ k − ( HTk H k ) −1 HTk h k +1b

 
= [H k h k +1 ]  
%"$"#
= H k +1  
 b 
Clearly this is θˆ k +1 8
Order-Recursive LS Solution
  hTk +1 Pk⊥ x 
−
θˆ k − ( H k H k ) H k h k +1  T
T 1 T 
  h k +1 Pk h k +1 
⊥

k +1 = 
ˆθ
T ⊥ 
 h P
k +1 k x 
 T ⊥
h k +1 Pk h k +1 
 
Drawback: Needs Inversion Each Recursion

See Eq. (8.29) and (8.30) for a way to avoid inversion
Comments:
1. If hk+1 ⊥ Hk ⇒ simplifies problem as we’ve seen
(This equation simplifies to our earlier result)
2. Note: P⊥k x above is residual of k-order model
= part of x not modeled by k-order model
⇒ Update recursion works solely with this
Makes Sense!!!
9
8.7 Sequential LS You have
In Last Section: In This Section: received
• Data Stays Fixed • Data Length Increases new data
sample!
• Model Order Increases • Model Order Stays Fixed
Say we have θˆ [ N − 1] based on {x[0], . . ., x[N-1]}
If we get x[N]… can we compute θˆ [ N ] based on θˆ [ N − 1] and x[N]?

(w/o solving using full data set!)
We want… θˆ [ N ] = f (θˆ [ N − 1], x [ N ])

Approach Here:
1. Derive for DC-Level case
2. Interpret Results
3. Write Down General Result w/o Proof
10
Sequential LS for DC-Level Case
N −1
∑ x[ n ]
1
N −1 =
We know this: Â
N n =0
Re-Write
N
1   1 N −1  
∑ x[n] = N + 1  N  N ∑ x[n] + x[ N ]
… and this: Â = 1
N
N + 1 n =0
  n =0  
N ˆ 1
= AN −1 + x[ N ]
N$
% +#1 N +1
1
=1−
N +1
1
Aˆ N = Aˆ N −1 + ( x[ N ] − Aˆ N −1 )
%$# N +1 %$#
old estimate prediction of
the new data
%"""$"""#
prediction error
11
Weighted Sequential LS for DC-Level Case
This is an even better illustration… w[n] has unknown PDF
but has known time-
Assumed model: x[n ] = A + w[n ] var{w[n ]} = σ n2 dependent variance
N −1
∑ σ2
x[n ]
n =0 n
Standard WLS gives: Aˆ N −1 = N −1
∑σ 2
1
n =0 n
With manipulations similar to the above case we get:

 1 
 2 
 σ 
Aˆ N = Aˆ N −1 +  N N  ( x[ N ] − Aˆ N −1 )
%$# %""$"" #
∑
old estimate  1  prediction error
 n =0 σ n 
2 kN is a “Gain” term that
% "$" # reflects “goodness” of
∆
= kN new data
12
Exploring The Gain Term
1
We know that var( Aˆ N −1 ) = … and using it in kN …
N −1
1 
∑ σ 2 

n =0 n 
“poorness” of
current estimate
var( Aˆ N −1 )
…we get that kN =
var( Aˆ N −1 ) + σ&
N
2 “poorness” of
new data
variance of
the new data
Note: 0 ≤ K[N] ≤ 1
⇒ Gain depends on Relative Goodness Between:

o Current Estimate
o New Data Point
13
Extreme Cases for The Gain Term
Aˆ [ N ] = %
Aˆ"$−
[N "
#1] + K [ N ] ( x[ N ] − Aˆ [ N − 1])
%"" "$""" #
old estimate prediction error
Good Estimate
If var( Aˆ [ N − 1]) << σ n2 Bad Data
⇒ K[N ] ≈ 0
⇒ New Data Has Little Use

⇒ Make Little " Correction " Based on New Data
Bad Estimate
If var( Aˆ [ N − 1]) >> σ n2 Good Data
⇒ K[N ] ≈ 1
⇒ New Data Very Useful

⇒ Make Large " Correction " Based on New Data
14
General Sequential LS Result See App. 8C for derivation
At time index n-1 we have: Diagonal
x n −1 = [x[0] x[n − 1]]T

Covariance
x[1] '
(Sequential LS
x n −1 = H n −1θ + w n −1 C n −1 = diag{σ 02 ,σ 12 ,',σ n2−1} requires this)
θˆ n −1 LS Estimate using x n −1
∆
Σ n −1 = { }
cov θˆ n −1 quality measure of estimate
At time index n we get x[n]:

H n −1 
x n = H n θ + w n =  θ + w Tack on row
 n at bottom to
 hTn  show how θ
maps to x[n]
15
Iterate these Equations:
Given the Following: θˆ n −1 Σ n −1 x[n ] h n σ n2
Update the Estimate: θˆ n = θˆ n −1 + k n ( x[n ] − hTn θˆ n −1 )

%"$" #
Σ n −1h n Prediction of x[n]

Compute the Gain: kn = using current
σ n2 + hTn Σ n −1h n
parameter estimate
Update the Est. Cov.: Σ n = ( I − k n hTn ) Σ n −1

Gain has same kind of
dependence on Relative
Initialization: (Assume p parameters) Goodness between:
o Current Estimate
• Collect first p data samples x[0], . . ., x[p-1]
o New Data Point
• Use “Batch” LS to compute: θˆ p −1 Σ p −1
• Then start sequential processing
16
Sequential LS Block Diagram
Σ n −1 , h n , σ n2
Updated
Compute Estimate
Observations Gain
x[n] + x[n ] − hTn θˆ n −1 + (

θˆ n −1 + k n x[n ] − hTn θˆ n −1 )
Σ kn Σ θ̂ n
− +
hTn θˆ n −1
hTn z-1
θˆ n −1
Predicted
hn Previous
Observation
Estimate
17
8.8 Constrained LS
Why Constrain? Because sometimes we know (or believe!)
certain values are not allowed for θ
For example: In emitter location you may know that the emitter’s
range can’t exceed the “radio horizon”
You may also know that the emitter is on the left side of the
aircraft (because you got a strong signal from the left-side
antennas and a weak one from the right-side antennas)
Thus, when finding θˆLS you want to constrain it to satisfy these

conditions
1
Constrained LS Problem Statement
Say that Sc is the set of allowable θ values (due to constraints).
Then we seek θˆCLS ∈ S c such that
2 2
x − Hθˆ CLS = min x − Hθ
θ∈Sc
Types of Constraints Constrained to a line,

plane or hyperplane
1. Linear Equality Aθ = b
H
A 2. Nonlinear Equality f (θ) = b Constrained to lie
R above/below a
D 3. Linear Inequality Aθ ≥ b hyperplane
E
Aθ ≤ b
R
4. Nonlinear Inequality f (θ) ≥ b
f(θ) ≤ b
We’ll Cover #1…. See Books on Optimization for Other Cases 2
LS Cost with a Linear Equality Constraint
Using Lagrange Multipliers… we need to minimize
J c (θ) = (x − Hθ )T (x − Hθ ) + λ T ( Aθ − b )
w.r.t. θ and λ Linear Equality

Constraint
x2 contours of
(x – Hθ)T (x – Hθ)
2-D Linear Equality
Constraint
Constrained
Unconstrained Minimum
Minimum
x1 3
Constrained Optimization: Lagrange Multiplier
x2 f (x1,x2) contours Constraint: g(x1,x2) = C
g(x1,x2) – C = h(x1,x2) = 0
Ex. ax1 + bx2 – c = 0

⇒ x2 = (–a/b)x1 + c/b
A Linear Constraint
 ∂h ( x1 , x2 ) 
 ∂x  a 
∇h ( x1 , x2 ) = 
1 = 
 ∂h ( x1 , x2 )   
   b 
 ∂x2 
Ex. The grad vector has

x1
“slope” of b/a ⇒
Constrained Max occurs when: orthogonal to constraint line
∇f ( x1 , x 2 ) = − λ∇h ( x1 , x 2 )
∇[ f ( x1 , x2 ) + λ (g ( x1 , x 2 ) − C )] = 0
⇒ ∇f ( x1 , x 2 ) + λ∇h ( x1 , x 2 ) = 0
4
LS Solution with a Linear Equality Constraint
Follow the usual steps for Lagrange Multiplier Solution:
∂J c
1. Set = 0 ⇒ θˆ CLS as a function of λ θˆ CLS ( λ )
∂θ
− 2H T x + 2H T Hθ + A T λ = 0 ⇒ θˆ c ( λ ) = $
H!T
H
−1 T
!#!!H ( −
1 T
" 2 H H
x ) ( )−1
AT λ
θˆ uc
Unconstrained Estimate
2. Solve for λ to make θˆCLS satisfy the constraint: A θˆ c ( λ ) = b

$!#!"
solve for λ ⇒ λ c
( ) ( )
−1


1
A θˆ uc − H T H
2
−1  
A λ  = b ⇒ λ c = 2A HT H
T
 
−1 
A 

T
(Aθˆ uc − b)
3. Plug in to get the constrained solution: θˆ c = θˆ c ( λ c )
( ) ( )
−1
θˆ c = θˆ uc
−1

 !!
−1 
− H T H A T  A H T H A T  Aθˆ uc − b
 !!!!!
( )
$!!!!! !#!!! "
"Correction Term" Amount of
Constraint Deviation 5
Geometry of Constrained Linear LS
The above result can be interpreted geometrically:
s ŝ uc
ŝ c
Constraint Line
Constrained Estimate of the Signal is the

Projection of the Unconstrained Estimate
onto the Linear Constraint Subspace
6
8.9 Nonlinear LS
Everything we’ve done up to now has assumed a linear
observation model… but we’ve already seen that many
applications have nonlinear observation models: s(θ) ≠ Hθ
Recall: For linear case – closed-form solution

< Not so for nonlinear case!! >
Must use numerical, iterative methods to minimize the LS cost

given by:
J(θ) = [x – s(θ)]T [x – s(θ)]
But first… Two Tricks!!! 7

Two Tricks for Nonlinear LS
Sometimes it is possible to: Sometimes
1. Transform into a Linear Problem Possible to Do
2. Separate out any Linear Parameters Both Tricks
Together
 g (θ) = α 
Trick #1: Seek an invertible function  
−1
θ = g (α) 
such that
s(θ(α)) = Hα, which can be easily solved for α̂ LS
and then find θˆLS = g −1 (αˆ LS )
Trick #2: See if some of the parameters are linear:

α 
Try to decompose θ =   to get s(θ) = H(α)β
β 
Linear in β!!!
Nonlinear in α 8
Example of Linearization Trick
Consider estimation of a sinusoid’s amplitude and phase (with a
known frequency):
 A
s[n ] = A cos(2πf o n + φ ) θ= 
φ 
But we can re-write this model as:
s[n ] = A cos(φ ) cos(2πf o n ) − A sin(φ ) sin(2πf o n )

$!#! " $
!#! "
α1 α2
which is linear in α = [α1 α2]T so: αˆ = ( HT H) −1 HT x
Then map this estimate back using  αˆ 2 + αˆ 2 

−  1 2 
Note that for this example this is θˆ = g (αˆ ) =  −1  − αˆ 
1
merely exploiting polar-to-  tan  2

 αˆ 
rectangular ideas!!!   1 
9
Example of Separation Trick
Consider a signal model of three exponentials:
s[n ] = A1r n + A2 r 2 n + A3r 3n 0 < r <1
θ = [ A1 A2 A3 r% ]T
$!#!"
α
βT
Then we can write:
 1 1 1 
  s(θ) = H( r )β
 r r2 r 3

H( r ) =  
 &

& & 

βˆ ( r ) = [HT ( r )H( r )]−1 H T ( r )x
r N −1 r 2( N −1) r 3( N −1) 

Depends on only one variable… so
Then we need to minimize : might conceivably just compute on
[ ][ T
J ( r ) = x − H( r )βˆ ( r ) x − H( r )βˆ ( r ) ] a grid and find minimum
= [x − H( r )[H ( r )H( r )] ] [x − H(r)[H ]

−1 T
T
HT ( r )x T
( r )H( r )]−1 HT ( r )x 10
Iterative Methods for Solving Nonlinear LS
Goal: Find θ value that minimizes J(θ) = [x-s(θ)]T [x-s(θ)]
without computing it over a p-dimensional grid
Two most common approaches:
1. Newton-Raphson
a. Analytically find ∂J(θ)/∂θ
b. Apply Newton-Raphson to find a zero of ∂J(θ)/∂θ
(i.e. linearize ∂J(θ)/∂θ about the current estimate)
c. Iteratively Repeat
2. Gauss-Newton
a. Linearize signal model s(θ) about the current estimate
b. Solve resulting linear problem
c. Iteratively Repeat
Both involve:
• Linearization (but they each linearize something different!)
• Solve linear problem
• Iteratively improve result
11
Newton-Raphson Solution to Nonlinear LS
To find minimum of J(θ): set ∂J
=0
∂θ
%
=∆ g ( θ )
 ∂J (θ) 
 
∂ θ N −1
∂J (θ)  1 
Need to find ∂θ
= & 
 ∂J (θ) 
for J (θ) = ∑ ( x[i ] − s θ [i ]) 2
i =0
 
 ∂θ p 
N −1
∂J (θ) ∂sθ [i ]
Taking these partials gives:
∂θ j
= −
%2 ∑ ($x[!#
i ] − sθ [i ])
! !" ! ∂θ j
can i =0
ignore ∆
=r
$#"
i
Why ? ∆
= hij
12
N −1
Now set to zero: ∑ ri hij = 0 for j = 1,…, p ⇒ g (θ) = HTθ rθ = 0
$=0#!
i! "
Matrix Depend nonlinearly on θ
×Vector
 ∂s θ [0 ] ∂sθ [0] ∂s θ [0 ] 
 ' 
∂θ 1 ∂θ 2 ∂θ p
 
   ( x [ 0 ] − s θ [ 0 ]) 
 ∂ s θ [1] ∂ s θ [1]
'
∂ s θ [1]   
 ∂θ 1 ∂θ 2 ∂θ p   
   
Hθ = rθ =  & 
 
 & & ' &   
   
   ( x [ N − 1] − s [ N − 1]) 
 θ 
 ∂ s θ [ N − 1] ∂ s θ [ N − 1] ∂ s θ [ N − 1] 
 ' 
 ∂θ 1 ∂θ 2 ∂θ p 
 ∂sθ [i ] ∂sθ [i ] ∂sθ [i ] 

Define the ith row of Hθ : hTi (θ) =  ' 
 ∂θ1 ∂θ 2 ∂θ P 
N −1
Then the equation to solve is: g (θ) = HTθ rθ = ∑ rθ [n]hi (θ) = 0
n =0
13
For Newton-Raphson we linearize g(θ) around our current
estimate and iterate: Need this
  ∂g (θ)  −1    T  −1 
∂H r
θˆ k +1 = θˆ k −    g ( θ) = θˆ k −   θ θ  H θ rθ 
T
  ∂θ   ˆ   ∂θ  
θ =θk     θ =θˆ
k
∂HTθ rθ ∂ N −1 N −1
∂rθ [n ]h n (θ) N −1∂h n (θ) N −1
∂rθ [n ]
∂θ
= ∑ n θ
∂θ n =0
h ( θ ) r [ n ] = ∑ ∂θ = ∑ ∂θ θ r [ n ] + ∑ n ∂θ
h ( θ )
n =0 n =0 $!#
! !" ! $ n =0!!#!!"
∆
= Gn (θ) − HTθ Hθ
Derivative of Product Rule
 ∂sθ [n ] 
 ∂θ 
 1 
 
 ∂sθ [n ] 
[G (θ)]
n ij =
∂ 2 sθ [ n ]
∂θ i ∂θ j
i, j = 1,2,…, p
∂rθ [n ] ∂ ( x[n ] − sθ [n ])
∂θ
=
∂θ
 ∂θ 2 
= −

 & 


 
 
 ∂sθ [n ] 
∂HTθ rθ N −1
= ∑ G n (θ)(x[n ] − sθ [n ]) − HTθ H θ  ∂θ 
 p 
∂θ n =0 14
So the Newton-Raphson method becomes:
  T  −1 
∂H r
θˆ k +1 = θˆ k −   θ θ  HTθ rθ 
  ∂θ  
    θ =θˆ
k
−1
ˆ
 T
θk
N −1
ˆ ( )
(
= θ k + H θˆ H ˆ − ∑ G n (θ k ) x[n ] − sθˆ [n ]  HTθˆ x − s θˆ )
 
k k k k
n =0
1st partials of signal 2nd partials of s[n]

w.r.t. parameters w.r.t. parameters
Note: if the signal is linear in parameters… this collapses to the non-

iterative result we found for the linear case!!!
Newton-Raphson LS Iteration Steps:
1. Start with an initial estimate
2. Iterate the above equation until change is “small”
15
Gauss-Newton Solution to Nonlinear LS
First we linearize the model around our current estimate by
using a Taylor series and keeping only the linear terms:
 ∂s 
θ
s θ ≈ s θˆ + (θ − θˆ k )
k
 ∂θ θ =θˆ k 
$!#!"
∆
= H ( θˆ k )
Then we use this linearized model in the LS cost:

J (θ) = [x − s θ ]T [x − s θ ]
[ {
≈ x − s θˆ
k
+ H θˆ (θ − θˆ k )
k
}] [x − {s
T
θˆ k
+ H θˆ (θ − θˆ k )
k
}]
= [x − s θ ] [x − s ]
T
θˆ k
+ H θˆ θˆ k − H θˆ θˆ k
+ H θˆ θˆ k − H θˆ θ
k k k k
∆ ∆
=y =y
All Known Things
16
This gives a form for the LS cost that looks like a linear problem!!
[
J (θ) = y − H θˆ θ
k
] [y − H θ]
T
θˆ k
We know the LS solution to that problem is

[ ]H y
θˆ k +1 = HθˆT H θˆ
k k
−1 T
ˆθ
k
= [H H ] H (x − s + H θˆ )
T −1 T
ˆθ
k θˆ k ˆθ
k θˆ k θˆ k k
= [H H ] H H θˆ + [H H ] ( )
T −1 T T −1
ˆθ θˆ k ˆθ θˆ k k ˆθ θˆ k
HθˆT x − s θˆ
$!!!#!!!"
k k k k k
=I
Gauss-Newton LS Iteration: [
θˆ k +1 = θˆ k + HθˆT H θˆ
k k
]
−1
(
HθˆT x − s θˆ
k k
)
Gauss-Newton LS Iteration Steps:
1. Start with an initial estimate
2. Iterate the above equation until change is “small”
17
Newton-Raphson vs. Gauss-Newton
How do these two methods compare?
θ
[
G-N: θˆ k +1 = θˆ k + HˆT H ˆ
θ k k
]
−1
(
HθˆT x − s θˆ
k k
)
−1
N-R: ˆθ
 T N −1
( )
(
k +1 = θ k + H θˆ k H θˆ − ∑ G n (θ k ) x[n ] − sθˆ k [n ]  H θˆ k x − s θˆ k
ˆ
 k
ˆ

T
)
n =0
The term of 2nd partials is missing

in the Gauss-Newton Equation
Which is better? See p. 683 of Numerical
Typically I prefer Gauss-Newton: Recipes book
• Gn matrices are often small enough to be negligible
• … or the error term is small enough to make the sum term negligible
• Inclusion of the sum term can sometimes de-stablize the iteration
18
8.10 Signal Processing Examples of LS
We’ll briefly look at two examples from the book…
Book Examples
1. Digital Filter Design
2. AR Parameter Estimation for the ARMA Model
3. Adaptive Noise Cancellation
4. Phase-Locked Loop (used in phase-coherent
demodulation)
The two examples we will cover highlight the flexibility of the LS
viewpoint!!!
Then (in separate note files) we’ll look in detail at two emitter
location examples not in the book
1
Ex. 8.11 Filter Design by Prony’s LS Method
The problem:
• You have some desired impulse response hd[n]
• Find a rational TF with impulse response h[n] ≈ hd[n]
View: hd[n] as the observed “data”!!!

Rational TF model’s coefficients as the parameters
General LS Problem LS Filter Design Problem

x[n] ε[n] hd[n] ε[n]
∑ ∑
+– +–
s[n; θˆ ] h[n; aˆ , bˆ ]
signal B( z )
θ̂ model aˆ , bˆ H ( z) =
A( z )
Choose the p×1
… to make this
Estimate… “residual” small (q+1)×1 δ[n] 2
Prony’s Modification to Get Linear Model
The previous formulation results in a model that is nonlinear in the
TF coefficient vectors a, b
Prony’s idea was to change the model slightly…

This model is only
hd[n] ε[n] approximately
A(z ) ∑
+– equivalent to the
original!!
A(z )
Solution (see book for details):
h[n; aˆ , bˆ ]
aˆ = [HTq H q ]−1 HTq h q, N −1
B( z )
aˆ , bˆ H ( z) =
A( z )
bˆ = h 0,q + H 0aˆ
p×1
(q+1)×1 δ[n] Hq, H0, hq,N-1, and h0,q all contain
elements from hd[n]… the subscripts
indicate the range of these elements
3
Key Ideas in Prony LS Example
1. Shows power and flexibility of LS approach
- There is no noise here!!! ⇒ MVU, ML, etc. are not applicable
- But, LS works nicely!
2. Shows a slick trick to convert nonlinear problem to linear one

- Be aware that finding such tricks is an art!!!
3. Results for LS “Prony” method have links to modeling methods

for Random Processes (i.e. AR, MA, ARMA)
Is this a practical filter design method?

It’s not the best: Remez-Based Method is Used Most
4
Ex. 8.13 Adaptive Noise Cancellation Done a bit
different from
the book
Desired Interference
x[ n ] = d [ n ] + i [ n ] dˆ[n ]
Σ
+
–
~ Estimate of the desired
i [n ] Adaptive iˆ[n ] signal… with
FIR Filter “cancelled” interference
Statistically correlated with Estimate of the interference i[n] adapted

interference i[n] but mostly to “best” cancel the interference
uncorrelated with desired d[n]
p
∑ n i [k − l ]
î [n ] = h [l ]~
l =0
Time-Varying Filter!!
Coefficients change at each sample index 5
Noise Cancellation Typical Applications
1. Fetal Heartbeat Monitoring
Fetal Mother’s
Heartbeat Heartbeat
On Mother’s via Stomach
Stomach
x[ n ] = d [ n ] + i [ n ] dˆ[n ]
Σ
On Mother’s +
Chest –
~ iˆ[n ]
i [n ] Adaptive
FIR Filter
Mother’s
Heartbeat
via Chest Adaptive filter has to mimic the TF of the
chest-to-stomach propagation
6
2. Noise Canceling Headphones
Noise
Music i[n ]
Signal ~ Ear
i [n ]
m[n ] m[n ] − iˆ[n ] m[n ] + i$
[n!− i!
]# ˆ[n ]
"
Σ cancel
+
–
Adaptive
FIR Filter
iˆ[n ]
7
3. Bistatic Radar System
t[n]
x[ n ] = t [ n ] + d t [ n ]
dt[n] Σ
Tx +
–
tˆ[n ]
d[n] d[n] Adaptive dˆt [n ]
FIR Filter
d[n] Delay/Doppler
Radar
Processing
8
LS and Adaptive Noise Cancellation
Goal: Adjust the filter coefficients to cancel the interference
There are many signal processing approaches to this problem…
We’ll look at this from a LS point of view:
Adjust the filter coefficients to minimize J = ∑ dˆ 2 [n ]
n
Because i[n] is uncorrelated with
d[n] minimizing J is essentially the
same as making this term zero
x[ n ] = d [ n ] + i [ n ] dˆ[n ] = d [n ] + (i[n ] − iˆ[n ])

Σ
+
– Because the interference likely
~ iˆ[n ]
i [n ] Adaptive changes is character with
FIR Filter time… we want to adapt!
Use Sequential LS with
“Fading Memory”
9
Sequential LS with Forgetting Factor
We want to weight recent measurements more heavily than past
measurements… that is we want to “forget” past values.
So we can use weighted LS… and if we choose our weighting

factor as an exponential function then it is easy to implement!
[x[k ] − iˆ[k ]] 2
n
J [n ] = ∑λ n −k
k =0
2
Small λ quickly
n  p −1  “down weights”
∑λ  x[k ] − ∑ hn [l ]i [k − l ]
n −k ~
=
k =0  l =0  the past errors
λ = forgetting factor if 0 < λ < 1
See book for solution details
See Fig. 8.17 for simulation results 10

A Concise Guide To Statistics

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

A Concise Guide To Statistics

Transféré par

Droits d'auteur :

Formats disponibles

Review of Probability

 Basic Idea – don’t know how to completely

pX(x) The Probability density function

A RV X with the following PDF is called a Gaussian RV

m & σ are parameters of the Gaussian pdf

Notation: When X has Gaussian PDF we say X ~ N(m,σ 2)

 Generally: take the noise to be Zero Mean

So What! Here is what : Electronic systems generate

CLT applies Guassian Noise

This graph shows a Joint PDF

This graph shows a Conditional PDF

Neither RV impacts the other statistically – thus, the

In other words: conditioning doesn’t change the PDF!!!

Independent If X & Y are independent,

RV’s X & Y are independent if:

Mean = Average = Expected Value

Possible values of RV X : V0 V1 V2... V100

Notation: E{ X } = X Shorthand Notation 18

There is no DATA here!!! There is no PDF here!!!

Variance: Characterizes how much you expect the

Note : If zero mean…

y−y y−y y−y

x−x x−x x−x

Positive Correlation Zero Correlation Negative Correlation

GPA Height Student Loans

If the RVs are both Zero-mean : σ XY = Ε{XY }

If X & Y are independent, then: σ XY = 0

So… RVs X and Y are said to be uncorrelated

X & Y are Implies X & Y are

PDFs Separate Means Separate

INDEPENDENCE IS A STRONGER CONDITION !!!!

Same if zero mean

where µx is the mean vector and Cx is the covariance matrix:

For the case of two jointly Gaussian RVs X1 and X2 with

It is easy to verify that X1 and X2 are uncorrelated (and independent!) if ρ = 0

Then the linear transform y = Ax + b is also jointly Gaussian with

Then the moments E{Xk} are as follows:

1 ⋅ 3 ( k − 1)σ k , k even

E{X1X2X3X4} = E{X1X2}E{X3X4} + E{X1X3}E{X2X4} + E{X1X4}E{X2X3}

Then the RV Y = X12 + X22 + … + XN2 is called a chi-squared (χ2) RV of N

E{Y} = N and var{Y} = 2N

Definition of Vector Addition: Add element-by-element

If the vectors of interest are complex valued then the set of

Multiplying a Vector by a Scalar :

…changes the vector’s length if |α| ≠ 1

Note: this means that ANY “linear combination” of vectors in the

So… a vector space is nothing more than a set of vectors with

Properties of Vector Norm:

Note that: d ( v 1 , v 2 ) = 0 iff v 1 = v 2

Clearly we see that… This gives a measure of the angle

Properties of Inner Products: < αu, v >= α < u, v >

< u, v + z >=< u, v > + < u, z >

4. Schwarz Inequality: < u, v > ≤ u 2

5. Inner Product and Angle: < u, v >

Two vectors u and v are said to be orthogonal when <u,v> = 0

If in addition, they each have unit length they are orthonormal

We want to be able to do is get any vector just by changing the

This property establishes if there are enough vectors in the

In other words, this “expansion” or “decomposition” is unique.

 v1   α1  Expansion can be viewed as a mapping (or

Often the key to solving a signal processing problem lies in finding

Then the IDFT equation can be viewed as an expansion of the

kth coefficient coefficient vector 19/45