EUSIPCO2014 Future Random Matrix Tools

1/142
Future Random Matrix Tools

for Large Dimensional Signal Processing
EUSIPCO 2014, Lisbon, Portugal.
Abla KAMMOUN1 and Romain COUILLET2
1 Kings Abdullah University of Technology and Science, Saudi Arabia

2 SUPELEC, France
September 1st, 2014

Introduction/ 2/142
High-dimensional data
I Consider n observations x1 , , xn of size N,

independent and identically distributed with
zero-mean and covariance CN , i.e, E x1 xH 1 = CN ,
I Let XN = [x1 , , xn ]. The sample covariance estimate SN of CN is given by:
X n
SN = n1 XN XH
N = n
1 xi xi ,
i =1
I From the law of large numbers, as n +,
a.s.
SN CN .
Convergence in the operator norm

I In practice, it might be difficult to afford n +,
I if n N, SN can be sufficiently accurate,
I if N /n = O(1), we model this scenario by the following assumption: N + and n + with
N
n c,
I Under this assumption, we have pointwise convergence to each element of CN , i.e,

a.s.
SN (CN )i,j
i,j
but kSN CN k does not converge to zero.

The convergence in the operator norm does not hold.
Introduction/ 3/142
Illustration
Consider CN = IN , the spectrum of SN is different from that of CN
Spectrum of eigenvalues
Marchenko-Pastur Law
1
0.8
0.6
Histogram
0.4
0.2
0 0.5 1 1.5 2
Eigenvalues of SbN
Figure: Spectrum of eigenvalues when N = 400 and n = 2000
The asymptotic spectrum can be characterized by the Marchenko-Pastur Law.

Introduction/ 4/142
Reasons of interest for signal processing

I Scale similarity in array processing applications: large antenna arrays vs limited number of
observations,
I Need for detection and estimation based on large dimensional random inputs: subspace
methods in array processing.
I The assumption number of obervations dimension of observation is no longer valid:
large arrays, systems with fast dynamics.
Example
MUSIC with few samples (or in large arrays) Call A() = [a(1 ), . . . , a(K )] CN K , N large,
K small, the steering vectors to identify and X = [x1 , . . . , xn ] CN n the n samples, taken from
X
K

xt = a(k ) p k sk,t + wt .
k =1
W U
The MUSIC localization function reads () = a()H U H a() in the signal vs. noise
W
spectral decomposition XXH = US H + U
SU W WUH .
S W
Writing equivalently A()PA()H + 2 IN = US S UHS + UW UW , as n, N , n/N c,
2 H
from our previous remarks

W U
U H 6 UW UH
W W
Music is NOT consistent in the large N, n regime! We need improved RMT-based solutions.
Part 1: Fundamentals of Random Matrix Theory/1.1. The Stieltjes Transform Method 7/142
Stieltjes Transform
Definition
Let F be a real probability distribution function. The Stieltjes transform mF of F is the function
defined, for z C+ , as Z
1
mF (z ) = dF ()
z
For a < b continuity points of F , denoting z = x + iy , we have the inverse formula
Zb
1
F (b ) F (a) = lim =[mF (x + iy )]dx
y 0 a
If F has a density f at x, then

1
f (x ) = lim =[mF (x + iy )]
y 0
The Stieltjes transform is to the Cauchy transform as the characteristic functin is to the Fourier
transform.
Equivalence F mF
Similar to the Fourier transform, knowing mF is the same as knowing F .
Stieltjes transform of a Hermitian matrix
I Let X be a N N randomP matrix. Denote by dF X the empirical measure of its eigenvalues

1 , , N , i.e, dF X = N1 Ni =1 i . The Stieltjes transform of X denoted by mX = mF is
the stieltjes transform of its empirical measure:
Z
1 X 1
N
1 1
mX (z ) = dF () = = tr (X zIN )1 .
z N i z N
i =1
I The Stieltjes transform of a random matrix is the trace of the resolvent matrix
Q(z ) = (X zIN )1 . The resolvent matrix plays a key role in the derivation of many of the
results of random matrix theory.
I For compactly supported F , mF (z ) is linked to the moments Mk = E N1 tr Xk ,
X
+
mF (z ) = Mk z k 1
k =0
I mF is defined in general on C+ but exists everywhere outside the support of F .

Side remark: the Shannon-transform
A. M. Tulino, S. Verd`
u, Random matrix theory and wireless communications, Now Publishers
Inc., 2004.
Definition
Let F be a probability distribution, mF its Stieltjes transform, then the Shannon-transform VF of
F is defined as Z Z
1
VF (x ) , log(1 + x )dF () = mF (t ) dt
0 x t
I This quantity is fundamental to wireless communication purposes!

I Note that mF itself is of interest, not F !
Proof of the Marcenko-Pastur law
V. A. Mar
cenko, L. A. Pastur, Distributions of eigenvalues for some sets of random matrices,
Math USSR-Sbornik, vol. 1, no. 4, pp. 457-483, 1967.
The theorem to be proven is the following
Theorem
Let XN CN n have i.i.d. zero mean variance 1/n entries with finite eighth order moments. As
n, N with Nn c (0, ), the e.s.d. of XN XHN converges almost surely to a nonrandom
distribution function Fc with density fc given by
1 p
fc (x ) = (1 c 1 )+ (x ) + (x a ) + ( b x ) +
2cx

where a = (1 c )2 , and b = (1 + c )2 .
The Marcenko-Pastur density

1.2
c = 0.1
c = 0.2
1 c = 0.5
0.8
Density fc (x )
0.6
0.4
0.2
0
0 0.5 1 1.5 2 2.5 3
Figure: Mar
cenko-Pastur law for different limit ratios c = limN N /n.
Diagonal entries of the resolvent
Since we want an expression of mF , we start by identifying the diagonal entries of the resolvent
( XN XH
N zIN )
1 of X XH . Denote
N N
yH

XN =
Y
Now, for z C+ , we have
1
yH y z yH Y H
1
XN XH
N zIN =
Yy YYH zIN 1
Consider the first diagonal element of (RN zIN )1 . From the matrix inversion lemma,
1
(A BD1 C)1 A1 B(D CA1 B)1

A B
=
C D (A BD1 C)1 CA1 (D CA1 B)1
which here gives 1 1

XN XH
N zIN =
11 z zyH (YH Y zIn )1 y
Trace Lemma
Z. Bai, J. Silverstein, Spectral Analysis of Large Dimensional Random Matrices, Springer Series
in Statistics, 2009.
To go further, we need the following result,
Theorem
Let {AN } CN N with bounded spectral norm. Let {xN } CN , be a random vector of i.i.d.
entries with zero mean, variance 1/N and finite 8th order moment, independent of AN . Then
1 a.s.
xH
N AN xN tr AN 0.
N
For large N, we therefore have approximately

1 1
XN XHN zIN '
11 z z N1 tr (YH Y zIn )1
Rank-1 perturbation lemma
J. W. Silverstein, Z. D. Bai, On the empirical distribution of eigenvalues of a class of large

dimensional random matrices, Journal of Multivariate Analysis, vol. 54, no. 2, pp. 175-192,
1995.
It is somewhat intuitive that adding a single column to Y wont affect the trace in the limit.
Theorem
Let A and B be N N with B Hermitian positive definite, and v CN . For z C \ R ,

tr (B zIN )1 (B + vvH zIN )1 A 6 1
1 kAk
N N dist(z, R+ )
with kAk the spectral norm of A, and dist(z, A) = inf y A ky z k.

Therefore, for large N, we have approximately,
1 1
XN XH
N zIN '
11 z z N1 tr (YH Y zIn )1
1
'
z z N1 tr (XH
N XN zIn )
1
1
=
z z Nn mF (z )
in which we recognize the Stieltjes transform mF of the l.s.d. of XH

N XN .
End of the proof
We have again the relation

n N n 1
m (z ) = mF (z ) +
N F N z
hence 1 1
XN XH
N zIN ' n
11 N 1 z zmF (z )
Note that the choice (1, 1) is irrelevant here, so the expression is valid for all pair (i, i ). Summing
over the N terms and averaging, we finally have
1 1 1
mF (z ) = tr XN XH
N zIN '
N c 1 z zmF (z )
which solve a polynomial of second order. Finally

p
c 1 1 (c 1 z )2 4z
mF (z ) = + .
2z 2 2z
From the inverse Stieltjes transform formula, we then verify that mF is the Stieltjes transform of
the Mar
cenko-Pastur law.
Related bibliography
I V. A. Mar
cenko, L. A. Pastur, Distributions of eigenvalues for some sets of random matrices, Math
USSR-Sbornik, vol. 1, no. 4, pp. 457-483, 1967.
I J. W. Silverstein, Z. D. Bai, On the empirical distribution of eigenvalues of a class of large dimensional
random matrices, Journal of Multivariate Analysis, vol. 54, no. 2, pp. 175-192, 1995.
I Z. D. Bai and J. W. Silverstein, Spectral analysis of large dimensional random matrices, 2nd Edition
Springer Series in Statistics, 2009.
I R. B. Dozier, J. W. Silverstein, On the empirical distribution of eigenvalues of large dimensional
information-plus-noise-type matrices, Journal of Multivariate Analysis, vol. 98, no. 4, pp. 678-694, 2007.
I V. L. Girko, Theory of Random Determinants, Kluwer, Dordrecht, 1990.
I A. M. Tulino, S. Verd`
u, Random matrix theory and wireless communications, Now Publishers Inc., 2004.
Asymptotic results involving Stieltjes transform

dimensional random matrices, Journal of Multivariate Analysis, vol. 54, no. 2, pp. 175-192,
1995.
Theorem 1
Let YN = 1 XN CN2 , where XN CnN has i.i.d entries of mean 0 and variance 1. Consider the
n
regime n, N + with Nn c. Let m N be the Stieltjes transform associated to XN XN . Then,
N mN 0 almost surely for all z C\R+ , where mN (z ) is the unique solution in the set
m
{z C+ , mN (z ) C+ } to:
1
ctdF CN
Z
m N (z ) = z
1 + tmN (z )
I in general, no explicit expression for F N , the distribution whose Stietljes transform is mN (z ).

I The theorem above characterizes also the Stieltjes transform of BN = XH
N XN denoted by mN ,
1
mN = cmN + (c 1)
z
This gives access to the spectrum of the sample covariance matrix model of x, when
1
yi = CN2 xi , xi i.i.d., CN = E [yyH ].
0
Getting F from mF
I Remember that, for a < b real,

1
F 0 (x ) = lim =[mF (x + iy )]
y 0
where mF is (up to now) only defined on C+ .

I to plot the density F 0 ,
I first approach: span z = x + iy on the line {x R, y = } parallel but close to the real axis, solve
mF (z ) for each z, and plot =[mF (z )].
I refined approach: spectral analysis, to come next.
Example (Sample covariance matrix)

1 1
For N multiple of 3, let F C (x ) = 31 1x 61 + 13 1x 63 + 13 1x 6K and let BN = n1 CN2 ZH 2
N ZN CN with
F BN F , then
1
mF = cmF + (c 1)
z
Z 1
t
mF (z ) = c dF C (t ) z
1 + tmF (z )
We take c = 1/10 and alternatively K = 7 and K = 4.

Spectrum of the sample covariance matrix
Empirical eigenvalue distribution Empirical eigenvalue distribution

Limit law Limit law
0.6 0.6
0.4 0.4
Density
Density
0.2 0.2
0 0
1 3 7 1 3 4
Eigenvalues Eigenvalues
1 1
Figure: Histogram of the eigenvalues of BN = n1 CN2 ZH 2
N ZN CN , N = 3000, n = 300, with CN diagonal composed
of three evenly weighted masses in (i) 1, 3 and 7 on top, (ii) 1, 3 and 4 at bottom.
Part 1: Fundamentals of Random Matrix Theory/1.2 Extreme eigenvalues 21/142
Support of a distribution
The support of a density f is the closure of the set {x, f (x ) 6=
0}.
cenko-Pastur law is (1 c )2 , (1 + c )2 .

For instance the support of the mar
1.2
0.8
Density fc (x )
0.6
0.4
0.2
" #
Support of the Marchenko-Pastur law
0
0.2
0 0.5 1 1.5 2 2.5 3
Figure: Mar
cenko-Pastur law for different limit ratios c = 0.5.
Extreme eigenvalues
I Limiting spectral results are insufficient to infer about the location of extreme eigenvalues.
P
I Example: Consider dFN (x ) = N1 N 0 N 1 1
k =1 ak . Then, dFN = N dFN + N AN (x ) and dFN with
AN > aN satisfy:
dFN dFN0 0.
I However, the supports of FN and FN0 differ by the mass AN .
Question: How is the behaviour of the extreme eigenvalues of random covariance matrices?
No eigenvalue outside the support of sample covariance matrices
Z. D. Bai, J. W. Silverstein, No eigenvalues outside the support of the limiting spectral

distribution of large-dimensional sample covariance matrices, The Annals of Probability, vol. 26,
no.1 pp. 316-345, 1998.
Theorem
Let XN CN n with i.i.d. entries with zero mean, unit variance and infinite fourth order. Let
CN CN N be nonrandom and bounded in norm. Let mN be the unique solution in C+ of
Z 1
N N N n 1
mN = z dF CN () , m N (z ) = m (z ) + , z C+ ,
n 1 + m N n N n z
Let FN be the distribution associated to the Stieltjes transform mN (z ). Consider

1 1
BN = n1 CN2 XN XH 2
N CN . We know that F
BN F converge weakly to zero. Choose N N and
N 0
[a, b ], a > 0, outside the support of FN for all N > N0 . Denote LN the set of eigenvalues of BN .
Then,
P (LN [a, b ] 6= i.o.) = 0.
No eigenvalue outside the support: which models?
J. W. Silverstein, P. Debashis, No eigenvalues outside the support of the limiting empirical

spectral distribution of a separable covariance matrix, J. of Multivariate Analysis vol. 100, no. 1,
pp. 37-57, 2009.
I It has already been shown that (for all large N) there is no eigenvalues outside the support of
I cenko-Pastur law: XXH , X i.i.d. with zero mean, variance 1/N, finite 4th order moment.
Mar
1 1
I Sample covariance matrix: C 2 XXH C 2 and XH CX, X i.i.d. with zero mean, variance 1/N, finite 4th
order moment.
1 1
I Doubly-correlated matrix: R 2 XCXH R 2 , X with i.i.d. zero mean, variance 1/N, finite 4th order
moment.
J. W. Silverstein, Z.D. Bai, Y.Q. Yin, A note on the largest eigenvalue of a large dimensional
sample covariance matrix, Journal of Multivariate Analysis, vol. 26, no. 2, pp. 166-168, 1988.
I If 4th order moment is infinite,
H
lim sup XX
max =
N
J. Silverstein, Z. Bai, No eigenvalues outside the support of the limiting spectral distribution of
information-plus-noise type matrices to appear in Random Matrices: Theory and Applications.
I Only recently, information plus noise models, X with i.i.d. zero mean, variance 1/N, finite
4th order moment

(X + A)(X + A)H ,
and the generally correlation model where each column of X has correlation Ri .
Extreme eigenvalues: Deeper into the spectrum
I In order to derive statistical detection tests, we need more information on the extreme
eigenvalues.
I We will study the fluctuations of the extreme eigenvalues (second order statistics)
I However, the Stieltjes transform method is not adapted here!
Distribution of the largest eigenvalues of XXH
C. A. Tracy, H. Widom, On orthogonal and symplectic matrix ensembles, Communications in

Mathematical Physics, vol. 177, no. 3, pp. 727-754, 1996.
K. Johansson, Shape Fluctuations and Random Matrices, Comm. Math. Phys. vol. 209, pp.
437-476, 2000.
Theorem
Let X CN n have i.i.d. Gaussian entries of zero mean and variance 1/n. Denoting +
N the
largest eigenvalue of XXH , then
+ 2
2 (1 + c)
N3 N 4 1 X F
+ +
(1 + c ) 3 c 2
with c = limN N /n and F + the Tracy-Widom distribution given by

Z
F + (t ) = exp (x t )2 q 2 (x )dx
t
with q the Painlev

e II function that solves the differential equation
q 00 (x ) = xq (x ) + 2q 3 (x )
q (x ) x Ai(x )
in which Ai(x ) is the Airy function.

The law of Tracy-Widom

0.5
Empirical Eigenvalues
Tracy-Widom law F +
0.4
0.3
Density
0.2
0.1
0
4 2 0 2
Centered-scaled largest eigenvalue of XXH
2 1 4
Figure: Distribution of N 3 c 2 (1 + c ) 3 + c )2 against the distribution of X + (distributed as

N (1 +
Tracy-Widom law) for N = 500, n = 1500, c = 1/3, for the covariance matrix model XXH . Empirical
distribution taken over 10, 000 Monte-Carlo simulations.
Techniques of proof
Method of proof requires very different tools:
I orthogonal (Laguerre) polynomials: to write joint unordered eigenvalue distribution as a
kernel determinant.
p
N (1 , . . . , p ) = det KN (i , j )
i,j =1
with K (x, y ) the kernel Laguerre polynomial.

I Fredholm determinants: we can write hole probability as a Fredholm determinant.
X (1)k Z Z k Y
P N 2/3 i (1 + c )2 A, i = 1, . . . , N = 1 +

det KN (xi , xj ) dxi
k! Ac Ac i,j =1
k >1
, det(IN KN ).
I kernel theory: show that KN converges to a Airy kernel.
Ai(x )Ai 0 (y ) Ai 0 (x )Ai(y )

KN (x, y ) KAiry (x, y ) = .
x y
I differential equation tricks: hole probability in [t, ) gives right-most eigenvalue distribution,
which is simplified as solution of a Painelve differential equation: the Tracy-Widom
distribution.
R 2
F + (t ) = e t (x t )q (x ) dx , q 00 = tq + 2q 3 , q (x ) x Ai(x ).
Comments on the Tracy-Widom law
I deeper result than limit eigenvalue result

I gives a hint on convergence speed
I fairly biased on the left: even fewer eigenvalues outside the support.
I can be shown to hold for other distributions than Gaussian under mild assumptions
Part 1: Fundamentals of Random Matrix Theory/1.3 Extreme eigenvalues: the spiked models 31/142
Spiked models
I We consider n independent observations x1 , , xn of size N,

I The correlation structure is in general white + low rank,
E x1 xH1 = I + P

where P is of low rank,

I Objective: to infer the eigenvalues and/or the eigenvectors of P
The first result
J. Baik, J. W. Silverstein, Eigenvalues of large sample covariance matrices of spiked population

models, Journal of Multivariate Analysis, vol. 97, no. 6, pp. 1382-1408, 2006.
Theorem 1 1
Let BN = n1 (I + P) 2 XN XH
N (I + P) , where XN C
2 N n has i.i.d., zero mean and unit variance
entries, and PN RN N with eigenvalues given by:
eig(P) = diag(1 , . . . , K , 0, . . . , . . . , 0)
| {z }
N K
with 1 > . . . > K > 1, c = limN N /n. Let 1 , , N be the eigenvalues of BN . We then
have
a.s. 1+
I if >
j c, j 1 + j + c j (i.e. beyond the Mar cenkoPastur bulk!)
j
a.s. 2
I if (0, c ], j (1 + c ) (i.e. right-edge of the Mar cenkoPastur bulk!)
j
a.s. 2
I if [ c, 0 ) , ( 1 c ) (i.e. left-edge of the Mar
cenkoPastur bulk!)
j j
I for the other eigenvalues, we discriminate over c:
a.s. 1+j
I if j < c, c < 1, j 1 + j + c j (i.e. beyond the Mar
cenkoPastur bulk!)
a.s.
I if j < c, c > 1, j (1 c )2 (i.e. left-edge of the Mar
cenkoPastur bulk!)
Illustration of spiked models

Mar
cenko-Pastur law, c = 1/3
0.8 Empirical Eigenvalues
0.6
Density
0.4
0.2
0
1+1 1+2
1 + 1 + c 1 , 1 + 2 + c 2
Eigenvalues
1 1
Figure: Eigenvalues of BN = 1
n (P + I)
2 XN XN H (P + I) 2 , where 1 = 2 = 1 and 3 = 4 = 2 Dimensions:
N = 500, n = 1500.
Interpretation of the result
I if c is large, or alternatively, if some population spikes are small, part to all of the
population spikes are attracted by the support!
I if so, no way to decide on the existence of the spikes from looking at the largest eigenvalues
I in signal processing words, signals might be missed using largest eigenvalues methods.
I as a consequence,
I the more the sensors (N),
I the larger c = lim N /n,
I the more probable we miss a spike
Sketch of the proof

I We start with a study of the limiting extreme eigenvalues.
I Let x > 0, then
det(BN xIN ) = det(IN + P) det(XXH xIN + x [IN (IN + P)1 ])

= det(IN + P) det(XXH xIN )1 det(IN + xP(IN + P)1 (XXH xIN )1 ).

I if x eigenvalue of BN but not of XXH , then for n large, x > (1 + c )2 (edge of MP law
support) and
det(IN + xP(IN + P)1 (XXH xIN )1 ) = det(Ir + x (IN + )1 UH (XXH xIN )1 U) = 0
with P = UUH , U CN r .
I due to unitary invariance of X,
Z
a.s.
UH (XXH xIN )1 U (t x )1 dF MP (t )Ir , m(x )Ir
with F MP the MP law, and m(x ) the Stieltjes transform of the MP law (often known for
r = 1 as trace lemma).
1
I finally, we have that the limiting solutions xk satisfy xk m(xk ) + (1 + k )
k = 0.
I replacing m(x ), this is finally:
a.s. 1
k xk , 1 + k + c (1 + k ) k , if k > c
Comments on the result
I there exists a phase

transition when the largest population eigenvalues move from inside to
outside (0, 1 + c ).

I more importantly, for t1 < 1 + c, we still have the same Tracy-Widom,
I no way to see the spike even when zooming in
I in fact, simulation suggests that convergence rate to the Tracy-Widom is slower with spikes.
Part 1: Fundamentals of Random Matrix Theory/1.4 Spectrum Analysis and G-estimation 38/142
Stieltjes transform inversion for covariance matrix models
J. W. Silverstein, S. Choi, Analysis of the limiting spectral distribution of large dimensional

random matrices, Journal of Multivariate Analysis, vol. 54, no. 2, pp. 295-309, 1995.
1
I We know for the model CN2 XN , XN CN n that, if F CN F C , the Stieltjes transform of the
a.s.
e.s.d. of BN = n1 XH
N CN XN satisfies mBN (z ) mF (z ), with
Z 1
t
mF (z ) = z c dF C (t )
1 + tmF (z )
which is unique on the set {z C+ , mF (z ) C+ }.

I This can be inverted into
Z
1 t
zF ( m ) = c dF C (t )
m 1 + tm
for m C+ .
Stieltjes transform inversion and spectrum characterization
Remember that we can evaluate the spectrum density by taking a complex line close to R and
evaluating =[mF (z )] along this line. Now we can do better.
It is shown that
lim mF (z ) = m0 (x ) exists.
z x R
z C+
We also have,
I for x0 inside the support, the density f (x ) of F in x0 is 1 =[m0 ] with m0 the unique solution
m C+ of Z
1 t
[zF (m) =] x0 = c dF C (t )
m 1 + tm
I let m0 R and xF the equivalent to zF on the real line. Then x0 outside the support of F
is equivalent to xF0 (mF (x0 )) > 0, mF (x0 ) 6= 0, 1/mF (x0 ) outside the support of F C .
This provides another way to determine the support!. For m (, 0), evaluate xF (m).
Whenever xF decreases, the image is outside the support. The rest is inside.
Another way to determine the spectrum: spectrum to analyze

Empirical eigenvalue distribution
Limit law
0.6
0.4
Density
0.2
0
1 3 7
Eigenvalues
1 1
Figure: Histogram of the eigenvalues of BN = n1 CN2 XN XH 2
N CN , N = 300, n = 3000, with CN diagonal composed
of three evenly weighted masses in 1, 3 and 7.
Another way to determine the spectrum: inverse function method

xF (m), m B
Support of F
7
xF ( m )
1 13 17 0
1 1
Figure: Stieltjes transform of BN = n1 CN2 XN XH 2
N CN , N = 300, n = 3000, with CN diagonal composed of three
evenly weighted masses in 1, 3 and 7. The support of F is read on the vertical axis, whenever mF is decreasing.
Cluster boundaries in sample covariance matrix models
Xavier Mestre, Improved estimation of eigenvalues of covariance matrices and their associated
subspaces using their sample estimates, IEEE Transactions on Information Theory, vol. 54, no.
11, Nov. 2008.
Theorem
Let XN CN n have i.i.d. entries of zero mean, unit variance, and CN be diagonal such that
F CN F C , as n, N , N /n c, where F C has K masses in t1 , . . . , tK with multiplicity
1 1
N CN has support S given by
n1 , . . . , nK respectively. Then the l.s.d. of BN = n1 CN2 XN XH 2
S = [x1 , x1+ ] [x2 , x2+ ] . . . [xQ

, xQ+ ]
with xq = xF (mq ), xq+ = xF (mq+ ), and
1X
K
1 tk
xF (m) = c nk
m n 1 + tk m
k =1
with 2Q the number of real-valued solutions counting multiplicities of xF0 (m) = 0 denoted in
order m1 < m1+ 6 m2 < m2+ 6 . . . 6 mQ

< mQ+
.
Comments on spectrum characterization
Previous results allows to determine

I the spectrum boundaries
I the number Q of clusters
I as a consequence, the total separation (Q = K ) or not (Q < K ) of the spectrum in K
clusters.
Mestre goes further: to determine local separability of the spectrum,

I identify the K inflexion points, i.e. the K solutions m1 , . . . , mK to
xF00 (m) = 0
I check whether xF0 (mi ) > 0 and xF0 (mi +1 ) > 0

I if so, the cluster in between corresponds to a single population eigenvalue.
Exact eigenvalue separation
Z. D. Bai, J. W. Silverstein, Exact Separation of Eigenvalues of Large Dimensional Sample

Covariance Matrices, The Annals of Probability, vol. 27, no. 3, pp. 1536-1555, 1999.
I Recall that the result on no eigenvalue outside the support
I says where eigenvalues are not to be found
I does not say, as we feel, that (if cluster separation) in cluster k, there are exactly nk eigenvalues.
I This is in fact the case,
Empirical eigenvalue distribution

Limit law
0.6
0.4
Density
0.2
0
1 3 7
n1 n2 n3
Eigenvalues
Eigeninference: Introduction of the problem
I Reminder: for a sequence x1 , . . . , xn CN of independent random variables,
X
n
N = 1
C xk xH
n k
k =1
is an n-consistent estimator of CN = E [x1 xH

1 ].
I If n, N have comparable sizes, this no longer holds.
I Typically, n, N-consistent estimators of the full CN matrix perform very badly.
I If only the eigenvalues of CN are of interest, things can be done. The process of retrieving
information about eigenvalues, eigenspace projections, or functional of these is called
eigen-inference.
Girko and the G -estimators
V. Girko, Ten years of general statistical analysis,

http://www.general-statistical-analysis.girko.freewebspace.com/chapter14.pdf
I Girko has come up with more than 50 N, n-consistent estimators, called G -estimators
(Generalized estimators). Among those, we find
I G1 -estimator of generalized variance. For
" #
N ) = 1 n(n 1)N
G1 ( C n log det(CN ) + log Q
(n N ) Nk =1 ( n k )
with n any sequence such that 2

n log(n/(n N )) 0, we have
N )
G1 (C 1
n log det(CN ) 0
in probability.
I However, Girkos proofs are rarely readable, if existent.
A long standing problem
X. Mestre, Improved estimation of eigenvalues and eigenvectors of covariance matrices using

their sample estimates, IEEE trans. on Information Theory, vol. 54, no. 11, pp. 5113-5129,
2008.
1 1
I Consider the model BN = n1 CN2 XN XH 2
N CN , where F
CN is formed of a finite number of masses
t1 , . . . , tK .
I It has long been thought the inverse problem of estimating t1 , . . . , tK from the Stieltjes
transform method was not possible.
I Only trials were iterative convex optimization methods.
I The problem was partially solved by Mestre in 2008!
I His technique uses elegant complex analysis tools. The description of this technique is the
subject of this course.
Reminders
1 1
I Consider the sample covariance matrix model BN = n1 CN2 XN XH 2
N CN .
I Up to now, we saw:
I that there is no eigenvalue outside the support with probability 1 for all large N.
I that for all large N, when the spectrum is divided into clusters, the number of empirical eigenvalues
in each cluster is exactly as we expect.
I these results are of crucial importance for the following.
Eigen-inference for the sample covariance matrix model
X. Mestre, Improved estimation of eigenvalues and eigenvectors of covariance matrices using

their sample estimates, IEEE trans. on Information Theory, vol. 54, no. 11, pp. 5113-5129,
2008.
Theorem 1 1
Consider the model BN = n1 CN2 XN XH 2
N CN , with XN C
N n , i.i.d. with entries of zero mean, unit
variance, and CN RN N is diagonal with K distinct entries t1 , . . . , tK of multiplicity N1 , . . . , NK

of same order as n. Let k {1, . . . , K }. Then, if the cluster associated to tk is separated from the
clusters associated to k 1 and k + 1, as N, n , N /n c,
n X
tk = (m m )
Nk
mNk
P PK
is an N, n-consistent estimator of tk , where Nk = {N K i =k Ni + 1, . . . , N i =k +1 Ni },
1 , . . . , N are the eigenvalues of BN and 1 , . . . , N are the N solutions of
m XH C () = 0
N N XN
1
T
or equivalently, 1 , . . . , N are the eigenvalues of diag() N .
Remarks on Mestres result
Assuming cluster separation, the result consists in

I taking the empirical ordered i s inside the cluster (note that exact separation ensures there
are Nk of these!)
I getting the ordered eigenvalues 1 , . . . , N of
1 T
diag()
N
with = (1 , . . . , N )T . Keep only those of index inside Nk .
I take the difference and scale.
How to obtain this result?
I Major trick requires tools from complex analysis

I Silversteins Stieltjes transform identity: for the conjugate model BN = n1 XH
N CN XN ,
Z 1
t
mN (z ) = z c dF CN (t )
1 + tmN (z )
with mN the deterministic equivalent of mBN . This is the only random matrix result we need.
I Before going further, we need some reminders from complex analysis.
Limiting spectrum of the sample covariance matrix

dimensional random matrices, J. of Multivariate Analysis, vol. 54, no. 2, pp. 175-192, 1995.
Reminder:
a.s.
I If F CN F C , then mBN (z ) mF (z ) such that
Z 1
t
mF (z ) = c dF C (t ) z
1 + tmF (z )
or equivalently
mF C 1/mF (z ) = zmF (z )mF (z )
with mF (z ) = cmF (z ) + (c 1) z1 and N /n c.
Reminders of complex analysis
I Cauchy integration formula

Theorem
Let U C be an open set and f : U C be holomorphic on U. Let U be a continuous
contour (i.e. closed path). Then, for a inside the surface formed by , we have
I
1 f (z )
dz = f (a)
2i z a
while for a outside the surface formed by ,

I
1 f (z )
dz = 0.
2i z a
Complex integration
I From Cauchy integral formula, denoting Ck a contour enclosing only tk ,
I I I
1 X
K
1 1 N
tk = d = Nj d = mF C ()d .
2i Ck tk 2i Ck Nk tj 2iNk Ck
j =1
I After the variable change = 1/mF (z ),

I mF0 (z )
N 1
tk = zmF (z ) dz,
Nk 2i CF ,k mF2 (z )
I When the system dimensions are large,
1 X
N
1
mF (z ) ' mBN (z ) , , with (1 , . . . , N ) = eig(BN ) = eig(YYH ).
N k z
k =1
I Dominated convergence arguments then show

I mB0 (z )
a.s. N 1
tk tk 0 with tk = zmBN (z ) N
2 (z )
dz
Nk 2i CF ,k mB
N
Understanding the contour change

xF (m), m B
Support of F
7
xF ( m )
m2
m1
1
1/x2 1/x1
1 13 17 0
I IF CF ,k encloses cluster k with real points m1 < m2

I THEN 1/m1 = x1 < tk < x2 = 1/m2 and Ck encloses tk .
Poles and residues
I we find two sets of poles (outside zeros):

I 1 , . . . , N , the eigenvalues of BN .
I the solutions 1 , . . . , N to m N (z ) = 0.
I remember that
n nN 1
mBN (w ) = m (w ) +
N BN N w
0 (w )
mB
residue calculus, denote f (w ) = n nN N
N wmBN (w ) + N ,
I
mB (w )2
N
I the k s are poles of order 1 and
n
lim (z k )f (z ) =
z k N k
I the k s are also poles of order 1 and by LHospitals rule

0
n (z k )zmBN (z ) n
lim (z k )f (z ) = lim =
z k z k N mBN (z ) N k
I So, finally
n X
tk = (m m )
Nk mcontour
Which poles in the contour?
I we now need to determine which poles are in the contour of interest.

I Since the i are rank-1 perturbations of the i , they have the interleaving property
1 < 2 < 2 < . . . < N < N
I what about 1 ? the trick is to use the fact that

I
1 1
dz = 0
2i Ck z
which leads to I
1 mF0 (w )
dw = 0
2i k mF (w )2
the empirical version of which is
#{i : i k } #{i : i k }
Since their difference tends to 0, there are as many k s as k s in the contour, hence 1 is
asymptotically in the integration contour.
Related bibliography
I C. A. Tracy and H. Widom, On orthogonal and symplectic matrix ensembles, Communications in Mathematical Physics, vol. 177, no. 3, pp.
727-754, 1996.
I G. W. Anderson, A. Guionnet, O. Zeitouni, An introduction to random matrices, Cambridge studies in advanced mathematics, vol. 118, 2010.
I F. Bornemann, On the numerical evaluation of distributions in random matrix theory: A review, Markov Process. Relat. Fields, vol. 16, pp.
803-866, 2010.
I Y. Q. Yin, Z. D. Bai, P. R. Krishnaiah, On the limit of the largest eigenvalue of the large dimensional sample covariance matrix, Probability
Theory and Related Fields, vol. 78, no. 4, pp. 509-521, 1988.
I J. W. Silverstein, Z.D. Bai and Y.Q. Yin, A note on the largest eigenvalue of a large dimensional sample covariance matrix, Journal of Multivariate
Analysis, vol. 26, no. 2, pp. 166-168. 1988.
I C. A. Tracy, H. Widom, On orthogonal and symplectic matrix ensembles, Communications in Mathematical Physics, vol. 177, no. 3, pp. 727-754,
1996.
I Z. D. Bai, J. W. Silverstein, No eigenvalues outside the support of the limiting spectral distribution of large-dimensional sample covariance
matrices, The Annals of Probability, vol. 26, no.1 pp. 316-345, 1998.
I Z. D. Bai, J. W. Silverstein, Exact Separation of Eigenvalues of Large Dimensional Sample Covariance Matrices, The Annals of Probability, vol.
27, no. 3, pp. 1536-1555, 1999.
I J. W. Silverstein, P. Debashis, No eigenvalues outside the support of the limiting empirical spectral distribution of a separable covariance matrix,
J. of Multivariate Analysis vol. 100, no. 1, pp. 37-57, 2009.
I J. W. Silverstein, J. Baik, Eigenvalues of large sample covariance matrices of spiked population models Journal of Multivariate Analysis, vol. 97,
no. 6, pp. 1382-1408, 2006.
I I. M. Johnstone, On the distribution of the largest eigenvalue in principal components analysis, Annals of Statistics, vol. 99, no. 2, pp. 295-327,
2001.
I K. Johansson, Shape Fluctuations and Random Matrices, Comm. Math. Phys. vol. 209, pp. 437-476, 2000.
I J. Baik, G. Ben Arous, S. P
ech
e, Phase transition of the largest eigenvalue for nonnull complex sample covariance matrices, The Annals of
Probability, vol. 33, no. 5, pp. 1643-1697, 2005.
Related bibliography (2)
I J. W. Silverstein, S. Choi, Analysis of the limiting spectral distribution of large dimensional random matrices, Journal of Multivariate Analysis, vol.
54, no. 2, pp. 295-309, 1995.
I W. Hachem, P. Loubaton, X. Mestre, J. Najim, P. Vallet, A Subspace Estimator for Fixed Rank Perturbations of Large Random Matrices, arxiv
preprint 1106.1497, 2011.
I R. Couillet, W. Hachem, Local failure detection and diagnosis in large sensor networks, (submitted to) IEEE Transactions on Information Theory,
arXiv preprint 1107.1409.
I F. Benaych-Georges, R. Rao, The eigenvalues and eigenvectors of finite, low rank perturbations of large random matrices, Advances in
Mathematics, vol. 227, no. 1, pp. 494-521, 2011.
I X. Mestre, On the asymptotic behavior of the sample estimates of eigenvalues and eigenvectors of covariance matrices, IEEE Transactions on
Signal Processing, vol. 56, no.11, 2008.
I X. Mestre, Improved estimation of eigenvalues and eigenvectors of covariance matrices using their sample estimates, IEEE trans. on Information
Theory, vol. 54, no. 11, pp. 5113-5129, 2008.
I R. Couillet, J. W. Silverstein, Z. Bai, M. Debbah, Eigen-Inference for Energy Estimation of Multiple Sources, IEEE Transactions on Information
Theory, vol. 57, no. 4, pp. 2420-2439, 2011.
I P. Vallet, P. Loubaton and X. Mestre, Improved subspace estimation for multivariate observations of high dimension: the deterministic signals
case, arxiv preprint 1002.3234, 2010.
Application to Signal Sensing and Array Processing/2.1 Eigenvalue-based detection 62/142
Problem formulation
I We want to test the hypothesis H0 against H1 ,

hxT + W , information plus noise, hypothesis H1
CN n 3 Y =
W , pure noise, hpothesis H0
with h CN , x CN , W CN n .
I We assume no knowledge whatsoever but that W has i.i.d. (non-necessarily Gaussian)
entries.
Exploiting the conditioning number
L. S. Cardoso, M. Debbah, P. Bianchi, J. Najim, Cooperative spectrum sensing using random

matrix theory, International Symposium on Wireless Pervasive Computing, pp. 334-338 , 2008.
I under either hypothesis,
I if H0 , for N large, we expect FYYH close to the Mar
cenko-Pastur law, of support
2 2
[2 1 c , 2 1 + c ].
q
I if H1 , if population spike more than 1 + N
n , largest eigenvalue is further away.
I the conditioning number of YYH is therefore asymptotically, as N, n , N /n c,
I if H0 ,
2
max 1 c
cond(Y) , 2
min 1+ c
I if H1 ,
2
ct1 1 c
cond(Y) t1 + > 2
t1 1 1+ c
PN
with t1 = k =1 |hk |2 + 2
I the conditioning number is independent of . We then have the decision criterion, whether
or not is known,

q 2

1 N
n
H0 : if cond(YYH ) 6 q 2 +
decide

1 + N
n

H : otherwise. 1
for some security margin .

Comments on the method
I Advantages:
I much simpler than finite size analysis
I ratio independent of , so needs not be known
I Drawbacks:
I only stands for very large N (dimension N for which asymptotic results arise function of !)
I ad-hoc method, does not rely on performance criterion.
Generalized likelihood ratio test
P. Bianchi, M. Debbah, M. Maida, J. Najim, Performance of Statistical Tests for Source

Detection using Random Matrix Theory, IEEE Trans. on Information Theory, vol. 57, no. 4, pp.
2400-2419, 2011.
I Alternative generalized likelihood ratio test (GLRT) decision criterion, i.e.
sup2 ,h PY|h,2 (Y, h, 2 )

C (Y ) = .
sup2 PY|2 (Y|2 )
I Denote
max (YYH )
TN = 1 H
N tr YY
To guarantee a maximum false alarm ratio of ,
(1N )n n (1N )n
TN
decide H1 : if 1 N1 TN 1 N > N
H0 : otherwise.
for some threshold N that can be explicitly given as a function of .

I Optimal test with respect to GLR.
I Performs better than conditioning number test.
Performance comparison for unknown 2 , P

0.7
Neyman-Pearson, Jeffreys prior
Neyman-Pearson, uniform prior
0.6 Conditioning number test
GLRT
0.5
Correct detection rate
0.4
0.3
0.2
0.1
0
0.1 0.5 1 2
False alarm rate 102
Figure: ROC curve for a priori unknown 2 of the Neyman-Pearson test, conditioning number method and
GLRT, K = 1, N = 4, M = 8, SNR = 0 dB. For the Neyman-Pearson test, both uniform and Jeffreys prior,
with exponent = 1, are provided.
Related biography
I R. Couillet, M. Debbah, A Bayesian Framework for Collaborative Multi-Source Signal Sensing, IEEE Transactions on Signal Processing, vol. 58,
no. 10, pp. 5186-5195, 2010.
I T. Ratnarajah, R. Vaillancourt, M. Alvo, Eigenvalues and condition numbers of complex random matrices, SIAM Journal on Matrix Analysis and
Applications, vol. 26, no. 2, pp. 441-456, 2005.
I M. Matthaiou, M. R. McKay, P. J. Smith, J. A. Mossek, On the condition number distribution of complex Wishart matrices, IEEE Transactions on
Communications, vol. 58, no. 6, pp. 1705-1717, 2010.
I C. Zhong, M. R. McKay, T. Ratnarajah, K. Wong, Distribution of the Demmel condition number of Wishart matrices, IEEE Trans. on
Communications, vol. 59, no. 5, pp. 1309-1320, 2011.
I L. S. Cardoso, M. Debbah, P. Bianchi, J. Najim, Cooperative spectrum sensing using random matrix theory, International Symposium on Wireless
Pervasive Computing, pp. 334-338 , 2008.
I P. Bianchi, M. Debbah, M. Maida, J. Najim, Performance of Statistical Tests for Source Detection using Random Matrix Theory, IEEE Trans. on
Information Theory, vol. 57, no. 4, pp. 2400-2419, 2011.
Application to Signal Sensing and Array Processing/2.2 The spiked G-MUSIC algorithm 69/142
Source localization
A uniform array of M antennas receives signal from K radio sources during n signal snapshots.
Objective: Estimate the arrival angles 1 , , K .
1
Source Localization using Music Algorithm

We consider the scenario of K sources and N antenna-array capturing n observations:
X
K
xt = a(k )sk,t + wt , t = 1, , n
k =1

1
e sin
I AN = [aN (1 ), , aN (K )] with aN () =

e (N 1) sin
I 2 is the noise variance and is set 1 for simplicity,
I Objective: infer 1 , , K from the n observations
I Let XN = [x1 , , xn ], then,

S
X = AS + W = [A IN ]
W
I If K is finite while n, N +, the model correponds to the spiked covariance model.

I MUSIC Algorithm: Let be the orthogonal projection matrix on the span of AA and
= IN (orthogonal projector on the noise subspace). Angles 1 , , K are the
unique ones verifying
() , aN () aN () = 0
Traditional MUSIC algorithm
I Traditional MUSIC algorithm: Angles are estimated as local minima of:
aN ()
aN ()
where is the orthogonal projection matrix on the eigenspace associated to the K largest
eigenvalues of n1 XN XN
I It is well-known that this estimator is consistent when n + with K , N fixed,
I We consider the case of K finite spiked covariance model
I What happens when n, N + ?
Asymptotic behaviour of the traditional MUSIC (1)
We first need to understand the spectrum of 1 H

n XX
I We know that the weak spectrum is the MP law
I Up to K eigenvalues can leave the support: we identify here these eigenvalues
Denote P = AAH = US UH T T T
S , = diag(1 , . . . , K ), and Z = [S W ] to recover (up to
one row) the generic spiked model
1
X = (IN + P) 2 Z.
1 H
I Reminder: If x eigenvalue of n XX with x > (1 + c )2 (edge of MP law), for all large n,
a.s. 1
x , k k , 1 + k + c (1 + k )
k , if k > c
for some k.

Recall the MUSIC approach: we want to estimate
() = a()H UW UH
W a() (UW CN (N K ) such that UH
W US = 0)
Instead of this quantity, we start with the study of
a()H u H
i ui a(), k = 1, . . . , K
1 , . . . , u
with u N the eigenvectors belonging to 1 > . . . > N .
To fall back on known RMT quantities, we use the Cauchy-integral:
I
1 1
a()H u H
i ui a() = a()H ( XXH zIN )1 a()dz
2 Ci n
with Ci a contour enclosing i only.

Woodburys identity (A + UCV )1 = A1 A1 U (C 1 + VA1 U )1 VA1 gives:
I I
1 1 ZZH 1 1
aH u H
i u i a= aH (IN + P) 2 ( zIN )1 (IN + P) 2 adz + aH
1H
b 1
a2 dz
2 Ci n 2 Ci
where P = US UH
S , and
1

H
b = IK + z (IK + )1 UH H
S ( n ZZ zIN )
1
US
1
aH
= za()H (IN + P) 2 ( n1 ZZH zIN )1 US

1
1

a2 = (IK + ) 1
UH 1
S ( n ZZ
H zIN )1 (IN + P) 2 a().

I For large n, the first term has no pole, while the second converges to

I
H = IK + zm(z )(IK + )1
1 H 1 1
Ti , a H a2 dz, with a1 = zm(z )a (IN + P) 2 US
H
2 Ci 1
21
a2 = m(z )(IK + ) UH 1
S (IN + P) a
which after development is
X
K I
1 1 zm2 (z )
Ti = 1+`
dz.
1 + ` 2
`=1 Ci ` + zm(z )
I Using residue calculus, the sole pole is in i and we find

2
a.s. 1 c
a()H u H
i ui a()
i
1
a()H ui uH
i a().
1 + c i
Therefore,
a.s. X
K
1 c 2
a() a()a()H
() = a()H i
a()H ui uH
i a()
1
i =1
1 + c
i
Improved G-MUSIC
Recall that:
1
1 + c a.s.
a()H uk uH
k a()
k
2
a()H u H
k uk a() 0
1 c k
The k are however unknown. But they can be estimated from

a.s. 1
k k = 1 + k + c (1 + k )
k
This gives finally
X
K

1 + c 1
G () ' a()H a()
k
2
a()H u H
k u k a()
k =1

1 c k
with

k (c + 1)
q
k =
+ (c + 1
k )2 4c )
2
We then obtain another (N, n)-consistent MUSIC estimator, only valid for K finite!
Simulation results
0
10
Cost function [dB]
15
20
25
35 37
MUSIC
G-MUSIC
30
angle [deg] -10 35 37
angle [deg]
Figure: MUSIC against G-MUSIC for DoA detection of K = 3 signal sources, N = 20 sensors, M = 150
samples, SNR of 10 dB. Angles of arrival of 10 , 35 , and 37 .
Outline of the tutorial
I Part 1: Basics of Random Matrix Theory for Sample Covariance Matrices

I 1.1. Introduction to the Stieltjes transform method, Mar
cenkoPastur law, advanced models
I 1.2. Extreme eigenvalues: no eigenvalue outside the support, exact separation, TracyWidom law
I 1.3. Extreme eigenvalues: the spiked models
I 1.4. Spectrum analysis and G-estimation
I Part 2: Application to Signal Sensing and Array Processing
I 2.1. Eigenvalue-based detection
I 2.2. The (spiked) G-MUSIC algorithm
I Part 3: Advanced Random Matrix Models for Robust Estimation
I 3.1. Robust estimation of scatter
I 3.2. Robust G-MUSIC
I 3.3. Robust shrinkage in finance
I 3.4. Second order robust statistics: GLRT detectors
I Part 4: Future Directions
I 4.1. Kernel random matrices and kernel methods
I 4.2. Neural network applications
Advanced Random Matrix Models for Robust Estimation/3.1 Robust Estimation of Scatter 80/142
Covariance estimation and sample covariance matrices
P.J. Huber, Robust Statistics, 1981.

Many statistical inference techniques rely on the sample covariance matrix (SCM) taken
from i.i.d. observations x1 , . . . , xn of a r.v. x CN .
I The main reasons are:
I Assuming E [x ] = 0, E [xx ] = CN , with X = [x1 , . . . , xn ], by the LLN
1 a.s.
SN , XX CN as n .
n
Hence, if = f (CN ), we often use the n-consistent estimate = f (SN ).
I The SCM SN is the ML estimate of CN for Gaussian x
One therefore expects to closely approximate for all finite n.
I This approach however has two limitations:
I if N, n are of the same order of magnitude,
kSN CN k 6 0 as N, n , N /n c > 0, so that in general |
| 6 0
This motivated the introduction of G-estimators.

I if x is not Gaussian, but has heavier tails, SN is a poor estimator for CN .
This motivated the introduction of robust estimators.
Reminders on robust estimation
J. T. Kent, D. E. Tyler, Redescending M-estimates of multivariate location and scatter, 1991.

R. A. Maronna, Robust M-estimators of multivariate location and scatter, 1976.
Y. Chitour, F. Pascal, Exact maximum likelihood estimates for SIRV covariance matrix:
Existence and algorithm analysis, 2008.
The objectives of robust estimators:
I Replace the SCM SN by another estimate CN of CN which:
I rejects (or downscales) observations deterministically
I or rejects observations inconsistent with the full set of observations
Example: Huber estimator, CN defined as solution of

1X
n
k2
CN = i xi xi with i = min 1, 1 for some > 1, k 2 function of CN .
n 1
C
i =1 N xi N xi
I Provide scale-free estimators of CN :

Example: Tylers estimator: if one observes xi = i zi for unknown scalars i ,
1X
n
1
CN = 1 1
xi xi
n
i =1 N xi CN xi
I existence and uniqueness of CN defined up to a constant.

I few constraints on x1 , . . . , xn (N + 1 of them must be linearly independent)
Reminders on robust estimation
The objectives of robust estimators:

I replace the SCM SN by the ML estimate for CN .
Example: Maronnas estimator for elliptical x
1X
n
1 1
CN = u xi CN xi xi xi
n N
i =1
with u (s ) such that

(i) u (s ) is continuous and non-increasing on [0, )
(ii) (s ) = su (s ) is non-decreasing, bounded by > 1. Moreover, (s ) increases where (s ) < .
(note that Hubers estimator is compliant with Maronnas estimators)
I existence is not too demanding
I uniqueness imposes strictly increasing u (s ) (inconsistent with Hubers estimate)
I consistency result: CN CN if u (s ) meets the ML estimator for CN .
Robust Estimation and RMT
So far, RMT has mostly focused on the SCM SN .

I x = AN w , w having i.i.d. zero-mean unit variance entries,
I x satisfies concentration inequalities, e.g. elliptically distributed x.
Robust RMT estimation

Can we study the performance of estimators based on the CN ?
N ?
I what are the spectral properties of C
I can we generate RMT-based estimators relying on CN ?

Setting and assumptions

I Assumptions:
1
I Take x1 , . . . , xn CN elliptical-like random vectors, i.e. xi = i CN2 wi where
I , . . . , n R+ random or deterministic with 1
Pn a.s.
1 n i =1 i 1
I w , . . . , wn CN random independent with w / N uniformly distributed over the unit-sphere
1 i
I C CN N deterministic, with C 0 and lim sup kC k <
N N N N
I We denote cN , N /n and consider the growth regime cN c (0, 1).
I Maronnas estimator of scatter: (almost sure) unique solution to
1X
n
1 1
CN = u xi CN xi xi xi
n N
i =1
where u satisfies
(i) u : [0, ) (0, ) nonnegative continuous and non-increasing
(ii) : x 7 xu (x ) increasing and bounded with limx (x ) , > 1
1
(iii) < c+ .
1 Pn
I Additional technical assumption: Let n , n i =1 i . For each a > b > 0, a.s.
lim supn n ((t, ))

lim sup = 0.
t (at ) (bt )
Controls relative speed of the tail of n versus the flattening speed of (x ) as x .

Examples:
I i < M for each i. In this case, n ((t, )) = 0 a.s. for t > M.
I For u (t ) = (1 + )/( + t ), > 0, and i i.i.d., it is sufficient to have E [11+ ] < .
Heuristic approach
I Major issues with CN :
I Defined implicitly q
I Sum of non-independent rank-one matrices from vectors u ( N1 xi CN1 xi )xi (CN depends on all xj s).
I But there is some hope:
I First remark: we can work with CN = IN without generality restriction!
I Denote
1X
n
1 1
C(j ) = u x C x x x
n N i N i i i
i 6=j
Then intuitively, C(j ) and xj are only weakly dependent.

I We expect in particular (highly non-rigorous but intuitive!!):
1 1 1 1
x C x ' i tr C(i )1 ' i tr CN1 .
N i (i ) i N N
I Our heuristic approach:
I Rewrite N1 xi CN1 xi as f ( N1 xi C(i )1 xi ) for some function f (later called g 1 )
I Deduce that
1X
n
1 1
CN = (u f ) xi C(i ) xi xi xi
n N
i =1
I Use 1
N xi C(i )1 xi ' i 1
N tr CN1 to get
1X
n
1
CN ' (u f ) i tr CN1 xi xi
n N
i =1
I Use random matrix results to find a limiting value for 1
N tr CN1 , and conclude
1X
n
CN ' (u f )(i )xi xi .
n
i =1
Heuristic approach in detail: f and
I Determination of f : Recall the identity (A + tvv )1 v = A1 /(1 + tv A1 v ). Then

1 1
1 1 N xi C(i ) xi
xi CN xi =
N 1 + cN u ( N xi CN1 xi ) N1 xi C(i )1 xi
1
so that
1 1
1 1 N xi CN xi
xi C(i ) xi = .
N 1 cN ( N1 xi CN1 xi )
Now the function g : x 7 x /(1 cN (x )) is monotonous increasing (we use the assumption
< c 1 !), hence, with f = g 1 ,

1 1 1 1
xi CN xi = g 1 xi C(i ) xi .
N N
Heuristic approach in detail: f and

I Determination of : From previous calculus, we expect
1X 1X
n n
1
CN ' (u g 1 ) i tr CN1 xi xi ' (u g 1 ) (i ) xi xi .
n N n
i =1 i =1
Hence !1
1X
n
1 1
' tr CN1 ' tr (u g 1 ) (i ) i wi wi .
N N n
i =1
Since i are independent of wi and deterministic, this is a Bai-Silverstein model
1
WDW , W = [w1 , . . . , wn ], D = diag(Dii ) = u g 1 (i ).
n
And we have:
1 Z !1
t (u g 1 )(t )

1 1
' tr WDW = m 1 WDW (0) ' 0+ N ( dt )
N n n 1 + c (u g 1 )(t )m 1 WDW (0)
n
!1
1X
n
i (u g 1 )(i )
= .
n 1 + c i (u g 1 )(i )m 1 WDW (0)
i =1 n
Since ' m 1 WDW (0), this defines as a solution of a fixed-point equation:

n
!1
1X
n
i (u g 1 )(i )
= .
n 1 + c i (u g 1 )(i )
i =1
Main result
R. Couillet, F. Pascal, J. W. Silverstein, The Random Matrix Regime of Maronnas M-estimator

with elliptically distributed samples, (submitted to) Elsevier Journal of Multivariate Analysis.
Theorem (Asymptotic Equivalence)
Under the assumptions defined earlier, we have
1X
n
a.s.
CN SN 0, where SN , v (i )xi xi
n
i =1
v (x ) = (u g 1 )(x ), (x ) = xv (x ), g (x ) = x /(1 c (x )) and > 0 unique solution of
1 X (i )
n
1= .
n 1 + c (i )
i =1
I Remarks:
I Th. says: first order substitution of CN by SN allowed for large N, n.
I It turns out that v u and in general behavior.
I Corollaries:
a.s.
max i (SN ) i (CN ) 0

16i 6n
1 1 a.s.
tr (CN zIN )1 tr (SN zIN )1 0
N N
Important feature for detection and estimation.
I Proof: So far in the tutorial, we do not have a rigorous proof!
Proof
Fundamental idea: Showing that all 1 1 1

i N xi C(i ) xi converge to the same limit .
I
I Technical trick: Denote

v 1 1
N xi C(i ) xi
ei ,
v (i )
and relabel terms such that
e1 6 . . . 6 en
We shall prove that, for each ` > 0,
e1 > 1 ` i.o. and en < 1 + ` i.o.
Some basic inequalities: Denoting di , 1 1 1 1 1

i N xi C(i ) xi = N wi C ( i ) wi , we have
I
P 1 P 1
v j N1 wj n1 i 6=j i v (i di )wi wi wj v j N1 wj n1 i 6=j i v (i )ei wi wi wj
ej = =
v (j ) v (j )
P 1

P 1
v j N1 wj n1 i 6=j i v (i )en wi wi wj v enj 1
N wj
1
n i 6=j i v (i )wi wi wj
6 =
v (j ) v (j )
Proof
I Specialization to en :
P 1
n 1 1
v e n N wn n i 6=n i v (i )wi wi wn
en 6
v (n )
or equivalently, recalling (x ) = xv (x ),
P 1
P 1
n 1 1
1 1 e n N wn i v (i )wi wi wn
N wn n i 6=n i v (i )wi wi wn n i 6=n
6 .
(n )
I Random Matrix results:

I By trace lemma, we should have
1 1
1 1 X 1 1 X
w i v (i )wi wi wn ' tr i v (i )wi wi '
N n n N n
i 6=n i 6=n
(by definition of as in previous slides). . .

I DANGER: by relabeling, wn no longer independent of w1 , . . . , wn1 !
Broken trace lemma!
I Solution: uniform convergence result.
By (higher order) moment bounds, Markov inequality, and Borel Cantelli, for all large n a.s.
1
X

1
1

max wj i v (i )wi wi wj < .

16j 6n N n
i 6=j
Proof
I Back to original problem: For all large n a.s., we then have (using growth of )

enn ( + )
6 .
(n )
I Proof by contradiction: Assume en > 1 + ` i.o., then on a subsequence en > 1 + ` always and
1+` n ( + )

6 .
(n )
I Bounded support for i : If 0 < < i < + < for all i, n, then on a subsequence where
n 0 ,
0
1+` ( + )
6 CONTRADICTION!
(0 )
| {z } | {z }

1 as 0

1+`0
<1 as 0
(0 )
I Unbounded support for i : Importance of relative growth of n versus convergence of to .

Proof consists in dividing {i } in two groups: few large ones versus all others.
Sufficient condition:
lim supn n ((t, ))
lim sup = 0.
t (at ) (bt )
Simulations
0.5
1 Pn
Empirical eigenvalue distribution of n
i =1 x i x i
Limiting density
0.4
0.3
Density
0.2
0.1
0
0 5 10 15 20 25 30
Eigenvalues
1 Pn
Figure: Histogram of the eigenvalues of n i =1 xi xi for n = 2500, N = 500, CN = diag(I125 , 3I125 , 10I250 ), 1
with (.5, 2)-distribution.
Simulations
Empirical eigenvalue distribution of CN Empirical eigenvalue distribution of SN

Limiting density Limiting density
3 3
2 2
Density
Density
1 1
0 0
0 0.5 1 1.5 2 0 0.5 1 1.5 2
Eigenvalues Eigenvalues
Figure: Histogram of the eigenvalues of CN (left) and SN (right) for n = 2500, N = 500,
CN = diag(I125 , 3I125 , 10I250 ), 1 with (.5, 2)-distribution.
I Remark/Corollary: Spectrum of CN a.s. bounded uniformly on n.

Hint on potential applications
I Spectrum boundedness: for impulsive noise scenarios,

I SCM spectrum grows unbounded
I robust scatter estimator spectrum remains bounded
Robust estimators improve spectrum separability (important for many statistical inference
techniques seen previously)
I Spiked model generalization: we may expect a generalization to spiked models
I spikes are swallowed by the bulk in SCM context
I we expect spikes to re-emerge in robust scatter context
We shall see that we get even better than this. . .
I Application scenarios:
I Radar detection in impulsive noise (non-Gaussian noise, possibly clutter)
I Financial data analytics
I Any application where Gaussianity is too strong an assumption. . .
Advanced Random Matrix Models for Robust Estimation/3.2 Spiked model extension and robust G-MUSIC 96/142
System Setting
I Signal model:
X
L

yi = pl al sli + i wi = Ai w
i
l =1

i , [s1i , . . . , sLi , wi ]T .

Ai , p1 a1 ... p L aL i IN , w
with y1 , . . . , yn CN satisfying:
1 Pn R
1. 1 , . . . , n > 0 random such that n , n i =1 i weakly and
t (dt ) = 1;

2. w1 , . . . , wn CN random independent unitarily invariant N-norm;
3. L N, p1 > . . . > pL > 0 deterministic;
a.s.
4. a1 , . . . , aL CN deterministic or random with A A diag(p1 , . . . , pL ) as N , with

A , [ p1 a1 , . . . , pL aL ] CN L .
5. s1,1 , . . . , sLn C independent with zero mean, unit variance.

I Relation to previous model: If L = 0, yi = i wi .
Elliptical model with covariance a low-rank (L) perturbation of IN .
We expect a spiked version of previous results.
I Application contexts:
I wireless communications: signals sli from L transmitters, N-antenna receiver; al random i.i.d.
channels (al al 0 l l 0 , e.g. al CN(0, IN /N ));
I array processing: L sources emit signals sli at steering angle al = a(l ). For ULA,
1
[a()]j = N 2 exp(2dj sin()).
Some intuition
I Signal detection/estimation in impulsive environments: Two scenarios

I heavy-tailed noise (elliptical, Gaussian mixtures, etc.)
I Gaussian noise with spurious impulsions
I Problems expected with SCM: Respectively,
I unbounded limiting spectrum, no source separation!
Invalidates G-MUSIC
I isolated eigenvalues due to spikes in time direction
False alarms induced by noise impulses!
I Our results: In a spiked model with noise impulsions,

I whatever noise impulsion type, spectrum of CN remains bounded
I isolated largest eigenvalues may appear, two classes:
I isolated eigenvalues due to noise impulses CANNOT exceed a threshold!
I all isolated eigenvalues beyond this threshold are due to signal
Detection criterion: everything above threshold is signal.
Theoretical results
Theorem (Extension to spiked robust model)
Under the same assumptions as in previous section,
a.s.
kCN SN k 0
where
1X
n
SN , v (i )Ai w i Ai
i w
n
i =1
with the unique solution to

Z
(t )
1= (dt )
1 + c (t )
and we recall

Ai , p1 a1 ... pL a L i IN
i = [s1i , . . . , sLi , wi ]T .
w
I Remark: For L = 0, Ai = [0, . . . , 0, IN ].

Recover previous result Ai wi becomes wi .
Localization of eigenvalues
Theorem (Eigenvalue localization)

Denote
PL
I uk eigenvector of k-th largest eigenvalue of AA =
i =1 pi ai ai
I uk eigenvector of k-th largest eigenvalue of CN
Also define (x ) unique positive solution to
Z 1
tvc (t )
(x ) = c x + (dt ) .
1 + (x )tvc (t )
Further denote
1
(1 + c )2
Z
(x )vc (t )
p , lim c (dt ) , S+ , .
x S + 1 + (x )tvc (t ) (1 c )
a.s.
Then, if pj > p ,
j j > S + , otherwise lim supn
j 6 S + a.s., with j unique positive
solution to
Z 1
vc ()
c (j ) (d ) = pj .
1 + (j )vc ()
Simulation
1.2
Eigenvalues of n1 Pn
i = 1 yi yi
Limiting spectral measure
1
0.8
Density
0.6
0.4
0.2
0
0 1 2 3 4 5
Eigenvalues
P
Figure: Histogram of the eigenvalues of n1 i yi yi against the limiting spectral measure, L = 2, p1 = p2 = 1,
N = 200, n = 1000, Sudent-t impulsions.
Simulation
8
Eigenvalues of CN
Limiting spectral measure
6
Density
4
Right-edge of support 1
S+
2
0
0 0.2 0.4 0.6 0.8 1 1.2
Eigenvalues
Figure: Histogram of the eigenvalues of CN against the limiting spectral measure, for u (x ) = (1 + )/( + x )
with = 0.2, L = 2, p1 = p2 = 1, N = 200, n = 1000, Student-t impulsions.
Comments
I SCM vs robust: Spikes invisible in SCM in impulsive noise, reborn in robust estimate of
scatter.
I Largest eigenvalues:
I i (CN ) > S + Presence of a source!
I i (CN ) (sup(Support), S + ) May be due to a source or to a noise impulse.
I i (CN ) < sup(Support) As usual, nothing can be said.
Induces a natural source detection algorithm.
Eigenvalue and eigenvector projection estimates

I Two scenarios: Pn
1
I known = limn n i =1 i
I unknown
Theorem (Estimation under known )

1. Power estimation. For each pj > p ,
Z !1
vc () a.s.
c (
j ) (d ) pj .
1 + (
j )vc ()
2. Bilinear form estimation. For each a, b CN with kak = kb k = 1, and pj > p

X X a.s.
a uk uk b wk a uk uk b 0
k,pk =pj k,pk =pj
where
Z
vc (t )
2 (dt )
1 + (k )tvc (t )
wk = .
Z Z
vc (t ) 1 (k )2 t 2 vc (t )2
(dt ) 1 2 (dt )

1 + (
k )tvc (t ) c

1 + (k )tvc (t )
Eigenvalue and eigenvector projection estimates

Theorem (Estimation under unknown )
1. Purely empirical power estimation. For each pj > p ,
!1
1 X
n
v ( i
n) a.s.
( j ) pj .
N 1 +
( j )
i v (
i
n)
i =1
2. Purely empirical bilinear form estimation. For each a, b CN with kak = kb k = 1, and each
pj > p ,
X X a.s.
a uk uk b k a uk uk b 0
w
k,pk =pj k,pk =pj
where
1X
n
v (
i )
2
n

i =1 1 + (k )i v (i
)
w
k =
1X 1 X (
n n
v (
i
) k )2
2i v ( )2
i
1

2
n 1 + ( i v (i
) N

i =1
k ) i =1 1 + ( k )i v (
i )
1 X 1 1
n
1 1
,
y C y, i ,
yi C(i )1 yi , (
x ) as (x ) but for (i , ) (i ,
).
n N i (i ) i N

i =1
Application to G-MUSIC
I Assume the model ai = a(i ) with
1
a() = N 2 [exp(2dj sin())]N 1
j =0 .
Corollary (Robust G-MUSIC)

Define emp
RG () and RG () as
|{j,pj >p }|
X
RG () = 1
wk a() uk uk a()
k =1
|{j,pj >p }|
X
emp
RG () = 1 k a() uk uk a().
w
k =1
Then, for each pj > p ,

a.s.
j
j
emp a.s.
j j
where
j , argminR {
RG ()}
j
emp
emp
, argminR RG () .
j j
Simulations: Single-shot in elliptical noise

1
Robust G-MUSIC
Emp. robust G-MUSIC
G-MUSIC
0.8
Emp. G-MUSIC
Robust MUSIC
X ()
MUSIC
0.6
Localization functions
0.4
0.2
4 6 8 10 12 14 16 18
[deg]
Figure: Random realization of the localization functions for the various MUSIC estimators, with N = 20,
n = 100, two sources at 10 and 12 , Student-t impulsions with parameter = 100, u (x ) = (1 + )/( + x )
with = 0.2. Powers p1 = p2 = 100.5 = 5 dB.
Simulations: Elliptical noise

101
Robust G-MUSIC
Emp. robust G-MUSIC
102 G-MUSIC
Emp. G-MUSIC
1 1 |2 ]
Robust MUSIC
103 MUSIC
Mean square error E [|
104
105
106
107
108
5 0 5 10 15 20 25 30
p1 , p2 [dB]
Figure: Means square error performance of the estimation of 1 = 10 , with N = 20, n = 100, two sources at
10 and 12 , Student-t impulsions with parameter = 10, u (x ) = (1 + )/( + x ) with = 0.2, p1 = p2 .
Simulations: Spurious impulses

101
Robust G-MUSIC
Emp. robust G-MUSIC
G-MUSIC
102
Emp. G-MUSIC
1 1 |2 ]
Robust MUSIC
MUSIC
103
Mean square error E [|
104
105
106
5 0 5 10 15 20 25 30
p1 , p2 [dB]
Figure: Means square error performance of the estimation of 1 = 10 , with N = 20, n = 100, two sources at
10 and 12 , sample outlier scenario i = 1, i < n, n = 100, u (x ) = (1 + )/( + x ) with = 0.2, p1 = p2 .
Advanced Random Matrix Models for Robust Estimation/3.3 Robust shrinkage and application to mathematical finance 110/142
Context
Ledoit and Wolf, 2004. A well-conditioned estimator for large-dimensional covariance matrices.
Pascal, Chitour, Quek, 2013. Generalized robust shrinkage estimator Application to STAP data.
Chen, Wiesel, Hero, 2011. Robust shrinkage estimation of high-dimensional covariance matrices.
I Shrinkage covariance estimation: For N > n or N ' n, shrinkage estimator
1X
n
(1 ) xi xi + IN , for some [0, 1].
n
i =1
I allows for invertibility, better conditioning

I may be chosen to minimize an expected error metric
I Limitation of Maronnas estimator:
I Maronna and Tyler estimators limited to N < n, otherwise do not exist
I introducing shrinkage in robust estimator cannot do much harm anyhow...
I Introducing the robust-shrinkage estimator: The literature proposes two such estimators
1 X
n
xi xi
N n
CN () = (1 ) + IN , (max{0, }, 1 ] (Pascal)
n 1
1 x C N
i =1 N i N ()xi

BN () 1 X
n
xi xi
CN () = , BN () = (1 ) + IN , (0, 1] (Chen)
N ()
1 tr B n 1
1 x C
N i =1 N i N ()xi
Main theoretical result
I Which estimator is better?

Having asked to authors of both papers, their estimator was much better than the
other, but the arguments we received were quite vague...
I Our result: In the random matrix regime, both estimators tend to be one and the same!
I Assumptions: As before, elliptical-like model
1
xi = i CN2 wi
This time, CN cannot be taken IN (due to +IN )!

Maronna-based shrinkage is possible but more involved...
Pascals estimator
Theorem (Pascals estimator)

= [ + max{0, 1 c 1 }, 1]. Then, as N, n ,
For (0, min{1, c 1 }), define R
N /n c (0, ),

a.s.
sup CN () SN () 0

R
where
1 X
n
xi xi
CN () = (1 ) + IN
n N ()1 xi
1 x C
i =1 N i
1 1 1 X n 1 1
SN () = CN2 wi wi CN2 + IN
() 1 (1 )c n

i =1
and
() is the unique positive solution to the equation in

1 X
N
i (CN )
1= .
N + (1 )i (CN )

i =1
Moreover, 7
() is continuous on (0, 1].
Chens estimator
Theorem (Chens estimator)
= [, 1]. Then, as N, n , N /n c (0, ),
For (0, 1), define R

a.s.
sup CN () SN () 0

R
where

BN () 1 X
n
xi xi
CN () = , BN () = (1 ) + IN
1 n N ()1 xi
1 x C
N tr BN () i =1 N i
1 X 12
n
1 1 T
SN () = CN wi wi CN2 + I
1 + T n 1 + T N
i =1
in which T =
()F (
(); ) with, for all x > 0,
r
1 1 1
F (x; ) = ( c (1 )) + ( c (1 ))2 + (1 )
2 4 x
and
() is the unique positive solution to the equation in

1 X
N
i (CN )
1= 1
.
N +
i =1 ;) i (CN )
(1)c +F (
Moreover, 7
() is continuous on (0, 1].
Asymptotic Model Equivalence
Theorem (Model Equivalence)

(max{0, 1 c 1 }, 1] and
For each (0, 1], there exist unique (0, 1] such that
1 X 12
n
SN (
) 1
1 1
= SN (
) = (1 ) CN wi wi CN2 + IN .

+ n
(
) 1(1 )c i =1
Besides, (0, 1] (max{0, 1 c 1 }, 1], 7

and (0, 1] (0, 1], 7
are increasing and onto.
I Up to normalization, both estimators behave the same!

I Both estimators behave the same as an impulsion-free Ledoit-Wolf estimator
I About uniformity: Uniformity over in the theorems is essential to find optimal values of .
Optimal Shrinkage parameter

I Chen sought for a Frobenius norm minimizing but got stuck by implicit nature of CN ()
I Our results allow for a simplification of the problem for large N, n!
I Model equivalence says only one problem needs be solved.
Theorem (Optimal Shrinkage)

For each (0, 1], define
!2
CN ()
2
1 N () = 1 tr
DN () = tr CN , D CN () CN .
N N ()
1 tr C N
N
M2 1 c 1 PN 2
Denote D ? = c c + ?
M2 1 , = c +M2 1 , M2 = limN N i =1 i (CN ) ? ,
and ? unique solutions to
?
T?
?
= = ? .
1 1
?
+ ? + T?
1
? ) 1(1
(
? )c
Then, letting small enough,

a.s.
N () a.s.
N ()
inf D D?, inf D D?

R
R
N ( a.s. N ( a.s.
D ? ) D ? , D ? ) D ? .
Estimating ? and ?
I ? and
Theorem only useful if ? can be estimated!
I Careful control of the proofs provide many ways to estimate these.
I Proposition below provides one example.
Optimal Shrinkage Estimate

N (max{0, 1 c 1 }, 1] and
Let N (0, 1] be solutions (not necessarily unique) to
N
cN
=
N (
" 2 #
1 tr C
N N ) 1 1 Pn xi xi
N tr n i =1 1 kx k2
i
1
N
Pn
xi CN (
N ) 1 x i
N n1
i =1
kxi k2 cN
P =
x C (
" 2 #
)1 xi Pn
1 N n1 ni=1 i Nkx Nk2
N + 1 1 xi xi
i N tr n i =1 1 kx k2
i
1
N
defined arbitrarily when no such solutions exist. Then

a.s. a.s.
? ,
N
?
N
N ( a.s. N ( a.s.
D N ) D ? , D N ) D ? .
Simulations
3
()}
inf (0,1] {DN
(
D N N )
D?
(
D N O )
Normalized Frobenius norm
0
1 2 4 8 16 32 64 128
n [log2 scale]
Figure: Performance of optimal shrinkage averaged over 10 000 Monte Carlo simulations, for N = 32, various
values of n, [CN ]ij = r |i j | with r = 0.7;
N as above;
O the clairvoyant estimator proposed in (Chen11).
Simulations
1
0.8
Shrinkage parameter
0.6
0.4
0.2

N
?

N
?

O

0
1 2 4 8 16 32 64 128
n [log2 scale]
Figure: Shrinkage parameter averaged over 10 000 Monte Carlo simulations, for N = 32, various values of n,
[CN ]ij = r |i j | with r = 0.7;
N and
N as above;
O the clairvoyant estimator proposed in (Chen11);
= argmin{(max{0,1c 1 },1]} {D
N ()} and N ()}.
= argmin{(0,1]} {D
N
Advanced Random Matrix Models for Robust Estimation/3.4 Optimal robust GLRT detectors 120/142
Context
I Hypothesis testing problem: Two sets of data
1
I Initial pure-noise data: x1 , . . . , xn , xi = i CN2 wi as before.
I New incoming data y given by:

x , H0
y=
p + x , H1
1
with x = CN2 w , p CN deterministic known, unknown.
I GLRT detection test:
H1
TN ()
H0
for some detection threshold where
|y CN1 ()p |
TN () , q q .
y CN1 ()y p CN1 ()p
and CN () defined in previous section.
In fact, originally found to be CN (0) but

I only valid for N < n
I introducing may bring improved for arbitrary N /n ratios.
Objectives and main results
I Initial observations:
I As N, n , N /n c > 0, under H0 ,
a.s.
TN () 0.
Trivial result of little interest!
I Natural question: for finite N, n and given , find such that
P (TN () > ) = min
I Turns out the correct non-trivial object is, for > 0 fixed

P NTN () > = min
I Objectives:
I for each , develop central limit theorem to evaluate

lim P NTN () >
N,n
N /nc
I determine limiting minimizing

I empirically estimate minimizing
What do we need?
CLT over CN statistics

a.s.
I We know that kCN () SN ()k 0
Key result so far!

I What about k N (CN () SN ())k ?
Does not converge to zero!!!
I But there is hope. . . : a.s.
N (a CN1 ()b a SN1 ()b ) 0
This is our main result!
I This requires much more delicate treatment, not discussed in this tutorial.
Main results
Theorem (Fluctuation of bilinear forms)

Let a, b CN with kak = kb k = 1. Then, as N, n with N /n c > 0, for any > 0 and
every k Z,

a.s.
sup N 1 a CNk ()b a SNk ()b 0

R
where R = [ + max{0, 1 1/c }, 1].

False alarm performance
Theorem (Asymptotic detector performance)

As N, n with N /n c (0, ),
!
2

sup P TN () > exp 2 0

R N 2 N (
)
where 7
is the aforementioned mapping and
1 p CN QN2 (
)p
2N (
) , 1 )2 N1 tr CN2 QN2 (

2 p QN (
)p )
N tr CN QN ( 1 c (1 )2 m( )
with QN (
) , ( IN + ( 1 )CN )1 .
)m(
I Limiting Rayleigh distribution

Weak convergence to Rayleigh variable RN (
)
I Remark: N and not a function of
There exists a uniformly optimal !
Simulation
1
Empirical hist. of TN ()
Distribution of RN ()

0.6
0.8
Cumulative distribution
0.6
0.4
Density
0.4
0.2
0.2
Empirical dist. of TN ()

0 0
0 1 2 3 4 0 1 2 3 4
Values taken Values taken
1
), N = 20, p = N 2 [1, . . . , 1]T , CN
Figure: Histogram distribution function of the NTN () versus RN (
Toeplitz from AR of order 0.7, cN = 1/2, = 0.2.
Simulation
1
Empirical hist. of TN ()
Density of RN ()

0.6 0.8
Cumulative distribution
0.6
Density
0.4
0.4
0.2
0.2
Empirical dist. of TN ()

0 0
0 1 2 3 4 0 1 2 3 4
Values taken Values taken
1
), N = 100, p = N 2 [1, . . . , 1]T , CN
Figure: Histogram distribution function of the NTN () versus RN (
Toeplitz from AR of order 0.7, cN = 1/2, = 0.2.
Empirical estimation of optimal
I Optimal can be found by line search. . . but CN unknown!

I We shall successively:
I empirical estimate N (
)
I minimize the estimate
I prove by uniformity asymptotic optimality of estimate
Theorem (Empirical performance estimation)

For (max{0, 1 cN1 }, 1), let
p C ()p 2
1 N1 1 tr CN ()
1 p CN ()p N
2N (
) , .
2 1 c + c
N1 tr CN1 () N1 tr CN () 1 N1 tr CN1 () 1
N tr CN ()
2N (1) , lim1
Also let 2N (
). Then

a.s.
sup 2N ( 2N (
) ) 0.

R
Final result
Theorem (Optimality of empirical estimator)

Define

N = argmin{R0 }
2N (
) .
Then, for every > 0,

P N ) > inf P
NTN ( NTN () > 0.
R
Simulations
100
=2
101
P ( NTN () > )
=3

102
Limiting theory
Empirical estimator
103 Detector
0 0.2 0.4 0.6 0.8 1
1
Figure: False alarm rate P ( NTN () > ), N = 20, p = N 2 [1, . . . , 1]T , CN Toeplitz from AR of order 0.7,
cN = 1/2.
Simulations
100
=2
P ( NTN () > )
101

=3
102 Limiting theory

Empirical estimator
Detector
0 0.2 0.4 0.6 0.8 1
1
Figure: False alarm rate P ( NTN () > ), N = 100, p = N 2 [1, . . . , 1]T , CN Toeplitz from AR of order 0.7,
cN = 1/2.
Simulations
100
Limiting theory
Detector
101
102 N = 20
N = 100
103
104
0 0.2 0.4 0.6 0.8 1
1
Figure: False alarm rate P (TN () > ) for N = 20 and N = 100, p = N 2 [1, . . . , 1]T , [CN ]ij = 0.7|i j | ,
cN = 1/2.
Future Directions/4.1 Kernel matrices and kernel methods 134/142
Motivation: Spectral Clustering
N. El Karoui. The spectrum of kernel random matrices. The Annals of Statistics, 38(1):150, 2010.
I Objective: Clustering data x1 , . . . , xn CN in k similarity classes

I classical machine learning problem brought here to big data!
I assumes similarity function, e.g. Gaussian kernel
kxi xj k2

f (xi , xj ) = exp
22
I naturally brings kernel matrix:

W = [Wij ]16i,j 6n = [f (xi , xj )]16i,j 6n .
I Letting x1 , . . . , xn random, leads naturally to studying kernel random matrices.

I Little is known on such random matrices, but for xi i.i.d. zero mean and covariance IN :

W 11T 1 WW
a.s.
0
n
for some , depending on f and its derivatives.

Basically, W gets equivalent to a rank-one matrix.
Motivation: Spectral Clustering
I Clustering x1 , . . . , xn in k often written as:
X
k X f (xj , xj)
(RatioCut) min .
S1 ,...,Sk |Si |
S1 ...Sk =S i =1 j Si ,jSi
c
i 6=j, Si Sj =
But difficult to solve, NP hard!

I Can be equivalently rewritten

(RatioCut) min tr M T LM
M M, M T M =Ik
1
where M = {M = [mij ]16i 6n,16j 6k , mij = |Sj | 2 xi Sj } and
X
n
" #
L = [Lij ]16i,j 6n = [W + diag(W 1)]16i,j 6n = f (xi , xj ) + i,j f (xi , xl ) .
l =1 16i,j 6n
I Relaxing M to unitary leads to a simple eigenvalue/eigenvector problem:

Spectral clustering.
Objectives
I Generalization to k distributions for x1 , . . . , xn should lead to asymptotically rank-k W

matrices.
I If established, specific choices of known good kernel better understood.
I Eventually, find optimal choices for kernels.
Future Directions/4.2 Neural networks 138/142
Echo-state neural networks

I Neural network:
I Input neuron signal st R (could be multivariate)
I Output neuron signal yt R (could be multivariate)
I N neurons with
I state xt RN at time t
I connectivity matrix W RN N
I connectivity vector to input wI RN
I connectivity vector to output wO RN
I State evolution x0 = 0 (say) and
xt +1 = S (Wxt + wI st )
with S entry-wise sigmoid function.
I Output observation
yt = wOT xt .
I Classical neural networks:

I Learning phase: input-output data (st , yt ) used to learn W , wO , wI (via e.g. LS)
I Interpolation phase: W , wO , wI fixed, we observe output yt from new data st .
Poses overlearning problems, difficult to set up, demands lots of learning data.
I Echo-state neural networks: To solve the problems of neural networks

I W and wI set to be a random matrix, no longer learned
I only wO is learned
Reduces amount of data to learn, shows striking performances in some scenarios.
ESN and random matrices
I W , wI being random, performance study involves random matrices.

Stability, chaos regime, etc. involve extreme eigenvalues of W
I main difficulty is non-linearity caused by S
I Performance measures:
I MSE for training data
I MSE for interpolated data
Optimization to be performed on regression method!, e.g.
T
wO = (Xtrain Xtrain + IN )1 Xtrain ytrain
with Xtrain = [x1 , . . . , xT ], ytrain = [y1 , . . . , yT ]T , T train period.

I In first approximation: S = Id.
MSE performance with stationary inputs leads to study

X
W j wI wIT (W T )j
j =1
New random matrix model, can be analyzed with usual tools though.
Related biography
I J. T. Kent, D. E. Tyler, Redescending M-estimates of multivariate location and scatter, 1991.

I R. A. Maronna, Robust M-estimators of multivariate location and scatter, 1976.
I Y. Chitour, F. Pascal, Exact maximum likelihood estimates for SIRV covariance matrix: Existence and algorithm analysis, 2008.
I N. El Karoui, Concentration of measure and spectra of random matrices: applications to correlation matrices, elliptical distributions and beyond,
2009.
I R. Couillet, F. Pascal, J. W. Silverstein, Robust M-Estimation for Array Processing: A Random Matrix Approach, 2012.
I J. Vinogradova, R. Couillet, W. Hachem, Statistical Inference in Large Antenna Arrays under Unknown Noise Pattern, (submitted to) IEEE
Transactions on Signal Processing, 2012.
I F. Chapon, R. Couillet, W. Hachem, X. Mestre, On the isolated eigenvalues of large Gram random matrices with a fixed rank deformation,
(submitted to) Electronic Journal of Probability, 2012, arXiv Preprint 1207.0471.
I R. Couillet, M. Debbah, Signal Processing in Large Systems: a New Paradigm, IEEE Signal Processing Magazine, vol. 30, no. 1, pp. 24-39, 2013.
I P. Loubaton, P. Vallet, Almost sure localization of the eigenvalues in a Gaussian information plus noise model. Application to the spiked models,
Electronic Journal of Probability, 2011.
I P. Vallet, W. Hachem, P. Loubaton, X. Mestre, J. Najim, On the consistency of the G-MUSIC DOA estimator. IEEE Statistical Signal Processing
Workshop (SSP), 2011.
To know more about all this
Our webpages:
I http://couillet.romain.perso.sfr.fr
I http://sri-uq.kaust.edu.sa/Pages/KammounAbla.aspx
Spraed it!
To download this presentation (PDF format):

I Log in to your Spraed account (www.spraed.net)
I Scan this QR code.

EUSIPCO2014 Future Random Matrix Tools

Transféré par

Informations du document

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

EUSIPCO2014 Future Random Matrix Tools

Transféré par

Droits d'auteur :

Formats disponibles

1/142

Future Random Matrix Tools

Abla KAMMOUN1 and Romain COUILLET2

1 Kings Abdullah University of Technology and Science, Saudi Arabia

September 1st, 2014

I Consider n observations x1 , , xn of size N,

Convergence in the operator norm

but kSN CN k does not converge to zero.

Figure: Spectrum of eigenvalues when N = 400 and n = 2000

The asymptotic spectrum can be characterized by the Marchenko-Pastur Law.

Reasons of interest for signal processing

from our previous remarks

If F has a density f at x, then

Stieltjes transform of a Hermitian matrix

I Let X be a N N randomP matrix. Denote by dF X the empirical measure of its eigenvalues

I mF is defined in general on C+ but exists everywhere outside the support of F .

Side remark: the Shannon-transform

I This quantity is fundamental to wireless communication purposes!

Proof of the Marcenko-Pastur law

The Marcenko-Pastur density

Diagonal entries of the resolvent

which here gives  1  1

For large N, we therefore have approximately

Rank-1 perturbation lemma

J. W. Silverstein, Z. D. Bai, On the empirical distribution of eigenvalues of a class of large

with kAk the spectral norm of A, and dist(z, A) = inf y A ky z k.

in which we recognize the Stieltjes transform mF of the l.s.d. of XH

End of the proof

We have again the relation

which solve a polynomial of second order. Finally

Asymptotic results involving Stieltjes transform

J. W. Silverstein, Z. D. Bai, On the empirical distribution of eigenvalues of a class of large

I in general, no explicit expression for F N , the distribution whose Stietljes transform is mN (z ).

I Remember that, for a < b real,

where mF is (up to now) only defined on C+ .

Example (Sample covariance matrix)

We take c = 1/10 and alternatively K = 7 and K = 4.

Spectrum of the sample covariance matrix

Empirical eigenvalue distribution Empirical eigenvalue distribution

No eigenvalue outside the support of sample covariance matrices

Z. D. Bai, J. W. Silverstein, No eigenvalues outside the support of the limiting spectral

Let FN be the distribution associated to the Stieltjes transform mN (z ). Consider

No eigenvalue outside the support: which models?

J. W. Silverstein, P. Debashis, No eigenvalues outside the support of the limiting empirical

4th order moment

Extreme eigenvalues: Deeper into the spectrum

Distribution of the largest eigenvalues of XXH

C. A. Tracy, H. Widom, On orthogonal and symplectic matrix ensembles, Communications in

with c = limN N /n and F + the Tracy-Widom distribution given by

with q the Painlev

in which Ai(x ) is the Airy function.

The law of Tracy-Widom

Centered-scaled largest eigenvalue of XXH

with K (x, y ) the kernel Laguerre polynomial.

I kernel theory: show that KN converges to a Airy kernel.

Ai(x )Ai 0 (y ) Ai 0 (x )Ai(y )

Comments on the Tracy-Widom law

I deeper result than limit eigenvalue result

I We consider n independent observations x1 , , xn of size N,

where P is of low rank,

The first result

J. Baik, J. W. Silverstein, Eigenvalues of large sample covariance matrices of spiked population

entries, and PN RN N with eigenvalues given by:

Illustration of spiked models

which here gives 1 1