Vous êtes sur la page 1sur 92

A SHORT COURSE ON

ROBUST STATISTICS
David E. Tyler
Rutgers
The State University of New Jersey
Web-Site
www.rci.rutgers.edu/ dtyler/ShortCourse.pdf
References
Huber, P.J. (1981). Robust Statistics. Wiley, New York.
Hampel, F.R., Ronchetti, E.M., Rousseeuw,P.J. and Stahel, W.A. (1986).
Robust Statistics: The Approach Based on Inuence Functions.
Wiley, New York.
Maronna, R.A., Martin, R.D. and Yohai, V.J. (2006).
Robust Statistics: Theory and Methods. Wiley, New York.
PART 1
CONCEPTS AND BASIC METHODS
MOTIVATION
Data Set: X
1
, X
2
, . . . , X
n
Parametric Model: F(x
1
, . . . , x
n
[ )
: Unknown parameter
F: Known function
e.g. X
1
, X
2
, . . . , X
n
i.i.d. Normal(,
2
)
Q: Is it realistic to believe we dont know (,
2
), but we know e.g. the shape of the
tails of the distribution?
A: The model is assumed to be approximately true, e.g. symmetric and unimodal
(past experience).
Q: Are statistical methods which are good under the model reasonably good if the
model is only approximately true?
ROBUST STATISTICS
Formally addresses this issue.
CLASSIC EXAMPLE: MEAN .vs. MEDIAN
Symmetric distributions: =
_

_
population mean
population median
Sample mean: X Normal
_
,

2
n
_
Sample median: Median Normal
_
,
1
n
1
4f()
2
_
At normal: Median Normal
_
,

2
n

2
_
Asymptotic Relative Efciency of Median to Mean
ARE(Median, X) =
avar(X)
avar(Median)
=
2

= 0.6366
CAUCHY DISTRIBUTION
X

Cauchy(,
2
)
f(x; , ) =
1

_
_
_1 +
_
_
x

_
_
2
_
_
_
1
Mean: X

Cauchy(,
2
) Median Normal
_
,

2

2
4n
_
ARE(Median, X) = or ARE(X, Median) = 0
For t on degrees of freedom:
ARE(Median, X) =
4
( 2)
(( + 1)/2)
2
(/2)
2 3 4 5
ARE(Median, X) 1.621 1.125 0.960
ARE(X, Median) 0 0.617 0.888 1.041
MIXTURE OF NORMALS
Theory of errors: Central Limit Theorem gives plausibility to normality.
X

_
Normal(,
2
) with probability 1
Normal(, (3)
2
) with probability
i.e. not all measurements are equally precise.
X

(1 ) Normal(,
2
) + Normal(, (3)
2
)
= 0.10
Classic paper: Tukey (1960), Asurvey of sampling fromcontaminated distributions.
For > 0.10 ARE(Median, X) > 1
The mean absolute deviation is more eficent than the sample standard deviation
for > 0.01.
PRINCETON ROBUSTNESS STUDIES
Andrews, et.al. (1972)
Other estimates of location.
-trimmed mean: Trim a proportion of from both ends of the data set and then
take the mean. (Throwing away data?)
-Windsorized mean: Replace a proportion of from both ends of the data set
by the next closest observation and then take the mean.
Example: 2, 4, 5, 10, 200
Mean = 44.2 Median = 5
20% trimmed mean = (4 + 5 + 10) / 3 = 6.33
20% Windsorized mean = (4 + 4 + 5 + 10 + 10) / 5 = 6.6
Measuring the robustness of a statistics
Relative Efciency over a range of distributional models.
There exist estimates of location which are asymptotically most efcient for
the center of any symmetric distribution. (Adaptive estimation, semi-parametrics).
Robust?
Inuence Function over a range of distributional models.
Maximum Bias Function and the Breakdown Point.
Measuring the effect of an outlier
(not modeled)
Good Data Set: x
1
, . . . , x
n1
Statistic: T
n1
= T(x
1
, . . . , x
n1
)
Contaminated Data Set: x
1
, . . . , x
n1
, x
Contaminated Value: T
n
= T(x
1
, . . . , x
n1
, x)
THE SENSITIVITY CURVE (Tukey, 1970)
SC
n
(x) = n(T
n
T
n1
) T
n
= T
n1
+
1
n
SC
n
(x)
THE INFLUENCE FUNCTION (Hampel, 1969, 1974)
Population version of the sensitivity curve.
THE INFLUENCE FUNCTION
Statistic T
n
= T(F
n
) estimates T(F).
Consider F and the -contaminated distribution,
F

= (1 )F +
x
where
x
is the point mass distribution at x.
F F

Compare Functional Values: T(F) .vs. T(F

)
Given Qualitative Robustness (Continuity):
T(F

) T(F) as 0
(e.g. the mode is not qualitatively robust)
Inuence Function (Innitesimal pertubation: G ateaux Derivative)
IF(x; T, F) = lim
0
T(F

) T(F)

T(F

=0
EXAMPLES
Mean: T(F) = E
F
[X].
T(F

) = E
F

(X)
= (1 )E
F
[X] + E[
x
]
= (1 )T[F] + x
IF(x; T, F) = lim
0
(1 )T(F) + x T(F)

= x T(F)
Median: T(F) = F
1
(1/2)
IF(x; T, F) = 2 f (T(F))
1
sign(XT(F))
Plots of Inuence Functions
Gives insight into the behavior of a statistic.
T
n
T
n1
+
1
n
IF(x; T, F)
Mean Median
-trimmed mean -Winsorized mean
(somewhat unexpected?)
Desirable robustness properties for the inuence function
SMALL
Gross Error Sensitivity
GES(T; F) = sup
x
[ IF(x; T, F) [
GES < B-robust (Bias-robust)
Asymptotic Variance
Note : E
F
[ IF(X; T, F) ] = 0
AV (T; F) = E
F
[ IF(X; T, F)
2
]
Under general conditions, e.g. Fr` echet differentiability,

n(T(X
1
, . . . , X
n
) T(F)) Normal(0, AV (T; F))
Trade-off at the normal model: Smaller AV Larger GES
SMOOTH (local shift sensitivity): protects e.g. against rounding error.
REDESCENDING to 0.
REDESCENDING INFLUENCE FUNCTION
Example: Data Set of Male Heights in cm
180, 175, 192, . . ., 185, 2020, 190, . . .
Redescender = Automatic Outlier Detector
CLASSES OF ESTIMATES
L-statistics: Linear combination of order statistics
Let X
(1)
. . . X(n) represent the order statistics.
T(X
1
, . . . , X
n
) =
n

i=1
a
i,n
X(i)
where a
i,n
are constants.
Examples: Mean. a
i,n
= 1/n
Median.
a
i,n
=
_

_
1 i =
n+1
2
0 i ,=
n+1
2
for n odd
a
i,n
=
_

_
1
2
i =
n
2
,
n
2
+ 1
0 otherwise
for n even
-trimmed mean.
-Winsorized mean.
General form for the inuence function exists
Can obtain any desirable monotonic shape, but not redescending
Do not readily generalize to other settings.
M-ESTIMATES
Huber (1964, 1967)
Maximum likelihood type estimates
under non-standard conditions
One-Parameter Case. X
1
, . . . , X
n
i.i.d. f(x; ),
Maximum likelihood estimates
Likelihood function. L( [ x
1
, . . . , x
n
) =
n

i=1
f(x
i
; )
Minimize the negative log-likelihood.
min

i=1
(x
i
; ) where (x
i
; ) = log f(x : ).
Solve the likelihood equations
n

i=1
(x
i
; ) = 0 where (x
i
; ) =
(x
i
; )

DEFINITIONS OF M-ESTIMATES
Objective function approach:

= arg min

n
i=1
(x
i
; )
M-estimating equation approach:

n
i=1
(x
i
;

) = 0.
Note: Unique solution when (x; ) is strictly monotone in .
Basic examples.
Mean. MLE for Normal: f(x) = (2)
1/2
e

1
2
(x)
2
for x 1.
(x; ) = (x )
2
or (x; ) = x
Median. MLE for Double Exponential: f(x) =
1
2
e
[x[
for x 1.
(x; ) = [ x [ or (x; ) = sign(x )
and need not be related to any density or to each other.
Estimates can be evaluated under various distributions.
M-ESTIMATES OF LOCATION
A symmetric and translation equivariant M-estimate.
Translation equivariance
X
i
X
i
+ a T
n
T
n
+ a
gives
(x; t) = (x t) and (x; t) = (x t)
Symmetric
X
i
X
i
T
n
T
n
gives
(r) = (r) or (r) = (r).
Alternative derivation
Generalization of MLE for center of symmetry for a given family of symmetric
distributions.
f(x; ) = g([x [)
INFLUENCE FUNCTION OF M-ESTIMATES
M-FUNCTIONAL: T(F) is the solution to E
F
[(X; T(F))] = 0.
IF(x; T, F) = c(T, F)(x; T(F))
where c(t, f) = 1/E
F
_
(X;)

_
evaluated at = T(F).
Note: E
F
[ IF(X; T, F) ] = 0.
One can decide what shape is desired for the Inuence Function and then construct
an appropriate M-estimate.
Mean Median
EXAMPLE
Choose
(r) =
_

_
c r c
r [ r [ < c
c r c
where c is a tuning constant.
Hubers M-estimate
Adaptively trimmed mean.
i.e. the proportion trimmed depends upon the data.
DERIVATION OF INFLUENCE FUNCTION FROM M-ESTIMATES
Sketch
Let T

= T(F

), and so
0 = E
F

[(X; T

)] = (1 )E
F
[(X; T

)] + (x; T

).
Taking the derivative with respect to
0 = E
F
[(X; T

)] + (1 )

E
F
[(X; T

)] + (x; T

) +

(x; T

).
Let
/
(x, ) = (x, )/. Using the chain rule
0 = E
F
[(X; T

)] + (1 )E
F
[
/
(X; T

)]
T

+ (x; T

) +
/
(x; T

)
T

.
Letting 0 and using qualitative robustness, i.e. T(F

) T(F), then gives


0 = E
F
[
/
(X; T)]IF(x; T(F)) + (x; T(F)) RESULTS.
ASYMPTOTIC NORMALITY OF M-ESTIMATES
Sketch
Let = T(F). Using Taylor series expansion on (x;

) about gives
0 =
n

i=1
(x
i
;

) =
n

i=1
(x
i
; ) + (

)
n

i=1

/
(x
i
; ) + . . . ,
or
0 =

n
1
n
n

i=1
(x
i
; ) +

n(

)
1
n
n

i=1

/
(x
i
; ) + O
p
(1/

n).
By the CLT,

n
1
n

n
i=1
(x
i
; )
d
Z Normal(0, E
F
[(X; )
2
]).
By the WLLN,
1
n

n
i=1

/
(x
i
; )
p
E
F
[
/
(X; )].
Thus, by Slutskys theorem,

n(

)
d
Z/E
F
[
/
(X; )] Normal(0,
2
),
where

2
= E
F
[(X; )
2
]/E
F
[
/
(X; )]
2
= E
F
[IF(X; TF)
2
]
NOTE: Proving Fr` echet differentiability is not necessary.
M-estimates of location
Adaptively weighted means
Recall translation equivariance and symmetry implies
(x; t) = (x t) and (r) = (r).
Express (r) = ru(r) and let w
i
= u(x
i
), then
0 =
n

i=1
(x
i
) =
n

i=1
(x
i
)u(x
i
) =
n

i=1
w
i
x
i

n
i=1
w
i
x
i

n
i=1
w
i
The weights are determined by the data cloud.

..
Heavily Downweighted
SOME COMMON M-ESTIMATES OF LOCATION
Hubers M-estimate
(r) =
_

_
c r c
r [ r [ < c
c r c
Given bound on the GES, it has maximum efciency at the normal model.
MLE for the least favorable distribution, i.e. symmetric unimodal model with
smallest Fisher Information within a neighborhood of the normal.
LFD = Normal in the middle and double exponential in the tails.
Tukeys Bi-weight M-estimate (or bi-square)
u(r) =
_

_
_
_
_1
r
2
c
2
_
_
_
+
_

_
2
where a
+
= max0, a
Linear near zero
Smooth (continuous second derivatives)
Strongly redescending to 0
NOT AN MLE.
CAUCHY MLE
(r) =
r/c
(1 + r
2
/c
2
)
NOT A STRONG REDESCENDER.
COMPUTATIONS
IRLS: Iterative Re-weighted Least Squares Algorithm.

k+1
=

n
i=1
w
i,k
x
i

n
i=1
w
i,k
where w
i,k
= u(x
i

k
) and

o
is any intial value.
Re-weighted mean = One-step M-estimate

1
=

n
i=1
w
i,o
x
i

n
i=1
w
i,o
,
where

o
is a preliminary robust estimate of location, such as the median.
PROOF OF CONVERGENCE
Sketch

k+1

k
=

n
i=1
w
i,k
x
i

n
i=1
w
i,k

k
=

n
i=1
w
i,k
(x
i

k
)

n
i=1
w
i,k
=

n
i=1
(x
i

k
)

n
i=1
u(x
i

k
)
Note: If

k
>

, then

k
>

k+1
and

k
<

, then

k
<

k+1
.
Decreasing objective function
Let (r) =
o
(r
2
) and suppose
/
o
(s) 0 and
//
o
(s) 0. Then
n

i=1
(x
i

k+1
) <
n

i=1
(x
i

k
).
Examples:
Mean
o
(s) = s
Median
o
(s) =

s
Cauchy MLE
o
(s) = log(1 + s).
Include redescending M-estimates of location.
Generalization of EM algorithm for mixture of normals.
PROOF OF MONOTONE CONVERGENCE
Sketch
Let r
i,k
= (x
i

k
). By Taylor series with remainder term,

o
(r
2
i,k+1
) =
o
(r
2
i,k
) + (r
2
i,k+1
r
2
i,k
)
/
o
(r
2
i,k
) +
1
2
(r
2
i,k+1
r
2
i,k
)
2

//
o
(r
2
i,
)
So,
n

i=1

o
(r
2
i,k+1
) <
n

i=1

o
(r
2
i,k
) +
n

i=1
(r
2
i,k+1
r
2
i,k
)
/
o
(r
2
i,k
)
Now,
ru(r) = (r) =
/
(r) = 2r
/
o
(r
2
) and so
/
o
(r
2
) = u(r)/2.
Also,
(r
2
i,k+1
r
2
i,k
) = (

k+1

k
)
2
2(

k+1

k
)(x
i

k
).
Thus,
n

i=1

o
(r
2
i,k+1
) <
n

i=1

o
(r
2
i,k
)
1
2
(

k+1

k
)
2
n

i=1
u(r
i,k
) <
n

i=1

o
(r
2
i,k
)
Slow but Sure Switch to Newton-Raphson after a few iterations.
PART 2
MORE ADVANCED CONCEPTS AND METHODS
SCALE EQUIVARIANCE
M-estimates of location alone are not scale equivariant in general, i.e.
X
i
X
i
+ a



+ a location equivariant
X
i
b X
i
,

b

not scale equivariant
(Exceptions: the mean and median.)
Thus, the adaptive weights are not dependent on the spread of the data

..
Equally Downweighted
. .

SCALE STATISTICS: s
n
X
i
bX
i
+ a s
n
[ b [s
n
Sample standard deviation.
s
n
=

_
1
n 1
n

i=1
(x
i
x)
2
MAD (or more approriately MADAM).
Median Absolute Deviation About the Median
s

n
= Median[ x
i
median [
s
n
= 1.4826 s

n

p
at Normal(,
2
)
Example
2, 4, 5, 10, 12, 14, 200
Median = 10
Absolute Deviations: 8, 6, 5, 0, 2, 4, 190
MADAM = 5 s
n
= 7.413
(Standard deviation = 72.7661)
M-estimates of location with auxiliary scale

= arg min
n

i=1

_
_
_
_
_
x
i

c s
n
_
_
_
_
_
or

solves:
n

i=1

_
_
_
_
_
x
i

c s
n
_
_
_
_
_
= 0
c: tuning constant
s
n
: consistent for at normal
Tuning an M-estimate
Given a -function, dene a class of M-estimates via

c
(r) = (r/c)
Tukeys Bi-weight.
(r) = r(1 r
2
)
+

c
(r) =
r
c
_

_
_
_
_1
r
2
c
2
_
_
_
+
_

_
2
c Mean
c 0 Locally very unstable
Normal distribution. Auxiliary scale.
MORE THAN ONE OUTLIER
GES: measure of local robustness.
Measures of global robustness?
A
n
= X
1
, . . . , X
n
: n good data points
}
m
= Y
1
, . . . , Y
m
: m bad data points
:
n+m
= A
n
}
m
:
m
-contaminated sample.
m
=
m
n+m
Bias of a statistic: [ T(A
n
}
m
) T(A
n
) [
MAX-BIAS and THE BREAKDOWN POINT
Max-Bias under
m
-contamination:
B(
m
; T, X
n
) = sup
}
m
[ T(A
n
}
m
) T(A
n
) [
Finite sample contamination breakdown point: Donoho and Huber (1983)

c
(T; X
n
) = inf
m
[ B(
m
; T, X
n
) =
Other concepts of breakdown (e.g. replacement).
The denition of BIAS can be modied so that BIAS as T(A
n
}
m
) goes
to the boundary of the parameter space.
Example: If T represents scale, then choose
BIAS = [ logT(A
n
}
m
) logT(A
n
) [.
EXAMPLES
Mean:

c
=
1
n+1
Median:

c
= 1/2
-trimmed mean:

c
=
-Windsorized mean:

c
=
M-estimate of location with monotone and bounded function:

c
= 1/2
PROOF
(sketch of lower bound)
Let K = sup
r
[(r)[.
0 =
n+m

i=1
(z
i
T
n+m
) =
n

i=1
(x
i
T
n+m
) +
m

i=1
(y
i
T
n+m
)
[
n

i=1
(x
i
T
n+m
) [ = [
m

i=1
(y
i
T
n+m
) [ mK
Breakdown occurs [ T
n+m
[ , say T
n+m

[
n

i=1
(x
i
T
n+m
) [ [
n

i=1
() [ = nK
Therefore, m n and so

c
1/2. []
Population Version of Breakdown Point
under contamination neighborhoods
Model Distribution: F Contaminating Distribution: H
-contaminated Distribution: F

= (1 )F + H
Max-Bias under -contamination:
B(; T, F) = sup
H
[ T(F

) T(F) [
Breakdown Point: Hampel (1968)

(T; F) = inf [ B(; T, F) =


Examples
Mean: T(F) = E
F
(X)

(T; F) = 0
Median: T(F) = F
1
(1/2)

(T; F) = 1/2
ILLUSTRATION
GES B(; T, F)/ [
=0

(T; F) = asymptote
Heuristic Interpretation of Breakdown Point
subject to debate
Proportion of bad data a statistic can tolerate before becoming arbitrary or
meaningless.
If 1/2 the data is bad then one cannot distinguish between the good data and
the bad data? 1/2?
Discussion: Davies and Gather (2005), Annals of Statistics.
Example: Redescending M-estimates of location with xed scale
T(n
1
, . . . , x
n
) = arg min
t
n

i=1

_
_
_
[ x
i
t [
c
_
_
_
Breakdown point. Huber (1984). For bounded increasing ,

c
(T; F) =
1 A(A
n
; c)/n
2 A(A
n
; c)/n
where A(A
n
; c) = min
t

n
i=1
(
x
i
t
c
) and sup (r) = 1.
Breakdown point depends on A
n
and c

: 0 1/2 as c : 0
For large c, T(n
1
, . . . , x
n
) Mean!!
Explanation: Relationship between redescending M-estimates of location and
kernel density estimates
Chu, Glad, Godtliebsen and Marron (1998)
Objective function:
n

i=1

_
_
x
i
t
c
_
_
Kernel density estimate:

f(x)
n

i=1

_
_
x
i
t
h
_
_
Relationship: 1 and c = window width h
Example:
(r) =
1

2
exp
r
2
/2
Normal kernel

(r) = 1 exp
r
2
/2
Welschs M-estimate
Example: Epanechnikov kernel skipped mean
Outliers less compact than good data Breakdown will not occur.
Not true for monotonic M-estimates.
PART 3
ROBUST REGRESSION
AND MULTIVARIATE STATISTICS
REGRESSION SETTING
Data: (Y
i
, X
i
) i = 1, . . . , n
Y
i
1 Response
X
i
1
p
Predictors
Predict: Y by X
/

Residual for a given : r


i
() = y
i
x
/
i

M-estimates of regression
Generalizion of MLE for symmetric error term
Y
i
= X
/
i
+
i
where
i
are i.i.d. symmetric.
M-ESTIMATES OF REGRESSION
Objective function approach:
min

i=1
(r
i
())
where (r) = (r) and for r 0.
M-estimating equation approach:
n

i=1
(r
i
())x
i
= 0
e.g. (r) =
/
(r)
Interpretation: Adaptively weighted least squares.
Express (r) = ru(r) and w
i
= u(r
i
(

))

= [
n

i=1
w
i
x
i
x
/
i
]
1
[
n

i=1
w
i
y
i
x
i
]
Computations via IRLS algorithm.
INFLUENCE FUNCTIONS FOR
M-ESTIMATES OF REGRESSION
IF(y, x; T, F) (r) x
r = y x
/
T(F)
Due to residual: (r) Due to Design: x
GES = , i.e. unbounded inuence.
Outlier 1: highly inuential for any function.
Outlier 2: highly inuential for monotonic but not redescending .
BOUNDED INFLUENCE REGRESSION
GM-estimates (Generalized M-estimates).
Mallows-type (Mallows, 1975):
n

i=1
w(x
i
) (r
i
()) x
i
= 0
Downweights outlying design points (leverage points), even if they are good
leverage points.
General form (c.g. Maronna and Yohai, 1981):
n

i=1
(x
i
, r
i
()) x
i
= 0
Breakdown points.
M-estimates:

= 0
GM-estimates:

1/(p + 1)
LATE 70s - EARLY 80s
Open problem: Is high breakdown point regression possible?
Yes. Repeated Median. Siegel (1982).
Not regression equivariate
Regression equivariance: For a 1 and A nonsingular
(Y
i
, X
i
) (aY
i
, A
/
X
i
)

aA
1

i.e.

Y
i
a

Y
i
Open problem: Is high breakdown point equivariate regression possible?
Yes. Least Median of Squares. Hampel (1984), Rousseeuw, (1984)
Least Median of Squares (LMS)
Hampel (1984), Rousseeuw, (1984)
min

Median{r
i
()
2
| i = 1, . . . , n}
Alternatively: min

MAD r
i
()
Breakdown point:

= 1/2
Location version: SHORTH (Princeton Robustness Study)
Midpoint of the SHORTest Half.
LMS: Mid-line of the shortest strip containing 1/2 of the data.
Problem: Not

n-consistent, but only


3

n-consistent

n[[

[[
p

3

n[[

[[ = O
p
(1)
Not locally stable. e.g. Example is pure noise.
S-ESTIMATES OF REGRESSION
Rousseeuw and Yohai (1984)
For S(), an estimate of scale (about zero): min

S(r
i
())
S = MAD LMS
S = sample standard deviation about 0 Least Squares
Bounded monotonic M-estimates of scale (about zero):
n

i=1
([ r
i
[/s) = 0
for , and bounded above and below. Alternatively,
1
n
n

i=1
([ r
i
[/s) =
for , and 0 1
For LMS: : 0 - 1 jump function and = 1/2
Breakdown point:

= min(, 1 )
S-estimates of Regression


n- consistent and Asymptotically normal.
Trade-off: Higher breakdown point Lower efciency and higher residual
gross error sensivity for Normal errors.
One Resolution: One-step M-estimate via IRLS. (Note: R, S-plus, Matlab.)
M-ESTIMATES OF REGRESSION WITH GENERAL SCALE
Martin, Yohai, and Zamar (1989)
min

i=1

_
_
_
| y
i
x

i
|
c s
n
_
_
_
Parameter of interest:
Scale statistic: s
n
(consistent for at normal errors)
Tuning constant: c
Monotonic bounded redescending M-estimate
High Breakdown Point Examples
LMS and S-estimates
MM-estimates, Yohai (1987).
CM-estimates, Mendes and Tyler (1994).
s
n
.vs.
n

i=1

_
_
_
| y
i
x

i
|
c s
n
_
_
_
Each gives one curve
S minimizes horizonally
M minimizes vertically
MM-ESTIMATES OF REGRESSION
Default robust regression estimate in S-plus (also SAS?)
Given a preliminary estimate of regression with breakdown point 1/2.
Usually an S-estimate of regression.
Compute a monotone M-estimate of scale about zero for the residuals.
s
n
: Usually from the S-estimate.
Compute an M-estimate of regression using the scale statistics s
n
and with any
desired tuning constant c.
Tune to obtain reasonable ARE and residual GES.
Breakdown point:

= 1/2
TUNING
High breakdown point S-estimates are badly tuned M-estimates.
Tuning the S-estimates affects its breakdown point.
MM-estimates can be tuned without affecting the breakdown point
COMPUTATIONAL ISSUES
All known high breakdown point regression estimates are computationally inten-
sive. Non-convex optimization problem.
Approximate or stochastic algorithms.
Random Elemental Subset Selection. Rousseeuw (1984).
Consider exact t of lines for p of the data points
_
_
_
n
p
_
_
_ such lines.
Optimize the criterion over such lines.
For large n and p, randomly sample from such lines so that there is a good
chance, say e.g. 95% chance, that one of the elemental subsets will contain
only good data even if half the data is contaminated.
MULTIVARIATE DATA
Robust Multivariate Location and Covariance Estimates
Parallel developments
Multivariate M-estimates. Maronna (1976). Huber (1977).
Adaptively weighted mean and covariances.
IRLS algorithms.
Breakdown point:

1/(d + 1), where d is the dimension of


the data.
Minimum Volume Ellipsoid Estimates. Rousseeuw (1985).
MVE is a multivariate version of LMS.
Multivariate S-estimates. Davies (1987), Lopuhua a (1989)
Multivariate CM-estimates. Kent and Tyler (1996).
Multivariate MM-estimates. Tatsuoka and Tyler (2000). Tyler (2002).
MULTIVARIATE DATA
d-dimensional data set: X
1
, . . . X
n
Classical summary statistics:
Sample Mean Vector: X =
1
n

n
i=1
X
i
Sample Variance-Covarinance Matrix
S
n
=
1
n
n

i=1
(X
i
X)(X
i
X)
T
= {s
ij
}
where s
ii
= sample variance, s
ij
= sample covariance.
Maximum likelihood estimates under normality
3.6 3.8 4.0 4.2 4.4 4.6
4
.
0
4
.
5
5
.
0
5
.
5
6
.
0


Sample Mean and Covariance
HertzsprungRussell Galaxy Data (Rousseeuw)
Visualization Ellipse containing half the data.
(X X)
T
S
1
n
(X X) c
Mahalanobis distances: d
2
i
= (X
i
X)
T
S
1
n
(X
i
X)
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
0 10 20 30 40
0
.
5
1
.
0
1
.
5
2
.
0
2
.
5
3
.
0
Index
M
a
h
D
i
s
Mahalanobis angles:
i,j
= Cos
1
_
(X
i
X)
T
S
1
n
(X
j
X)/(d
i
d
j
)
_
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
1.0 0.5 0.0 0.5 1.0

1
.
0

0
.
5
0
.
0
0
.
5
1
.
0
PC1
P
C
2
Principal Components: Do not always reveal outliers.
ENGINEERING APPLICATIONS
Noisy signal arrays, radar clutter.
Test for signal, i.e. detector, based on Mahalanobis angle be-
tween signal and observed data.
Hyperspectral Data.
Adaptive cosine estimator = Mahalanobis angle.
BIG DATA
Can not clean data via observations. Need automatic methods.
ROBUST MULTIVARIATE ESTIMATES
Location and Scatter (Pseudo-Covariance)
Trimmed versions: complicated
Weighted mean and covariance matrices

n
i=1
w
i
X
i

n
i=1
w
i

=
1
n
n

i=1
w
i
(X
i

)(X
i

)
T
where w
i
= u(d
i,o
) and d
2
i,o
= (X
i

o
)
T

1
o
(X
i

o
)
and with

o
and

o
being initial estimates, e.g.
Sample mean vector and covariance matrix.
High breakdown point estimates.
MULTIVARIATE M-ESTIMATES
MLE type for elliptically symmetric distributions
Adaptively weighted mean and covariances

n
i=1
w
i
X
i

n
i=1
w
i

=
1
n
n

i=1
w
i
(X
i

)(X
i

)
T
w
i
= u(d
i
) and d
2
i
= (X
i

)
T

1
(X
i


)
Implicit equations
PROPERTIES OF M-ESTIMATES
Root-n consistent and asymptotically normal.
Can be tuned to have desirable properties:
high efciency over a broad class of distributions
smooth and bounded inuence function
Computationally simple (IRLS)
Breakdown point:

1/(d + 1)
HIGH BREAKDOWN POINT ESTIMATES
MVE: Minimum Volume Ellipsoid

= 0.5
Center and scatter matrix for smallest ellipse covering at least
half of the data.
MCD, S-estimates, MM-estimates, CM-estimates

= 0.5
Reweighted versions based on high breakdown start.
Computationally intensive:
Approximate/probabilistic algorithms.
Elemental Subset Approach.
3.6 3.8 4.0 4.2 4.4 4.6
4
.
0
4
.
5
5
.
0
5
.
5
6
.
0


Sample Mean and Covariance
HertzsprungRussell Galaxy Data (Rousseeuw)
3.6 3.8 4.0 4.2 4.4 4.6
4
.
0
4
.
5
5
.
0
5
.
5
6
.
0


Add MVE
3.6 3.8 4.0 4.2 4.4 4.6
4
.
0
4
.
5
5
.
0
5
.
5
6
.
0


Add Cauchy MLE
ELLIPTICAL DISTRIBUTIONS
and
AFFINE EQUIVARIANCE
SPHERICALLY SYMMETRIC DISTRIBUTIONS
Z = R U where R U, R > 0,
and U
d
Uniform on the unit sphere.
Density: f(z) = g(z

z)
1.0 0.5 0.0 0.5 1.0

1
.
0

0
.
5
0
.
0
0
.
5
1
.
0
x
y
ELLIPTICAL SYMMETRIC DISTRIBUTIONS
X = AZ + E(, ; g), where = AA

f(x) = ||
1/2
g
_
(x )

1
(x )
_
0.5 0.0 0.5 1.0 1.5 2.0 2.5

0
.
5
0
.
0
0
.
5
1
.
0
1
.
5
2
.
0
2
.
5
Eliplot()[,1]
E
l
i
p
l
o
t
(
)
[
,
2
]
AFFINE EQUIVARIANCE
Parameters of elliptical distributions:
X E(, ; g) X

= BX + b E(

; g)
where

= B + b and

= BB

Sample version:
X
i
X

i
= BX + b, i = 1, . . . n

= B

+ b and

V

V

= B

V B

Examples
Mean vector and Covariance matrix
M-estimates
MVE
ELLIPTICAL SYMMETRIC DISTRIBUTIONS
f(x; , ) = ||
1/2
g
_
(x )

1
(x )
_
.
If second moments exist, shape matrix covariance matrix.
For any afne equivariant scatter functional: (F)
That is, any afne equivariant scatter statistic estimates a matrix
proportional to the shape matrix .
NON-ELLIPTICAL DISTRIBUTIONS
Difference scatter matrices estimate different population quantities.
Analogy: For non-symmetric distributions:
population mean = population median.
OTHER TOPICS
Projection-based multivariate approaches
Tukeys data depth. (or half-space depth) Tukey (1974).
Stahel-Donohos Estimate. Stahel (1981). Donoho 1982.
Projection-estimates. Maronna, Stahel and Yohai (1994). Tyler (1996).
Computationally Intensive.
TUKEYS DEPTH
PART 5
ROBUSTNESS AND COMPUTER VISION
Independent development of robust statistics within the com-
puter vision/image understanding community.
Hough transform: = xcos() + ysin()
y = a +bx where a = /sin() and b = cos()/sin()
That is, intersection refers to the line connecting two points.
Hough: Estimation in Data Space to Clustering in Feature Space
Find centers of the clusters
Terminology:
Feature Space = Parameter Space
Accumulator = Elemental Fit
Computation: RANSAC (Random sample consensus)
Randomly choose a subset from the Accumulator. (Random
elemental ts.)
Check to see how many data points are within a xed neigh-
borhood are in a neighborhood.
Alternative Formulation of Hough Transform/RANSAC

(r
i
) should be small, where (r) =
_

_
1, | r | > R
0, | r | R
That is, a redescending M-estimate of regression with known scale.
Note: Relationship to LMS.
Vision: Need e.g. 90% breakdown point, i.e. tolerate 90% bad data
Dention of Residuals?
Hough transform approach does not distinguish between:
Regression line for regressing Y on X.
Regression line for regressing X on Y.
Orthogonal regression line.
(Note: Small stochastic errors little difference.)
GENERAL PARADIGM
Line in 2-D: Exact t to all pairs
Quadratic in 2-D: Exact t to all triples
Conic Sections: Ellipse Fitting
(x )

A(x ) = 1
Linearizing: Let x

= (x
1
, x
2
, x
2
1
, x
2
2
, x
1
x
2
, 1)
Ellipse: a

= 0
Exact t to all subsets of size 5.
Hyperboloid Fitting in 4-D
Epipolar Geometry: Exact t to
5 pairs: Calibrated cameras (location, rotation, scale).
8 pairs: Uncalibrated cameras (focal point, image plane).
Hyperboloid surface in 4-D.
Applications: 2D 3D, Motion.
OUTLIERS = MISMATCHES

Vous aimerez peut-être aussi