Vous êtes sur la page 1sur 48

Sharp statistical tools

Statistics for extremes

Georg Lindgren

Lund University

October 18, 2012


SARMA
Background

Motivation

We want to predict outside the range of observations


Sums, averages and proportions – Normal distribution
Central limit theorem
Successes in large experiments
Extremes
Normal distribution inappropriate
Bulk of data may be misleading
Extremes are rare
Sparsity of data may be compensated by proper
(extreme value) theory
Background

Overview of contents
Topics
I Extreme value distributions
I Block maxima
I POT - Peaks Over Threshold
I Estimation, return period, uncertainties
I Extremes with cyclic or linear trend
I Extremes with other covariates – CO2, NAO, AO, ...
Useful statistics packages
I The R environment - need packages extRemes and evd for the
computer experiments
I The WAFO package in Matlab and Python (Downloadable from
code.google.com)
Background

A typical example: annual maximum of daily precipitation


100 years of yearly maximum of daily precipitation in Fort Collins
4
precipitation (in)

3
2
1

1900 1920 1940 1960 1980 2000


Background

A second typical example

Monthly maximum precipitation values has 12 times as many data – but


season 4
3
Precipitation (in)

2
1
0

2 4 6 8 10 12
Background

Some basic facts about statistical extremes

Notation: Mn = max(X1 , X2 , . . . , Xn ) is the maximum of n


observations of a random quantity.
Example: Xk = maximum precipitation year k
Problem: Find the probability Prob(Mn > x) for large n and
reasonable x. In particular when x > largest observed data!
Background

Return period – return level

X1 , X2 , . . . a sequence of random quantities with common distribution,


e.g. one observation per year (a mean, a yearly maximum, ...,
anything)
”100 year return value x100 is the value that is exceeded on average
once in 100 years. The same as

Prob(X1 > x100 ) = 1/100


1 100
 
Prob(M100 > x100 ) = 1 − 1 − ≈ 1 − 1/e = 0.63.
100

The probability that the 100-year return level is exceeded at least one
time during a 100 year period is 63%. The 1000-year value exceeded
with the same probability at least once during 1000 years, etc.
IF YEARS ARE INDEPENDENT OF EACH OTHER. – USUALLY,
Background

Wave data from the Pacific Ocean

582 measurements of Hs = ”Significant wave-height” at buoy 46005 in the


Pacific. About 14 data per month, Dec-Feb, during 7 years.

12
10
8
Hs
6
4
2

0 100 200 300 400 500 600


t
Background

Histogram of Pacific wave height


CDF of Hs exceedances over 7m
Histogram of Hs 1

0.9
80

0.8
60

0.7

0.6
40

F(x)
0.5

0.4
20

CDF of exponential distribution +7m


0.3

0.2
0

0.1
2 4 6 8 10 12
0
Hs 7 8 9 10 11 12 13
x

Conclusion: the tail of the distribution has a more ”standardized”


distribution than the central part.
GEV and GPD

Goals for statistical extreme value analysis

One-dimensional: From a series of observations (often in time)


estimate the probability of high values, of the order of the largest
observation; OR HIGHER, even MUCH HIGHER
Make a statement about how uncertain the prediction is
Identify possible non-stationarity in extremes – can be different from
non-stationarity in averages or standard deviation
Find covariates that explain the occurrence of extreme values
Utilize data as much as possible – balance between number of data
available and ”bias”
Multi-dimensional: find probabilities of combinations of extreme (or
not so extreme) values in two or more series (research is going on)
Extremes in random fields (even more research is going on)
GEV and GPD

The GEV and the GPD

Block maxima: Take one extreme value per time unit (e.g. day, month,
year, ...). Ideal situation: stationary conditions, no trend.
Exceedance analysis: Take all values over a certain threshold – gives more
data that are more representative for the global maximum
that the bulk data.
Statistical extreme value analysis relies on two mathematical facts
Block maxima: The maximum of many observations has
a GEV distribution.
Exceedance analysis: The exceedances have a GPD
distribution.
GEV and GPD

The ”Extremal types theorem”

An ”almost true” theorem: When n is large, the distribution of the


maximum Mn is a GEV = Generalized Extreme Value
distribution:
(  )
x − µ −1/ξ

Prob(Mn ≤ x) ≈ exp − 1 + ξ
ψ +

where ξ = shape (type), ψ = scale, µ = location


ξ < 0 = an upper limit exists, ξ > 0 = a lower limit exists,
ξ = 0 = ”Gumbel distribution”, no limits exist
GEV and GPD

Light, Gumbel, or Heavy tails

Gumbel Probability Plot


20 5

−log(−log(F))
Light
10 tail ξ < 0
0
0 −5
−20 0 20 −10 0 10
X
Gumbel Probability Plot
20 5

−log(−log(F))
10 Double 0
tail ξ
=0
0 −5
−5 0 5 10 −5 0 5 10
X
Gumbel Probability Plot
600 20
−log(−log(F))

400 10
Heavy
200 tail ξ > 0
0
0 −10
−5 0 5 10 −5 0 5 10
X
GEV and GPD

The ”Tail theorem”

Another ”almost true” theorem: Almost all statistical distribution have a


”tail” that is GPD – they have a Generalized Pareto
Distribution. Take only observations that above a fixed
threshold u and let Y = X − u be the exceedance.
 y −1/ξ
Prob(Y ≤ y ) ≈ 1 − 1 + ξ
σ +
where ξ = shape (type), σ = scale, (µ = location = 0).
ξ = 0 = exponential tail, ξ > 0 = heavy tail, ξ < 0 = limited
tail
GEV and GPD

GPD-tail in normal distribution

Normal tail is GPD with ξ = 0


Normal distribution
800

600

400

200

0
−4 −3 −2 −1 0 1 2 3 4

The tail > 2 of a normal distribution


1

0.8

0.6 Red = empirical cdf of exceedances over 2


F(x)

0.4 Blue = estimated GPD

0.2

0
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 x 1.8
Block maxima

Block maxima with ”block” = one year

Many observations over a year (hourly, semi-daily, daily) – take the


maximum. Assume a GEV distribution for yearly maximum.
Estimate parameters in GEV by ”Maximum likelihood” method in
R-package extRemes (or WAFO)
fit <- gev.fit(data)
Block maxima

100 years of precipitation in Fort Collins, CO

Fort Collins, CO, daily precipitation:


http://ccc.atmos.colostate.edu/ odie/rain.html
Time series of daily precipitation, 1900 – 1999
Strong annual cycle – wet in late spring, dry in winter
No long-term trend
Recent flood, July 28, 1997
Block maxima

Annual daily maximum precipitation, Fort Collins

4
precipitation (in)

3
2
1

1900 1920 1940 1960 1980 2000

year
Block maxima

Seasons – can be handled by making separate analyses for


each month
4
3
Precipitation (in)

2
1
0

2 4 6 8 10 12
Block maxima

Some R-code

data <- read.table(file="FtCprec.prn", header=TRUE)


plot(data$Mn,data$Prec/100,ylab="Precipitation (in)",
xlab="month")
YearlyMax = aggregate(Prec Year,data=data,max)
YearlyMax[,2]= YearlyMax[,2]/100
plot(YearlyMax,type=’l’,col=’blue’)
library(extRemes)
ftcanmax <- read.table(file="Ftcanmax.prn", header=TRUE)
fit <- gev.fit(ftcanmax$Prec/100)
gev.diag(fit)
Block maxima

Estimated parameter values + standard errror

Parameter Estimate Standard error


Location µ 1.35 0.062
Scale ψ 0.53 0.049
Shape ξ 0.17 0.092
Block maxima

Is the GEV a Gumbel distribution?

Is the shape parameter ξ = 0.174 > 0 really significantly different from 0?


Can be tested by a likelihood ratio test:

max likelihood under restricted model (ξ = 0)


Dev = −2 log
max likelihood under full model
If restricted model is correct, this has a chi-squared distribution;
d.f. = # parameters in full model - # parameters in restricted model
Test by extRemes (ξ = 0 = Gumbel):
fit0 <- gum.fit(ftcanmax$Prec/100)
Dev <- 2*(fit0$nllh - fit$nllh)
pval = pchisq( Dev, 1 , lower.tail=F) (= 0.038)
Block maxima

A 95% confidence interval for shape estimate is (0.009,


0.369)

gev.profxi(fit, fit$mle[3] -0.25, fit$mle[3] +0.25)


−105
−106
Profile Log−likelihood

−107
−108
−109
−110
Block maxima

Diagnostics and conclusions

We want return values for N = 10, 100, 100, 1000 years in GEV:
ψn o
xN = µ − 1 − (− ln(1 − 1/N))−ξ
ξ

Estimated standard errors gives confidence limits for return levels.


gev.diag(fit)
An example for the exercises;
fit.rl <- return.level(fit,rperiods=c(10,100), conf = 0.05)
Block maxima

GEV summary
Probability Plot Quantile Plot

1.0

4
0.8
0.6

Empirical

3
Model

0.4

2
0.2

1
0.0

0.0 0.2 0.4 0.6 0.8 1.0 1 2 3 4 5

Empirical Model

Return Level Plot Density Plot


8

0.6
Return Level

0.4
f(z)
4

0.2
2

0.0

1e−01 1e+00 1e+01 1e+02 1e+03 1 2 3 4 5

Return Period z
Block maxima

Trends in extremes – Water level in the Japan Sea

Water level
20

15
(m)

10

5
0 5 10 15 20 25

Maximum 5 min water level


20
(m)

15

10
0 5 10 15 20 25
Time (h)
Block maxima

Alt I: Normalize (simplistic)

Take average and standard deviation for each 5-minute period, subtract
and divide. Take maximum over 5-minute period, and fit a GPD.
Diagnostic plot:
Probability plot Density plot
1

0.5
0.8
0.4
0.6
F(x)

0.3
0.4
0.2

0.2 0.1

0 0
10 12 14 16 18 20 12 14 16 18 20
x x

Residual Quantile Plot Residual Probability Plot


1

18 0.8
Model (gev)
Model (gev)

0.6
16

0.4
14
0.2

12
0
12 14 16 18 0 0.5 1
Empirical Empirical
Fit method: PWM, Fit p−value: 1.00
Block maxima

Alt II: fit with trend

yura5fit<-gev.fit(y5max[,1])
Est.: 13.362742 0.7083573 -0.1156161
S.e.: 0.0453072 0.0307655 0.0249203
yura5fit.mul<-gev.fit(y5max[,1],
ydat=as.matrix(1:length(y5max[,1])),mul=1)
Est. 13.90486754 -0.0036693 0.59705809 -0.0564354
S.e.: 0.06510006 0.00036877 0.02695204 0.03153350
yura5fit.sigl<-gev.fit(y5max[,1],
ydat=as.matrix(1:length(y5max[,1])),sigl=1)
Est.: 13.337740339 0.406178243 0.001745658 0.098749706
S.E.: 0.04233340 0.02916609 0.00004856 0.05472873
Exceedance analysis

The dilemma of statistical extremes

We want to predict events that have never been observed!


From 20 years data – can one say something about the 100 year return
value?
Exceedance analysis

20 years of monthly data


20 years of monthly data
5

4.5

3.5

2.5

1.5

0.5

0
0 50 100 150 200
Exceedance analysis

Use more high values

Waste of data to use only yearly maximum


20 years of data = 240 monthly data
The smallest yearly maximum (year 7) is X7 = 1.67. There are 42
monthly values greater than 1.67
Can one use all 42?
Or all those 48 greater than 1.5?
Or the 84 values greater than 1?
Exceedance analysis

High values are rare and occur randomly

Take a reasonably high level u – try many!


Estimate the rate by which this level is exceeded per time unit (e.g.
per year), λ = λu :

Observed number of exceedances over u


λ∗ =
Total observation time
u = 1.5 gives λ∗1.5 = 48/20 = 2.4
Reasonable to assume that N = the number of exceedances over 1.5
any year has a Poisson distribution with mean λ = 2.4:

Prob(N = k) = e −λ λk /k!
Exceedance analysis

Generaliserad Pareto fördelning - GPD

Exceedances over a high threshold are more representative for the


global extremes than the buld data
“Almost all distributions” have a Generalized Pareto-tail, GPD
With Y = X − u = exceedance over level u:
 y −1/ξ
Prob(Y ≤ y ) ≈ 1 − 1 + ξ
σ +
Exponential tail: ξ = 0; Heavy tail: ξ > 0; Limited tail: ξ < 0
Exceedance analysis

GPD-tail in normal distribution

Normal tail is GPD with ξ = 0


Normal distribution
800

600

400

200

0
−4 −3 −2 −1 0 1 2 3 4

The tail > 2 of a normal distribution


1

0.8

0.6 Red = empirical cdf of exceedances over 2


F(x)

0.4 Blue = estimated GPD

0.2

0
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 x 1.8
Exceedance analysis

Poisson + GPD = GEV

N = number of exceedances Yj = Xj − u over u is (approximately)


Poisson distributed with expectation λ
The size of exceedances Y1 , . . . , YN , have (approximately) a GPD
If M = yearly maximum = u + max(Y1 , . . . , YN ), then for x > u:

X
Prob(M ≤ x) = Prob(N = 0) + Prob(N = n, Y1 , . . . , Yn ≤ x − u)
n=1
(  −1/ξ )
x −u
= . . . = exp −λ 1 + ξ (1)
σ +
Exceedance analysis

Poisson + GPD = GEV, contd

(1) is a GEV distribution


( )
x − µ −1/ξ
 
Prob(M ≤ x) = exp − 1 +ξ
ψ +

Translation from Poisson+GPD to GEV:

ψ = σ λξ
ψ−σ
µ=u+
ξ

To get maximum over n years, replace λ with nλ in (1)


Exceedance analysis

Choice of threshold

How to choose u? Note: assume GPD above threshold u


Diagnostic: A GPD has linear mean excess
σ + ξu
E(X − u | X > u) =
1−ξ

Plot the mean of all excesses over u as function of u. Take u as the


smallest threshold for which the curve to the right is ”linear”
The slope is ξ/(1 − ξ) om ξ < 1.
Exceedance analysis

Mean excess plot

Plot of E(X − u | X > u) for 20 years of monthly data:


Mean exceedance over threshold
1.2

1.1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2
1 1.5 2 2.5 3 3.5 4
Exceedance analysis

Diagnostics in GPD analysis

A plot of mean excess can be had to interpret


Alternative: Estimate a full GPD for different thresholds
If the tail above u0 is GPD then all exceedances over u > u0 are also
GPD
with the same shape parameter ξ
but with different scal parameter

σu = σu0 + ξ (u − u0 )

“Modified scale” = σu − ξ u should be constant, independent of u,


when GPD is appropriate
Exceedance analysis

Estimated CDF for yearly maximum

From 20 yearly maxima: c = 0.14, µ = 2.81, a = 0.77


Empirical and GEV estimated cdf (PWM method)
1

0.9

0.8

0.7 True CDF for yearly maximum

0.6
F(x)

0.5

0.4

0.3
CDF for estimated GEV
0.2

0.1

0
0 1 2 3 4 5
x
Exceedance analysis

Estimated CDF by POT method

84 exceedances over u = 1 and GPD estimate gives ξ = 0.04,


b = 2.38, a = 0.93
Tail probability
0
10

−1 Tail by POT−method
10

True CDF
−2
10

Tail by direct GEV−estimation


−3
10

−4
10
2 3 4 5 6 7 8 9 10
Exceedance analysis

Fatalities in English coal mines

Time and number of fatalities 1861 - 1962


450

400

350

300

250

200

150

100

50

0
1860 1880 1900 1920 1940 1960
Exceedance analysis

GEV?

We try GEV on all data (not really motivated!):


Empirical and GEV estimated cdf (PWM method)
1

0.9

0.8

0.7

0.6
F(x)

0.5

0.4

0.3

0.2

0.1

0
0 100 200 300 400 500
x
Exceedance analysis

Consider only big accidents – POT

There are 25 accidents with > 100 dead. Fit GPD to data > 100. 10% of
these exceed 350.
CDF for deaths > 100 and GPD
1

0.9

0.8

0.7

0.6
F(x)

0.5

0.4

0.3

0.2

0.1

0
100 200 300 400 500 600
x
Exceedance analysis

Vargas, Venezuela, catastrophe 1999

In December 15, 1999, the Vargas province was hit by 410 mm rain.
Exceedance analysis

Rain in Venezuela - GEV or Gumbel?

Gumbel distribution is GEV with shape parameter ξ = 0. Based on rain


statistics 1951-1998 the maximum daily rain distribution was estimated
with GEV: scale = 19.9, location = 49.2, shape = 0.16.
The shape parameter ξ is not significantly different from 0, so Gumbel
could ”perhaps” be used instead of GEV. Estimated 10000 year return value
for daily rain is x10000 = 249 mm. In December 1999 the Vargas province
was hit by 410 mm rain during one day.
From the full GEV the estimate of the 10000 year value had been
x10000 = 468 mm, much closer to the real value. A 95% one-sided
confidence interval is x10000 < 1030 mm.
Multivariate extremes

General recipe for bivariate extremes


Assume, for a sequence of days,
X1 , X2 , . . . is the wind speed and
Y1 , Y2 , . . . is the wave height mea-
sured at a North Sea platform.
The platform is designed for extreme 2
waves. It is also designed for extreme 1.8
Extremes come togehter
winds. What about the combined 1.6

effect of wind and waves (and cur- 1.4

rent)? 1.2

Make marginal EVA and transform 1


Extremes just add

x and y to standard scale. Plot 0.8

pairs of transformed extremes to ex- 0.6

amine joint extreme behavior. Three 0.4 Extremes come alone

main types of dependence/non- 0.2

dependence: 0
0 0.5 1 1.5 2
Some literature

Some literature

Beirlant, J., Goegebeur, Y., Segers, J., Teugels, J. (2004), Statistics of


Extremes: Theory and Applications. Wiley.
Coles S. (2001), An Introduction to Statistical Modeling of Extreme
Values. Springer-Verlag.
Gilleland, E., Katz, R., Young, G. (Feb. 2012), Package ’extRemes’.
Nelsen, R.B. (2006), An Introduction to Copulas. Wiley.
The WAFO group (2011), WAFO – a MATLAB toolbox for analysis of
random waves and loads. Lund univ.
http://www.maths.lth.se/matstat/wafo/ and
http://code.google.com/p/wafo/

Vous aimerez peut-être aussi