Sharp Statistical Tools Statistics For Extremes: Georg Lindgren

Sharp statistical tools
Statistics for extremes
Georg Lindgren
Lund University
October 18, 2012

SARMA
Background
Motivation
We want to predict outside the range of observations

Sums, averages and proportions – Normal distribution
Central limit theorem
Successes in large experiments
Extremes
Normal distribution inappropriate
Bulk of data may be misleading
Extremes are rare
Sparsity of data may be compensated by proper
(extreme value) theory
Background
Overview of contents
Topics
I Extreme value distributions
I Block maxima
I POT - Peaks Over Threshold
I Estimation, return period, uncertainties
I Extremes with cyclic or linear trend
I Extremes with other covariates – CO2, NAO, AO, ...
Useful statistics packages
I The R environment - need packages extRemes and evd for the
computer experiments
I The WAFO package in Matlab and Python (Downloadable from
code.google.com)
Background
A typical example: annual maximum of daily precipitation

100 years of yearly maximum of daily precipitation in Fort Collins
4
precipitation (in)
3
2
1
1900 1920 1940 1960 1980 2000

Background
A second typical example
Monthly maximum precipitation values has 12 times as many data – but

season 4
3
Precipitation (in)
2
1
0
2 4 6 8 10 12
Background
Some basic facts about statistical extremes
Notation: Mn = max(X1 , X2 , . . . , Xn ) is the maximum of n

observations of a random quantity.
Example: Xk = maximum precipitation year k
Problem: Find the probability Prob(Mn > x) for large n and
reasonable x. In particular when x > largest observed data!
Background
Return period – return level
X1 , X2 , . . . a sequence of random quantities with common distribution,

e.g. one observation per year (a mean, a yearly maximum, ...,
anything)
”100 year return value x100 is the value that is exceeded on average
once in 100 years. The same as
Prob(X1 > x100 ) = 1/100

1 100

Prob(M100 > x100 ) = 1 − 1 − ≈ 1 − 1/e = 0.63.
100
The probability that the 100-year return level is exceeded at least one
time during a 100 year period is 63%. The 1000-year value exceeded
with the same probability at least once during 1000 years, etc.
IF YEARS ARE INDEPENDENT OF EACH OTHER. – USUALLY,
Background
Wave data from the Pacific Ocean
582 measurements of Hs = ”Significant wave-height” at buoy 46005 in the

Pacific. About 14 data per month, Dec-Feb, during 7 years.
12
10
8
Hs
6
4
2
0 100 200 300 400 500 600

t
Background
Histogram of Pacific wave height

CDF of Hs exceedances over 7m
Histogram of Hs 1
0.9
80
0.8
60
0.7
0.6
40
F(x)
0.5
0.4
20
CDF of exponential distribution +7m

0.3
0.2
0
0.1
2 4 6 8 10 12
0
Hs 7 8 9 10 11 12 13
x
Conclusion: the tail of the distribution has a more ”standardized”

distribution than the central part.
GEV and GPD
Goals for statistical extreme value analysis
One-dimensional: From a series of observations (often in time)

estimate the probability of high values, of the order of the largest
observation; OR HIGHER, even MUCH HIGHER
Make a statement about how uncertain the prediction is
Identify possible non-stationarity in extremes – can be different from
non-stationarity in averages or standard deviation
Find covariates that explain the occurrence of extreme values
Utilize data as much as possible – balance between number of data
available and ”bias”
Multi-dimensional: find probabilities of combinations of extreme (or
not so extreme) values in two or more series (research is going on)
Extremes in random fields (even more research is going on)
GEV and GPD
The GEV and the GPD
Block maxima: Take one extreme value per time unit (e.g. day, month,
year, ...). Ideal situation: stationary conditions, no trend.
Exceedance analysis: Take all values over a certain threshold – gives more
data that are more representative for the global maximum
that the bulk data.
Statistical extreme value analysis relies on two mathematical facts
Block maxima: The maximum of many observations has
a GEV distribution.
Exceedance analysis: The exceedances have a GPD
distribution.
GEV and GPD
The ”Extremal types theorem”
An ”almost true” theorem: When n is large, the distribution of the

maximum Mn is a GEV = Generalized Extreme Value
distribution:
( )
x − µ −1/ξ

Prob(Mn ≤ x) ≈ exp − 1 + ξ
ψ +
where ξ = shape (type), ψ = scale, µ = location

ξ < 0 = an upper limit exists, ξ > 0 = a lower limit exists,
ξ = 0 = ”Gumbel distribution”, no limits exist
GEV and GPD
Light, Gumbel, or Heavy tails
Gumbel Probability Plot

20 5
−log(−log(F))
Light
10 tail ξ < 0
0
0 −5
−20 0 20 −10 0 10
X
20 5
−log(−log(F))
10 Double 0
tail ξ
=0
0 −5
−5 0 5 10 −5 0 5 10
X
600 20
−log(−log(F))
400 10
Heavy
200 tail ξ > 0
0
0 −10
−5 0 5 10 −5 0 5 10
X
GEV and GPD
The ”Tail theorem”
Another ”almost true” theorem: Almost all statistical distribution have a

”tail” that is GPD – they have a Generalized Pareto
Distribution. Take only observations that above a fixed
threshold u and let Y = X − u be the exceedance.
y −1/ξ
Prob(Y ≤ y ) ≈ 1 − 1 + ξ
σ +
where ξ = shape (type), σ = scale, (µ = location = 0).
ξ = 0 = exponential tail, ξ > 0 = heavy tail, ξ < 0 = limited
tail
GEV and GPD
GPD-tail in normal distribution
Normal tail is GPD with ξ = 0

Normal distribution
800
600
400
200
0
−4 −3 −2 −1 0 1 2 3 4
The tail > 2 of a normal distribution

1
0.8
0.6 Red = empirical cdf of exceedances over 2

F(x)
0.4 Blue = estimated GPD
0.2
0
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 x 1.8
Block maxima
Block maxima with ”block” = one year
Many observations over a year (hourly, semi-daily, daily) – take the

maximum. Assume a GEV distribution for yearly maximum.
Estimate parameters in GEV by ”Maximum likelihood” method in
R-package extRemes (or WAFO)
fit <- gev.fit(data)
Block maxima
100 years of precipitation in Fort Collins, CO
Fort Collins, CO, daily precipitation:

http://ccc.atmos.colostate.edu/ odie/rain.html
Time series of daily precipitation, 1900 – 1999
Strong annual cycle – wet in late spring, dry in winter
No long-term trend
Recent flood, July 28, 1997
Block maxima
Annual daily maximum precipitation, Fort Collins
4
precipitation (in)
3
2
1
1900 1920 1940 1960 1980 2000
year
Block maxima
Seasons – can be handled by making separate analyses for

each month
4
3
Precipitation (in)
2
1
0
2 4 6 8 10 12
Block maxima
Some R-code
data <- read.table(file="FtCprec.prn", header=TRUE)

plot(data$Mn,data$Prec/100,ylab="Precipitation (in)",
xlab="month")
YearlyMax = aggregate(Prec Year,data=data,max)
YearlyMax[,2]= YearlyMax[,2]/100
plot(YearlyMax,type=’l’,col=’blue’)
library(extRemes)
ftcanmax <- read.table(file="Ftcanmax.prn", header=TRUE)
fit <- gev.fit(ftcanmax$Prec/100)
gev.diag(fit)
Block maxima
Estimated parameter values + standard errror
Parameter Estimate Standard error

Location µ 1.35 0.062
Scale ψ 0.53 0.049
Shape ξ 0.17 0.092
Block maxima
Is the GEV a Gumbel distribution?
Is the shape parameter ξ = 0.174 > 0 really significantly different from 0?

Can be tested by a likelihood ratio test:
max likelihood under restricted model (ξ = 0)

Dev = −2 log
max likelihood under full model
If restricted model is correct, this has a chi-squared distribution;
d.f. = # parameters in full model - # parameters in restricted model
Test by extRemes (ξ = 0 = Gumbel):
fit0 <- gum.fit(ftcanmax$Prec/100)
Dev <- 2*(fit0$nllh - fit$nllh)
pval = pchisq( Dev, 1 , lower.tail=F) (= 0.038)
Block maxima
A 95% confidence interval for shape estimate is (0.009,

0.369)
gev.profxi(fit, fit$mle[3] -0.25, fit$mle[3] +0.25)

−105
−106
Profile Log−likelihood
−107
−108
−109
−110
Block maxima
Diagnostics and conclusions
We want return values for N = 10, 100, 100, 1000 years in GEV:
ψn o
xN = µ − 1 − (− ln(1 − 1/N))−ξ
ξ
Estimated standard errors gives confidence limits for return levels.

gev.diag(fit)
An example for the exercises;
fit.rl <- return.level(fit,rperiods=c(10,100), conf = 0.05)
Block maxima
GEV summary
Probability Plot Quantile Plot
1.0
4
0.8
0.6
Empirical
3
Model
0.4
2
0.2
1
0.0
0.0 0.2 0.4 0.6 0.8 1.0 1 2 3 4 5
Empirical Model
Return Level Plot Density Plot

8
0.6
Return Level
0.4
f(z)
4
0.2
2
0.0
1e−01 1e+00 1e+01 1e+02 1e+03 1 2 3 4 5
Return Period z
Block maxima
Trends in extremes – Water level in the Japan Sea
Water level
20
15
(m)
10
5
0 5 10 15 20 25
Maximum 5 min water level

20
(m)
15
10
0 5 10 15 20 25
Time (h)
Block maxima
Alt I: Normalize (simplistic)
Take average and standard deviation for each 5-minute period, subtract
and divide. Take maximum over 5-minute period, and fit a GPD.
Diagnostic plot:
Probability plot Density plot
1
0.5
0.8
0.4
0.6
F(x)
0.3
0.4
0.2
0.2 0.1
0 0
10 12 14 16 18 20 12 14 16 18 20
x x
Residual Quantile Plot Residual Probability Plot

1
18 0.8
Model (gev)
Model (gev)
0.6
16
0.4
14
0.2
12
0
12 14 16 18 0 0.5 1
Empirical Empirical
Fit method: PWM, Fit p−value: 1.00
Block maxima
Alt II: fit with trend
yura5fit<-gev.fit(y5max[,1])
Est.: 13.362742 0.7083573 -0.1156161
S.e.: 0.0453072 0.0307655 0.0249203
yura5fit.mul<-gev.fit(y5max[,1],
ydat=as.matrix(1:length(y5max[,1])),mul=1)
Est. 13.90486754 -0.0036693 0.59705809 -0.0564354
S.e.: 0.06510006 0.00036877 0.02695204 0.03153350
yura5fit.sigl<-gev.fit(y5max[,1],
ydat=as.matrix(1:length(y5max[,1])),sigl=1)
Est.: 13.337740339 0.406178243 0.001745658 0.098749706
S.E.: 0.04233340 0.02916609 0.00004856 0.05472873
Exceedance analysis
The dilemma of statistical extremes
We want to predict events that have never been observed!

From 20 years data – can one say something about the 100 year return
value?
Exceedance analysis
20 years of monthly data

20 years of monthly data
5
4.5
3.5
2.5
1.5
0.5
0
0 50 100 150 200
Exceedance analysis
Use more high values
Waste of data to use only yearly maximum

20 years of data = 240 monthly data
The smallest yearly maximum (year 7) is X7 = 1.67. There are 42
monthly values greater than 1.67
Can one use all 42?
Or all those 48 greater than 1.5?
Or the 84 values greater than 1?
Exceedance analysis
High values are rare and occur randomly
Take a reasonably high level u – try many!

Estimate the rate by which this level is exceeded per time unit (e.g.
per year), λ = λu :
Observed number of exceedances over u

λ∗ =
Total observation time
u = 1.5 gives λ∗1.5 = 48/20 = 2.4
Reasonable to assume that N = the number of exceedances over 1.5
any year has a Poisson distribution with mean λ = 2.4:
Prob(N = k) = e −λ λk /k!
Exceedance analysis
Generaliserad Pareto fördelning - GPD
Exceedances over a high threshold are more representative for the

global extremes than the buld data
“Almost all distributions” have a Generalized Pareto-tail, GPD
With Y = X − u = exceedance over level u:
y −1/ξ
Prob(Y ≤ y ) ≈ 1 − 1 + ξ
σ +
Exponential tail: ξ = 0; Heavy tail: ξ > 0; Limited tail: ξ < 0
Exceedance analysis
GPD-tail in normal distribution
Normal tail is GPD with ξ = 0

Normal distribution
800
600
400
200
0
−4 −3 −2 −1 0 1 2 3 4
The tail > 2 of a normal distribution

1
0.8
0.6 Red = empirical cdf of exceedances over 2

F(x)
0.4 Blue = estimated GPD
0.2
0
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 x 1.8
Exceedance analysis
Poisson + GPD = GEV
N = number of exceedances Yj = Xj − u over u is (approximately)

Poisson distributed with expectation λ
The size of exceedances Y1 , . . . , YN , have (approximately) a GPD
If M = yearly maximum = u + max(Y1 , . . . , YN ), then for x > u:
∞
X
Prob(M ≤ x) = Prob(N = 0) + Prob(N = n, Y1 , . . . , Yn ≤ x − u)
n=1
( −1/ξ )
x −u
= . . . = exp −λ 1 + ξ (1)
σ +
Exceedance analysis
Poisson + GPD = GEV, contd
(1) is a GEV distribution

( )
x − µ −1/ξ

Prob(M ≤ x) = exp − 1 +ξ
ψ +
Translation from Poisson+GPD to GEV:
ψ = σ λξ
ψ−σ
µ=u+
ξ
To get maximum over n years, replace λ with nλ in (1)

Exceedance analysis
Choice of threshold
How to choose u? Note: assume GPD above threshold u

Diagnostic: A GPD has linear mean excess
σ + ξu
E(X − u | X > u) =
1−ξ
Plot the mean of all excesses over u as function of u. Take u as the

smallest threshold for which the curve to the right is ”linear”
The slope is ξ/(1 − ξ) om ξ < 1.
Exceedance analysis
Mean excess plot
Plot of E(X − u | X > u) for 20 years of monthly data:

Mean exceedance over threshold
1.2
1.1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
1 1.5 2 2.5 3 3.5 4
Exceedance analysis
Diagnostics in GPD analysis
A plot of mean excess can be had to interpret

Alternative: Estimate a full GPD for different thresholds
If the tail above u0 is GPD then all exceedances over u > u0 are also
GPD
with the same shape parameter ξ
but with different scal parameter
σu = σu0 + ξ (u − u0 )
“Modified scale” = σu − ξ u should be constant, independent of u,

when GPD is appropriate
Exceedance analysis
Estimated CDF for yearly maximum
From 20 yearly maxima: c = 0.14, µ = 2.81, a = 0.77

Empirical and GEV estimated cdf (PWM method)
1
0.9
0.8
0.7 True CDF for yearly maximum
0.6
F(x)
0.5
0.4
0.3
CDF for estimated GEV
0.2
0.1
0
0 1 2 3 4 5
x
Exceedance analysis
Estimated CDF by POT method
84 exceedances over u = 1 and GPD estimate gives ξ = 0.04,

b = 2.38, a = 0.93
Tail probability
0
10
−1 Tail by POT−method
10
True CDF
−2
10
Tail by direct GEV−estimation

−3
10
−4
10
2 3 4 5 6 7 8 9 10
Exceedance analysis
Fatalities in English coal mines
Time and number of fatalities 1861 - 1962

450
400
350
300
250
200
150
100
50
0
1860 1880 1900 1920 1940 1960
Exceedance analysis
GEV?
We try GEV on all data (not really motivated!):

Empirical and GEV estimated cdf (PWM method)
1
0.9
0.8
0.7
0.6
F(x)
0.5
0.4
0.3
0.2
0.1
0
0 100 200 300 400 500
x
Exceedance analysis
Consider only big accidents – POT
There are 25 accidents with > 100 dead. Fit GPD to data > 100. 10% of
these exceed 350.
CDF for deaths > 100 and GPD
1
0.9
0.8
0.7
0.6
F(x)
0.5
0.4
0.3
0.2
0.1
0
100 200 300 400 500 600
x
Exceedance analysis
Vargas, Venezuela, catastrophe 1999
In December 15, 1999, the Vargas province was hit by 410 mm rain.
Exceedance analysis
Rain in Venezuela - GEV or Gumbel?
Gumbel distribution is GEV with shape parameter ξ = 0. Based on rain

statistics 1951-1998 the maximum daily rain distribution was estimated
with GEV: scale = 19.9, location = 49.2, shape = 0.16.
The shape parameter ξ is not significantly different from 0, so Gumbel
could ”perhaps” be used instead of GEV. Estimated 10000 year return value
for daily rain is x10000 = 249 mm. In December 1999 the Vargas province
was hit by 410 mm rain during one day.
From the full GEV the estimate of the 10000 year value had been
x10000 = 468 mm, much closer to the real value. A 95% one-sided
confidence interval is x10000 < 1030 mm.
Multivariate extremes
General recipe for bivariate extremes

Assume, for a sequence of days,
X1 , X2 , . . . is the wind speed and
Y1 , Y2 , . . . is the wave height mea-
sured at a North Sea platform.
The platform is designed for extreme 2
waves. It is also designed for extreme 1.8
Extremes come togehter
winds. What about the combined 1.6
effect of wind and waves (and cur- 1.4
rent)? 1.2
Make marginal EVA and transform 1

Extremes just add
x and y to standard scale. Plot 0.8
pairs of transformed extremes to ex- 0.6
amine joint extreme behavior. Three 0.4 Extremes come alone
main types of dependence/non- 0.2
dependence: 0
0 0.5 1 1.5 2
Some literature
Some literature
Beirlant, J., Goegebeur, Y., Segers, J., Teugels, J. (2004), Statistics of

Extremes: Theory and Applications. Wiley.
Coles S. (2001), An Introduction to Statistical Modeling of Extreme
Values. Springer-Verlag.
Gilleland, E., Katz, R., Young, G. (Feb. 2012), Package ’extRemes’.
Nelsen, R.B. (2006), An Introduction to Copulas. Wiley.
The WAFO group (2011), WAFO – a MATLAB toolbox for analysis of
random waves and loads. Lund univ.
http://www.maths.lth.se/matstat/wafo/ and
http://code.google.com/p/wafo/

Sharp Statistical Tools Statistics For Extremes: Georg Lindgren

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Sharp Statistical Tools Statistics For Extremes: Georg Lindgren

Transféré par

Droits d'auteur :

Formats disponibles

Sharp statistical tools

Statistics for extremes

October 18, 2012

We want to predict outside the range of observations

A typical example: annual maximum of daily precipitation

1900 1920 1940 1960 1980 2000

A second typical example

Monthly maximum precipitation values has 12 times as many data – but

Some basic facts about statistical extremes

Notation: Mn = max(X1 , X2 , . . . , Xn ) is the maximum of n

Return period – return level

X1 , X2 , . . . a sequence of random quantities with common distribution,

Prob(X1 > x100 ) = 1/100

Wave data from the Pacific Ocean

582 measurements of Hs = ”Significant wave-height” at buoy 46005 in the

0 100 200 300 400 500 600

Histogram of Pacific wave height

CDF of exponential distribution +7m

Conclusion: the tail of the distribution has a more ”standardized”

Goals for statistical extreme value analysis

One-dimensional: From a series of observations (often in time)

The GEV and the GPD

The ”Extremal types theorem”

An ”almost true” theorem: When n is large, the distribution of the

where ξ = shape (type), ψ = scale, µ = location

Light, Gumbel, or Heavy tails

Gumbel Probability Plot

The ”Tail theorem”

Another ”almost true” theorem: Almost all statistical distribution have a

GPD-tail in normal distribution

Normal tail is GPD with ξ = 0

The tail > 2 of a normal distribution

0.6 Red = empirical cdf of exceedances over 2

0.4 Blue = estimated GPD

Block maxima with ”block” = one year

Many observations over a year (hourly, semi-daily, daily) – take the

100 years of precipitation in Fort Collins, CO

Fort Collins, CO, daily precipitation:

Annual daily maximum precipitation, Fort Collins

1900 1920 1940 1960 1980 2000

Seasons – can be handled by making separate analyses for

data <- read.table(file="FtCprec.prn", header=TRUE)

Estimated parameter values + standard errror

Parameter Estimate Standard error

Is the GEV a Gumbel distribution?

Is the shape parameter ξ = 0.174 > 0 really significantly different from 0?

max likelihood under restricted model (ξ = 0)

A 95% confidence interval for shape estimate is (0.009,

gev.profxi(fit, fit$mle[3] -0.25, fit$mle[3] +0.25)

Diagnostics and conclusions

Estimated standard errors gives confidence limits for return levels.

0.0 0.2 0.4 0.6 0.8 1.0 1 2 3 4 5

Return Level Plot Density Plot

1e−01 1e+00 1e+01 1e+02 1e+03 1 2 3 4 5

Trends in extremes – Water level in the Japan Sea

Maximum 5 min water level

Alt I: Normalize (simplistic)

Residual Quantile Plot Residual Probability Plot

Alt II: fit with trend

The dilemma of statistical extremes

We want to predict events that have never been observed!

20 years of monthly data

Use more high values