05-2012 - DeE12 - Application of Descriptive Statistical Analysis and Time Series

APPLICATION OF DESCRIPTIVE
STATISTICAL ANALYSIS AND TIME SERIES

ANALYSIS ON ATMOSPHERIC POLLUTION
Eirini Rethemiotaki - Efthimios Zervas
Hellenic Open University
CONTENTS CONTENTS
1. INTRODUCTION
2. STATISTICAL TOOLS
2.1 DESCRIPTIVE STATISTICAL ANALYSIS
2 2 3 APPLICATION IN THE CASE OF ATHENS 2.2 3. APPLICATION IN THE CASE OF ATHENS
ATMOSPHERIC POLLUTANTS
CONCLUSIONS
I. Rethemiotaki, E. Zervas, HOU DEEE12, December 2-4, 2012, Paris-France
1. Introduction 1. Introduction
Th l i f i i d b f d i The analysis of time series data can be performed using
several statistical methods. Two of these methods are
presented here. presented here.
The first one uses four descriptive statistical measures: mean
value, standard deviation, skewness and kurtosis of the data
distribution and also the frequency distribution of values.
The second method uses the time series analysis. The time
i i t d b d i d l h h series is represented by a dynamic model where each
observation is considered as a function of the past values.
The previous two methods are applied in the case of Athens The previous two methods are applied in the case of Athens
atmospheric pollutants.
1. AIR POLLUTION IN ATHENS 1. AIR POLLUTION IN ATHENS
2. STATISTICAL TOOLS 2. STATISTICAL TOOLS
The analysis of atmospheric time series data can be
performed using several statistical methods. Two of these
methods are presented here:
2 1 DESCRIPTIVE STATISTICAL ANALYSIS 2.1 DESCRIPTIVE STATISTICAL ANALYSIS
2.2 TIME SERIES ANALYSIS
2.1 DESCRIPTIVE STATISTICAL
Th i i l l i f d i i i i l
MEASURES
The statistical analysis uses four descriptive statistical measures:
1. Mean value
2 Standard deviation 2. Standard deviation
3. Skewness and
4. Kurtosis of the data distribution.
Furthermore, we can use:
M di Median
Percentile (1%, 5%, 10%, 25% etc.)
Number or % of values out of certain limits Number, or %, of values out of certain limits
2.1 DESCRIPTIVE STATISTICAL MEASURES.
MEAN VALUE STANDARD DEVIATION MEAN VALUE STANDARD DEVIATION
Mean value is the central tendency of a collection of numbers taken as
the sum of the numbers divided by the size of the collection.
Standard deviation shows how much variation or "dispersion" exists from Standard deviation shows how much variation or dispersion exists from
the average (mean, or expected value).
Mean value:
Standard deviation:
average value
Standard deviation:
g
standard deviation
SKEWNESS
Skewness is a measure of the asymmetry of the probability distribution of a real-
SKEWNESS
Skewness is a measure of the asymmetry of the probability distribution of a real-
valued random variable. Qualitatively, a negative skew indicates that the tail on the
left side of the probability density function is longer than the right side and the bulk
f th l li t th i ht f th A iti k i di t th t th t il of the values lie to the right of the mean. A positive skew indicates that the tail on
the right side is longer than the left side and the bulk of the values lie to the left of
the mean. A zero value indicates that the values are relatively evenly distributed on
both sides of the mean, typically implying a symmetric distribution.
Skewness: Skewness:
KURTOSIS
K t i (f th G k d k t k t i b l i ) i
KURTOSIS
Kurtosis (from the Greek word , kyrtos or kurtos, meaning bulging) is any
measure of the "peakedness" of the probability distribution of a real-valued
random variable. Distributions with negative or positive excess kurtosis are called
respectively platykurtic distributions or leptokurtic distributions. Kurtosis describes
how a frequency distribution is bunched around the center or spread at the
endpoints.
Kurtosis: Kurtosis:
MEDIAN MEDIAN
Median is described as the numerical value separating the higher half of a sample
from the lower half. The mode of a set of data values is the value(s) that occurs
most often.
The median may be more useful than the mean when there are extreme values in
e ed a ay be o e use u a e ea w e e e a e e e e va ues
the data set as it is not affected by the extreme values.
PERCENTILES
A til ( til ) i th l f i bl b l t i t f
PERCENTILES
A percentile (or centile) is the value of a variable below a certain percent of
observations fall. For example, the 20th percentile is the value (or score) below
which 20 percent of the observations may be found. The most important percentiles
are the Quartiles, 4 parts of 25% each. The 25th percentile is also known as the
first quartile (Q1), the 50th percentile as the median or second quartile (Q2), and
the 75th percentile as the third quartile (Q3).
2.1 DESCRIPTIVE STATISTICAL MEASURES
NUMBER OF VALUES OUT OF CERTAIN LIMITS
N b % f l t f t i li it d t d t i th b f
NUMBER OF VALUES OUT OF CERTAIN LIMITS
Number, or %, of values out of certain limits: used to determine the number of measures
over the limit values defined in air quality legislation.
2.2 TIME SERIES ANALYSIS 2.2 TIME SERIES ANALYSIS
A time series is a sequence of numerical data in which each item is associated with a
particular instant of time.
An analysis of a single sequence of data is called univariate time-series analysis
An analysis of several sets of data for the same sequence of time periods is called
multivariate time-series analysis or, more simply, multiple time-series analysis
140
160
180
80
100
120
O
3
40
60
80
0
20
0 1000 2000 3000 4000 5000 6000 7000 8000
time
2.2 ANALYSIS METHODS OF TIME
SERIES
Time-domain:
The methods are functions of time. The time series is represented
by a dynamic model (model ARIMA, process Box & Jenkins),
where each observation is considered as a function of the past
l values.
Frequency domain:
The methods are functions of frequency (for example spectral
density).
2.2.1 OUTLINE OF TIME SERIES 2.2.1 OUTLINE OF TIME SERIES
Some concepts and strategies necessary for the
analysisof thetimeseriesare: analysis of the time series are:
STATIONARY TIME SERIES S ON M S S
METHOD OF DIFFERENCING
THE WHITE NOISE MODEL THE WHITE NOISE MODEL
ARIMA MODELS
STATIONARY TIME SERIES
METHOD OF DIFFERENCING METHOD OF DIFFERENCING
THE WHITE NOISE MODEL
ARIMA MODELS ARIMA MODELS
COMPONENTS OF A TIME SERIES
(Identification of non stationary)
+
+
trend seasonal
noise
What is a Stationary Time Series? What is a Stationary Time Series?
A Stationary Series is a Variable with
constant Mean across time
constant Variance across time
These are Examples of
N S S Non-Stationary Time Series
6000
8000
10000
12000
14000
16000
4000
6000
8000
10000
12000
4000
6000
8000
10000
4000
5000
6000
7000
8000
9000
2000
4000
6000
92 94 96 98 00 02 04
AUSTRALIA
0
2000
92 94 96 98 00 02 04
CANADA
0
2000
92 94 96 98 00 02 04
CHINA
1000
2000
3000
92 94 96 98 00 02 04
GERMANY
24000 45000 50000 7000
4000
8000
12000
16000
20000
20000
25000
30000
35000
40000
10000
20000
30000
40000
2000
3000
4000
5000
6000
0
4000
92 94 96 98 00 02 04
HONGKONG
10000
15000
92 94 96 98 00 02 04
JAPAN
0
92 94 96 98 00 02 04
KOREA
0
1000
92 94 96 98 00 02 04
MALAYSIA
6000
7000
25000
30000
10000
12000
50000
60000
1000
2000
3000
4000
5000
5000
10000
15000
20000
25000
2000
4000
6000
8000
10000
20000
30000
40000
50000
0
92 94 96 98 00 02 04
SINGAPORE
0
92 94 96 98 00 02 04
TAIWAN
0
92 94 96 98 00 02 04
UK
10000
92 94 96 98 00 02 04
USA
These are Examples of
S S Stationary Time Series
-2000
0
2000
4000
6000
8000
-2000
0
2000
4000
6000
0
1000
2000
3000
4000
-2000
0
2000
4000
6000
-6000
-4000
-2000
92 94 96 98 00 02 04
AUST
-6000
-4000
92 94 96 98 00 02 04
CAN
-2000
-1000
92 94 96 98 00 02 04
CHI
-6000
-4000
92 94 96 98 00 02 04
GERM
12000 12000 12000 2000
-8000
-4000
0
4000
8000
-8000
-4000
0
4000
8000
-8000
-4000
0
4000
8000
-2000
-1000
0
1000
-12000
92 94 96 98 00 02 04
HONG
-12000
92 94 96 98 00 02 04
JAP
-12000
92 94 96 98 00 02 04
KOR
-3000
92 94 96 98 00 02 04
MAL
2000
3000
8000
12000
3000
4000
5000
20000
30000
-3000
-2000
-1000
0
1000
92 94 96 98 00 02 04
-8000
-4000
0
4000
92 94 96 98 00 02 04
-4000
-3000
-2000
-1000
0
1000
2000
92 94 96 98 00 02 04
-20000
-10000
0
10000
92 94 96 98 00 02 04
SING TWN UKK US
Removing non-stationarity in time
series
One way of removing non-stationarity is through the method of
differencing.
ff f The differenced series is defined as:
Taking first differencing is a very useful tool for removing non-
1
=
t t t
y y y
Taking first differencing is a very useful tool for removing non-
stationarity, but sometimes the differenced data will not appear
stationary and it may be necessary to difference the data a
second time.
2 1 2 1 1 1
2 ) ( ) (

+ = =
=

t t t t t t t t t t
y y y y y y y y y y
In practice, it is quite rare to proceed beyond second order
differences
2 1 2 1 1 1
) ( ) (
t t t t t t t t t t
y y y y y y y y y y
differences.
Seasonal differencing Seasonal differencing
With not stationary seasonal data, a seasonal difference must
be applied.
A seasonal difference is the difference between an
observation and the corresponding observation from the observation and the corresponding observation from the
previous year.
s t t t
y y y

=
Where s is the length of the season

The white noise - a stationary
stochastic process
The Box-Jenkins models are based on the idea that a time series can
be usefully regarded as generated from (driven by) a series of
uncorrelated independent shocks e uncorrelated independent shocks e
t
Such a sequence e
t
, e
t-1
, e
t-2
, is called a white noise process
A white noise is a model where observations Y
t
is made of two
parts: a fixed value and an uncorrelated random error component parts: a fixed value and an uncorrelated random error component.
We can use this to perform significance tests for the autocorrelation
C
coefficients by constructing a confidence interval.
t t
e C y + =
ARIMA MODELS ARIMA MODELS :
- DESCRIPTION OF THE MODELS
GENERAL STRATEGY - GENERAL STRATEGY
Identification
Estimation and testing Estimation and testing
Application
Identification
Application
A t i I t t d M i Autoregressive Integrated Moving-average
Can represent a wide range of time series
A stochastic modeling approach can be used to calculate the probability
of a future value lying between two specified limits
In the 1960s Box and Jenkins recognized the importance of
these models in the area of economic forecasting
Of ll d h k h Often called The Box-Jenkins approach
The Box-Jenkins approach is one of the most widely used
methodologies for the analysis of time-series data methodologies for the analysis of time-series data
It is popular because of its generality, it can handle any series,
stationary or not, with or without seasonal elements.
Model building blocks Model building blocks
It uses an interactive approach of identifying a possible model from a general class uses a e ac ve app oac o de y g a poss b e ode o a ge e a c ass
of models:
Autoregressive (AR) models:
An AR(p) model is a regression model with lagged values of the dependent
i bl i h i d d i bl i i h h i variable in the independent variable positions, hence the name autoregressive
model.
Moving-average (MA) models:
An MA(q) model is a regression model with the dependent variable Y
t
An MA(q) model is a regression model with the dependent variable, Y
t
,
depending on previous values of the errors rather than on the variable itself.
where
1
,
p
,
1
,
q
are parameters of models, C is the constant and
1
,
t
are white noise error terms
Model building blocks Model building blocks
Mixed ARMA models:
Non stationary ARMA models:
ARIMA models
A model with autoregressive terms can be combined with a model having moving
average terms to get an ARMA(p,q) model.
Seasonal ARIMA models:
A multiplicative seasonal autoregressive integrated moving average model A multiplicative seasonal autoregressive integrated moving average model,
SARIMA(p, d, q) x (P, D, Q)s
order of AR order of order of MA order of order of order of order of AR order of order of MA order of order of order of
differencing seasonal AR seasonal seasonal MA
differencing
where
1
,
p
,
1
,
q
,
1
p
,
1
,..
q
are parameters of models, C is the constant ,
1
,
t
are white noise error
terms and s is the number of periods per season
ARIMA MODELS
A The Box-Jenkins Approach
Advantages
Derived from solid mathematical statistics foundations
ARIMA models are a family of models and the BJ approach is a ARIMA models are a family of models and the BJ approach is a
strategy of choosing the best model out of this family
It can be shown that an appropriate ARIMA model can produce pp p M p
optimal univariate forecasts
Disadvantages
Requires large number of observations for the model
identification
Hard to explain and interpret to unsophisticated users Hard to explain and interpret to unsophisticated users
Identification
Application
The Box-Jenkins model building
process
Model
identification
Differencing the
series to achieve
stationarity
Model estimation
Is model
adequate
Modify
model
No
?
Yes
Forecasts
( ) process (cont.)
I. Identification
Data preparation
Transformthedatatostabilizethevariance -Transform the data to stabilize the variance
-Differencing the data to obtain stationary series
Model selection Model selection
-Examine the data, ACF and PACF to identify potential
models (Autocorrelation and Partial Autocorrelation
Function are valuable tools for investigating the properties
of a time series.)
( ) process (cont.)
II. Estimation and testing
Estimation
E i i i l d l -Estimate parameters in potential models
-Select best model using suitable criteria
Diagnostics Diagnostics
-Check ACF/PACF of residuals
-Do test of residuals
-Are the residuals white noise?
III A li ti III. Application
Forecasting: use the best model
Identification
Application
Identification Tools Identification Tools
Correlogram graph showing the ACF and the PACF at different lags. g g p g g
Autocorrelation function (ACF)- Autocorrelations are statistical measures
indicating how a time series is related to itself over time
The autocorrelation at lag 1 is the correlation between the original series z The autocorrelation at lag 1 is the correlation between the original series z
t
and the same series moved forward one period (represented as z
t-1
)

n
y y y y ) )( (
=
+ =

=
n
t
t
k t
k t t
k
y y
y y y y
r
1
2
1
) (
) )( (
Partial autocorrelation function (PACF) measures the correlation between
(time series) observations that are k time periods apart, after controlling for
correlations at intermediate lags (i.e., lags less than k). In other words, it is the
correlation between Y
t
and Y
t-k
after removing the effects of intermediate Y
s
EXAMPLE:
(a) A correlogram of a nonstationary time series
(b) A correlogram of a stationary time series after 1st order (b) A correlogram of a stationary time series after 1st order
differencing
I. Model identification I. Model identification
Plot the data Plot the data
Identify any unusual observations
If necessary, transform the data to stabilize the variance (logarithmic or y, ( g
power transformation of the data)
Check the time series plot ACF PACF of the data (possibly Check the time series plot, ACF, PACF of the data (possibly
transformed) for stationarity.
IF
Time plot shows the data scattered horizontally around a constant mean
ACF and PACF to or near zero quickly
Then, the data are stationary.
Construction of the time series chart
Id tifi ti f t ti it
180
Identification of non stationarity
120
140
160
80
100
O
3
20
40
60
0
0 1000 2000 3000 4000 5000 6000 7000 8000
time
Use differencing to transform the data into a stationary
series series
For no-seasonal data take first differences
For seasonal data take seasonal differences
Check the plots again if they appear non-stationary, take
the differences of the differenced data.
FIRST DIFFERENCES OF THE DATA FIRST DIFFERENCES OF THE DATA
transform the data into a stationary series transform the data into a stationary series
When the stationarity has been achieved, check the ACF
and PACF plots for any pattern remaining.
There are three possibilities
AR or MA models AR or MA models
No significant ACF after time lag q indicates that
MA(q) may be appropriate. MA(q) may be appropriate.
No significant PACF after time lag p indicates that
AR(p) may be appropriate. (p) ay be app op a e.
Seasonality is present if ACF and/or PACF at the
seasonal lags are large and significant. g g g
If no clear MA or AR model is suggested, a
i t d l b i t mixture model may be appropriate.
Theoretical Patterns of ACF and
AC PACF
Type of Type of
Model Model
Typical Pattern of Typical Pattern of
ACF ACF
Typical Typical
Pattern of Pattern of Model Model ACF ACF Pattern of Pattern of
PACF PACF
AR ( AR (pp)) Decays exponentially Significant spikes AR ( AR (pp)) Decays exponentially
or with damped sine
wave pattern or both
Significant spikes
through lags p
wave pattern or both
MA ( MA (qq)) Significant spikes
through lags q
Declines
exponentially through lags q exponentially
ARMA ( ARMA (p,q p,q)) Exponential decay Exponential
d
decay
Theoretical Patterns of ACF and
PACF: Examples
0
0.05 0.4
-0.2
-0.15
-0.1
-0.05
0
1 2 3 4 5 6 7 8 9 10
n
d

p
a
c
f
0
0.1
0.2
0.3
n
d

p
a
c
f
acf
pacf
-0.4
-0.35
-0.3
-0.25
0.2
a
c
f

a
n
acf
pacf
-0.3
-0.2
-0.1
0
1 2 3 4 5 6 7 8 9 10
a
c
f

a
n
-0.45
Lag
-0.4
Lags
0 5
0.6
0.8
MA(1) MA(2)
0.3
0.4
0.5
f

a
n
d

p
a
c
f
acf
pacf
0.2
0.4
0.6
f

a
n
d

p
a
c
f
acf
pacf
0
0.1
0.2
1 2 3 4 5 6 7 8 9 10
a
c
f

-0.2
0
1 2 3 4 5 6 7 8 9 10
a
c
f
-0.1
Lags
-0.4
Lags
ARMA AR(1)
Thegeneral non seasonal model isknownas The general non-seasonal model is known as
ARIMA (p, d, q):
p is the number of autoregressive terms.
d is the number of differences.
q is the number of moving average terms.
TheARIMA modelscanbeextendedtohandleseasonal The ARIMA models can be extended to handle seasonal
components of a data series.
Th l h h d i i The general shorthand notation is
ARIMA (p, d, q)(P, D, Q)
s
Where s is the number of periods per season.
Note that if any of p, d, or q are equal to zero, the
model canbewritteninashorthandnotationby model can be written in a shorthand notation by
dropping the unused part.
l Example
ARIMA(2, 0, 0) = AR(2) ( ) ( )
ARIMA (1, 0, 1) = ARMA(1, 1)
Example: ARIMA(1, 0, 0) time
d series data.
1.8
Time Series Plot of AR1 data series
The ACF and PACF can be used to identify
1.6
1.4
1.2
1.0
y
an AR(1) model.
The autocorrelations decay
exponentially.
200 180 160 140 120 100 80 60 40 20 1
0.8
0.6
0.4
0.2
There is a single significant partial
autocorrelation.
200 180 160 140 120 100 80 60 40 20 1
1.0
Autocorrelation Function for AR1 data series
(with 5% significance limits for the autocorrelations)
1.0
Partial Autocorrelation Function for AR1 data series
(with 5% significance limits for the partial autocorrelations)
c
o
r
r
e
l
a
t
i
o
n
0.8
0.6
0.4
0.2
0.0
t
o
c
o
r
r
e
l
a
t
i
o
n
0.8
0.6
0.4
0.2
0.0
A
u
t
o
c
-0.2
-0.4
-0.6
-0.8
-1.0
P
a
r
t
i
a
l

A
u
-0.2
-0.4
-0.6
-0.8
-1.0
Lag
50 45 40 35 30 25 20 15 10 5 1
Lag
20 18 16 14 12 10 8 6 4 2
Example: ARIMA(0, 0, 1) time series
data.
1.8
Time Series Plot of MA1 data series
1.6
1.4
1.2
1.0
0.8
The ACF and PACF can be used to identify an
MA(1) model.
Note that there is only one significant
200 180 160 140 120 100 80 60 40 20 1
0.6
0.4
0.2
0.0
autocorrelation at time lag 1.
The partial autocorrelations decay
exponentially.
200 180 160 140 120 100 80 60 40 20 1
1.0
Autocorrelation Function for MA1 data series
1.0
0 8
Partial Autocorrelation Function for MA1 data series
c
o
r
r
e
l
a
t
i
o
n
0.8
0.6
0.4
0.2
0.0
u
t
o
c
o
r
r
e
l
a
t
i
o
n
0.8
0.6
0.4
0.2
0.0
A
u
t
o
c
-0.2
-0.4
-0.6
-0.8
-1.0
P
a
r
t
i
a
l

A
u
-0.2
-0.4
-0.6
-0.8
-1.0
Lag
50 45 40 35 30 25 20 15 10 5 1
Lag
20 18 16 14 12 10 8 6 4 2
Example: seasonal ARIMA
(0,1,1)(0,1,1)
12
Time Series Plot of first difference of seasonal
We take a seasonal difference and then we
o
f

s
e
a
s
o
n
a
l
200
100
We take a seasonal difference and then we
difference the data again(achieve stationarity)
The PACF shows the exponential decay in values.
The ACF shows a significant value at time
l 1
f
i
r
s
t

d
i
f
f
e
r
e
n
c
e

o
0
-100
lag 1.
This suggest a MA(1) model.
The ACF also shows a significant value at time lag
12
Year
Month
1973 1972 1971 1970 1969 1968 1967 1966 1965 1964
J an J an J an J an J an J an J an J an J an J an
-200
Autocorrelation Function for first difference of seasonal
Partial Autocorrelation Function for first difference of seasonal
12
This suggest a seasonal MA(1).
a
t
i
o
n
1.0
0.8
0.6
0.4
0 2
r
r
e
l
a
t
i
o
n
1.0
0.8
0.6
0.4
0.2
A
u
t
o
c
o
r
r
e
l
a0.2
0.0
-0.2
-0.4
-0.6
-0.8
P
a
r
t
i
a
l

A
u
t
o
c
o
r0.2
0.0
-0.2
-0.4
-0.6
-0.8
Lag
40 35 30 25 20 15 10 5 1
-1.0
Lag
40 35 30 25 20 15 10 5 1
-1.0
DESCRIPTION OF THE MODELS
GENERAL STRATEGY GENERAL STRATEGY :
Identification
Application
II. Estimating the parameters II. Estimating the parameters
Once a tentative model has been selected, the parameters for the
model must be estimated.(
1
,
2
,
p
)
e.g. ARIMA(1, 0, 0) :
t t t
e y C y + + =
1 1
C is the constant term
j
is the jth auto regression parameter
e
t
is the error term at time t.
One method of estimating the parameters is the maximum
likelihood procedure.
II. Estimating the parameters
Maximum likelihood procedure.
After the determination of the estimates and their
standard errors, the t values can be constructed.
The parameters that are judged significantly different
f f from zero are retained in the fitted model.
The parameters that are not significantly different from
f zero are dropped from the model.
II. Estimation of AR, MA, and ARMA
Models
- There are two criteria often used that reflect the closeness of fit
of the model and the number of parameters estimated
One is the Akaike information criterion (AIC) - One is the Akaike information criterion (AIC):
AIC = -2*ln(likelihood) + 2*k
- And the other is the Schwartz Bayesian criterion (SBC) S y (S C)
The latter is also called the Bayesian information criterion (BIC):
BIC = -2*ln(likelihood) + ln(N)*k
where:
k = is the number of parameters estimated in the model: k = p+q+P+Q
N = number of observations
If we are considering several ARMA models, we choose the one
with lowest AIC or BIC
with lowest AIC or BIC.
II. Diagnostic Checking II. Diagnostic Checking
Before using the model for forecasting, it must be
checked for adequacy.
A model is adequate if the residuals left over after
fitting the model is simpl white noise fitting the model is simply white noise.
A test can also be applied to the residuals as an A test can also be applied to the residuals as an
additional test of fit.
II. Diagnostic Checking II. Diagnostic Checking
The residuals are white noise.
The ACF and PACF are well within their two standard error limits.
example:
ACF of Residuals for ISC PACF of Residuals for ISC
example:
o
n
1.0
0.8
0.6
0.4
a
t
i
o
n
1.0
0.8
0.6
0.4
A
u
t
o
c
o
r
r
e
l
a
t
i
o
0.2
0.0
-0.2
-0.4
-0 6
P
a
r
t
i
a
l

A
u
t
o
c
o
r
r
e
l
a
0.4
0.2
0.0
-0.2
-0.4
0 6
Lag
16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
0.6
-0.8
-1.0
Lag
P
16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
-0.6
-0.8
-1.0
II. Diagnostic Checking g g
=
m
k
r N Q
2
= k
k
Q
1
2

m
k n
n n LB
m
k
k
2
1
2
) 2 (
+ =
DESCRIPTION OF THE MODELS
GENERAL STRATEGY GENERAL STRATEGY :
Identification
Application
III. Application
F ti d l t f t Forecasting: use model to forecast
We use the model to forecast future values
e.g. : The forecasts are generated by the
following equation. g q

2 2 1 1
+ + =

Y Y c Y
t t t

4 . 287 ) 300 ( 219 . ) 195 ( 324 . 9 . 284
219 . ) 324 . ( 9 . 284
64 65 66
= + =
+ + = Y Y Y
5 . 234 ) 195 ( 219 . ) 4 . 287 ( 324 . 9 . 284
219 .
) 324 . ( 9 . 284
65 66 67
= + =
+ + = Y Y Y
III.THE PREDICTIVE ABILITY OF THE
MODEL
How can we test whether a forecast is accurate or not?
Mean absolute error (MAE)
Root mean squared error (RMSE)
The mean absolute percentage error (MAPE)
h A i h l l d F i h f l where A
t
is the actual value and F
t
is the forecast value
The best model minimizes these error indicators
The best model minimizes these error indicators.
Forecasting: use model to forecast:
Example
3. APLICATION IN THE CASE OF
ATHENS
DATA SOURCES DATA SOURCES
Th H ll i Mi i t f E i t l E d The Hellenic Ministry of Environmental Energy and
Climate Change has a network of measuring
t ti f th i i ll t t stations of the main air pollutants:
Carbon monoxide CO
Nitrogen monoxide NO
Nitrogen dioxide NO
2
Sulfur dioxide SO
2
Ozone O
3
Particulate matter < 10m PM
10
Particulate matter < 2,5m PM
2 5
,
2,5
RESULTS RESULTS
1 St ti ti l l i f i ll t t b id (CO) t 3 t ti 1. Statistical analysis of air pollutant carbon monoxide (CO) at 3 stations:
- Patision(urbantraffic station)
- Peristeri(urban station)
- Lycovrisi(suburban station)
Study of statistical descriptive measures of annual, monthly, daily and hourly
concentrations of carbon monoxide.
2. Time series analysis of CO in the year 2011 with ARIMA models
1.STATISTICAL ANALYSIS:
M S d d d i i Mean-Standard deviation
Relative standard deviation or coefficient variation Relative standard deviation or coefficient variation
RSD of annual
concentration
Skewness - Kurtosis
Median
PercentilePatision station
Mean of monthly concentration-Patision station
Mean of monthly concentration
4,5
5,0
Meanofmonthlyconcentration
3,0
3,5
4,0
m
3
)
2,0
2,5
3,0
C
O
(
m
g
/
m
Patision2001
P ti i 2011
0,5
1,0
1,5
C
Patision2011
0,0
,
Jan Feb Mar Apr May Jun Jul Aug Sept Oct Nov Dec
MONTHS
Mean of monthly concentration-Peristeri station
Mean of monthly concentration
1,2
1,4
Meanofmonthlyconcentration
1,0
,
m
3
)
0,6
0,8
C
O
(
m
g
/
m
Peristeri2001
P i t i 2011
0,2
0,4
C
Peristeri2011
0,0
Jan Feb Mar Apr May Jun Jul Aug Sept Oct Nov Dec
MONTHS
Mean of daily concentration-Patision station
Mean of daily concentration
4,0
4,5
Meanofdailyconcentration
3,0
3,5
m
3
)
2,0
2,5
C
O
(
m
g
/
m
Patision2001
Patision 2011
0,5
1,0
1,5
Patision2011
0,0
0,5
Monday Tuesday Wednesday Thursday Friday Saturday Sunday
DAYS
Mean of daily concentration-Peristeri station
Mean of daily concentration
0,9
1,0
Meanofdailyconcentration
0 6
0,7
0,8
3
)
0,4
0,5
0,6
C
O
(
m
g
/
m
Peristeri2001
P i t i 2011
0,2
0,3
C
Peristeri2011
0,0
0,1
Monday Tuesday Wednesday Thursday Friday Saturday Sunday
DAYS
Mean of hourly concentration-Patision station
f
6,00
Meanofhourlyconcentration
4,00
5,00
)
3,00
C
O
(
m
g
/
m
3
Patision2001
1,00
2,00
C
Patision2011
0,00
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
HOURS
Mean of hourly concentration-Peristeri station
f
1,40
1,60
Meanofhourlyconcentration
1,00
1,20
1,40
3
)
0 60
0,80
1,00
C
O
(
m
g
/
m
3
Peristeri2001
P i t i 2011
0 20
0,40
0,60
C
Peristeri2011
0,00
0,20
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
HOURS
2.TIME SERIES ANALYSIS:
C f h h Construction of the time series chart
transform the data into a stationary series transform the data into a stationary series
We take a seasonal difference and then we difference the data again(achieve We take a seasonal difference and then we difference the data again(achieve
stationarity)
ACF and PACF
We check the ACF and PACF plots for any pattern remaining p y p g
The PACF shows the
ti l d i l exponential decay in values.
The ACF shows a significant
value at time
lag 2 lag 2.
This suggest a MA(2) model.
The ACF also shows a
significant value at time lag significant value at time lag
24
This suggest a seasonal MA(1).
A seasonal model ARIMA fits
our data
Akaike information criterion (AIC)
W th Ak ik i f ti it i (AIC) t fi d th b t We use the Akaike information criterion (AIC) to find the best
model for our data.
We choose the one with lowest AIC: We choose the one with lowest AIC:
ARIMA (1,1,2)X(1,1,1)
24
:
order of AR order of order of MA order of order of order of
differencing seasonal AR seasonal seasonal MA
SAR differencing SMA
ARIMA (1,1,2)X(1,1,1)
24
: (1-
1
24
)(1
1
)(1)(1
24
)x
t
=(1-
1
2
)(1
1
24
)
t
Where B
j
x
t
=x
t-j
B B called backward shift operator and
1
,
p
,
1
,
q
,
1
p
,
1
,..
q
are parameters of
the model
Estimating the parameters
We use the maximum likelihood procedure to estimate the parameters of
the model: ARIMA (1,1,2)X(1,1,1)
24
Parameter Estimate
AR( ) 0 757063 AR(
1
) 0,757063
MA(
1
) 0,861985
MA(
2
) 0,0795282 MA(
2
) 0,0795282
SAR(
1
) 0,0622429
SMA(
1
) 0,987624
t t f id l test of residuals
Before using the model for forecasting, it must be checked for adequacy. g g, q y
The residuals are white noise. The ACF is well within their two standard error limits.
Forecasting
We use the model to forecast future values
Time Sequence Plot for CO
ARIMA(1,1,2)x(1,1,1)24
11
actual
5
7
9
C
O
forecast
95,0% limits
1
3
C
0 2 4 6 8 10
(X 1000,0)
-1
THE PREDICTIVE ABILITY OF THE MODEL THE PREDICTIVE ABILITY OF THE MODEL
The three statistics Mean absolute error (MAE) Root mean squared error The three statistics , Mean absolute error (MAE), Root mean squared error
(RMSE) and the mean absolute percentage error (MAPE) measure the magnitude
of the errors.
In this case, the model was estimated from the first 8699 data values. 50 data In this case, the model was estimated from the first 8699 data values. 50 data
values at the end of the time series were withheld to validate the model. The table
shows the error statistics for both the estimation and validation periods.
If the results are considerably worse in the validation period, it means that the y p ,
model is not likely to perform as well as otherwise expected in forecasting the
future.
St ti ti E ti ti P i d V lid ti P i d Statistic Estimation Period Validation Period
RMSE 0,47183 0,634806
MAE 0,321131 0,491219 MAE 0,321131 0,491219
MAPE 28,4324 23,6307
Forecasting
Time series forecast
3,5
4
Timeseriesforecast
2,5
3
g
/
m
3
)
1
1,5
2
C
O
(
m
g
actual
forecast
0
0,5
1
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47
It can be seen that the predicted values produced by our proposed model follow
the actual values of CO emissions closely. It not only shows that our proposed model is
y y p p
capable of forecasting CO emissions but also speaks out the usefulness of the model.
2.TIME SERIES ANALYSIS: Forecasting
R l i hi b l d i d l Relationship between actual and estimated values
y=0,859x+0,214
R = 0 831
10
CO(mg/m
3
)2011EstimatedActualvalues
R =0,831
6
8
4
E
s
t
i
m
a
t
e
d
0
2
2
0 2 4 6 8 10 12
Actual
CONCLUSIONS:
DESCRIPTIVE STATISTICAL ANALYSIS DESCRIPTIVE STATISTICAL ANALYSIS
From the application of descriptive statistical analysis of carbon monoxide in the From the application of descriptive statistical analysis of carbon monoxide in the
case of Athens conclude that:
The mean and the standard deviation of annual concentration of CO are
greater in urban traffic Patision station This is expected since the biggest greater in urban-traffic Patision station. This is expected since the biggest
source of CO is road transport. There is a downward trend in the concentration
of the pollutant over the years. This is due mostly to the replacement of old
hi l i h h l vehicles with new technology ones
The relative standard deviation is very large at the suburban station and
presents a stability over time in all three stations.
The skewness of annual concentration of CO is greater in suburban Lykovrisi
station. The asymmetry is positive in all three stations, showing that the tail of
the distribution is to the right of the peak. This means that there are many
values with high concentrations.
The kurtosis of annual concentration of CO is positive in all three stations. The
distribution is very fine, so the observed values are close to average.
distribution is very fine, so the observed values are close to average.
CONCLUSIONS:
DESCRIPTIVE STATISTICAL ANALYSIS DESCRIPTIVE STATISTICAL ANALYSIS
Th di f l t ti f CO h i il b h i t th The median of annual concentration of CO has a similar behavior to the mean.
The mean of monthly concentration of CO in both stations Patision and Peristeri shows
a decline over the last decade. Higher values of carbon monoxide are observed in
winter months with a maximum in December. This is due to the weather and traffic
conditions during this period . A decrease is observed the summer months, due to
weather conditions and the decrease of traffic due to summer vacation.
Similarly, the mean of daily concentration of CO in both stations Patision and Peristeri
shows a decline over the last decade. Higher values of carbon monoxide are
observed in the midweek, while the concentrations of CO decrease during the , g
weekend.
CO pollution show two peaks: 8-10H and 20-22H.
CONCLUSIONS:
TIME SERIES ANALYSIS TIME SERIES ANALYSIS
A non-stationary time series statistical model with trend and seasonal effects to A non-stationary time series statistical model with trend and seasonal effects to
predict future estimates of carbon monoxide emissions in Athens is also used. The
data set used consists of hourly atmospheric CO concentrations in 2011.
Th d l d i l t d t tt t th d f lit b i The developed processes is evaluated to attest the degree of quality by using
various statistical criteria.
Finally, the accuracy of the proposed models is tested, by predicting and analyzing
the CO emission for 50 data values. The results ashow that the model can well
predict the future values.
Application of descriptive statistical analysis and
ti i l i t h i ll ti time series analysis on atmospheric pollution
Thank you very much for your
attention
Efthimios Zervas
Hellenic Open University - Greece Hellenic Open University Greece
zervas@eap.gr

05-2012 - DeE12 - Application of Descriptive Statistical Analysis and Time Series

Transféré par

Informations du document

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

05-2012 - DeE12 - Application of Descriptive Statistical Analysis and Time Series

Transféré par

Droits d'auteur :

Formats disponibles

APPLICATION OF DESCRIPTIVE

STATISTICAL ANALYSIS AND TIME SERIES

Where s is the length of the season

C is the constant term

Vous aimerez peut-être aussi