Vous êtes sur la page 1sur 46

https://bookdown.

org/raghavendrajain/dengueforecasting/time-series-
analysis-in-r.html

Dengue Forecasting Project


Raghvendra Jain

08 October, 2016

1 Abstract
To create early warning system of dengue outbreaks, we
present a machine learning-based methodology capable of
providing real-time (“nowcast”) and forecast estimates of
dengue prediction in each of the fifty districts of Thailand by
leveraging data from multiple data sources. Using a set of
prediction variables we show an increasing prediction
accuracy of the model with an optimal combination of
predictors which include: meteorological data, clinical data,
lag variables of disease surveillance, socio-economic data
and the data encoding spatial dependence on dengue
transmission. We use generalized Generalized Additive
Models (GAMs) to fit the relationships between the predictors
and the clinical data of Dengue hemorrhagic fever (DHF) on
the basis of the data from 2008 to 2012. Using the data from
2013 to 2015 and a comparative set of prediction models we
evaluate the predictive ability of the fitted models according
to RMSE and SRMSE, BIC as well as AIC. We also show that
for the prediction of dengue outbreaks within a district, the
influence of dengue incidences and socio-economic data
from the surrounding districts is statistically significant,
possibly indicating the influence of movement patterns of
people and spatial heterogeneity of human activities on the
spread of the epidemic.

2 Hypothesis
H1:H1: To forecast dengue incidences in a particular district,
the influence of the data from past dengue incidences and
socio-economic data from its surrounding districts is
statistically significant.
H2:H2: To forecast dengue incidences, a data-driven
interpretable, non-parametric time-series forecasting
approach (e.g. Generalized Additive Models (GAMs)) is
statistically better than parametric modeling approaches
(e.g. ARIMA.)

H3:H3: To forecast dengue incidences, an ensemble


forecasting model with Bayesian Network and time-series
modeling approach is statistically better than the individual
models.

3 Exploratory Data Analysis


I import the important the following packages into R.
require(ggplot2) # used in plotting require(reshape)require(plotly)require(bnlearn) #
used for Bayesian Networksrequire(mgcv) # used for Generalized Additive Modeling
require(fpp)require(lubridate)require(bsts)require(dplyr)require(CausalImpact)require(
xtable)library(dplyr)library(tidyr)

3.1 The description of variables


The description of the data variables.

Descriptio
Variable Name
n
The rows
have the
codes of 50
districts
of Bangkok.
connected_district_code
The colums
tell its
connecions
to other
districts.
district_code Has
geocodes,
postal code
of each of
the
district,
their
names,
population
Descriptio
Variable Name
n
and area.
Has yearly
and daily
average
gardbage
district_garbage_data
collection
data for 3
consecutive
years.
Has
population
deographics
into age
groups in
each
district_population district,
their total
population,
number of
communities
and the
total area.

3.2 The diurnal temperature range (DTR)


DTR is the difference between the daily maximum and
minimum temperature.

The plot summary and visualization is given below:


AverageDTRInBangkok_2008.2015

Month 2015 2014 2013 2012 2011 2010 2009 2008

1 Jan 9.6 11.2 10.3 9.2 9.9 8.6 10.6 9.0

2 Feb 9.0 8.6 9.0 8.3 8.8 7.2 9.7 7.8

3 Mar 8.0 8.1 9.1 8.3 7.4 8.9 8.9 8.0

4 Apr 8.7 8.7 9.2 8.5 8.1 8.4 9.5 8.1

5 May 8.3 9.3 8.9 8.8 7.6 8.5 8.1 7.0

6 Jun 8.6 7.6 7.8 7.6 6.7 8.7 7.1 6.9

7 Jul 7.8 7.4 7.2 6.9 7.7 8.2 7.3 6.4

8 Aug 7.9 7.9 8.1 7.9 7.4 8.1 8.4 6.4


9 Sep 8.1 8.1 7.3 7.3 7.4 8.3 8.5 6.7

10 Oct 7.2 7.9 7.9 7.6 7.7 7.2 8.6 6.9

11 Nov 8.0 8.5 8.5 8.0 8.6 7.6 9.3 7.0

12 Dec 8.4 8.6 10.5 9.7 9.2 8.7 9.3 9.0

plot.ts(summary_DTR)

Plotting DTR for all the years on the same plot.


3.3 The average monthly rainfall in
Bangkok.
The plot summary for monthly rainfall and visualization is
given below:
AverageRainInBangkok_2008.2015

Month 2015 2014 2013 2012 2011 2010 2009 2008

1 Jan 3.5 0.000 32.206 43.354 0.630 99.9 0.0 62.1

2 Feb 16.8 2.512 4.098 18.654 37.342 2.9 0.0 69.3

3 Mar 183.9 19.314 35.918 31.888 136.716 14.6 30.2 3.6

4 Apr 128.9 25.358 64.366 51.914 136.662 17.3 359.6 180.8

5 May 82.5 125.410 111.540 108.984 246.128 279.3 463.4 257.9

6 Jun 495.0 125.936 206.946 123.418 196.584 198.8 219.3 163.6

7 Jul 220.8 122.270 142.212 176.626 288.680 348.7 175.7 221.8

8 Aug 50.5 218.006 228.214 190.436 292.804 343.1 354.0 172.1


9 Sep 352.4 203.436 312.220 548.736 266.538 409.5 351.8 335.2

10 Oct 334.2 206.998 314.090 236.996 337.502 256.3 264.2 399.3

11 Nov 34.9 51.506 69.286 106.022 1.880 30.6 46.5 36.7

12 Dec 42.7 25.926 2.060 12.256 0.584 22.7 7.3 0.0

plot.ts(summary_Rain)

Plotting average monthly rainfall for all the years on the


same plot.
3.4 The dengue incidences in Bangkok
Districts
The plot shows the Dengue hemorrhagic fever(DHF)
incidence peaked in 2013 and 2015.
The above plot shows that most of the DHF incidents were
reported in the month of October and November.
4 The Prediction Starts here
5 Time Series Analysis in R
First, let’s plot the data
We can see from this time series that there seems to be
seasonal variation in the number of dengue incidences per
month: there is a peak every winter, and a trough every
summer. Again, it seems that this time series could probably
be described using an additive model, as the seasonal
fluctuations are roughly constant in size over time and do not
seem to depend on the level of the time series, and the
random fluctuations also seem to be roughly constant in size
over time.

Thus, we don’t need to tranform the time series by


calculating the natural log of the original data.

5.1 Exponential Smoothing


5.2 Seasonal ARIMA Model
When a model is fit by manual setting of parameters.

I fit an ARIMA(0,1,1)(0,1,1)[12] model.


I fit an ARIMA(1,0,3)(1,1,1)[12] model.
5.3 A Bayesian Structural Time Series
Model
5.4 A Structured Bayesian Network
Approach
ME RMSE MAE MPE MAPE

Test set -1.063421 10.0441 6.664085 -60.43833 79.39199


accuracy(f = pred, x = bayesianDF.test[, "count"])

ME RMSE MAE MPE MAPE

Test set 8.765601 27.51189 16.08321 -87.3934 135.7163

5.5 Generalized Additive Model


Now, I analyse the dataset according using the Generalized
Additive Model.

The dataset consists of following entries:

1. Information of each district


1. identification codes
2. name of the district.
3. population (divided into various age bins)
4. area in square kms.
5. Number of communities
6. number of neighboring districts
2. Monthly DHF count in each district from 2008~2015
3. Monthly average rainfall in Bangkok from 2008~2015
4. Monthly Diurnal Temperature Range (DTR) in Bangkok from
2008~2015

The map of Bangkok is shown below.

There are 18 districts that are close to stream as shown in


the above map. Their names are:
[1] "Bang Su" "Dusit" "Bang Plad"

[4] "Phra Nakhon" "Bangkok Noi" "Bangkok Yai"

[7] "Thon Buri" "Khlong San" "Pom Pram Sattru"


[10] "Samphantawong" "Bang Rak" "Sathorn"

[13] "Bang Kho Laem" "Yannawa" "Rat Burana"

[16] "Khlong Toey" "Prakanong" "Bang Na"

5.6 Study Area 1


For the ease of experimentation, I use the data from first
district (indexed as 1 in the image of Bangkok) and perform
the our analyses. The name of the district is Phra Nakhon. As
you can see, it is located near the stream.
5.7 Prediction using the entire data with
GAM
5.8 Notes to myself:
 I need to create the lagged values for each district.
 Separate the data into training and testing datasets. Keep
the year variable flexible, so that I can experiment with year
variables
 Row bind all the training datasets i.e. the data from 50 BKK
districts
 Row bind all the data from the testing datasets similar to
above.
 make tests.

6 Introduction
This is an R Markdown document. Markdown is a simple
formatting syntax for authoring HTML, PDF, and MS Word
documents. For more details on using R Markdown
see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated


that includes both content as well as the output of any
embedded R code chunks within the document. You can
embed an R code chunk like this:
summary(cars)

speed dist

Min. : 4.0 Min. : 2.00

1st Qu.:12.0 1st Qu.: 26.00

Median :15.0 Median : 36.00

Mean :15.4 Mean : 42.98

3rd Qu.:19.0 3rd Qu.: 56.00

Max. :25.0 Max. :120.00

6.1 Including Plots


You can also embed plots, for example:
Note that the echo = FALSE parameter was added to the code
chunk to prevent printing of the R code that generated the
plot.

Vous aimerez peut-être aussi