Vous êtes sur la page 1sur 8

Term Paper in Stat 276: An Exposition of a Published Article

Angelita P. Tobias

About the Material


Field: Ecological Modelling
Title:

Application of a Random Forest algorithm to predict spatial distribution of the potential


yield of Ruditapes philippinarum in the Venice lgoon, Italy (February, 2011)

I. SUMMARY
The study applied a machine learning technique to model the yield and predict spatial
distribution of the potential yield of Manila clam. Manila clam scientifically known as Ruditapes
philippinarum is an aquaculture species that is widely grown in Venice lagoon (Italy). The model
was built to support management of the specie by identifying suitable production and
harvestable grounds. The overall approach combined Random Forest (RF) algorithm and
Geographic Information Systems to develop a predictive model that captures the complex
relationship of the nine environmental factors that are known to affect the yield or abundance of
the clam.
A total of 2,539 observations were collected wherein it was further split into calibration and
validation set or commonly knows as train and test set respectively. The train set with n = 1,698
was used to build the random forest model and the test set with n = 841 was used to validate
the accuracy of the model to predict unseen data. Before building the RF model, correlation
analysis was also conducted among the environmental factors through Hierarchical Clustering
using squared Spearman correlation as measure of similarity. The result shows that RF model
was able to explain well the variability of yield of R. philippinarum with r2train = 0.99 and r2test =
0.74.
II. DISCUSSION OF MATERIAL
1. Description of the data
The data set is collected last 2007 and consist of yield data and nine (9) environmental
factors that are known to affect the abundance of Manila clam. These nine environmental
factors are:
Percentage of sand in the sediment ( Sand)
Dissolved oxygen (DO)
Salinity (Sal)
Water Speed (Speed)
Chlorophyll a (Chla)
Turbidity (Turb)
Resident time (RT)
Water temperature (T)
Water depth (WD)

Information about Yield is only available from ~65% of the production area and since
yield data is usually divided into farming and fishing areas, biomass data of R.
philippinarum gathered from different studies is used to derived spatial distribution of
yield as a proxy indicator of yield.
Moreover, data on the predictor variables are collected from different sources. Aside
from scattered data sources, missing observations is also one of the challenges before
modelling. Available observations are then interpolated through ordinary kriging on the
chosen grid (100m x 100m cells) using the gstat package in R.

The data set consist of 2,539 cells and was randomly split into a train set (n = 1,698) and
a test set (n = 841). The relationship between yield and the nine environmental factors
was modelled using the train set and the model was validated using the test set.
2. Methodology
i. Hierarchical Clustering
In classical regression like simple linear and multiple linear regression, correlation
analysis is one of the initial step before proceeding further with model building. In
this study, correlation analysis was conducted through hierarchical clustering
using squared Spearman correlation (2) as similarity measure. In practice, aside
from k-means clustering, hierarchical clustering is also a common clustering
technique and is characterized by a tree-like structure which can cluster in two
ways: agglomerative clustering or divisive clustering. The agglomerative
approach starts with each object representing an individual cluster. These
clusters are then sequentially merged according to their similarity. The divisive
approach starts with all objects initially merged into a single cluster which is then
gradually split up based on dissimilarity.
ii. Random Forest
To investigate the relationship between the site specific yield of Manila clam and
the nine environmental factors, random forest algorithm was applied to the data.
The Random Forest algorithm is a machine learning technique that generates
predictions based on a set of unique decision trees built from bootstrapped
training samples. It can be implemented in R using the built in package

randomForest with three parameters needed. The number of bootstrapped


samples (ntree), the number of predictor variables tested at each node (mtry) and
the minimum number of observations at each terminal node (nodesize). Decision
trees on the other hand is a partitioning method that stratifies the data based on
optimal split chosen from a random subset of predictors. A decision tree is called
a regression tree if the dependent variable Yi is numerical and classification tree
if categorical. In this study, a regression tree was grown for each bootstrapped
sample.
The random forest algorithm works as follows:
1) Draw bootstrapped samples. From the train set, ntree bootstrapped samples
were drawn with number of observations approximately two third of the
original train set. The one third that were not selected for each draw are
treated as out-of-bag data for that ith bootstrapped sample.
2) Build random forest. For each bootstrapped sample, grow an unpruned
decision tree. A random forest will then be formed consisting of ntree number
of trees.
3) Predict out-of-bag elements. The ith decision tree grown from the i th
bootstrapped sample is then used to predict the values of the dependent
variable of the ith out-of-bag elements given the values of their predictor
variables. On the average each observation from the train set will be out-ofbag in one third of the ntree bootstrapped samples. Therefore the value of
Yis are predicted by averaging the predictions from the ntree decision trees.
With the actual and the predicted value, estimate of the out-of-bag error rate
(ERROOB) can be derived.

After parameter tuning and achieving the optimal random forest, the model then
can be used to predict the observations in the validation set to check predictive
ability and adequacy of the model before using it for further predictive initiatives.

3. Results

Correlation analysis showed that there is a strong correlation between chlorophyll a


and turbidity but they were still included in the random forest model.
The random forest model was able to capture the yield variability of R. philippinarum
with r2train = 0.99 on the train set, r 2OOB = 0.93 on the out-of-bag elements and r 2test
=0.74 on the test set. The model showed superior ability to predict unseen data.
4. Assessment of the Model
The paper used two methods that are different from usual spatial analysis.
Hierarchical clustering was used for correlation analysis and random forest for
predictive modelling.
Random forest models have been studied and compared with other methodologies.
Advantages and disadvantages are discussed in detail below.
i. Advantages
Its predictive ability is proven to outperform other methodologies.
It can capture both linear, non-linear and more complex relationships

between dependent and predictor variables.


It can still deliver efficient models even if there is no or minimal relationship

between the dependent and predictor variables.


It performs well even with the presence of highly correlated predictor

variables.
It does not require assumptions of normality and hence does not require
validation of assumption on the model.

Compared to spatial regression analysis which require that there is spatial


correlation and validates multiple assumptions like no multicolinearity, constant
variance, normality, independence and linearity, random forest proves to be
simpler to execute yet more efficient. This is a great approach that not only
captures the behavior of those related variables but also even the complex
interactions with minimal validation step. Its just answering how high is the
predictive ability of the model.

ii. Disadvantages
Specific to the study, random forest tends to over estimate on some areas
with low yield or those areas with low value of the dependent variable.

Finding the optimal number of parameters may be a challenging part due to

high volume of possible combination of parameters.


There is no general representation of the model that shows how the
variables are interacting unlike spatial regression that can be represented
through Y=X + .

III. CONLUSION
Spatial data analysis is now being explored and researchers from different fields like Biology,
Ecology, Fisheries and other fields that works with spatial data are now beginning to shift or
adapt new techniques. In recent years, machine learning techniques such as decision trees,
random forest and artificial neural networks are some of these techniques. This study used
random forest model. Compared to classical regression models, random forest was proven to
be simple yet and efficient predictive model. It may not assume normality or linearity but still
provides good predictions on unseen data. Before any modelling exercise, pros and cons
should be weighed properly whether to use spatial regression techniques or machine
learning techniques.

IV. SIMULATION STUDY


1. Data Simulation Set up
Data simulation is performed in excel using the range of values as described in the table
below:
Variables
Yield

Range
500 to 1500 (based on p6 of the study)

Sand
DO
Sal

assumed to range from 20% to 50%.


from 0-10 but only 6 -10mg/L are ideal
varies from 32 to 37 with an average of 35

Speed
Chla
Turb
RT
T
WD

10 to 20 ft per hour
Less than 3 is optimal, 3-7 is good and 7-15 not desirable
around 10
44.312.2 days for the Eulerian residence time
33.8 degree F
the Venice lagoon has a mean depth is 1.5 meter

2. Random Forest Model


The modelling process started with data simulation and data manipulation. This includes
randomly splitting the data into calibration data set and validation data set. Then random
forest model was developed using the calibration data set and tested using the validation
data set.
Correlation analysis show that Chlorophyl a and Turbidity are highly correlated with
0.95 correlation value which is aligned to the study. Furthermore, the model was able to
explain well the variability of the yield with 0.97 R-square.
To test the ability of the model to predict unseen data, Mean Absolute Percent Error was
computed between the yield predicted value and the yield actual value using the
validation data set. MAPE of 0.00033 is an evidence that random forest model done a
great job capturing the relationship of the yield with the nine environmental factors.

Figure 1. Variable importance in the random forest model.

R Codes for Random Forest Modelling


# Extract Data
data = read.csv("~/Documents/School/GIS/GIS DAta.csv")

colnames(data) = c("Site", "Si", "Sj","Sand", "DO","Salinity", "Water_Speed",


"Chlorophyla", "Turbidity","Resid_time","Water_temp", "Water_depth", "yield")
# Retaining the predictors and dependent variable
newdata = data[,-1:-3]
# Validate that Chlorophyl "a" and Turbidity are correlated
corr_analysis = cor(newdata$Chlorophyla, newdata$Turbidity)
# Random Forest Execution
require(randomForest)
N = nrow(newdata) # total number of observations
n_train = 1698
# number of observations in the calibration data set
calibration = newdata[1:n_train,]
validation = newdata[(n_train+1):N,]
n_test = nrow(validation)
# Applying the optimal parameter values
rf_model = randomForest(yield ~ ., data = calibration, ntree = 700 , mtry = 3,
nodesize = 5, importance = TRUE)
importance(rf_model, type = 1)
varImpPlot(rf_model, sort = TRUE)
mean(rf_model$rsq)
# Compute for MAPE
yield = validation$yield
for (i in 1:n_test){
MAPE = (sum(abs(yield[i] - predictions[i,])/yield[i])/n_test)*100
}
MAPE

LINK OF REFERENCES
http://www.vertexwaterfeatures.com/pond-water-chemistry
https://en.wikipedia.org/wiki/Lake_retention_time
http://europeforvisitors.com/venice/articles/venetian_lagoon.htm
https://sites.google.com/site/shyfem/application-1/lagoons/venice

Vous aimerez peut-être aussi