Académique Documents
Professionnel Documents
Culture Documents
Angelita P. Tobias
I. SUMMARY
The study applied a machine learning technique to model the yield and predict spatial
distribution of the potential yield of Manila clam. Manila clam scientifically known as Ruditapes
philippinarum is an aquaculture species that is widely grown in Venice lagoon (Italy). The model
was built to support management of the specie by identifying suitable production and
harvestable grounds. The overall approach combined Random Forest (RF) algorithm and
Geographic Information Systems to develop a predictive model that captures the complex
relationship of the nine environmental factors that are known to affect the yield or abundance of
the clam.
A total of 2,539 observations were collected wherein it was further split into calibration and
validation set or commonly knows as train and test set respectively. The train set with n = 1,698
was used to build the random forest model and the test set with n = 841 was used to validate
the accuracy of the model to predict unseen data. Before building the RF model, correlation
analysis was also conducted among the environmental factors through Hierarchical Clustering
using squared Spearman correlation as measure of similarity. The result shows that RF model
was able to explain well the variability of yield of R. philippinarum with r2train = 0.99 and r2test =
0.74.
II. DISCUSSION OF MATERIAL
1. Description of the data
The data set is collected last 2007 and consist of yield data and nine (9) environmental
factors that are known to affect the abundance of Manila clam. These nine environmental
factors are:
Percentage of sand in the sediment ( Sand)
Dissolved oxygen (DO)
Salinity (Sal)
Water Speed (Speed)
Chlorophyll a (Chla)
Turbidity (Turb)
Resident time (RT)
Water temperature (T)
Water depth (WD)
Information about Yield is only available from ~65% of the production area and since
yield data is usually divided into farming and fishing areas, biomass data of R.
philippinarum gathered from different studies is used to derived spatial distribution of
yield as a proxy indicator of yield.
Moreover, data on the predictor variables are collected from different sources. Aside
from scattered data sources, missing observations is also one of the challenges before
modelling. Available observations are then interpolated through ordinary kriging on the
chosen grid (100m x 100m cells) using the gstat package in R.
The data set consist of 2,539 cells and was randomly split into a train set (n = 1,698) and
a test set (n = 841). The relationship between yield and the nine environmental factors
was modelled using the train set and the model was validated using the test set.
2. Methodology
i. Hierarchical Clustering
In classical regression like simple linear and multiple linear regression, correlation
analysis is one of the initial step before proceeding further with model building. In
this study, correlation analysis was conducted through hierarchical clustering
using squared Spearman correlation (2) as similarity measure. In practice, aside
from k-means clustering, hierarchical clustering is also a common clustering
technique and is characterized by a tree-like structure which can cluster in two
ways: agglomerative clustering or divisive clustering. The agglomerative
approach starts with each object representing an individual cluster. These
clusters are then sequentially merged according to their similarity. The divisive
approach starts with all objects initially merged into a single cluster which is then
gradually split up based on dissimilarity.
ii. Random Forest
To investigate the relationship between the site specific yield of Manila clam and
the nine environmental factors, random forest algorithm was applied to the data.
The Random Forest algorithm is a machine learning technique that generates
predictions based on a set of unique decision trees built from bootstrapped
training samples. It can be implemented in R using the built in package
After parameter tuning and achieving the optimal random forest, the model then
can be used to predict the observations in the validation set to check predictive
ability and adequacy of the model before using it for further predictive initiatives.
3. Results
variables.
It does not require assumptions of normality and hence does not require
validation of assumption on the model.
ii. Disadvantages
Specific to the study, random forest tends to over estimate on some areas
with low yield or those areas with low value of the dependent variable.
III. CONLUSION
Spatial data analysis is now being explored and researchers from different fields like Biology,
Ecology, Fisheries and other fields that works with spatial data are now beginning to shift or
adapt new techniques. In recent years, machine learning techniques such as decision trees,
random forest and artificial neural networks are some of these techniques. This study used
random forest model. Compared to classical regression models, random forest was proven to
be simple yet and efficient predictive model. It may not assume normality or linearity but still
provides good predictions on unseen data. Before any modelling exercise, pros and cons
should be weighed properly whether to use spatial regression techniques or machine
learning techniques.
Range
500 to 1500 (based on p6 of the study)
Sand
DO
Sal
Speed
Chla
Turb
RT
T
WD
10 to 20 ft per hour
Less than 3 is optimal, 3-7 is good and 7-15 not desirable
around 10
44.312.2 days for the Eulerian residence time
33.8 degree F
the Venice lagoon has a mean depth is 1.5 meter
LINK OF REFERENCES
http://www.vertexwaterfeatures.com/pond-water-chemistry
https://en.wikipedia.org/wiki/Lake_retention_time
http://europeforvisitors.com/venice/articles/venetian_lagoon.htm
https://sites.google.com/site/shyfem/application-1/lagoons/venice