Demystifying Random Forest - SFU Big Data Science - Medium

29/03/2019 Demystifying Random Forest – SFU Big Data Science – Medium
Images haven’t loaded yet. Please exit printing, wait for images to load, and try to print again.
https://medium.com/sfu-big-data/demystifying-random-forest-1ed89e335fb3 1/14
Demystifying Random Forest
Tushar Chand Kapoor Follow
Mar 3 · 7 min read
Content Contributors: Mehak Parashar, Syed Ikram, Tushar Chand

Kapoor
O dds are high that the camera application you are using for face
recognition uses Random Forest. Any guess, about the bank you
visited today in the morning, detects frauds using? You guessed it right!
Random Forests have come as a popular technique used for building
predictive models. In this post, we will unveil the dynamics of random
forest, it’s applications and following this, how you can build your own
spam detector using random forests in python using scikit-learn. So
let’s start, shall we?
The big question — What is the Random Forest?
Let’s first understand this by an example before getting into a formal

definition, suppose it’s time for dinner and your mother asks you and
your sibling what she should cook for dinner, to get an answer from
both, she asks both of you to play a yes and no game. She starts by
asking questions like should the dinner include curry? She further
investigates by asking should it contain vegetables? By asking questions

in this decisive manner, she is trying different combinations to come to
a conclusion about what kind of dish she should make and as a result of
these questions, she is able to come to a conclusion of the final dish to
prepare. Random forest works in the similar manner, where you and
your sibling are trees in a forest, your food preferences are different
samples of data, and finally, the dish is the decision that random forest
makes.
Formally, Random Forests is a powerful ensemble method used to build

predictive models by constructing a large number of de-correlated trees
to solve the problem of overfitting which involves averaging the
predictor output from different tree models which are distributed
across helping in reducing the variance. This is one of the most widely
used algorithms for both classification and regression.
Why Random Forest is a good predictive model?
The reason behind the random forest’s success lies behind its simplicity.
Random forest is a modification of ‘bagging’ technique where a single
decision tree is built as the predictive model considering the entire data
set. Random Forest builds a collection of de-correlated trees using a
random selection of features, data points and then averages them
which reduces the variance. Usually, deeper the trees, lower the bias as
shown below.
Since, all the trees are trained independently, when the point ‘d’ arrives
it is sent across all the trees in the forest. Then after that, we take the
average of the probabilities obtained from all the trees to predict the
output of the variable class in the test set. It can be represented as
follows:
Before getting into the algorithm of Random Forest, Let’s

understand how the split works in the Decision Tree inside the
Random Forest.
Say we have a large collection of SMS messages dataset. How likely is

the message to be spam or non-spam? Trees need a splitting criterion
for separating these two classes. Imagine these SMS messages be the
points in the 2-D plane. Each point depicts the SMS as spam(Orange)
or not spam(Blue). As each tree is built, we need to come up with the
nodes that split the points in a manner that it divides these points into
2 classes producing homogeneous distribution after splitting on the
plane. This is done by taking some random data points and select the
split threshold (x-y coordinate) based on the data point which
maximizes the information gain. These splits can be vertical, horizontal
or linear depending upon the distribution of data. In the figure below,
the information gained is more for the vertical split when compared to
the horizontal split, the histograms chart depicts the same.
Algorithm:
Now let’s see how the individual trees are built in the random forest
1. Select the training set sample Z* randomly for each of the

individual trees from the size N of training data using bootstrap
sampling technique with the replacement of data points.
2. For each node of the tree, pick ‘m’ feature variables randomly from
the set of ‘p’ feature variables. The best variable/split is calculated
using these ‘m’ variables in the training set. The decision made at
each node is done using this split threshold until the data point
reaches the terminal node with the prediction of its class.
3. The rest of the sample from the training data set is used to
calculate the error of the tree using it as a validation set.
Finally, the predictions are aggregated from the ensemble of trees

using voting by each of the trees in the forest to give a prediction with
low variance and de-correlated output.
Parameter Importance
To improve the confidence and prediction of random forest, it is
important to fine tune the parameters.
• Maximum Depth — If the value of maximum depth parameter D

is large, the model tends to suffer from overfitting, whereas when
the value is small it produces low confidence of the model
resulting in under-fitting. Therefore, one has to carefully select an
optimal value of D.
random_forest_classifier =
RandomForestClassifier(n_estimators=31, random_state=100,
max_depth=D)
The chart below shows the effect pertaining to taking different values
for D (Depth) ranging from 10–200, as you can see in our case when
the depth of the trees is around 80, a good increase in accuracy is
attained, after which the graph attains a stability with minor up/down
variation.
• Number of trees — Generally, more the number of trees, better

the accuracy. There is no penalty for using more trees however it’s
computationally expensive. One way of choosing the number of
trees (T) is to look at the out-of-bag error estimate when trees are
being added to the forest and stop it once the error rate starts
plateauing.
RandomForestClassifier(n_estimators=T, random_state=100)
• Max Features — This parameter specifies the size of random

subsets of features when looking for the best split at each node.
Increasing the max features to a higher value will improve the
performance on the training set but results in a correlation
between the trees which we are trying to avoid. The optimal value
(F) which is usually picked as the square root of total number of
features i.e. sqrt(n_features) which produces good results in

performance, anti-correlation between trees and reduction of
variance.
RandomForestClassifier(n_estimators=31, random_state=100,
max_features=F)
The chart below shows the effect pertaining to taking different values
for T (Numbers of Trees) ranging from 2–200, for different sub
sampling criterion. As you can see taking sqrt(n_features) outperforms
the other two when it comes to the accuracy of the model. A general
trend that can be seen in all the three cases is that after a certain point
the graph becomes a straight line signifying that increasing the number
of trees after a certain point has no effect rather it increases the
computation time of the training model.
• Split Criterion — Since our algorithm is highly dependent on the

splits it performs, a close attention should be given towards the
quality of the splits. We have two split criterions, as follows:
1. GINI Split — In this GINI index is used as a metric to evaluate cost

function in case of classification using CART (Classification and
Regression Trees) algorithm which is further used to evaluate splits in
the dataset. It gives an idea about how good or bad a split is for every
node Q in the tree, the split which has the best GINIsplit value is
chosen. The equations below are used to determine the split which has
the best GINIsplit.
2. Entropy — It signifies the degree of purity of the distribution.

Entropy is maximum when the records are equally distributed among
all classes and minimum when all records belong to one class. The
equations below are used to determines the split which has the best
GAINsplit.
where,
li = Number of lead nodes of Q.
l = Number of data points at node Q.
Summary
In summary, Random Forest is a powerful ensemble technique which
can be used for solving both classification and regression problems.
This algorithm generates Random Trees which are trained on different
sub-samples of the data, leading to the de-correlation amongst them to
give an efficient model. Furthermore, if used with well-tuned
parameters it can save huge amounts of computation power and time.
References
T. Hastie, R. Tibshirani, and J. Friedman. Springer Series in The
Elements of Statistical Learning, USA, (2008)
. . .
How to build a Spam Detector Classifier using

Random Forest in python using scikit-learn
Below is the code to train the model and generate graphs for parameter
running. Also, here is the link to the Jupyter Notebook for the same.
The dataset can downloaded by following this link.

Demystifying Random Forest - SFU Big Data Science - Medium

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Demystifying Random Forest - SFU Big Data Science - Medium

Transféré par

Droits d'auteur :

Formats disponibles

29/03/2019 Demystifying Random Forest – SFU Big Data Science – Medium

Content Contributors: Mehak Parashar, Syed Ikram, Tushar Chand

The big question — What is the Random Forest?

Let’s first understand this by an example before getting into a formal

investigates by asking should it contain vegetables? By asking questions

Formally, Random Forests is a powerful ensemble method used to build

Why Random Forest is a good predictive model?

Before getting into the algorithm of Random Forest, Let’s

Say we have a large collection of SMS messages dataset. How likely is

1. Select the training set sample Z* randomly for each of the

Finally, the predictions are aggregated from the ensemble of trees

• Maximum Depth — If the value of maximum depth parameter D

• Number of trees — Generally, more the number of trees, better

• Max Features — This parameter specifies the size of random

features i.e. sqrt(n_features) which produces good results in

• Split Criterion — Since our algorithm is highly dependent on the

1. GINI Split — In this GINI index is used as a metric to evaluate cost

2. Entropy — It signifies the degree of purity of the distribution.

How to build a Spam Detector Classifier using

The dataset can downloaded by following this link.

Vous aimerez peut-être aussi