Vous êtes sur la page 1sur 14

29/03/2019 Demystifying Random Forest – SFU Big Data Science – Medium

Images haven’t loaded yet. Please exit printing, wait for images to load, and try to print again.

https://medium.com/sfu-big-data/demystifying-random-forest-1ed89e335fb3 1/14
29/03/2019 Demystifying Random Forest – SFU Big Data Science – Medium

Demystifying Random Forest
Tushar Chand Kapoor Follow
Mar 3 · 7 min read

Content Contributors: Mehak Parashar, Syed Ikram, Tushar Chand


Kapoor

O dds are high that the camera application you are using for face
recognition uses Random Forest. Any guess, about the bank you
visited today in the morning, detects frauds using? You guessed it right!
Random Forests have come as a popular technique used for building
predictive models. In this post, we will unveil the dynamics of random
forest, it’s applications and following this, how you can build your own
spam detector using random forests in python using scikit-learn. So
let’s start, shall we?

https://medium.com/sfu-big-data/demystifying-random-forest-1ed89e335fb3 2/14
29/03/2019 Demystifying Random Forest – SFU Big Data Science – Medium

The big question — What is the Random Forest?

Let’s first understand this by an example before getting into a formal


definition, suppose it’s time for dinner and your mother asks you and
your sibling what she should cook for dinner, to get an answer from
both, she asks both of you to play a yes and no game. She starts by
asking questions like should the dinner include curry? She further

https://medium.com/sfu-big-data/demystifying-random-forest-1ed89e335fb3 3/14
29/03/2019 Demystifying Random Forest – SFU Big Data Science – Medium

investigates by asking should it contain vegetables? By asking questions


in this decisive manner, she is trying different combinations to come to
a conclusion about what kind of dish she should make and as a result of
these questions, she is able to come to a conclusion of the final dish to
prepare. Random forest works in the similar manner, where you and
your sibling are trees in a forest, your food preferences are different
samples of data, and finally, the dish is the decision that random forest
makes.

Formally, Random Forests is a powerful ensemble method used to build


predictive models by constructing a large number of de-correlated trees
to solve the problem of overfitting which involves averaging the
predictor output from different tree models which are distributed
across helping in reducing the variance. This is one of the most widely
used algorithms for both classification and regression.

Why Random Forest is a good predictive model?

The reason behind the random forest’s success lies behind its simplicity.
Random forest is a modification of ‘bagging’ technique where a single
decision tree is built as the predictive model considering the entire data
set. Random Forest builds a collection of de-correlated trees using a
random selection of features, data points and then averages them
which reduces the variance. Usually, deeper the trees, lower the bias as
shown below.

https://medium.com/sfu-big-data/demystifying-random-forest-1ed89e335fb3 4/14
29/03/2019 Demystifying Random Forest – SFU Big Data Science – Medium

Since, all the trees are trained independently, when the point ‘d’ arrives
it is sent across all the trees in the forest. Then after that, we take the
average of the probabilities obtained from all the trees to predict the
output of the variable class in the test set. It can be represented as
follows:

https://medium.com/sfu-big-data/demystifying-random-forest-1ed89e335fb3 5/14
29/03/2019 Demystifying Random Forest – SFU Big Data Science – Medium

Before getting into the algorithm of Random Forest, Let’s


understand how the split works in the Decision Tree inside the
Random Forest.

Say we have a large collection of SMS messages dataset. How likely is


the message to be spam or non-spam? Trees need a splitting criterion
for separating these two classes. Imagine these SMS messages be the
points in the 2-D plane. Each point depicts the SMS as spam(Orange)
or not spam(Blue). As each tree is built, we need to come up with the
nodes that split the points in a manner that it divides these points into
2 classes producing homogeneous distribution after splitting on the
plane. This is done by taking some random data points and select the
split threshold (x-y coordinate) based on the data point which
maximizes the information gain. These splits can be vertical, horizontal
or linear depending upon the distribution of data. In the figure below,
the information gained is more for the vertical split when compared to
the horizontal split, the histograms chart depicts the same.

https://medium.com/sfu-big-data/demystifying-random-forest-1ed89e335fb3 6/14
29/03/2019 Demystifying Random Forest – SFU Big Data Science – Medium

Algorithm:
Now let’s see how the individual trees are built in the random forest

1. Select the training set sample Z* randomly for each of the


individual trees from the size N of training data using bootstrap
sampling technique with the replacement of data points.

2. For each node of the tree, pick ‘m’ feature variables randomly from
the set of ‘p’ feature variables. The best variable/split is calculated
using these ‘m’ variables in the training set. The decision made at
each node is done using this split threshold until the data point
reaches the terminal node with the prediction of its class.

3. The rest of the sample from the training data set is used to
calculate the error of the tree using it as a validation set.

https://medium.com/sfu-big-data/demystifying-random-forest-1ed89e335fb3 7/14
29/03/2019 Demystifying Random Forest – SFU Big Data Science – Medium

Finally, the predictions are aggregated from the ensemble of trees


using voting by each of the trees in the forest to give a prediction with
low variance and de-correlated output.

Parameter Importance
To improve the confidence and prediction of random forest, it is
important to fine tune the parameters.

• Maximum Depth — If the value of maximum depth parameter D


is large, the model tends to suffer from overfitting, whereas when
the value is small it produces low confidence of the model
resulting in under-fitting. Therefore, one has to carefully select an
optimal value of D.

random_forest_classifier =
RandomForestClassifier(n_estimators=31, random_state=100,
max_depth=D)

The chart below shows the effect pertaining to taking different values
for D (Depth) ranging from 10–200, as you can see in our case when
the depth of the trees is around 80, a good increase in accuracy is
attained, after which the graph attains a stability with minor up/down
variation.

https://medium.com/sfu-big-data/demystifying-random-forest-1ed89e335fb3 8/14
29/03/2019 Demystifying Random Forest – SFU Big Data Science – Medium

• Number of trees — Generally, more the number of trees, better


the accuracy. There is no penalty for using more trees however it’s
computationally expensive. One way of choosing the number of
trees (T) is to look at the out-of-bag error estimate when trees are
being added to the forest and stop it once the error rate starts
plateauing.

random_forest_classifier =
RandomForestClassifier(n_estimators=T, random_state=100)

• Max Features — This parameter specifies the size of random


subsets of features when looking for the best split at each node.
Increasing the max features to a higher value will improve the
performance on the training set but results in a correlation
between the trees which we are trying to avoid. The optimal value
(F) which is usually picked as the square root of total number of

https://medium.com/sfu-big-data/demystifying-random-forest-1ed89e335fb3 9/14
29/03/2019 Demystifying Random Forest – SFU Big Data Science – Medium

features i.e. sqrt(n_features) which produces good results in


performance, anti-correlation between trees and reduction of
variance.

random_forest_classifier =
RandomForestClassifier(n_estimators=31, random_state=100,
max_features=F)

The chart below shows the effect pertaining to taking different values
for T (Numbers of Trees) ranging from 2–200, for different sub
sampling criterion. As you can see taking sqrt(n_features) outperforms
the other two when it comes to the accuracy of the model. A general
trend that can be seen in all the three cases is that after a certain point
the graph becomes a straight line signifying that increasing the number
of trees after a certain point has no effect rather it increases the
computation time of the training model.

https://medium.com/sfu-big-data/demystifying-random-forest-1ed89e335fb3 10/14
29/03/2019 Demystifying Random Forest – SFU Big Data Science – Medium

• Split Criterion — Since our algorithm is highly dependent on the


splits it performs, a close attention should be given towards the
quality of the splits. We have two split criterions, as follows:

1. GINI Split — In this GINI index is used as a metric to evaluate cost


function in case of classification using CART (Classification and
Regression Trees) algorithm which is further used to evaluate splits in
the dataset. It gives an idea about how good or bad a split is for every
node Q in the tree, the split which has the best GINIsplit value is
chosen. The equations below are used to determine the split which has
the best GINIsplit.

2. Entropy — It signifies the degree of purity of the distribution.


Entropy is maximum when the records are equally distributed among
all classes and minimum when all records belong to one class. The
equations below are used to determines the split which has the best
GAINsplit.

https://medium.com/sfu-big-data/demystifying-random-forest-1ed89e335fb3 11/14
29/03/2019 Demystifying Random Forest – SFU Big Data Science – Medium

where,
 li = Number of lead nodes of Q.
 l = Number of data points at node Q.

Summary
In summary, Random Forest is a powerful ensemble technique which
can be used for solving both classification and regression problems.
This algorithm generates Random Trees which are trained on different
sub-samples of the data, leading to the de-correlation amongst them to
give an efficient model. Furthermore, if used with well-tuned
parameters it can save huge amounts of computation power and time.

References
T. Hastie, R. Tibshirani, and J. Friedman. Springer Series in The
Elements of Statistical Learning, USA, (2008)

. . .

How to build a Spam Detector Classifier using


Random Forest in python using scikit-learn

https://medium.com/sfu-big-data/demystifying-random-forest-1ed89e335fb3 12/14
29/03/2019 Demystifying Random Forest – SFU Big Data Science – Medium

Below is the code to train the model and generate graphs for parameter
running. Also, here is the link to the Jupyter Notebook for the same.

The dataset can downloaded by following this link.

https://medium.com/sfu-big-data/demystifying-random-forest-1ed89e335fb3 13/14
29/03/2019 Demystifying Random Forest – SFU Big Data Science – Medium

https://medium.com/sfu-big-data/demystifying-random-forest-1ed89e335fb3 14/14

Vous aimerez peut-être aussi