Académique Documents
Professionnel Documents
Culture Documents
Images haven’t loaded yet. Please exit printing, wait for images to load, and try to print again.
https://medium.com/sfu-big-data/demystifying-random-forest-1ed89e335fb3 1/14
29/03/2019 Demystifying Random Forest – SFU Big Data Science – Medium
Demystifying Random Forest
Tushar Chand Kapoor Follow
Mar 3 · 7 min read
O dds are high that the camera application you are using for face
recognition uses Random Forest. Any guess, about the bank you
visited today in the morning, detects frauds using? You guessed it right!
Random Forests have come as a popular technique used for building
predictive models. In this post, we will unveil the dynamics of random
forest, it’s applications and following this, how you can build your own
spam detector using random forests in python using scikit-learn. So
let’s start, shall we?
https://medium.com/sfu-big-data/demystifying-random-forest-1ed89e335fb3 2/14
29/03/2019 Demystifying Random Forest – SFU Big Data Science – Medium
https://medium.com/sfu-big-data/demystifying-random-forest-1ed89e335fb3 3/14
29/03/2019 Demystifying Random Forest – SFU Big Data Science – Medium
The reason behind the random forest’s success lies behind its simplicity.
Random forest is a modification of ‘bagging’ technique where a single
decision tree is built as the predictive model considering the entire data
set. Random Forest builds a collection of de-correlated trees using a
random selection of features, data points and then averages them
which reduces the variance. Usually, deeper the trees, lower the bias as
shown below.
https://medium.com/sfu-big-data/demystifying-random-forest-1ed89e335fb3 4/14
29/03/2019 Demystifying Random Forest – SFU Big Data Science – Medium
Since, all the trees are trained independently, when the point ‘d’ arrives
it is sent across all the trees in the forest. Then after that, we take the
average of the probabilities obtained from all the trees to predict the
output of the variable class in the test set. It can be represented as
follows:
https://medium.com/sfu-big-data/demystifying-random-forest-1ed89e335fb3 5/14
29/03/2019 Demystifying Random Forest – SFU Big Data Science – Medium
https://medium.com/sfu-big-data/demystifying-random-forest-1ed89e335fb3 6/14
29/03/2019 Demystifying Random Forest – SFU Big Data Science – Medium
Algorithm:
Now let’s see how the individual trees are built in the random forest
2. For each node of the tree, pick ‘m’ feature variables randomly from
the set of ‘p’ feature variables. The best variable/split is calculated
using these ‘m’ variables in the training set. The decision made at
each node is done using this split threshold until the data point
reaches the terminal node with the prediction of its class.
3. The rest of the sample from the training data set is used to
calculate the error of the tree using it as a validation set.
https://medium.com/sfu-big-data/demystifying-random-forest-1ed89e335fb3 7/14
29/03/2019 Demystifying Random Forest – SFU Big Data Science – Medium
Parameter Importance
To improve the confidence and prediction of random forest, it is
important to fine tune the parameters.
random_forest_classifier =
RandomForestClassifier(n_estimators=31, random_state=100,
max_depth=D)
The chart below shows the effect pertaining to taking different values
for D (Depth) ranging from 10–200, as you can see in our case when
the depth of the trees is around 80, a good increase in accuracy is
attained, after which the graph attains a stability with minor up/down
variation.
https://medium.com/sfu-big-data/demystifying-random-forest-1ed89e335fb3 8/14
29/03/2019 Demystifying Random Forest – SFU Big Data Science – Medium
random_forest_classifier =
RandomForestClassifier(n_estimators=T, random_state=100)
https://medium.com/sfu-big-data/demystifying-random-forest-1ed89e335fb3 9/14
29/03/2019 Demystifying Random Forest – SFU Big Data Science – Medium
random_forest_classifier =
RandomForestClassifier(n_estimators=31, random_state=100,
max_features=F)
The chart below shows the effect pertaining to taking different values
for T (Numbers of Trees) ranging from 2–200, for different sub
sampling criterion. As you can see taking sqrt(n_features) outperforms
the other two when it comes to the accuracy of the model. A general
trend that can be seen in all the three cases is that after a certain point
the graph becomes a straight line signifying that increasing the number
of trees after a certain point has no effect rather it increases the
computation time of the training model.
https://medium.com/sfu-big-data/demystifying-random-forest-1ed89e335fb3 10/14
29/03/2019 Demystifying Random Forest – SFU Big Data Science – Medium
https://medium.com/sfu-big-data/demystifying-random-forest-1ed89e335fb3 11/14
29/03/2019 Demystifying Random Forest – SFU Big Data Science – Medium
where,
li = Number of lead nodes of Q.
l = Number of data points at node Q.
Summary
In summary, Random Forest is a powerful ensemble technique which
can be used for solving both classification and regression problems.
This algorithm generates Random Trees which are trained on different
sub-samples of the data, leading to the de-correlation amongst them to
give an efficient model. Furthermore, if used with well-tuned
parameters it can save huge amounts of computation power and time.
References
T. Hastie, R. Tibshirani, and J. Friedman. Springer Series in The
Elements of Statistical Learning, USA, (2008)
. . .
https://medium.com/sfu-big-data/demystifying-random-forest-1ed89e335fb3 12/14
29/03/2019 Demystifying Random Forest – SFU Big Data Science – Medium
Below is the code to train the model and generate graphs for parameter
running. Also, here is the link to the Jupyter Notebook for the same.
https://medium.com/sfu-big-data/demystifying-random-forest-1ed89e335fb3 13/14
29/03/2019 Demystifying Random Forest – SFU Big Data Science – Medium
https://medium.com/sfu-big-data/demystifying-random-forest-1ed89e335fb3 14/14