Vous êtes sur la page 1sur 90

Practical guide to implement machine learning

with CARET package in R (with practice


problem)
Introduction
One of the biggest challenge beginners in machine learning face is which algorithms to learn and focus
on. In case of R, the problem gets accentuated by the fact that various algorithms would have different
syntax, different parameters to tune and different requirements on the data format. This could be too
much for a beginner.
So, then how do you transform from a beginner to a data scientist building hundreds of models and
stacking them together? There certainly isnt any shortcut but what Ill tell you today will make you
capable of applying hundreds of machine learning models without having to:
remember the different package names for each algorithm.
syntax of applying each algorithm.
parameters to tune for each algorithm.

All this has been made possible by the years of effort that have gone behind CARET ( Classification
And Regression Training) which is possibly the biggest project in R. This package alone is all you
need to know for solving almost any supervised machine learning problem. It provides a uniform
interface to several machine learning algorithms and standardizes various other tasks such as Data
splitting, Pre-processing, Feature selection, Variable importance estimation, etc.
To get an in-depth overview of various functionalities provided by Caret, you can refer to this article.
Today, well work on the Loan Prediction problem-III to show you the power of Caret package.
P.S. While caret definitely simplifies the job to a degree, it can not take away the hard work and
practice you need to put in to become a master at machine learning.

Table of Contents
1. Getting started
2. Pre-processing using Caret
3. Splitting the data using Caret
4. Feature selection using Caret
5. Training models using Caret
6. Parameter tuning using Caret
7. Variable importance estimation using Caret
8. Making predictions using Caret

1. Getting started
To put in simple words, Caret is essentially a wrapper for 200+ machine learning algorithms.
Additionally, it provides several features which makes it a one stop solution for all the modeling needs
for supervised machine learning problems.
Caret tries not to load all the packages it depends upon at the start. Instead, it loads them only when the
packages are needed. But it does assume that you already have all the algorithms installed on your
system.
To install Caret on your system, use the following command. Heads up: It might take some time:
> install.packages("caret", dependencies = c("Depends", "Suggests"))

Now, lets get started using caret package on Loan Prediction 3 problem:
#Loading caret package
library("caret")

#Loading training data


train<-read.csv("train_u6lujuX_CVtuZ9i.csv",stringsAsFactors = T)

#Looking at the structure of caret package.


str(train)
#'data.frame': 614 obs. of 13 variables:
#$ Loan_ID : Factor w/ 614 levels "LP001002","LP001003",..: 1 2 3 4 5 6 7
8 9 #10 ..
#$ Gender : Factor w/ 3 levels "","Female","Male": 3 3 3 3 3 3 3 3 3
3 ...
#$ Married : Factor w/ 3 levels "","No","Yes": 2 3 3 3 2 3 3 3 3 3 ...
#$ Dependents : Factor w/ 5 levels "","0","1","2",..: 2 3 2 2 2 4 2 5 4 3 ...
#$ Education : Factor w/ 2 levels "Graduate","Not Graduate": 1 1 1 2 1 1 2 1
1 #1 ...
#$ Self_Employed : Factor w/ 3 levels "","No","Yes": 2 2 3 2 2 3 2 2 2 2 ...
#$ ApplicantIncome : int 5849 4583 3000 2583 6000 5417 2333 3036 4006 12841 #...
#$ CoapplicantIncome: num 0 1508 0 2358 0 ...
#$ LoanAmount : int NA 128 66 120 141 267 95 158 168 349 ...
#$ Loan_Amount_Term : int 360 360 360 360 360 360 360 360 360 360 ...
#$ Credit_History : int 1 1 1 1 1 1 1 0 1 1 ...
#$ Property_Area : Factor w/ 3 levels "Rural","Semiurban",..: 3 1 3 3 3 3 3 2 3
2 #...
#$ Loan_Status : Factor w/ 2 levels "N","Y": 2 1 2 2 2 2 2 1 2 1 ...

In this problem, we have to predict the Loan Status of a person based on his/ her profile.
2. Pre-processing using Caret
We need to pre-process our data before we can use it for modeling. Lets check if the data has any
missing values:
sum(is.na(train))
#[1] 86

Next, let us use Caret to impute these missing values using KNN algorithm. We will predict these
missing values based on other attributes for that row. Also, well scale and center the numerical data by
using the convenient preprocess() in Caret.
#Imputing missing values using KNN.Also centering and scaling numerical columns
preProcValues <- preProcess(train, method = c("knnImpute","center","scale"))

library('RANN')
train_processed <- predict(preProcValues, train)
sum(is.na(train_processed))
#[1] 0

It is also very easy to use one hot encoding in Caret to create dummy variables for each level of a
categorical variable. But first, well convert the dependent variable to numerical.
#Converting outcome variable to numeric
train_processed$Loan_Status<-ifelse(train_processed$Loan_Status=='N',0,1)

id<-train_processed$Loan_ID
train_processed$Loan_ID<-NULL

#Checking the structure of processed train file


str(train_processed)
#'data.frame': 614 obs. of 12 variables:
#$ Gender : Factor w/ 3 levels "","Female","Male": 3 3 3 3 3 3 3 3 3
3 ...
#$ Married : Factor w/ 3 levels "","No","Yes": 2 3 3 3 2 3 3 3 3 3 ...
#$ Dependents : Factor w/ 5 levels "","0","1","2",..: 2 3 2 2 2 4 2 5 4 3 ...
#$ Education : Factor w/ 2 levels "Graduate","Not Graduate": 1 1 1 2 1 1 2 1
1 #1 ...
#$ Self_Employed : Factor w/ 3 levels "","No","Yes": 2 2 3 2 2 3 2 2 2 2 ...
#$ ApplicantIncome : num 0.0729 -0.1343 -0.3934 -0.4617 0.0976 ...
#$ CoapplicantIncome: num -0.554 -0.0387 -0.554 0.2518 -0.554 ...
#$ LoanAmount : num 0.0162 -0.2151 -0.9395 -0.3086 -0.0632 ...
#$ Loan_Amount_Term : num 0.276 0.276 0.276 0.276 0.276 ...
#$ Credit_History : num 0.432 0.432 0.432 0.432 0.432 ...
#$ Property_Area : Factor w/ 3 levels "Rural","Semiurban",..: 3 1 3 3 3 3 3 2 3
2 #...
#$ Loan_Status : num 1 0 1 1 1 1 1 0 1 0 ...

Now, creating dummy variables using one hot encoding:


#Converting every categorical variable to numerical using dummy variables
dmy <- dummyVars(" ~ .", data = train_processed,fullRank = T)
train_transformed <- data.frame(predict(dmy, newdata = train_processed))

#Checking the structure of transformed train file


str(train_transformed)
#'data.frame': 614 obs. of 19 variables:
#$ Gender.Female : num 0 0 0 0 0 0 0 0 0 0 ...
#$ Gender.Male : num 1 1 1 1 1 1 1 1 1 1 ...
#$ Married.No : num 1 0 0 0 1 0 0 0 0 0 ...
#$ Married.Yes : num 0 1 1 1 0 1 1 1 1 1 ...
#$ Dependents.0 : num 1 0 1 1 1 0 1 0 0 0 ...
#$ Dependents.1 : num 0 1 0 0 0 0 0 0 0 1 ...
#$ Dependents.2 : num 0 0 0 0 0 1 0 0 1 0 ...
#$ Dependents.3. : num 0 0 0 0 0 0 0 1 0 0 ...
#$ Education.Not.Graduate : num 0 0 0 1 0 0 1 0 0 0 ...
#$ Self_Employed.No : num 1 1 0 1 1 0 1 1 1 1 ...
#$ Self_Employed.Yes : num 0 0 1 0 0 1 0 0 0 0 ...
#$ ApplicantIncome : num 0.0729 -0.1343 -0.3934 -0.4617 0.0976 ...
#$ CoapplicantIncome : num -0.554 -0.0387 -0.554 0.2518 -0.554 ...
#$ LoanAmount : num 0.0162 -0.2151 -0.9395 -0.3086 -0.0632 ...
#$ Loan_Amount_Term : num 0.276 0.276 0.276 0.276 0.276 ...
#$ Credit_History : num 0.432 0.432 0.432 0.432 0.432 ...
#$ Property_Area.Semiurban: num 0 0 0 0 0 0 0 1 0 1 ...
#$ Property_Area.Urban : num 1 0 1 1 1 1 1 0 1 0 ...
#$ Loan_Status : num 1 0 1 1 1 1 1 0 1 0 ...

#Converting the dependent variable back to categorical


train_transformed$Loan_Status<-as.factor(train_transformed$Loan_Status)

Here, fullrank=T will create only (n-1) columns for a categorical column with n different levels. This
works well particularly for the representing categorical predictors like gender, married, etc. where we
only have two levels: Male/Female, Yes/No, etc. because 0 can be used to represent one class while 1
represents the other class in same column.

3. Splitting data using caret


Well be creating a cross-validation set from the training set to evaluate our model against. It is
important to rely more on the cross-validation set for the actual evaluation of your model otherwise you
might end up overfitting the public leaderboard.
Well use createDataPartition() to split our training data into two sets : 75% and 25%. Since, our
outcome variable is categorical in nature, this function will make sure that the distribution of outcome
variable classes will be similar in both the sets.
#Spliting training set into two parts based on outcome: 75% and 25%
index <- createDataPartition(train_transformed$Loan_Status, p=0.75, list=FALSE)
trainSet <- train_transformed[ index,]
testSet <- train_transformed[-index,]

#Checking the structure of trainSet


str(trainSet)
#'data.frame': 461 obs. of 19 variables:
#$ Gender.Female : num 0 0 0 0 0 0 0 0 0 0 ...
#$ Gender.Male : num 1 1 1 1 1 1 1 1 1 1 ...
#$ Married.No : num 1 0 0 0 1 0 0 0 0 0 ...
#$ Married.Yes : num 0 1 1 1 0 1 1 1 1 1 ...
#$ Dependents.0 : num 1 0 1 1 1 0 1 0 0 0 ...
#$ Dependents.1 : num 0 1 0 0 0 0 0 0 1 0 ...
#$ Dependents.2 : num 0 0 0 0 0 1 0 0 0 1 ...
#$ Dependents.3. : num 0 0 0 0 0 0 0 1 0 0 ...
#$ Education.Not.Graduate : num 0 0 0 1 0 0 1 0 0 0 ...
#$ Self_Employed.No : num 1 1 0 1 1 0 1 1 1 1 ...
#$ Self_Employed.Yes : num 0 0 1 0 0 1 0 0 0 0 ...
#$ ApplicantIncome : num 0.0729 -0.1343 -0.3934 -0.4617 0.0976 ...
#$ CoapplicantIncome : num -0.554 -0.0387 -0.554 0.2518 -0.554 ...
#$ LoanAmount : num 0.0162 -0.2151 -0.9395 -0.3086 -0.0632 ...
#$ Loan_Amount_Term : num 0.276 0.276 0.276 0.276 0.276 ...
#$ Credit_History : num 0.432 0.432 0.432 0.432 0.432 ...
#$ Property_Area.Semiurban: num 0 0 0 0 0 0 0 1 1 0 ...
#$ Property_Area.Urban : num 1 0 1 1 1 1 1 0 0 1 ...
#$ Loan_Status : Factor w/ 2 levels "0","1": 2 1 2 2 2 2 2 1 1 2 ...

4. Feature selection using Caret


Feature selection is an extremely crucial part of modeling. To understand the importance of feature
selection and various techniques used for feature selection, I strongly recommend that you to go
through my previous article. For now, well be using Recursive Feature elimination which is a wrapper
method to find the best subset of features to use for modeling.
#Feature selection using rfe in caret
control <- rfeControl(functions = rfFuncs,
method = "repeatedcv",
repeats = 3,
verbose = FALSE)
outcomeName<-'Loan_Status'
predictors<-names(trainSet)[!names(trainSet) %in% outcomeName]
Loan_Pred_Profile <- rfe(trainSet[,predictors], trainSet[,outcomeName],
rfeControl = control)
Loan_Pred_Profile
#Recursive feature selection
#Outer resampling method: Cross-Validated (10 fold, repeated 3 times)
#Resampling performance over subset size:
# Variables Accuracy Kappa AccuracySD KappaSD Selected
#4 0.7737 0.4127 0.03707 0.09962
#8 0.7874 0.4317 0.03833 0.11168
#16 0.7903 0.4527 0.04159 0.11526
#18 0.7882 0.4431 0.03615 0.10812
#The top 5 variables (out of 16):
# Credit_History, LoanAmount, Loan_Amount_Term, ApplicantIncome, CoapplicantIncome
#Taking only the top 5 predictors
predictors<-c("Credit_History", "LoanAmount", "Loan_Amount_Term",
"ApplicantIncome", "CoapplicantIncome")
5. Training models using Caret
This is probably the part where Caret stands out from any other available package. It provides the
ability for implementing 200+ machine learning algorithms using consistent syntax. To get a list of all
the algorithms that Caret supports, you can use:
names(getModelInfo())
#[1] "ada" "AdaBag" "AdaBoost.M1"
"adaboost"
#[5] "amdai" "ANFIS" "avNNet"
"awnb"
#[9] "awtan" "bag" "bagEarth"
"bagEarthGCV"
#[13] "bagFDA" "bagFDAGCV" "bam"
"bartMachine"
#[17] "bayesglm" "bdk" "binda"
"blackboost"
#[21] "blasso" "blassoAveraged" "Boruta"
"bridge"
#.
#[205] "svmBoundrangeString" "svmExpoString" "svmLinear"
"svmLinear2"
#[209] "svmLinear3" "svmLinearWeights" "svmLinearWeights2"
"svmPoly"
#[213] "svmRadial" "svmRadialCost" "svmRadialSigma"
"svmRadialWeights"
#[217] "svmSpectrumString" "tan" "tanSearch"
"treebag"
#[221] "vbmpRadial" "vglmAdjCat" "vglmContRatio"
"vglmCumulative"
#[225] "widekernelpls" "WM" "wsrf"
"xgbLinear"
#[229] "xgbTree" "xyf"

To get more details of any model, you can refer here.


We can simply apply a large number of algorithms with similar syntax. For example, to apply, GBM,
Random forest, Neural net and Logistic regression :
model_gbm<-train(trainSet[,predictors],trainSet[,outcomeName],method='gbm')
model_rf<-train(trainSet[,predictors],trainSet[,outcomeName],method='rf')
model_nnet<-train(trainSet[,predictors],trainSet[,outcomeName],method='nnet')
model_glm<-train(trainSet[,predictors],trainSet[,outcomeName],method='glm')

You can proceed further tune the parameters in all these algorithms using the parameter tuning
techniques.

6. Parameter tuning using Caret


Its extremely easy to tune parameters using Caret. Typically, parameter tuning in Caret is done as
below:
It is possible to customize almost every step in the tuning process. The resampling technique used for
evaluating the performance of the model using a set of parameters in Caret by default is bootstrap, but
it provides alternatives for using k-fold, repeated k-fold as well as Leave-one-out cross validation
(LOOCV) which can be specified using trainControl(). In this example, well be using 5-Fold cross-
validation repeated 5 times.
fitControl <- trainControl(
method = "repeatedcv",
number = 5,
repeats = 5)

If the search space for parameters is not defined, Caret will use 3 random values of each tunable
parameter and use the cross-validation results to find the best set of parameters for that algorithm.
Otherwise, there are two more ways to tune parameters:

6.1.Using tuneGrid
To find the parameters of a model that can be tuned, you can use
modelLookup(model='gbm')

#model parameter label forReg forClass probModel


#1 gbm n.trees # Boosting Iterations TRUE TRUE TRUE
#2 gbm interaction.depth Max Tree Depth TRUE TRUE TRUE
#3 gbm shrinkage Shrinkage TRUE TRUE TRUE
#4 gbm n.minobsinnode Min. Terminal Node Size TRUE TRUE TRUE
#using grid search

#Creating grid
grid <-
expand.grid(n.trees=c(10,20,50,100,500,1000),shrinkage=c(0.01,0.05,0.1,0.5),n.minob
sinnode = c(3,5,10),interaction.depth=c(1,5,10))

# training the model


model_gbm<-
train(trainSet[,predictors],trainSet[,outcomeName],method='gbm',trControl=fitContro
l,tuneGrid=grid)

# summarizing the model


print(model_gbm)

#Stochastic Gradient Boosting


#461 samples
#5 predictor
#2 classes: '0', '1'

#No pre-processing
#Resampling: Cross-Validated (5 fold, repeated 5 times)
#Summary of sample sizes: 368, 370, 369, 369, 368, 369, ...
#Resampling results across tuning parameters:

# shrinkage interaction.depth n.minobsinnode n.trees Accuracy Kappa


#0.01 1 3 10 0.6876416 0.0000000
#0.01 1 3 20 0.6876416 0.0000000
#0.01 1 3 50 0.7982345 0.4423609
#0.01 1 3 100 0.7952190 0.4364383
#0.01 1 3 500 0.7904882 0.4342300
#0.01 1 3 1000 0.7913627 0.4421230
#0.01 1 5 10 0.6876416 0.0000000
#0.01 1 5 20 0.6876416 0.0000000
#0.01 1 5 50 0.7982345 0.4423609
#0.01 1 5 100 0.7943635 0.4351912
#0.01 1 5 500 0.7930783 0.4411348
#0.01 1 5 1000 0.7913720 0.4417463
#0.01 1 10 10 0.6876416 0.0000000
#0.01 1 10 20 0.6876416 0.0000000
#0.01 1 10 50 0.7982345 0.4423609
#0.01 1 10 100 0.7943635 0.4351912
#0.01 1 10 500 0.7939525 0.4426503
#0.01 1 10 1000 0.7948362 0.4476742
#0.01 5 3 10 0.6876416 0.0000000
#0.01 5 3 20 0.6876416 0.0000000
#0.01 5 3 50 0.7960556 0.4349571
#0.01 5 3 100 0.7934987 0.4345481
#0.01 5 3 500 0.7775055 0.4147204
#...
#0.50 5 10 100 0.7045617 0.2834696
#0.50 5 10 500 0.6924480 0.2650477
#0.50 5 10 1000 0.7115234 0.3050953
#0.50 10 3 10 0.7389117 0.3681917
#0.50 10 3 20 0.7228519 0.3317001
#0.50 10 3 50 0.7180833 0.3159445
#0.50 10 3 100 0.7172417 0.3189655
#0.50 10 3 500 0.7058472 0.3098146
#0.50 10 3 1000 0.7001852 0.2967784
#0.50 10 5 10 0.7266895 0.3378430
#0.50 10 5 20 0.7154746 0.3197905
#0.50 10 5 50 0.7063535 0.2984819
#0.50 10 5 100 0.7151012 0.3141440
#0.50 10 5 500 0.7108516 0.3146822
#0.50 10 5 1000 0.7147320 0.3225373
#0.50 10 10 10 0.7314871 0.3327504
#0.50 10 10 20 0.7150814 0.3081869
#0.50 10 10 50 0.6993723 0.2815981
#0.50 10 10 100 0.6977416 0.2719140
#0.50 10 10 500 0.7037864 0.2854748
#0.50 10 10 1000 0.6995610 0.2869718
Accuracy was used to select the optimal model using the largest value.
The final values used for the model were n.trees = 10, interaction.depth = 1, shrinkage = 0.05 and
n.minobsinnode = 3
plot(model_gbm)

Thus, for all the parameter combinations that you listed in expand.grid(), a model will be created and
tested using cross-validation. The set of parameters with the best cross-validation performance will be
used to create the final model which you get at the end.
6.2. Using tuneLength
Instead, of specifying the exact values for each parameter for tuning we can simply ask it to use any
number of possible values for each tuning parameter through tuneLength. Lets try an example using
tuneLength=10.
#using tune length
model_gbm<-
train(trainSet[,predictors],trainSet[,outcomeName],method='gbm',trControl=fitContro
l,tuneLength=10)
print(model_gbm)

#Stochastic Gradient Boosting


#461 samples
#5 predictor
#2 classes: '0', '1'

#No pre-processing
#Resampling: Cross-Validated (5 fold, repeated 5 times)
#Summary of sample sizes: 368, 369, 369, 370, 368, 369, ...
#Resampling results across tuning parameters:

# interaction.depth n.trees Accuracy Kappa


#1 50 0.7978084 0.4541008
#1 100 0.7978177 0.4566764
#1 150 0.7934792 0.4472347
#1 200 0.7904310 0.4424091
#1 250 0.7869714 0.4342797
#1 300 0.7830488 0.4262414
...
#10 100 0.7575230 0.3860319
#10 150 0.7479757 0.3719707
#10 200 0.7397290 0.3566972
#10 250 0.7397285 0.3561990
#10 300 0.7362552 0.3513413
#10 350 0.7340812 0.3453415
#10 400 0.7336416 0.3453117
#10 450 0.7306027 0.3415153
#10 500 0.7253854 0.3295929

Tuning parameter shrinkage was held constant at a value of 0.1


Tuning parameter n.minobsinnode was held constant at a value of 10
Accuracy was used to select the optimal model using the largest value.
The final values used for the model were n.trees = 50, interaction.depth = 2, shrinkage = 0.1 and
n.minobsinnode = 10.
plot(model_gbm)
Here, it keeps the shrinkage and n.minobsinnode parameters constant while alters n.trees and
interaction.depth over 10 values and uses the best combination to train the final model with.

7. Variable importance estimation using caret


Caret also makes the variable importance estimates accessible with the use of varImp() for any model.
Lets have a look at the variable importance for all the four models that we created:
#Checking variable importance for GBM

#Variable Importance
varImp(object=model_gbm)
#gbm variable importance
#Overall
#Credit_History 100.000
#LoanAmount 16.633
#ApplicantIncome 7.104
#CoapplicantIncome 6.773
#Loan_Amount_Term 0.000

#Plotting Varianle importance for GBM


plot(varImp(object=model_gbm),main="GBM - Variable Importance")

#Checking variable importance for RF


varImp(object=model_rf)
#rf variable importance

#Overall
#Credit_History 100.00
#ApplicantIncome 73.46
#LoanAmount 60.59
#CoapplicantIncome 40.43
#Loan_Amount_Term 0.00

#Plotting Varianle importance for Random Forest


plot(varImp(object=model_rf),main="RF - Variable Importance")
#Checking variable importance for NNET
varImp(object=model_nnet)
#nnet variable importance

#Overall
#ApplicantIncome 100.00
#LoanAmount 82.87
#CoapplicantIncome 56.92
#Credit_History 41.11
#Loan_Amount_Term 0.00

#Plotting Variable importance for Neural Network


plot(varImp(object=model_nnet),main="NNET - Variable Importance")
#Checking variable importance for GLM
varImp(object=model_glm)
#glm variable importance

#Overall
#Credit_History 100.000
#CoapplicantIncome 17.218
#Loan_Amount_Term 12.988
#LoanAmount 5.632
#ApplicantIncome 0.000

#Plotting Variable importance for GLM


plot(varImp(object=model_glm),main="GLM - Variable Importance")
Clearly, the variable importance estimates of different models differs and thus might be used to get a
more holistic view of importance of each predictor. Two main uses of variable importance from various
models are:
Predictors that are important for the majority of models represents genuinely important
predictors.
Foe ensembling, we should use predictions from models that have significantly different
variable importance as their predictions are also expected to be different. Although, one thing
that must be make sure is that all of them are sufficiently accurate.

8. Predictions using Caret


For predicting the dependent variable for the testing set, Caret offers predict.train(). You need to
specify the model name, testing data. For classification problems, Caret also offers another feature
named type which can be set to either prob or raw. For type=raw, the predictions will just be the
outcome classes for the testing data while for type=prob, it will give probabilities for the occurrence
of each observation in various classes of the outcome variable.
Lets take a look at the predictions from our GBM model:
#Predictions
predictions<-predict.train(object=model_gbm,testSet[,predictors],type="raw")
table(predictions)
#predictions
#0 1
#28 125

Caret also provides a confusionMatrix function which will give the confusion matrix along with
various other metrics for your predictions. Here is the performance analysis of our GBM model:
confusionMatrix(predictions,testSet[,outcomeName])
#Confusion Matrix and Statistics
#Reference
#Prediction 0 1
#0 25 3
#1 23 102

#Accuracy : 0.8301
#95% CI : (0.761, 0.8859)
#No Information Rate : 0.6863
#P-Value [Acc > NIR] : 4.049e-05
#Kappa : 0.555
#Mcnemar's Test P-Value : 0.0001944
#Sensitivity : 0.5208
#Specificity : 0.9714
#Pos Pred Value : 0.8929
#Neg Pred Value : 0.8160
#Prevalence : 0.3137
#Detection Rate : 0.1634
#Detection Prevalence : 0.1830
#Balanced Accuracy : 0.7461
#'Positive' Class : 0

Additional Resources
Caret Package Homepage
Caret Package on CRAN
Caret Package Manual (PDF, all the functions)
A Short Introduction to the caret Package (PDF)
Open source project on GitHub (source code)
Here is a webinar by creater of Caret package himself

End Notes
Caret is one of the most powerful and useful packages ever made in R. It alone has the capability to
fulfill all the needs for predictive modeling from preprocessing to interpretation. Additionally, its
syntax is also very easy to use. If you use R, Ill encourage you to use Caret.
Caret is a very comprehensive package and instead of covering all the functionalities that it offers, I
thought itll be a better idea to show an end-to-end implementation of Caret on a real hackathon J
dataset. I have tried to cover as many functions in Caret as I could, but Caret has a lot more to offer.
For going in depth, you might find the resources mentioned above very useful. Several of these
resources have been written by Max Kuhn (the creator of caret package) himself.

Introduction to Feature Selection methods with


an example (or how to select the right
variables?)
Introduction
One of the best ways I use to learn machine learning, is by benchmarking myself against the best data
scientists in competitions. It gives you a lot of insight into how you perform against the best on a level
playing field.
Initially, I used to believe that machine learning is going to be all about algorithms know which one
to apply when and you will come on the top. When I got there, I realized that was not the case the
winners were using the same algorithms which a lot of other people were using.
Next, I thought surely these people would have better / superior machines. I discovered that is not the
case. I saw competitions being won using a Mac Book Air, which is not the best computational
machine. Over time, I realized that there are 2 things which distinguish winners from others in most of
the cases: Feature Creation and Feature Selection.
In other words, it boils down to creating variables which capture hidden business insights and then
making the right choices about which variable to choose for your predictive models! Sadly or
thankfully, both these skills require a ton of practice. There is also some art involved in creating new
features some people have a knack of finding trends where other people struggle.
In this article, I will focus on one of the 2 critical parts of getting your models right feature selection.
I will discuss in detail why feature selection plays such a vital role in creating an effective predictive
model.
Read on!

Table of Contents
1. Importance of Feature Selection
2. Filter Methods
3. Wrapper Methods
4. Embedded Methods
5. Difference between Filter and Wrapper methods
6. Walkthrough example

1. Importance of Feature Selection


Machine learning works on a simple rule if you put garbage in, you will only get garbage to come
out. By garbage here, I mean noise in data.
This becomes even more important when the number of features are very large. You need not use every
feature at your disposal for creating an algorithm. You can assist your algorithm by feeding in only
those features that are really important. I have myself witnessed feature subsets giving better results
than complete set of feature for the same algorithm. Or as Rohan Rao puts it Sometimes, less is
better!
Not only in the competitions but this can be very useful in industrial applications as well. You not only
reduce the training time and the evaluation time, you also have less things to worry about!
Top reasons to use feature selection are:
It enables the machine learning algorithm to train faster.
It reduces the complexity of a model and makes it easier to interpret.
It improves the accuracy of a model if the right subset is chosen.
It reduces overfitting.

Next, well discuss various methodologies and techniques that you can use to subset your feature space
and help your models perform better and efficiently. So, lets get started.

2. Filter Methods

Filter methods are generally used as a preprocessing step. The selection of features is independent of
any machine learning algorithms. Instead, features are selected on the basis of their scores in various
statistical tests for their correlation with the outcome variable. The correlation is a subjective term here.
For basic guidance, you can refer to the following table for defining correlation co-efficients.
Pearsons Correlation: It is used as a measure for quantifying linear dependence between two
continuous variables X and Y. Its value varies from -1 to +1. Pearsons correlation is given as:

LDA: Linear discriminant analysis is used to find a linear combination of features that
characterizes or separates two or more classes (or levels) of a categorical variable.
ANOVA: ANOVA stands for Analysis of variance. It is similar to LDA except for the fact that it
is operated using one or more categorical independent features and one continuous dependent
feature. It provides a statistical test of whether the means of several groups are equal or not.
Chi-Square: It is a is a statistical test applied to the groups of categorical features to evaluate
the likelihood of correlation or association between them using their frequency distribution.
One thing that should be kept in mind is that filter methods do not remove multicollinearity. So, you
must deal with multicollinearity of features as well before training models for your data.

3. Wrapper Methods

In wrapper methods, we try to use a subset of features and train a model using them. Based on the
inferences that we draw from the previous model, we decide to add or remove features from your
subset. The problem is essentially reduced to a search problem. These methods are usually
computationally very expensive.
Some common examples of wrapper methods are forward feature selection, backward feature
elimination, recursive feature elimination, etc.
Forward Selection: Forward selection is an iterative method in which we start with having no
feature in the model. In each iteration, we keep adding the feature which best improves our
model till an addition of a new variable does not improve the performance of the model.
Backward Elimination: In backward elimination, we start with all the features and removes
the least significant feature at each iteration which improves the performance of the model. We
repeat this until no improvement is observed on removal of features.
Recursive Feature elimination: It is a greedy optimization algorithm which aims to find the
best performing feature subset. It repeatedly creates models and keeps aside the best or the
worst performing feature at each iteration. It constructs the next model with the left features
until all the features are exhausted. It then ranks the features based on the order of their
elimination.
One of the best ways for implementing feature selection with wrapper methods is to use Boruta
package that finds the importance of a feature by creating shadow features.
It works in the following steps:
1. Firstly, it adds randomness to the given data set by creating shuffled copies of all features
(which are called shadow features).
2. Then, it trains a random forest classifier on the extended data set and applies a feature
importance measure (the default is Mean Decrease Accuracy) to evaluate the importance of
each feature where higher means more important.
3. At every iteration, it checks whether a real feature has a higher importance than the best of its
shadow features (i.e. whether the feature has a higher Z-score than the maximum Z-score of its
shadow features) and constantly removes features which are deemed highly unimportant.
4. Finally, the algorithm stops either when all features get confirmed or rejected or it reaches a
specified limit of random forest runs.
For more information on the implementation of Boruta package, you can refer to this article :
For the implementation of Boruta in python, refer can refer to this article.

4. Embedded Methods

Embedded methods combine the qualities of filter and wrapper methods. Its implemented by
algorithms that have their own built-in feature selection methods.
Some of the most popular examples of these methods are LASSO and RIDGE regression which have
inbuilt penalization functions to reduce overfitting.
Lasso regression performs L1 regularization which adds penalty equivalent to absolute value of
the magnitude of coefficients.
Ridge regression performs L2 regularization which adds penalty equivalent to square of the
magnitude of coefficients.
For more details and implementation of LASSO and RIDGE regression, you can refer to this article.
Other examples of embedded methods are Regularized trees, Memetic algorithm, Random multinomial
logit.

5. Difference between Filter and Wrapper methods


The main differences between the filter and wrapper methods for feature selection are:
Filter methods measure the relevance of features by their correlation with dependent variable
while wrapper methods measure the usefulness of a subset of feature by actually training a
model on it.
Filter methods are much faster compared to wrapper methods as they do not involve training the
models. On the other hand, wrapper methods are computationally very expensive as well.
Filter methods use statistical methods for evaluation of a subset of features while wrapper
methods use cross validation.
Filter methods might fail to find the best subset of features in many occasions but wrapper
methods can always provide the best subset of features.
Using the subset of features from the wrapper methods make the model more prone to
overfitting as compared to using subset of features from the filter methods.

6. Walkthrough example
Lets use wrapper methods for feature selection and see whether we can improve the accuracy of our
model by using an intelligently selected subset of features instead of using every feature at our
disposal.
Well be using stock prediction data in which well predict whether the stock will go up or down based
on 100 predictors in R. This dataset contains 100 independent variables from X1 to X100 representing
profile of a stock and one outcome variable Y with two levels : 1 for rise in stock price and -1 for drop
in stock price.
To download the dataset, click here.
Lets start with applying random forest for all the features on the dataset first.
library('Metrics')
library('randomForest')
library('ggplot2')
library('ggthemes')
library('dplyr')
#set random seed
set.seed(101)
#loading dataset
data<-read.csv("train.csv",stringsAsFactors= T)
#checking dimensions of data
dim(data)
## [1] 3000 101
#specifying outcome variable as factor

data$Y<-as.factor(data$Y)
data$Time<-NULL
#dividing the dataset into train and test
train<-data[1:2000,]
test<-data[2001:3000,]
#applying Random Forest
model_rf<-randomForest(Y ~ ., data = train)

preds<-predict(model_rf,test[,-101])

table(preds)
##preds
## -1 1
##453 547
#checking accuracy
auc(preds,test$Y)
##[1] 0.4522703

Now, instead of trying a large number of possible subsets through say forward selection or backward
elimination, well keep it simple by using the top 20 features only to build a Random forest. Lets find
out if it can improve the accuracy of our model.
Lets look at the feature importance:
importance(model_rf)
#MeanDecreaseGini
##x1 8.815363
##x2 10.920485
##x3 9.607715
##x4 10.308006
##x5 9.645401
##x6 11.409772
##x7 10.896794
##x8 9.694667
##x9 9.636996
##x10 8.609218

##x87 8.730480
##x88 9.734735
##x89 10.884997
##x90 10.684744
##x91 9.496665
##x92 9.978600
##x93 10.479482
##x94 9.922332
##x95 8.640581
##x96 9.368352
##x97 7.014134
##x98 10.640761
##x99 8.837624
##x100 9.914497

Applying Random forest for most important 20 features only


model_rf<-randomForest(Y ~ X55+X11+X15+X64+X30
+X37+X58+X2+X7+X89
+X31+X66+X40+X12+X90
+X29+X98+X24+X75+X56,
data = train)

preds<-predict(model_rf,test[,-101])

table(preds)
##preds
##-1 1
##218 782
#checking accuracy

auc(preds,test$Y)
##[1] 0.4767592

So, by just using 20 most important features, we have improved the accuracy from 0.452 to 0.476. This
is just an example of how feature selection makes a difference. Not only we have improved the
accuracy but by using just 20 predictors instead of 100, we have also:
increased the interpretability of the model.
reduced the complexity of the model.
reduced the training time of the model.
End Notes
I believe that his article has given you a good idea of how you can perform feature selection to get the
best out of your models. These are the broad categories that are commonly used for feature selection. I
believe you will be convinced about the potential uplift in your model that you can unlock using feature
selection and added benefits of feature selection.

Simulating queueing systems with simmer


We are very pleased to announce that a new release of simmer, the Discrete-Event Simulator for R, is
on CRAN. There are quite a few changes and fixes, with the support of preemption as a star new
feature. Check out the complete set of release notes here.
Lets simmer for a bit and see how this package can be used to simulate queueing systems in a very
straightforward way.

The M/M/1 system


In Kendalls notation, an M/M/1 system has exponential arrivals (M/M/1), a single server (M/M/1)
with exponential service time (M/M/1) and an inifinite queue (implicit M/M/1/(infty)). For instance,
people arriving at an ATM at rate (lambda), waiting their turn in the street and withdrawing money at
rate (mu).
Let us remember the basic parameters of this system:
whenever (rho < 1). If that is not true, it means that the system is unstable: there are more arrivals than
the server is capable of handling, and the queue will grow indefinitely.
The simulation of an M/M/1 system is quite simple using simmer. The trajectory-based design,
combined with magrittrs pipe, is very verbal and self-explanatory.
library(simmer)
set.seed(1234)

lambda <- 2
mu <- 4
rho <- lambda/mu # = 2/4

mm1.trajectory <- create_trajectory() %>%


seize("resource", amount=1) %>%
timeout(function() rexp(1, mu)) %>%
release("resource", amount=1)

mm1.env <- simmer() %>%


add_resource("resource", capacity=1, queue_size=Inf) %>%
add_generator("arrival", mm1.trajectory, function() rexp(1, lambda)) %>%
run(until=2000)
Our package provides convenience plotting functions to quickly visualise the usage of a resource over
time, for instance. Down below, we can see how the simulation converges to the theoretical average
number of customers in the system.
library(ggplot2)

# Evolution of the average number of customers in the system


graph <- plot_resource_usage(mm1.env, "resource", items="system")

# Theoretical value
mm1.N <- rho/(1-rho)
graph + geom_hline(yintercept=mm1.N)

It is possible also to visualise, for instance, the instantaneous usage of individual elements by playing
with the parameters items and steps.
plot_resource_usage(mm1.env, "resource", items=c("queue", "server"), steps=TRUE) +
xlim(0, 20) + ylim(0, 4)
We may obtain the time spent by each customer in the system and we compare the average with the
theoretical expression.
mm1.arrivals <- get_mon_arrivals(mm1.env)
mm1.t_system <- mm1.arrivals$end_time - mm1.arrivals$start_time

mm1.T <- mm1.N / lambda


mm1.T ; mean(mm1.t_system)

## [1] 0.5

## [1] 0.5012594

It seems that it matches the theoretical value pretty well. But of course we are picky, so lets take a
closer look, just to be sure (and to learn more about simmer, why not). Replication can be done with
standard R tools:
library(parallel)

envs <- mclapply(1:1000, function(i) {


simmer() %>%
add_resource("resource", capacity=1, queue_size=Inf) %>%
add_generator("arrival", mm1.trajectory, function() rexp(1, lambda)) %>%
run(1000/lambda) %>%
wrap()
})

Et voil! Parallelizing has the shortcoming that we lose the underlying C++ objects when each thread
finishes, but the wrap function does all the magic for us retrieving the monitored data. Lets perform a
simple test:
library(dplyr)
t_system <- get_mon_arrivals(envs) %>%
mutate(t_system = end_time - start_time) %>%
group_by(replication) %>%
summarise(mean = mean(t_system))

t.test(t_system$mean)

##
## One Sample t-test
##
## data: t_system$mean
## t = 344.14, df = 999, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 0.4953154 0.5009966
## sample estimates:
## mean of x
## 0.498156

Good news: the simulator works. Finally, an M/M/1 satisfies that the distribution of the time spent in
the system is, in turn, an exponential random variable with average (T).
qqplot(mm1.t_system, rexp(length(mm1.t_system), 1/mm1.T))
abline(0, 1, lty=2, col="red")

M/M/c/k systems
An M/M/c/k system keeps exponential arrivals and service times, but has more than one server in
general and a finite queue, which often is more realistic. For instance, a router may have several
processor to handle packets, and the in/out queues are necessarily finite.
This is the simulation of an M/M/2/3 system (2 server, 1 position in queue). Note that the trajectory is
identical to the M/M/1 case.
lambda <- 2
mu <- 4

mm23.trajectory <- create_trajectory() %>%


seize("server", amount=1) %>%
timeout(function() rexp(1, mu)) %>%
release("server", amount=1)

mm23.env <- simmer() %>%


add_resource("server", capacity=2, queue_size=1) %>%
add_generator("arrival", mm23.trajectory, function() rexp(1, lambda)) %>%
run(until=2000)

In this case, there are rejections when the queue is full.


mm23.arrivals <- get_mon_arrivals(mm23.env)
mm23.arrivals %>%
summarise(rejection_rate = sum(!finished)/length(finished))

## rejection_rate
## 1 0.02065614

Despite this, the time spent in the system still follows an exponential random variable, as in the M/M/1
case, but the average has dropped.
mm23.t_system <- mm23.arrivals$end_time - mm23.arrivals$start_time
# Comparison with M/M/1 times
qqplot(mm1.t_system, mm23.t_system)
abline(0, 1, lty=2, col="red")
How to create Beautiful, Interactive data
visualizations using Plotly in R and Python?
Python R
SHARE
Saurav Kaushik , / 6

Introduction
The greatest value of a picture is when it forces us to notice what we never expected to see.

John Tukey

Data visualization is an art as well as a science. It takes constant practice and efforts to master the art of
data visualization. I always keep exploring how to make my visualizations more interesting and
informative. My main tool for creating these data visualizations had been ggplots. When I started using
ggplots, I was amazed by its power. I felt like I was now an evolved story teller.
Then I realized that it is difficult to make interactive charts using ggplots. So, if you want to show
something in 3 dimension, you can not look at it from various angles. So my exploration started again.
One of the best alternatives, I found after spending hours was to learn D3.js. D3.js is a must know
language if you really wish to excel in data visualization. Here, you can find a great resource to realize
the power of D3.js and master it.
But I realized that D3.js is not as popular in data science community as it should have been, probably
because it requires a different skill set (for ex. HTML, CSS and knowledge of JavaScript).
Today, I am going to tell you something which will change the way you perform data visualizations in
the language / tool of your choice (R, Python, MATLAB, Perl, Julia, Arduino).

Table of Contents
1. What is Plotly?
2. Advantages and Disadvantages of Plotly
3. Steps for using Plotly
4. Setting up Data
5. Basic Visualizations
Bar Charts
Box Plots
Scatter Plots
Time Series Plots
6. Advanced Visualizations
Heat Maps
3D Scatter Plots
3D Surfaces
7. Using plotly with ggplot2
8. Different versions of Plotly

1. What is Plotly?
Plotly is one of the finest data visualization tools available built on top of visualization library D3.js,
HTML and CSS. It is created using Python and the Django framework. One can choose to create
interactive data visualizations online or use the libraries that plotly offers to create these visualizations
in the language/ tool of choice. It is compatible with a number of languages/ tools: R, Python,
MATLAB, Perl, Julia, Arduino.

2. Advantages and Disadvantages of Plotly.


Lets have a look at some of the advantages and disadvantages of Plotly:
Advantages:
It lets you create interactive visualizations built using D3.js without even having to know D3.js.
It provides compatibility with number of different languages/ tools like R, Python, MATLAB,
Perl, Julia, Arduino.
Using plotly, interactive plots can easily be shared online with multiple people.
Plotly can also be used by people with no technical background for creating interactive plots by
uploading the data and using plotly GUI.
Plotly is compatible with ggplots in R and Python.
It allows to embed interactive plots in projects or websites using iframes or html.
The syntax for creating interactive plots using plotly is very simple as well.

Disadvantages:
The plots made using plotly community version are always public and can be viewed by
anyone.
For plotly community version, there is an upper limit on the API calls per day.
There are also limited number of color Palettes available in community version which acts as
an upper bound on the coloring options.

3. Steps for creating plots in Plotly.


Data visualization is an art with no hard and fast rules.
One simply should do what it takes to convey the message to the audience. Here is a series of typical
steps for creating interactive plots using plotly
1. Getting the data to be used for creating visualization and preprocesisng it to convert it into the
desired format.
2. Calling the plotly API in the language/ tool of your choice.
3. Creating the plot by specifying objectives like the data that is to be represented at each axis of
the plot, most appropriate plot type (like histogram, boxplots, 3D surfaces), color of data points
or line in the plot and other features. Heres a generalized format for basic plotting in R and
Python:
In R:
plot_ly( x , y ,type,mode,color ,size )

In Python:
plotly.plotly( [plotly.graph_objs .type(x ,y ,mode , marker = dict(color ,size ))]

Where:
size= values for same length as x, y and z that represents the size of datapoints or lines in
plot.
x = values for x-axis
y = values for y-axis
type = to specify the plot that you want to create like histogram, surface , box, etc.
mode = format in which you want data to be represented in the plot. Possible values are
markers, lines, points.
color = values of same length as x, y and z that represents the color of datapoints or lines
in plot.
4. Adding the layout fields like plot title axis title/ labels, axis title/ label fonts, etc.
In R:
layout(plot ,title , xaxis = list(title ,titlefont ), yaxis = list(title ,titlefont
))
In Python:
plotly.plotly.iplot( plot, plotly.graph_objs.Layout(title , xaxis = dict( title
,titlefont ), yaxis = dict( title ,titlefont)))

Where
plot = the plotly object to be displayed
title = string containing the title of the plot
xaxis : title = title/ label for x-axis
xaxis : titlefont = font for title/ label of x-axis
yaxis : title = title/ label for y-axis
yaxis : titlefont = font for title/ label of y-axis

5. Plotly also allows you to share the plots with someone else in various formats. For this, one
needs to sign in to a plotly account. For sharing your plots youll need the following credentials:
your username and your unique API key. Sharing the plots can be done as:
In R
Sys.setenv("plotly_username"="XXXX")
Sys.setenv("plotly_api_key"="YYYY")

#To post the plots online


plotly_POST(x = Plotting_Object)

#To plot the last plot you created, simply use this.
plotly_POST(x = last_plot(), filename = "XYZ")

In Python
#Setting plotly credentials
plotly.tools.set_credentials_file(username=XXXX, api_key='YYYY)

#To post plots online


plotly.plotly.plot(Plotting_Object)

Since R and Python are two of the most popular languages among data scientists, Ill be focusing on
creating interactive visualizations using these two languages.

4. Setting up Data
For performing a wide range of interactive data visualizations, Ill be using some of the publicly
available datasets. You can follow the following code to get the datasets that Ill be using during the
course of this article :

4.1 Iris Data


In R
#Loading iris dataset
data(iris)

#Structure of Iris dataset


str(iris)

## 'data.frame': 150 obs. of 5 variables:


## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1
1 ...

In Python
from sklearn import datasets
import pandas as pd

iris = datasets.load_iris()
iris_df = pd.DataFrame(iris.data)

iris_df.columns = ['Sepal.Length','Sepal.Width','Petal.Length','Petal.Width']
iris_df.columns

iris_df['Species'] = iris.target
iris_df['Species'] = iris_df['Species'].astype('category')
iris_df.dtypes

#Sepal.Length float64
#Sepal.Width float64
#Petal.Length float64
#Petal.Width float64
#Species category

#dtype: object
iris_df['Species'].replace(0,'setosa',inplace=True)
iris_df['Species'].replace(1,'versicolor',inplace=True)
iris_df['Species'].replace(2,'virginica',inplace=True)

4.2 International Airline Passengers Dataset


In R:
#Loading the data
data(AirPassengers)

#Structure of International Airline Passengers Time series Dataset


str(AirPassengers)
#Time-Series [1:144] from 1949 to 1961: 112 118 132 129 121 135 148 148 136 119 ...

In Python
You can get International airline passengers dataset here .
#Loading the data
airline_data = pd.read_csv('international-airline-passengers.csv')
4.3 Volcano Dataset
In R
#Loading the data
data(volcano)

#Checking dimensions
dim(volcano)
## [1] 87 61

In Python
You can get International airline passengers dataset here.
#Loading the data
volcano_data = pd.read_csv('volcano.csv')

5. Basic Visualization
To get a good understanding of when you should use which plot, Ill recommend you to check out this
resource. Feel free to play around and explore these plots more. Here are a few things that you can try
in the coming plots:
hovering your mouse over the plot to view associated attributes
selecting a particular region on the plot using your mouse to zoom
resetting the axis
rotating the 3D images
5.1 Histograms

;
You can view the interactive plot here.

In R
library('plotly')

#attaching the variables


attach(iris)

#plotting a histogram with Sepal.Length variable and storing it in hist


hist<-plot_ly(x=Sepal.Length,type='histogram')

#defining labels and title using layout()


layout(hist,title = "Iris Dataset - Sepal.Length",
xaxis = list(title = "Sepal.Length"),
yaxis = list(title = "Count"))

In Python
import plotly.plotly as py
import plotly.graph_objs as go

data = [go.Histogram(x=iris.data[:,0])]

layout = go.Layout(
title='Iris Dataset - Sepal.Length',
xaxis=dict(title='Sepal.Length'),
yaxis=dict(title='Count')
)

fig = go.Figure(data=data, layout=layout)


py.iplot(fig)
5.2 Bar Charts

You can view the interactive plot here.

In R
#plotting a histogram with Species variable and storing it in bar_chart
bar_chart<-plot_ly(x=Species,type='histogram')

#defining labels and titile using layout()


layout(bar_chart,title = "Iris Dataset - Species",
xaxis = list(title = "Species"),
yaxis = list(title = "Count"))

In Python
data = [go.Bar(x=['setosa','versicolor','virginica'],
y=[iris_df.loc[iris_df['Species']=='setosa'].shape[0],iris_df.loc[iris_df['Species'
]=='versicolor'].shape[0],iris_df.loc[iris_df['Species']=='virginica'].shape[0]]

)]

layout = go.Layout(title='Iris Dataset - Species',


xaxis=dict(title='Iris Dataset - Species'),
yaxis=dict(title='Count')
)

fig = go.Figure(data=data, layout=layout)


py.iplot(fig)
5.3 Box Plots

You can view the interactive plot here.

In R
#plotting a Boxplot with Sepal.Length variable and storing it in box_plot
box_plot<-plot_ly(y=Sepal.Length,type='box',color=Species)

#defining labels and title using layout()


layout(box_plot,title = "Iris Dataset - Sepal.Length Boxplot",
yaxis = list(title = "Sepal.Length"))

In Python
data =
[go.Box(y=iris_df.loc[iris_df["Species"]=='setosa','Sepal.Length'],name='Setosa'),
go.Box(y=iris_df.loc[iris_df["Species"]=='versicolor','Sepal.Length'],name='Versico
lor'),
go.Box(y=iris_df.loc[iris_df["Species"]=='virginica','Sepal.Length'],name='Virginic
a')]

layout = go.Layout(title='Iris Dataset - Sepal.Length Boxplot',


yaxis=dict(title='Sepal.Length'))

fig = go.Figure(data=data, layout=layout)


py.iplot(fig)

5.4 Scatter Plots


Lets start with a simple scatter plot using iris dataset.
You can view the interactive plot here.

In R
#plotting a Scatter Plot with Sepal.Length and Sepal.Width variables and storing it
in scatter_plot1
scatter_plot1<-plot_ly(x=Sepal.Length,y=Sepal.Width,type='scatter',mode='markers')

#defining labels and titile using layout()


layout(scatter_plot1,title = "Iris Dataset - Sepal.Length vs Sepal.Width",
xaxis = list(title = "Sepal.Length"),
yaxis = list(title = "Sepal.Width"))

In Python
data = [go.Scatter(x = iris_df["Sepal.Length"],y = iris_df["Sepal.Width"],mode =
'markers')]

layout = go.Layout(title='Iris Dataset - Sepal.Length vs Sepal.Width',


xaxis=dict(title='Sepal.Length'),
yaxis=dict(title='Sepal.Width'))

fig = go.Figure(data=data, layout=layout)

py.iplot(fig)

1. Lets go a step further and add another dimension (Species) using color.
You can view the interactive plot here.

In R
#plotting a Scatter Plot with Sepal.Length and Sepal.Width variables with color
representing the Species and storing it in scatter_plot12
scatter_plot2<-
plot_ly(x=Sepal.Length,y=Sepal.Width,type='scatter',mode='markers',color = Species)

#defining labels and titile using layout()


layout(scatter_plot2,title = "Iris Dataset - Sepal.Length vs Sepal.Width",
xaxis = list(title = "Sepal.Length"),
yaxis = list(title = "Sepal.Width"))

In Python
data = [go.Scatter(x = iris_df["Sepal.Length"],y = iris_df["Sepal.Width"],mode =
'markers', marker=dict(color = iris_df["Species"]))]

layout = go.Layout(title='Iris Dataset - Sepal.Length vs Sepal.Width',


xaxis=dict(title='Sepal.Length'),
yaxis=dict(title='Sepal.Width'))

fig = go.Figure(data=data, layout=layout)


py.iplot(fig)

2. We can add another dimension (Petal Length) to the plot by using the size of each data point in
the plot.
You can view the interactive plot here.
#plotting a Scatter Plot with Sepal.Length and Sepal.Width variables with color
represneting the Species and size representing the Petal.Length. Then, storing it
in scatter_plot3
scatter_plot3<-
plot_ly(x=Sepal.Length,y=Sepal.Width,type='scatter',mode='markers',color =
Species,size=Petal.Length)

#defining labels and titile using layout()


layout(scatter_plot3,title = "Iris Dataset - Sepal.Length vs Sepal.Width",
xaxis = list(title = "Sepal.Length"),
yaxis = list(title = "Sepal.Width"))

In Python
data = [go.Scatter(x = iris_df["Sepal.Length"],y = iris_df["Sepal.Width"],mode =
'markers', marker=dict(color = iris_df["Species"],size=iris_df["Petal.Length"]))]

layout = go.Layout(title='Iris Dataset - Sepal.Length vs Sepal.Width',


xaxis=dict(title='Sepal.Length'),
yaxis=dict(title='Sepal.Width'))

fig = go.Figure(data=data, layout=layout)


py.iplot(fig)
5.5 Time Series Plots

You can view the interactive plot here.

In R
#plotting a Boxplot with Sepal.Length variable and storing it in box_plot
time_seies<-
plot_ly(x=time(AirPassengers),y=AirPassengers,type="scatter",mode="lines")

#defining labels and titile using layout()


layout(time_seies,title = "AirPassengers Dataset - Time Series Plot",
xaxis = list(title = "Time"),
yaxis = list(title = "Passengers"))

In Python
data = [go.Scatter(x=airline_data.ix[:,0],y=airline_data.ix[:,1])]
layout = go.Layout(
title='AirPassengers Dataset - Time Series Plot',
xaxis=dict(title='Time'),
yaxis=dict(title='Passengers'))

fig = go.Figure(data=data, layout=layout)


py.iplot(fig)

6. Advanced Visualization
Till now, we have got a grasp of how plotly can be beneficial for basic visualizations. Now lets shift
gears and see plotly in action for advanced visualizations.
6.1 Heat Maps

You can view the interactive plot here.

In R
plot_ly(z=~volcano,type="heatmap")

In Python
data = [go.Heatmap(z=volcano_data.as_matrix())]
fig = go.Figure(data=data)
py.iplot(fig)

6.2 3D Scatter Plots

You can view the interactive plot here.


In R
#Plotting the Iris dataset in 3D
plot_ly(x=Sepal.Length,y=Sepal.Width,z=Petal.Length,type="scatter3d",mode='markers'
,size=Petal.Width,color=Species)

In Python
data = [go.Scatter3d(x = iris_df["Sepal.Length"],y = iris_df["Sepal.Width"],z =
iris_df["Petal.Length"],mode = 'markers', marker=dict(color =
iris_df["Species"],size=iris_df["Petal.Width"]))]
fig = go.Figure(data=data)
py.iplot(fig)

6.3 3D Surfaces

You can view the interactive plot here.

In R
#Plotting the volcano 3D surface
plot_ly(z=~volcano,type="surface")

In Python
data = [go.Surface(z=volcano_data.as_matrix())]
fig = go.Figure(data=data)
py.iplot(fig)
7. Using plotly with ggplot2
ggplot2 is one of the best visualization libraries out there. The best part about plotly is that it can add
interactivity to ggplots and also ggplotly() which will further enhance the plots. For learning more
about ggplot, you can check out this resource.
Lets better understand it with an example in R.
#Loading required libraries
library('ggplot2')
library('ggmap')

#List of Countries for ICC T20 WC 2017


ICC_WC_T20 <- c("Australia",
"India",
"South Africa",
"New Zealand",
"Sri Lanka",
"England",
"Bangladesh",
"Pakistan",
"West Indies",
"Ireland",
"Zimbabwe",
"Afghanistan")

#extract geo location of these countries


countries <- geocode(ICC_WC_T20)

#map longitude and latitude in separate variables


nation.x <- countries$lon
nation.y <- countries$lat

#using ggplot to plot the world map


mapWorld <- borders("world", colour="grey", fill="lightblue")

#add data points to the world map


q<-ggplot() + mapWorld + geom_point(aes(x=nation.x, y=nation.y) ,color="red",
size=3)

#Using ggplotly() of ployly to add interactivity to ggplot objects.


ggplotly(q)
You
can view the interactive plot here.

8. Different versions of Plotly.


Plotly offers four different versions, namely:
1. Community
2. Personal
3. Professional
4. On-Premise
Each of these versions is differentiated based on pricing and features. You can learn more about each
of the versions here. The community version is free to get started and also provides decent capabilities.
But one major drawback of community version is the inability to create private plots that to share
online. If data security is a prominent challenge for an individual or organisation, either of personal,
professional or on-premise versions should be opted based upon the needs. For the above examples, I
have used the community version.

End Notes
After going through this article, you would have got a good grasp of how to create interactive plotly
visualizations in R as well as Python. I personally use plotly a lot and find it really useful. Combining
plotly with ggplots by using ggplotly() can give you the best visualizations in R or Python. But keep in
mind that plotly is not limited to R and Python only, there a lot of other languages/ tools that it supports
as well.
I believe this article has inspired you to use plotly for data visualization tasks. Did you
data(iris)
str(iris)
# OUTPUT:
# 'data.frame': 150 obs. of 5 variables:
# $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
# $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
# $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
# $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
# $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1
1 1 1 1 1 1 1 ...

### CONVERT THE FACTOR TO DUMMIES ###


library(caret)
dummies <- predict(dummyVars(~ Species, data = iris), newdata = iris)
head(dummies, n = 3)
# OUTPUT:
# Species.setosa Species.versicolor Species.virginica
# 1 1 0 0
# 2 1 0 0
# 3 1 0 0

### CONVERT DUMMIES TO THE FACTOR ###


header <- unlist(strsplit(colnames(dummies), '[.]'))[2 *
(1:ncol(dummies))]
species <- factor(dummies %*% 1:ncol(dummies), labels = header)
str(species)
# OUTPUT:
# Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1
1 ...

### COMPARE THE ORIGINAL AND THE CALCULATED FACTORS ###


library(compare)
all.equal(species, iris$Species)
# OUTPUT:

One data manipulation task that you need to do in pretty much any data analysis is recode data. Its
almost never the case that the data are set up exactly the way you need them for your analysis.
In R, you can re-code an entire vector or array at once. To illustrate, lets set up a vector that has
missing values.
A <- c(3, 2, NA, 5, 3, 7, NA, NA, 5, 2, 6)
A
[1] 3 2 NA 5 3 7 NA NA 5 2 6

We can re-code all missing values by another number (such as zero) as follows:
A[ is.na(A) ] <- 0
A
[1] 3 2 0 5 3 7 0 0 5 2 6

Lets re-code all values less than 5 to the value 99.


A[ A < 5 ] <- 99
A
[1] 99 99 99 5 99 7 99 99 5 99 6

However, some re-coding tasks are more complex, particularly when you wish to re-code a categorical
variable or factor. In such cases, you might want to re-code an array with character elements to numeric
elements.
gender <- c("MALE","FEMALE","FEMALE","UNKNOWN","MALE")
gender
[1] "MALE" "FEMALE" "FEMALE" "UNKNOWN" "MALE"

Lets re-code males as 1 and females as 2. Very useful is the following re-coding syntax because it
works in many practical situations. It involves repeated (nested) use of the ifelse() command.
ifelse(gender == "MALE", 1, ifelse(gender == "FEMALE", 2, 3))
[1] 1 2 2 3 1

The element with unknown gender was re-coded as 3. Make a note of this syntax. Its great for re-
coding within R programs.
Another example, this time using a rectangular array.
A <- data.frame(Gender = c("F", "F", "M", "F", "B", "M", "M"), Height
= c(154, 167, 178, 145, 169, 183, 176))
A
Gender Height
1 F 154
2 F 167
3 M 178
4 F 145
5 B 169
6 M 183
7 M 176

We have deliberately introduced an error where gender is misclassified as B. This one gets re-coded to
the value 99. Note that the Gender variable is located in the first column, or A[ ,1].
A[,1] <- ifelse(A[,1] == "M", 1, ifelse(A[,1] == "F", 2, 99))
A
Gender Height
1 2 154
2 2 167
3 1 178
4 2 145
5 99 169
6 1 183
7 1 176

You can use the same approach to code as many different levels as you need to. Lets re-code for four
different levels.
My last example is drawn from the films of the Lord of the Rings and the Hobbit.
The sets where Peter Jackson produced these films are just a short walk from where I live, so the
example is relevant for me.
S <- data.frame(SPECIES = c("ORC", "HOBBIT", "ELF", "TROLL", "ORC",
"ORC", "ELF", "HOBBIT"), HEIGHT
= c(194, 127, 178, 195, 149, 183, 176, 134))
S
SPECIES HEIGHT
1 ORC 194
2 HOBBIT 127
3 ELF 178
4 TROLL 195
5 ORC 149
6 ORC 183
7 ELF 176
8 HOBBIT 134

We now use nested ifelse commands to re-code Orcs as 1, Elves as 2, Hobbits as 3, and Trolls as 4.
S[,1] <- ifelse(S[,1] == "ORC", 1, ifelse(S[,1] == "ELF", 2,
ifelse(S[,1] == "HOBBIT", 3, ifelse(S[,1] == "TROLL", 4, 99))))
S
SPECIES HEIGHT
1 1 194
2 3 127
3 2 178
4 4 195
5 1 149
6 1 183
7 2 176
8 3 134

We can recode back to character just as easily.


S[,1] <- ifelse(S[,1] == 1, "ORC", ifelse(S[,1] == 2, "ELF",
ifelse(S[,1] == 3, "HOBBIT", ifelse(S[,1] == 4, "TROLL", 99))))
S
SPECIES HEIGHT
1 ORC 194
2 HOBBIT 127
3 ELF 178
4 TROLL 195
5 ORC 149
6 ORC 183
7 ELF 176
8 HOBBIT 134

Brief Walkthrough Of The dummyVars


Function From {caret}
Practical walkthroughs on machine learning, data exploration and finding insight.
Resources
YouTube Companion Video
Full Source Code

Packages Used in this Walkthrough


{caret} - dummyVars function

As the name implies, the dummyVars function allows you to create dummy variables - in other words
it translates text data into numerical data for modeling purposes.
If you are planning on doing predictive analytics or machine learning and want to use regression or any
other modeling technique that requires numerical data, you will need to transform your text data into
numbers otherwise you run the risk of leaving a lot of information on the table
In R, there are plenty of ways of translating text into numerical data. You can do it manually, use a base
function, such as matrix, or a packaged function like dummyVar from the caret package. One of the
big advantages of going with the caret package is that its full of features, including hundreds of
algorithms and pre-processing functions. Once your data fits into carets modular design, it can be run
through different models with minimal tweaking.
Lets look at a few examples of dummy variables. If you have a survey question with 5 categorical
values such as very unhappy, unhappy, neutral, happy and very happy.
survey <- data.frame(service=c('very unhappy','unhappy','neutral','happy','very
happy'))
print(survey)
## service
## 1 very unhappy
## 2 unhappy
## 3 neutral
## 4 happy
## 5 very happy

You can easily translate this into a sequence of numbers from 1 to 5. Where 3 means neutral and, in the
example of a linear model that thinks in fractions, 2.5 means somewhat unhappy, and 4.88 means very
happy. So here we successfully transformed this survey question into a continuous numerical scale and
do not need to add dummy variables - a simple rank column will do.
survey <- data.frame(service=c('very unhappy','unhappy','neutral','happy','very
happy'), rank=c(1,2,3,4,5))
print(survey)

## service rank
## 1 very unhappy 1
## 2 unhappy 2
## 3 neutral 3
## 4 happy 4
## 5 very happy 5

So, the above could easily be used in a model that needs numbers and still represent that data
accurately using the rank variable instead of service. But this only works in specific situations
where you have somewhat linear and continuous-like data. What happens with categorical values such
as marital status, gender, alive?

Does it make sense to be a quarter female? Or half single? Even numerical data of a categorical nature
may require transformation. Take the zip code system. Does the half-way point between two zip codes
make geographical sense? Because that is how a regression model would use it.

It may work in a fuzzy-logic way but it wont help in predicting much; therefore we need a more
precise way of translating these values into numbers so that they can be regressed by the model.
library(caret)
# check the help file for more details
?dummyVars

The dummyVars function breaks out unique values from a column into individual columns - if you
have 1000 unique values in a column, dummying them will add 1000 new columns to your data set (be
careful). Lets create a more complex data frame:
customers <- data.frame(
id=c(10,20,30,40,50),
gender=c('male','female','female','male','female'),
mood=c('happy','sad','happy','sad','happy'),
outcome=c(1,1,0,0,0))

And ask the dummyVars function to dummify it. The function takes a standard R formula: something
~ (broken down) by something else or groups of other things. So we simply use ~ . and the
dummyVars will transform all characters and factors columns (the function never transforms numeric
columns) and return the entire data set:
# dummify the data
dmy <- dummyVars(" ~ .", data = customers)
trsf <- data.frame(predict(dmy, newdata = customers))
print(trsf)

## id gender.female gender.male mood.happy mood.sad outcome


## 1 10 0 1 1 0 1
## 2 20 1 0 0 1 1
## 3 30 1 0 1 0 0
## 4 40 0 1 0 1 0
## 5 50 1 0 1 0 0

If you just want one column transform you need to include that column in the formula and it will return
a data frame based on that variable only:
customers <- data.frame(
id=c(10,20,30,40,50),
gender=c('male','female','female','male','female'),
mood=c('happy','sad','happy','sad','happy'),
outcome=c(1,1,0,0,0))

dmy <- dummyVars(" ~ gender", data = customers)


trsf <- data.frame(predict(dmy, newdata = customers))
print(trsf)

## gender.female gender.male
## 1 0 1
## 2 1 0
## 3 1 0
## 4 0 1
## 5 1 0

The fullRank parameter is worth mentioning here. The general rule for creating dummy variables is
to have one less variable than the number of categories present to avoid perfect collinearity (dummy
variable trap). You basically want to avoid highly correlated variables but it also save space. If you
have a factor column comprised of two levels male and female, then you dont need to transform it
into two columns, instead, you pick one of the variables and you are either female, if its a 1, or male if
its a 0.
Lets turn on fullRank and try our data frame again:
customers <- data.frame(
id=c(10,20,30,40,50),
gender=c('male','female','female','male','female'),
mood=c('happy','sad','happy','sad','happy'),
outcome=c(1,1,0,0,0))

dmy <- dummyVars(" ~ .", data = customers, fullRank=T)


trsf <- data.frame(predict(dmy, newdata = customers))
print(trsf)

## id gender.male mood.sad outcome


## 1 10 1 0 1
## 2 20 0 1 1
## 3 30 0 0 0
## 4 40 1 1 0
## 5 50 0 0 0

As you can see, it picked male and sad, if you are 0 in both columns, then you are female and
happy.

Things to keep in mind


Don't dummy a large data set full of zip codes; you more than likely don't have the computing
muscle to add an extra 43,000 columns to your data set.
You can dummify large, free-text columns. Before running the function, look for repeated
words or sentences, only take the top 50 of them and replace the rest with 'others'. This will
allow you to use that field without delving deeply into NLP.

Full source code (also on GitHub):


survey <- data.frame(service=c('very unhappy','unhappy','neutral','happy','very
happy'))
print(survey)

survey <- data.frame(service=c('very unhappy','unhappy','neutral','happy','very


happy'), rank=c(1,2,3,4,5))
print(survey)

library(caret)

?dummyVars # many options

customers <- data.frame(


id=c(10,20,30,40,50),
gender=c('male','female','female','male','female'),
mood=c('happy','sad','happy','sad','happy'),
outcome=c(1,1,0,0,0))

dmy <- dummyVars(" ~ .", data = customers)


trsf <- data.frame(predict(dmy, newdata = customers))
print(trsf)
print(str(trsf))
# works only on factors
customers$outcome <- as.factor(customers$outcome)

# tranform just gender


dmy <- dummyVars(" ~ gender", data = customers)
trsf <- data.frame(predict(dmy, newdata = customers))
print(trsf)

# use fullRank to avoid the 'dummy trap'


dmy <- dummyVars(" ~ .", data = customers, fullRank=T)
trsf <- data.frame(predict(dmy, newdata = customers))
print(trsf)

Beginners Guide on Web Scraping in R (using


rvest) with hands-on example

Introduction
Data and information on the web is growing exponentially. All of us today use Google as our first
source of knowledge be it about finding reviews about a place to understanding a new term. All this
information is available on the web already.
With the amount of data available over the web, it opens new horizons of possibility for a Data
Scientist. I strongly believe web scrapping is a must have skill for any data scientist. In todays world,
all the data that you need is already available on the internet, the only thing limiting you from using it
is the ability to access it. With the help of this article, you will be able to overcome that barrier as well.
Most of the data available over the web is not readily available. It is present in an unstructured format
(HTML format) and is not downloadable. Therefore, it requires knowledge & expertise to use this data.
In this article, I am going to take you through the process of web scrapping in R. With this article, you
will gain expertise to use any type of data available over the internet.

Table of Content
1. What is Web Scraping?
2. Why do we need Web Scraping?
3. Ways to scrap data
4. Pre-requisites
5. Scraping a web page using R
6. Analyzing scraped data from the web

1. What is Web Scraping?


Web scraping is a technique for converting the data present in unstructured format (HTML tags) over
the web to the structured format which can easily be accessed and used.
Almost all the main languages provide ways for performing web scrapping. In this article, well use R
for scrapping the data for the most popular feature films of 2016 from the IMDb website.
Well get a number of features for each of the 100 popular feature films released in 2016. Also, well
look at the most common problems that one might face while scrapping data from the internet because
of lack of consistency in the website code and look at how to solve these problems.
If you are more comfortable using Python, Ill recommend you to go through this guide for getting
started with web scraping using Python.

2. Why do we need Web Scraping?


I am sure the first questions that must have popped in your head till now is Why do we need web
scraping? As I stated before, the possibilities with web scraping are immense.
To provide you with hands-on knowledge, we are going to scrap data from IMDB. Some other possible
applications that you can use web scrapping for are:
Scrapping movie rating data to create movie recommendation engines.
Scrapping text data from Wikipedia and other sources for making NLP-based systems or
training deep learning models for tasks like topic recognition from the given text.
Scrapping labeled image data from websites like Google, Flickr, etc to train image classification
models.
Scrapping data from social media sites like Facebook and Twitter for performing tasks
Sentiment analysis, opinion mining, etc.
Scrapping user reviews and feedbacks from e-commerce sites like Amazon, Flipkart, etc.

3. Ways to scrap data


There are several ways of scraping data from the web. Some of the popular ways are:
Human Copy-Paste: This is a slow and efficient way of scraping data from the web. This
involves humans themselves analyzing and copying the data to local storage.
Text pattern matching: Another simple yet powerful approach to extract information from
the web is by using regular expression matching facilities of programming languages. You can
learn more about regular expressions here.
API Interface: Many websites like Facebook, Twitter, LinkedIn, etc. provides public and/ or
private APIs which can be called using standard code for retrieving the data in the prescribed
format.
DOM Parsing: By using the web browsers, programs can retrieve the dynamic content
generated by client-side scripts. It is also possible to parse web pages into a DOM tree, based on
which programs can retrieve parts of these pages.
Well use the DOM parsing approach during the course of this article. And rely on the CSS selectors of
the webpage for finding the relevant fields which contain the desired information. But before we begin
there are a few prerequisites that one need in order to proficiently scrap data from any website.

4. Pre-requisites
The prerequisites for performing web scraping in R are divided into two buckets:
To get started with web scraping, you must have a working knowledge of R language. If you are
just starting or want to brush up the basics, Ill highly recommend following this learning path
in R. During the course of this article, well be using the rvest package in R authored by
Hadley Wickham. You can access the documentation for rvest package here. Make sure you
have this package installed. If you dont have this package by now, you can follow the
following code to install it.
install.packages('rvest')

Adding, knowledge of HTML and CSS will be an added advantage. One of the best sources I
could find for learning HTML and CSS is this. I have observed that most of the Data Scientists
are not very sound with technical knowledge of HTML and CSS. Therefore, well be using an
open source software named Selector Gadget which will be more than sufficient for anyone in
order to perform Web scrapping. You can access and download the Selector Gadget extension
here. Make sure that you have this extension installed by following the instructions from the
website. I have done the same. Im using Google chrome and I can access the extension in the
extension bar to the top right.
Using this you can select the parts of any website and get the relevant tags to get access to that part by
simply clicking on that part of the website. Note that, this is a way around to actually learning HTML
& CSS and doing it manually. But to master the art of Web scrapping, Ill highly recommend you to
learn HTML & CSS in order to better understand and appreciate whats happening under the hood.

4. Scraping a webpage using R


Now, lets get started with scraping the IMDb website for the 100 most popular feature films released
in 2016. You can access them here.
#Loading the rvest package
library('rvest')

#Specifying the url for desired website to be scrapped


url <- 'http://www.imdb.com/search/title?
count=100&release_date=2016,2016&title_type=feature'

#Reading the HTML code from the website


webpage <- read_html(url)

Now, well be scraping the following data from this website.


Rank: The rank of the film from 1 to 100 on the list of 100 most popular feature films released
in 2016.
Title: The title of the feature film.
Description: The description of the feature film.
Runtime: The duration of the feature film.
Genre: The genre of the feature film,
Rating: The IMDb rating of the feature film.
Metascore: The metascore on IMDb website for the feature film.
Votes: Votes cast in favor of the feature film.
Gross_Earning_in_Mil: The gross earnings of the feature film in millions.
Director: The main director of the feature film. Note, in case of multiple directors, Ill take
only the first.
Actor: The main actor of the feature film. Note, in case of multiple actors, Ill take only the
first.
Heres a screenshot that contains how all these fields are arranged.
Step 1: Now, we will start with scraping the Rank field. For that, well use the selector gadget to get
the specific CSS selectors that encloses the rankings. You can click on the extenstion in your browser
and select the rankings field with cursor.
Make sure that all the rankings are selected. You can select some more ranking sections in case you are
not able to get all of them and you can also de-select them by clicking on the selected section to make
sure that you only have those sections highlighted that you want to scrap for that go.

Step 2: Once you are sure that you have made the right selections, you need to copy the corresponding
CSS selector that you can view in the bottom center.
Step 3: Once you know the CSS selector that contains the rankings, you can use this simple R code to
get all the rankings:
#Using CSS selectors to scrap the rankings section
rank_data_html <- html_nodes(webpage,'.text-primary')

#Converting the ranking data to text


rank_data <- html_text(rank_data_html)

#Let's have a look at the rankings


head(rank_data)

[1] "1." "2." "3." "4." "5." "6."

Step 4: Once you have the data, make sure that it looks in the desired format. I am preprocessing my
data to convert it to numerical format.
#Data-Preprocessing: Converting rankings to numerical
rank_data<-as.numeric(rank_data)

#Let's have another look at the rankings


head(rank_data)

[1] 1 2 3 4 5 6

Step 5: Now you can clear the selector section and select all the titles. You can visually inspect that all
the titles are selected. Make any required additions and deletions with the help of your curser. I have
done the same here.
Step 6: Again, I have the corresponding CSS selector for the titles .lister-item-header a. I will use this
selector to scrap all the titles using the following code.
#Using CSS selectors to scrap the title section
title_data_html <- html_nodes(webpage,'.lister-item-header a')

#Converting the title data to text


title_data <- html_text(title_data_html)

#Let's have a look at the title


head(title_data)

[1] "Sing" "Moana" "Moonlight" "Hacksaw Ridge"

[5] "Passengers" "Trolls"

Step 7: In the following code, I have done the same thing for scrapping Description, Runtime, Genre,
Rating, Metascore, Votes, Gross_Earning_in_Mil , Director and Actor data.
#Using CSS selectors to scrap the description section
description_data_html <- html_nodes(webpage,'.ratings-bar+ .text-muted')

#Converting the description data to text


description_data <- html_text(description_data_html)

#Let's have a look at the description data


head(description_data)

[1] "\nIn a city of humanoid animals, a hustling theater impresario's attempt to


save his theater with a singing competition becomes grander than he anticipates
even as its finalists' find that their lives will never be the same."

[2] "\nIn Ancient Polynesia, when a terrible curse incurred by the Demigod Maui
reaches an impetuous Chieftain's daughter's island, she answers the Ocean's call to
seek out the Demigod to set things right."

[3] "\nA chronicle of the childhood, adolescence and burgeoning adulthood of a


young, African-American, gay man growing up in a rough neighborhood of Miami."

[4] "\nWWII American Army Medic Desmond T. Doss, who served during the Battle of
Okinawa, refuses to kill people, and becomes the first man in American history to
receive the Medal of Honor without firing a shot."

[5] "\nA spacecraft traveling to a distant colony planet and transporting thousands
of people has a malfunction in its sleep chambers. As a result, two passengers are
awakened 90 years early."

[6] "\nAfter the Bergens invade Troll Village, Poppy, the happiest Troll ever born,
and the curmudgeonly Branch set off on a journey to rescue her friends.

#Data-Preprocessing: removing '\n'


description_data<-gsub("\n","",description_data)

#Let's have another look at the description data


head(description_data)

[1] "In a city of humanoid animals, a hustling theater impresario's attempt to save
his theater with a singing competition becomes grander than he anticipates even as
its finalists' find that their lives will never be the same."

[2] "In Ancient Polynesia, when a terrible curse incurred by the Demigod Maui
reaches an impetuous Chieftain's daughter's island, she answers the Ocean's call to
seek out the Demigod to set things right."

[3] "A chronicle of the childhood, adolescence and burgeoning adulthood of a young,
African-American, gay man growing up in a rough neighborhood of Miami."
[4] "WWII American Army Medic Desmond T. Doss, who served during the Battle of
Okinawa, refuses to kill people, and becomes the first man in American history to
receive the Medal of Honor without firing a shot."

[5] "A spacecraft traveling to a distant colony planet and transporting thousands
of people has a malfunction in its sleep chambers. As a result, two passengers are
awakened 90 years early."

[6] "After the Bergens invade Troll Village, Poppy, the happiest Troll ever born,
and the curmudgeonly Branch set off on a journey to rescue her friends."

#Using CSS selectors to scrap the Movie runtime section


runtime_data_html <- html_nodes(webpage,'.text-muted .runtime')

#Converting the runtime data to text


runtime_data <- html_text(runtime_data_html)

#Let's have a look at the runtime


head(runtime_data)

[1] "108 min" "107 min" "111 min" "139 min" "116 min" "92 min"

#Data-Preprocessing: removing mins and converting it to numerical

runtime_data<-gsub(" min","",runtime_data)
runtime_data<-as.numeric(runtime_data)

#Let's have another look at the runtime data


head(rank_data)

[1] 1 2 3 4 5 6

#Using CSS selectors to scrap the Movie genre section


genre_data_html <- html_nodes(webpage,'.genre')

#Converting the genre data to text


genre_data <- html_text(genre_data_html)

#Let's have a look at the runtime


head(genre_data)

[1] "\nAnimation, Comedy, Family "

[2] "\nAnimation, Adventure, Comedy "

[3] "\nDrama "

[4] "\nBiography, Drama, History "

[5] "\nAdventure, Drama, Romance "

[6] "\nAnimation, Adventure, Comedy "

#Data-Preprocessing: removing \n
genre_data<-gsub("\n","",genre_data)

#Data-Preprocessing: removing excess spaces


genre_data<-gsub(" ","",genre_data)
#taking only the first genre of each movie
genre_data<-gsub(",.*","",genre_data)

#Convering each genre from text to factor


genre_data<-as.factor(genre_data)

#Let's have another look at the genre data


head(genre_data)

[1] Animation Animation Drama Biography Adventure Animation

10 Levels: Action Adventure Animation Biography Comedy Crime Drama ... Thriller

#Using CSS selectors to scrap the IMDB rating section


rating_data_html <- html_nodes(webpage,'.ratings-imdb-rating strong')

#Converting the ratings data to text


rating_data <- html_text(rating_data_html)

#Let's have a look at the ratings


head(rating_data)

[1] "7.2" "7.7" "7.6" "8.2" "7.0" "6.5"

#Data-Preprocessing: converting ratings to numerical


rating_data<-as.numeric(rating_data)

#Let's have another look at the ratings data


head(rating_data)

[1] 7.2 7.7 7.6 8.2 7.0 6.5

#Using CSS selectors to scrap the votes section


votes_data_html <- html_nodes(webpage,'.sort-num_votes-visible span:nth-child(2)')

#Converting the votes data to text


votes_data <- html_text(votes_data_html)

#Let's have a look at the votes data


head(votes_data)

[1] "40,603" "91,333" "112,609" "177,229" "148,467" "32,497"

#Data-Preprocessing: removing commas


votes_data<-gsub(",","",votes_data)

#Data-Preprocessing: converting votes to numerical


votes_data<-as.numeric(votes_data)

#Let's have another look at the votes data


head(votes_data)

[1] 40603 91333 112609 177229 148467 32497

#Using CSS selectors to scrap the directors section


directors_data_html <- html_nodes(webpage,'.text-muted+ p a:nth-child(1)')

#Converting the directors data to text


directors_data <- html_text(directors_data_html)
#Let's have a look at the directors data
head(directors_data)

[1] "Christophe Lourdelet" "Ron Clements" "Barry Jenkins"

[4] "Mel Gibson" "Morten Tyldum" "Walt Dohrn"

#Data-Preprocessing: converting directors data into factors


directors_data<-as.factor(directors_data)

#Using CSS selectors to scrap the actors section


actors_data_html <- html_nodes(webpage,'.lister-item-content .ghost+ a')

#Converting the gross actors data to text


actors_data <- html_text(actors_data_html)

#Let's have a look at the actors data


head(actors_data)

[1] "Matthew McConaughey" "Auli'i Cravalho" "Mahershala Ali"

[4] "Andrew Garfield" "Jennifer Lawrence" "Anna Kendrick"

#Data-Preprocessing: converting actors data into factors


actors_data<-as.factor(actors_data)

But, I want you to closely follow what happens when I do the same thing for Metascore data.
#Using CSS selectors to scrap the metascore section
metascore_data_html <- html_nodes(webpage,'.metascore')

#Converting the runtime data to text


metascore_data <- html_text(metascore_data_html)

#Let's have a look at the metascore


data head(metascore_data)

[1] "59 " "81 " "99 " "71 " "41 "

[6] "56 "

#Data-Preprocessing: removing extra space in metascore


metascore_data<-gsub(" ","",metascore_data)

#Lets check the length of metascore data


length(metascore_data)

[1] 96

Step 8: The length of meta score data is 96 while we are scrapping the data for 100 movies. The reason
this happened is because there are 4 movies which dont have the corresponding Metascore fields.
Step 9: It is a practical situation which can arise while scrapping any website. Unfortunately, if we
simply add NAs to last 4 entries, it will map NA as Metascore for movies 96 to 100 while in reality, the
data is missing for some other movies. After a visual inspection, I found that the Metascore is missing
for movies 39, 73, 80 and 89. I have written the following function to get around this problem.
for (i in c(39,73,80,89)){

a<-metascore_data[1:(i-1)]

b<-metascore_data[i:length(metascore_data)]

metascore_data<-append(a,list("NA"))
metascore_data<-append(metascore_data,b)

#Data-Preprocessing: converting metascore to numerical


metascore_data<-as.numeric(metascore_data)

#Let's have another look at length of the metascore data

length(metascore_data)

[1] 100

#Let's look at summary statistics


summary(metascore_data)

Min. 1st Qu. Median Mean 3rd Qu. Max. NA's

23.00 47.00 60.00 60.22 74.00 99.00 4

Step 10: The same thing happens with the Gross variable which represents gross earnings of that movie
in millions. I have use the same solution to work my way around:
#Using CSS selectors to scrap the gross revenue section
gross_data_html <- html_nodes(webpage,'.ghost~ .text-muted+ span')

#Converting the gross revenue data to text


gross_data <- html_text(gross_data_html)

#Let's have a look at the votes data


head(gross_data)

[1] "$269.36M" "$248.04M" "$27.50M" "$67.12M" "$99.47M" "$153.67M"

#Data-Preprocessing: removing '$' and 'M' signs


gross_data<-gsub("M","",gross_data)

gross_data<-substring(gross_data,2,6)

#Let's check the length of gross data


length(gross_data)

[1] 86

#Filling missing entries with NA


for (i in c(17,39,49,52,57,64,66,73,76,77,80,87,88,89)){

a<-gross_data[1:(i-1)]

b<-gross_data[i:length(gross_data)]

gross_data<-append(a,list("NA"))

gross_data<-append(gross_data,b)
}

#Data-Preprocessing: converting gross to numerical


gross_data<-as.numeric(gross_data)

#Let's have another look at the length of gross data


length(gross_data)

[1] 100

summary(gross_data)

Min. 1st Qu. Median Mean 3rd Qu. Max. NA's

0.08 15.52 54.69 96.91 119.50 530.70 14

Step 11: Now we have successfully scrapped all the 11 features for the 100 most popular feature films
released in 2016. Lets combine them to create a dataframe and inspect its structure.
#Combining all the lists to form a data frame
movies_df<-data.frame(Rank = rank_data, Title = title_data,

Description = description_data, Runtime = runtime_data,

Genre = genre_data, Rating = rating_data,

Metascore = metascore_data, Votes = votes_data,


Gross_Earning_in_Mil =
gross_data,

Director = directors_data, Actor = actors_data)

#Structure of the data frame

str(movies_df)

'data.frame': 100 obs. of 11 variables:

$ Rank : num 1 2 3 4 5 6 7 8 9 10 ...

$ Title : Factor w/ 99 levels "10 Cloverfield Lane",..: 66 53 54 32


58 93 8 43 97 7 ...

$ Description : Factor w/ 100 levels "19-year-old Billy Lynn is brought


home for a victory tour after a harrowing Iraq battle. Through flashbacks the film
shows what"| __truncated__,..: 57 59 3 100 21 33 90 14 13 97 ...

$ Runtime : num 108 107 111 139 116 92 115 128 111 116 ...

$ Genre : Factor w/ 10 levels "Action","Adventure",..: 3 3 7 4 2 3 1


5 5 7 ...

$ Rating : num 7.2 7.7 7.6 8.2 7 6.5 6.1 8.4 6.3 8 ...

$ Metascore : num 59 81 99 71 41 56 36 93 39 81 ...


$ Votes : num 40603 91333 112609 177229 148467 ...

$ Gross_Earning_in_Mil: num 269.3 248 27.5 67.1 99.5 ...

$ Director : Factor w/ 98 levels "Andrew Stanton",..: 17 80 9 64 67 95


56 19 49 28 ...

$ Actor : Factor w/ 86 levels "Aaron Eckhart",..: 59 7 56 5 42 6 64


71 86 3 ...

You have now successfully scrapped the IMDb website for the 100 most popular feature films released
in 2016.

6. Analyzing scrapped data from the web


Once you have the data, you can perform several tasks like analyzing the data, drawing inferences from
it, training machine learning models over this data, etc. I have gone on to create some interesting
visualization out of the data we have just scrapped. Follow the visualizations and answer the questions
given below. Post your answers in the comment section below.
library('ggplot2')

qplot(data = movies_df,Runtime,fill = Genre,bins = 30)


Question 1: Based on the above data, which movie from which Genre had the longest runtime?

ggplot(movies_df,aes(x=Runtime,y=Rating))+
geom_point(aes(size=Votes,col=Genre))

Question 2: Based on the above data, in the Runtime of 130-160 mins, which genre has the highest
votes?

ggplot(movies_df,aes(x=Runtime,y=Gross_Earning_in_Mil))+
geom_point(aes(size=Rating,col=Genre))
Question 3: Based on the above data, across all genres which genre has the highest average gross
earnings in runtime 100 to 120.

End Notes
I believe this article would have given you a complete understanding of the web scrapping in R. Now,
you also have a fair idea of the problems which you might come across and how you can make your
way around them. As most of the data on the web is present in an unstructured format, web scrapping is
a really handy skill for any data scientist.
Comprehensive Guide on t-SNE algorithm with
implementation in R & Python
Introduction
Imagine you get a dataset with hundreds of features (variables) and have little understanding about the
domain the data belongs to. You are expected to identify hidden patterns in the data, explore and
analyze the dataset. And not just that, you have to find out if there is a pattern in the data is it signal
or is it just noise?
Does that thought make you uncomfortable? It made my hands sweat when I came across this situation
for the first time. Do you wonder how to explore a multidimensional dataset? It is one of the frequently
asked question by many data scientists. In this article, I will take you through a very powerful way to
exactly do this.

What about PCA?


By now, some of you would be screaming Ill use PCA for dimensionality reduction and
visualization. Well, you are right! PCA is definitely a good choice for dimensionality reduction and
visualization for datasets with a large number of features. But, what if you could use something more
advanced than PCA? (If you dont know PCA, I would strongly recommend to read this article first)
What if you could easily search for a pattern in non-linear style? In this article, I will tell you about a
new algorithm called t-SNE (2008), which is much more effective than PCA (1933). I will take you
through the basics of t-SNE algorithm first and then will walk you through why t-SNE is a good fit for
dimensionality reduction algorithms.
You will also, get hands-on knowledge for using t-SNE in both R and Python.
Read on!

Table of Content
1. What is t-SNE?
2. What is dimensionality reduction?
3. How does t-SNE fit in the dimensionality reduction algorithm space
4. Algorithmic details of t-SNE
Algorithm
Time and Space Complexity
5. What does t-SNE actually do?
6. Use cases
7. t-SNE compared to other dimensionality reduction algorithm
8. Example Implementations
In R
Hyper parameter tuning
Code
Implementation Time
Interpreting Results
In Python
Hyper parameter tuning
Code
Implementation Time
9. Where and when to use
Data Scientist
Machine Learning Competition Enthusiast
Student
10.Common fallacies

1. What is t-SNE?
(t-SNE) t-Distributed Stochastic Neighbor Embedding is a non-linear dimensionality reduction
algorithm used for exploring high-dimensional data. It maps multi-dimensional data to two or more
dimensions suitable for human observation. With help of the t-SNE algorithms, you may have to plot
fewer exploratory data analysis plots next time you work with high dimensional data.
2. What is dimensionality reduction?
In order to understand how t-SNE works, lets first understand what is dimensionality reduction?
Well, in simple terms, dimensionality reduction is the technique of representing multi-dimensional data
(data with multiple features having a correlation with each other) in 2 or 3 dimensions.
Some of you might question why do we need Dimensionality Reduction when we can plot the data
using scatter plots, histograms & boxplots and make sense of the pattern in data using descriptive
statistics.
Well, even if you can understand the patterns in data and present it on simple charts, it is still difficult
for anyone without statistics background to make sense of it. Also, if you have hundreds of features,
you have to study thousands of charts before you can make sense of this data. (Read more about
dimensionality reduction here)
With the help of dimensionality reduction algorithm, you will be able to present the data explicitly.

3. How does t-SNE fit in the dimensionality reduction algorithm


space?
Now that you have an understanding of what is dimensionality reduction, lets look at how we can use
t-SNE algorithm for reducing dimensions.
Following are a few dimensionality reduction algorithms that you can check out:
1. PCA (linear)
2. t-SNE (non-parametric/ nonlinear)
3. Sammon mapping (nonlinear)
4. Isomap (nonlinear)
5. LLE (nonlinear)
6. CCA (nonlinear)
7. SNE (nonlinear)
8. MVU (nonlinear)
9. Laplacian Eigenmaps (nonlinear)
The good news is that you need to study only two of the algorithms mentioned above to effectively
visualize data in lower dimensions PCA and t-SNE.
Limitations of PCA
PCA is a linear algorithm. It will not be able to interpret complex polynomial relationship between
features. On the other hand, t-SNE is based on probability distributions with random walk on
neighborhood graphs to find the structure within the data.
A major problem with, linear dimensionality reduction algorithms is that they concentrate on placing
dissimilar data points far apart in a lower dimension representation. But in order to represent high
dimension data on low dimension, non-linear manifold, it is important that similar datapoints must be
represented close together, which is not what linear dimensionality reduction algorithms do.
Now, you have a brief understanding of what PCA endeavors to do.
Local approaches seek to map nearby points on the manifold to nearby points in the low-dimensional
representation. Global approaches on the other hand attempt to preserve geometry at all scales, i.e
mapping nearby points to nearby points and far away points to far away points
It is important to know that most of the nonlinear techniques other than t-SNE are not capable of
retaining both the local and global structure of the data at the same time.

4. Algorithmic details of t-SNE (optional read)


This section is for the people interested in understanding the algorithm in depth. You can safely skip
this section if you do not want to go through the math in detail.
Lets understand why you should know about t-SNE and the algorithmic details of t-SNE. t-SNE is an
improvement on the Stochastic Neighbor Embedding (SNE) algorithm.

4.1 Algorithm
Step 1
Stochastic Neighbor Embedding (SNE) starts by converting the high-dimensional Euclidean distances
between data points into conditional probabilities that represent similarities. The similarity of datapoint

to datapoint is the conditional probability, , would pick as its neighbor if

neighbors were picked in proportion to their probability density under a Gaussian centered at .

For nearby datapoints, is relatively high, whereas for widely separated datapoints, will be

almost infinitesimal (for reasonable values of the variance of the Gaussian, ). Mathematically, the

conditional probability is given by


where is the variance of the Gaussian that is centered on datapoint
If you are not interested in the math, think about it in this way, the algorithm starts by converting the
shortest distance (a straight line) between the points into probability of similarity of points. Where,

the similarity between points is: the conditional probability that would pick as its neighbor if
neighbors were picked in proportion to their probability density under a Gaussian (normal distribution)

centered at .

Step 2

For the low-dimensional counterparts and of the high-dimensional datapoints and it

is possible to compute a similar conditional probability, which we denote by .

Note that, pi|i and pj|j are set to zero as we only want to model pair wise similarity.
In simple terms step 1 and step2 calculate the conditional probability of similarity between a pair of
points in
1. High dimensional space
2. In low dimensional space
For the sake of simplicity, try to understand this in detail.
Let us map 3D space to 2D space. What step1 and step2 are doing is calculating the probability of
similarity of points in 3D space and calculating the probability of similarity of points in the
corresponding 2D space.

Logically, the conditional probabilities and must be equal for a perfect representation of

the similarity of the datapoints in the different dimensional spaces, i.e the difference between

and must be zero for the perfect replication of the plot in high and low dimensions.
By this logic SNE attempts to minimize this difference of conditional probability.

Step 3
Now here is the difference between the SNE and t-SNE algorithms.
To measure the minimization of sum of difference of conditional probability SNE minimizes the sum of
Kullback-Leibler divergences overall data points using a gradient descent method. We must know that
KL divergences are asymmetric in nature.
In other words, the SNE cost function focuses on retaining the local structure of the data in the map (for

reasonable values of the variance of the Gaussian in the high-dimensional space, ).


Additionally, it is very difficult (computationally inefficient) to optimize this cost function.
So t-SNE also tries to minimize the sum of the difference in conditional probabilities. But it does that
by using the symmetric version of the SNE cost function, with simple gradients. Also, t-SNE employs a
heavy-tailed distribution in the low-dimensional space to alleviate both the crowding problem (the area
of the two-dimensional map that is available to accommodate moderately distant data points will not be
nearly large enough compared with the area available to accommodate nearby data points) and the
optimization problems of SNE.

Step 4
If we see the equation to calculate the conditional probability, we have left out the variance from the

discussion as of now. The remaining parameter to be selected is the variance of the students t-

distribution that is centered over each high-dimensional datapoint . It is not likely that there is a

single value of that is optimal for all data points in the data set because the density of the data is

likely to vary. In dense regions, a smaller value of is usually more appropriate than in sparser

regions. Any particular value of induces a probability distribution, , over all of the other data
points. This distribution has an

This distribution has an entropy which increases as increases. t-SNE performs a binary search for

the value of that produces a with a fixed perplexity that is specified by the user. The
perplexity is defined as
where H( ) is the Shannon entropy of measured in bits

The perplexity can be interpreted as a smooth measure of the effective number of neighbors. The
performance of SNE is fairly robust to changes in the perplexity, and typical values are between 5 and
50.
The minimization of the cost function is performed using gradient decent. And physically, the gradient
may be interpreted as the resultant force created by a set of springs between the map point and all
other map points . All springs exert a force along the direction ( ). The spring between
and repels or attracts the map points depending on whether the distance between the two in the map
is too small or too large to represent the similarities between the two high-dimensional datapoints. The
force exerted by the spring between and is proportional to its length, and also proportional to its
stiffness, which is the mismatch (pj|i qj|i + p i| j q i| j ) between the pairwise similarities of the data
points and the map points[1].-

4.2 Time and Space Complexity


Now that we have understood the algorithm, it is time to analyze its performance. As you might have
observed, that the algorithm computes pairwise conditional probabilities and tries to minimize the sum
of the difference of the probabilities in higher and lower dimensions. This involves a lot of calculations
and computations. So the algorithm is quite heavy on the system resources.
t-SNE has a quadratic time and space complexity in the number of data points. This makes it
particularly slow and resource draining while applying it to data sets comprising of more than 10,000
observations.
5. What does t-SNE actually do?
After we have looked into the mathematical description of how does the algorithms works, to sum
up, what we have learned above. Here is a brief explanation of how t-SNE works.
Its quite simple actually, t-SNE a non-linear dimensionality reduction algorithm finds patterns in the
data by identifying observed clusters based on similarity of data points with multiple features. But it is
not a clustering algorithm it is a dimensionality reduction algorithm. This is because it maps the multi-
dimensional data to a lower dimensional space, the input features are no longer identifiable. Thus you
cannot make any inference based only on the output of t-SNE. So essentially it is mainly a data
exploration and visualization technique.
But t-SNE can be used in the process of classification and clustering by using its output as the input
feature for other classification algorithms.

6. Use cases
You may ask, what are the use cases of such an algorithm. t-SNE can be used on almost all high
dimensional data sets. But it is extensively applied in Image processing, NLP, genomic data and speech
processing. It has been utilized for improving the analysis of brain and heart scans. Below are a few
examples:

6.1 Facial Expression Recognition


A lot of progress has been made on FER and many algorithms like PCA have been studied for FER.
But, FER still remains a challenge due to the difficulties of dimension reduction and classification. t-
Stochastic Neighbor Embedding (t-SNE) is used for reducing the high-dimensional data into a
relatively low-dimensional subspace and then using other algorithms like AdaBoostM2, Random
Forests, Logistic Regression, NNs and others as multi-classifier for the expression classification.
In one such attempt for facial recognition based on the Japanese Female Facial Expression (JAFFE)
database with t-SNE and AdaBoostM2. Experimental results showed that the proposed new algorithm
applied to FER gained the better performance compared with those traditional algorithms, such as
PCA, LDA, LLE and SNE.[2]
The flowchart for implementing such a combination on the data could be as follows:
Preprocessing normalization t-SNE classification algorithm
PCA LDA LLE SNE t-SNE
SVM 73.5% 74.3% 84.7% 89.6% 90.3%
AdaboostM2 75.4% 75.9% 87.7% 90.6% 94.5%
6.2 Identifying Tumor subpopulations (Medical Imaging)
Mass spectrometry imaging (MSI) is a technology that simultaneously provides the spatial distribution
for hundreds of biomolecules directly from tissue. Spatially mapped t-distributed stochastic neighbor
embedding (t-SNE), a nonlinear visualization of the data that is able to better resolve the biomolecular
intratumor heterogeneity.
In an unbiased manner, t-SNE can uncover tumor subpopulations that are statistically linked to patient
survival in gastric cancer and metastasis status in primary tumors of breast cancer. Survival analysis
performed on each t-SNE clusters will provide significantly useful results.[3]

6.3 Text comparison using wordvec


Word vector representations capture many linguistic properties such as gender, tense, plurality and even
semantic concepts like capital city of. Using dimensionality reduction, a 2D map can be computed
where semantically similar words are close to each other. This combination of techniques can be used
to provide a birds-eye view of different text sources, including text summaries and their source
material. This enables users to explore a text source like a geographical map.[4]

7. t-SNE compared to other dimensionality reduction algorithms


While comparing the performance of t-SNE with other algorithms, we will compare t-SNE with other
algorithms based on the achieved accuracy rather than the time and resource requirements with relation
to accuracy.
t-SNE outputs provide better results than PCA and other linear dimensionality reduction models. This
is because a linear method such as classical scaling is not good at modeling curved manifolds. It
focuses on preserving the distances between widely separated data points rather than on preserving the
distances between nearby data points.
The Gaussian kernel employed in the high-dimensional space by t-SNE defines a soft border between
the local and global structure of the data. And for pairs of data points that are close together relative to
the standard deviation of the Gaussian, the importance of modeling their separations is almost
independent of the magnitudes of those separations. Moreover, t-SNE determines the local
neighborhood size for each datapoint separately based on the local density of the data (by forcing each
conditional probability distribution to have the same perplexity)[1]. This is because the algorithm
defines a soft border between the local and global structure of the data. And unlike other non-linear
dimensionality reduction algorithms, it performs better than any of them.
8. Example Implementations
Lets implement the t-SNE algorithm on MNIST handwritten digit database. This is one of the most
explored dataset for image processing.

1. In R
The Rtsne package has an implementation of t-SNE in R. The Rtsne package can be installed in R
using the following command typed in the R console:
install.packages(Rtsne)

Hyper parameter tuning

Code
MNIST data can be downloaded from the MNIST website and can be converted into a csv file
with small amount of code.For this example, please download the following preprocessed
MNIST data. link
## calling the installed package
train<- read.csv(file.choose()) ## Choose the train.csv file downloaded from the
link above
library(Rtsne)
## Curating the database for analysis with both t-SNE and PCA
Labels<-train$label
train$label<-as.factor(train$label)
## for plotting
colors = rainbow(length(unique(train$label)))
names(colors) = unique(train$label)

## Executing the algorithm on curated data


tsne <- Rtsne(train[,-1], dims = 2, perplexity=30, verbose=TRUE, max_iter = 500)
exeTimeTsne<- system.time(Rtsne(train[,-1], dims = 2, perplexity=30, verbose=TRUE,
max_iter = 500))

## Plotting
plot(tsne$Y, t='n', main="tsne")
text(tsne$Y, labels=train$label, col=colors[train$label])

Implementation Time
exeTimeTsne
user system elapsed
118.037 0.000 118.006

exectutiontimePCA
user system elapsed
11.259 0.012 11.360

As can be seen t-SNE takes considerably longer time to execute on the same sample size of data than
PCA.

Interpreting Results
The plots can be used for exploratory analysis. The output x & y co-ordinates and as well as cost can be
used as features in classification algorithms.
2. In Python
An important thing to note is that the pip install tsne produces an error. Installing tsne package is
not recommended. t-SNE algorithm can be accessed from sklearn package.
Hyper parameter tuning

Code
The following code is taken from the sklearn examples on the sklearn website.
## importing the required packages
from time import time
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import offsetbox
from sklearn import (manifold, datasets, decomposition, ensemble,
discriminant_analysis, random_projection)
## Loading and curating the data
digits = datasets.load_digits(n_class=10)
X = digits.data
y = digits.target
n_samples, n_features = X.shape
n_neighbors = 30
## Function to Scale and visualize the embedding vectors
def plot_embedding(X, title=None):
x_min, x_max = np.min(X, 0), np.max(X, 0)
X = (X - x_min) / (x_max - x_min)
plt.figure()
ax = plt.subplot(111)
for i in range(X.shape[0]):
plt.text(X[i, 0], X[i, 1], str(digits.target[i]),
color=plt.cm.Set1(y[i] / 10.),
fontdict={'weight': 'bold', 'size': 9})
if hasattr(offsetbox, 'AnnotationBbox'):
## only print thumbnails with matplotlib > 1.0
shown_images = np.array([[1., 1.]]) # just something big
for i in range(digits.data.shape[0]):
dist = np.sum((X[i] - shown_images) ** 2, 1)
if np.min(dist) < 4e-3:
## don't show points that are too close
continue
shown_images = np.r_[shown_images, [X[i]]]
imagebox = offsetbox.AnnotationBbox(
offsetbox.OffsetImage(digits.images[i], cmap=plt.cm.gray_r),
X[i])
ax.add_artist(imagebox)
plt.xticks([]), plt.yticks([])
if title is not None:
plt.title(title)

#----------------------------------------------------------------------
## Plot images of the digits
n_img_per_row = 20
img = np.zeros((10 * n_img_per_row, 10 * n_img_per_row))
for i in range(n_img_per_row):
ix = 10 * i + 1
for j in range(n_img_per_row):
iy = 10 * j + 1
img[ix:ix + 8, iy:iy + 8] = X[i * n_img_per_row + j].reshape((8, 8))
plt.imshow(img, cmap=plt.cm.binary)
plt.xticks([])
plt.yticks([])
plt.title('A selection from the 64-dimensional digits dataset')
## Computing PCA
print("Computing PCA projection")
t0 = time()
X_pca = decomposition.TruncatedSVD(n_components=2).fit_transform(X)
plot_embedding(X_pca,
"Principal Components projection of the digits (time %.2fs)" %
(time() - t0))
## Computing t-SNE
print("Computing t-SNE embedding")
tsne = manifold.TSNE(n_components=2, init='pca', random_state=0)
t0 = time()
X_tsne = tsne.fit_transform(X)
plot_embedding(X_tsne,
"t-SNE embedding of the digits (time %.2fs)" %
(time() - t0))
plt.show()

Implementation Time
Tsne: 13.40 s

PCA: 0.01 s
9. Where and When to use t-SNE?
9.1 Data Scientist
Well for the data scientist the main problem while using t-SNE is the black box type nature of the
algorithm. This impedes the process of providing inferences and insights based on the results. Also,
another problem with the algorithm is that it doesnt always provide a similar output on successive
runs.
So then how could you use the algorithm? The best way to used the algorithm is to use it for
exploratory data analysis. It will give you a very good sense of patterns hidden inside the data. It can
also be used as an input parameter for other classification & clustering algorithms.

9.2 Machine Learning Hacker


Reduce the dataset to 2 or 3 dimensions and stack this with a non-linear stacker. Using a holdout set for
stacking / blending. Then you can boost the t-SNE vectors using XGboost to get better results.

9.3 Data Science Enthusiasts


For data science enthusiasts who are beginning to work with data science, this algorithm presents the
best opportunities in terms of research and performance enhancements. There have been a few research
papers attempting to improve the time complexity of the algorithm by utilizing linear functions. But an
optimal solution is still required. Research papers on implementing t-SNE for a variety of NLP
problems and image processing applications is an unexplored territory and has enough scope.

10. Common Fallacies


Following are a few common fallacies to avoid while interpreting the results of t-SNE:
1. For the algorithm to execute properly, the perplexity should be smaller than the number of
points. Also, the suggested perplexity is in the range of (5 to 50)
2. Sometimes, different runs with same hyper parameters may produce different results.
3. Cluster sizes in any t-SNE plot must not be evaluated for standard deviation, dispersion or any
other similar measures. This is because t-SNE expands denser clusters and contracts sparser
clusters to even out cluster sizes. This is one of the reasons for the crisp and clear plots it
produces.
4. Distances between clusters may change because global geometry is closely related to optimal
perplexity. And in a dataset with many clusters with different number of elements one perplexity
cannot optimize distances for all clusters.
5. Patterns may be found in random noise as well, so multiple runs of the algorithm with different
sets of hyperparameter must be checked before deciding if a pattern exists in the data.
6. Different cluster shapes may be observed at different perplexity levels.
7. Topology cannot be analyzed based on a single t-SNE plot, multiple plots must be observed
before making any assessment.

Reference
[1] L.J.P. van der Maaten and G.E. Hinton. Visualizing High-Dimensional Data Using t-SNE. Journal
of Machine Learning Research 9(Nov):2579-2605, 2008
[2] Jizheng Yi et.al. Facial expression recognition Based on t-SNE and AdaBoostM2.
IEEE International Conference on Green Computing and Communications and IEEE Internet of
Things and IEEE Cyber,Physical and Social Computing (2013)
[3] Walid M. Abdelmoulaa et.al. Data-driven identification of prognostic tumor subpopulations using
spatially mapped t-SNE of mass spectrometry imaging data.
1224412249 | PNAS | October 25, 2016 | vol. 113 | no. 43
[4] Hendrik Heuer. Text comparison using word vector representations and dimensionality
reduction. 8th EUR. CONF. ON PYTHON IN SCIENCE (EUROSCIPY 2015)

End Notes
I hope you enjoyed reading this article. In this article, I have tried to explore all the aspects to help you
get started with t-SNE. Im sure you must be excited to explore more on t-SNE algorithm and use it at
your end.
Share your experience of working with t-SNE algorithm and if you think its better than PCA. If you
have any doubts or questions, feel free to post it in the comments section.

Vous aimerez peut-être aussi