IDS 575 Assignment - 3: Name: Swapnil Shashank Parkhe UIN: 660014865

Name:
Swapnil Shashank Parkhe

IDS 575 Assignment - 3 UIN : 660014865

Ans-1) Partition of two- dimensional feature space that could result from recursive binary splitting
Ans-2) Below are the answers and corresponding code could be found in the submitted R-file

v Definitions used (as per book ISLR):
• Gini_index=prob*(1-prob)+prob*(1-prob)
• Classify_error=1-pmax(prob,1-prob)
• Cross_entropy=(prob*log(prob)+(1-prob)*log(1-prob))

Ans-3) CARSEATS data

(a) Data Split: Split data into training and test data in the 1:1 ratio or 50:50 ratio (please see R-code file)

(b) Fit Regression Tree:
Using “rpart” package to plot the tree (without tuning parameters for this example and allowing R to
choose default parameters and hyperparameters)

Note: Please zoom in to see relevant parts of the trees or use R-code for better views (again, this tree
is neither optimal nor maximal tree, & it is something that is built by R with default rpart parameters. I
will be using appropriate parameters in part b for first maximally growing tree, and then pruning it)

Fig1: Decision tree using “rpart” (neither optimal nor maximal as made using default parameters)

Observation and Results interpretation:
• The tree is not an optimally grown tree
• It might result in overfitting and underfitting, if the depth of tree is not controlled in an optimal
manner. It results in overfitting if the tree is allowed to grow to its maximal depth, whereas if it
is not allowed to grow much at all, it might cause underfitting

Test MSE (using Regression Tree by rpart default parameters)= 4.8

(c) Use CV to determine optimal tree parameter:
• Using “rpart” package to plot the maximal tree first (with appropriate parameters for growing
tree to full depth – please see R code for details)
• To make pruned tree, we first find the optimal parameter called Complexity parameter (a.ka.
CP) using the in-built inner CV functionality provided by rpart package (this step is also called
parameter tuning using inner CV that gives a set of “x-relative-error” or avg. cross-validation
error for a set of CP values in “cptable” – please find below the image & R code for details).
“cptable” is looked at, in order to determine that optimal CP (optimal point highlighted in using
a red ring; Optimal CP=0.04) pertaining to minimum “x-relative-error”.

Fig2: CP Plot using “rpart” (X-relative error for different values of CP; Optimal CP=0.04)

(CP decreases from left to right)

Fig3: Maximal tree using “rpart” (using appropriate parameters like minsplit, CP, etc.)

Fig4: Optimal tree using “rpart” (using optimized CP obtained through CV processing)

Observation and Results interpretation:
• The tree is now an optimally grown tree (Fig4) which is made with the optimal complexity
parameter value based on minimum x-relative error pertaining to parameter tuning done using
inner CV on just he training dataset used for tree building (CP - controls the depth of the tree)
• Pruning has improved the test MSE when compared to maximal tree (please see below for
values). This happens as maximal tree results in overfitting whereas optimal tree doesn’t.

Test MSE (on maximal tree)= 5.7
Test MSE (on pruned tree)= 5.3

Note: Please zoom in to see relevant parts of the trees or use R-code for better views

(d) Use Bagging to analyse data:
• Using “randomForest” package to perform bagging as a special case where, the parameter,
mtry=no. of predictors in the dataset=10 (in this case of Carseats data)
• Using ntree=1000 for this case, and we could see in below fig that the error doesn’t alter much
after ntree>=100

Fig5: Bagging Error plot (Error w.r.t no. of trees for mtry=10)

Test MSE (using Bagging)= 2.6

Variable Importance (arranged in decreasing order of IncNodePurity):

Variable %IncMSE IncNodePurity
Price 73.5099743 506.188874
ShelveLoc 73.8780837 480.583304
Age 24.9530411 163.885279
CompPrice 21.8497254 128.600269
Advertising 24.3982165 118.801023
Income 5.1046129 86.656037
Population 3.2621868 76.786471
Education 0.9979023 45.954264
US 6.1677957 11.558938
Urban 0.1523434 5.701354

(e) Use Random Forest to analyse data:
• Using “randomForest” package to perform randomForest with the parameter based on the
thumb-rule of optimal parameter value for regression tree problem, pertaining to mtry=(no. of
predictors/3) =3 (in this case of Carseats data, there are 10 predcitors)
• Using ntree=1000 for this case, and we could see in below fig that the error doesn’t alter much
after ntree>=400

Fig6: Random Forest Error plot (Error w.r.t no. of trees for mtry=3)

Test MSE (using RandomForest)= 2.9

Variable Importance (arranged in decreasing order of IncNodePurity):

Variable %IncMSE IncNodePurity
Price 54.38094944 403.39521
ShelveLoc 51.87686972 351.28853
Age 20.01056489 189.99033
Advertising 18.52220032 139.30268
CompPrice 11.40028745 133.62955
Population 4.91033778 127.34497
Income 3.33676622 122.91471
Education -0.06187146 64.42161
US 7.25216899 26.46457
Urban 0.83772362 12.00037

Effect of choice of m on the error rate: As m increases (from 1 to 8), error decreases (~min at m=7),
but then slightly starts increasing with m (from 9 to 10)

Fig7: Random Forest Error plot (Error w.r.t mtry with ntree=1000)

Ans-4) HITTER dataset [Note: Please check the R code for details]

(a) Data Treatment and Transformation: Removed rows where Salary is absent; & log transf. salaries

(b) Data Split: Splitting such that first 200 rows are train and remaining are test data

(c) Adaboost with 1000 trees for a range of lambda; Plotting lambda on x-axis & training set MSE on y-axis

Fig8: Training set MSE vs Lambda values

(d) Adaboost with 1000 trees for a range of lambda; Plotting lambda on x-axis & test set MSE on y-axis

Fig9: Test set MSE vs Lambda values

Note: In above fig9, Optimal lambda = 0.16 where test MSE is 0.26

(e) Comparing test MSE of Boosting with test MSE(s) of Linear Regression and Ridge Regression

Test MSE (using Boosting)= 0.27
Test MSE (using Linear Regression)= 0.49
Test MSE (using Ridge Regression)= 0.46

(f) Variable Importance as per Boosting:

Variables rel.inf
CAtBat 15.2806675
CRuns 10.6631899
CHits 9.4053306
PutOuts 9.3891947
Walks 8.218591
CHmRun 6.0778593
Assists 5.9919862
Years 5.9746214
AtBat 4.7580801
RBI 4.0231737
CRBI 3.5175945
CWalks 3.4347325
HmRun 3.3532394
Hits 3.2644582
Runs 2.6499791
Errors 2.4091699
NewLeague 0.6056431
Division 0.5017385
League 0.4807503

(g) Variable Importance as per Boosting:

Test MSE (using Bagging)= 0.23

Fig10: Bagging Error plot (Error w.r.t no. of trees for mtry=19)

IDS 575 Assignment - 3: Name: Swapnil Shashank Parkhe UIN: 660014865

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

IDS 575 Assignment - 3: Name: Swapnil Shashank Parkhe UIN: 660014865

Transféré par

Droits d'auteur :

Formats disponibles

Name:

Swapnil Shashank Parkhe

Vous aimerez peut-être aussi