Vous êtes sur la page 1sur 10

Victoria Puckett

BI and Data Mining with R

Decision Tree in R
1. The R commands used to create tree model andimage:
> library(tree)
> library(ISLR)
> data(Hitters)
> set.seed(1)
> Hitters.tree <- tree(League~HmRun+CHmRun+Hits+Years, data=Hitters)
> Hitters.tree
node), split, n, deviance, yval, (yprob)
* denotes terminal node
1) root 322 443.900 A ( 0.5435 0.4565 )
2) HmRun < 16.5 244 338.100 N ( 0.4877 0.5123 ) *
3) HmRun > 16.5 78 92.800 A ( 0.7179 0.2821 )
6) CHmRun < 273 70 87.150 A ( 0.6857 0.3143 )
12) CHmRun < 213.5 60 67.480 A ( 0.7500 0.2500 )
24) HmRun < 27.5 44 55.040 A ( 0.6818 0.3182 ) *
25) HmRun > 27.5 16 7.481 A ( 0.9375 0.0625 ) *
13) CHmRun > 213.5 10 12.220 N ( 0.3000 0.7000 )
26) Hits < 136 5 6.730 A ( 0.6000 0.4000 ) *
27) Hits > 136 5 0.000 N ( 0.0000 1.0000 ) *
7) CHmRun > 273 8 0.000 A ( 1.0000 0.0000 ) *
> plot(Hitters.tree, col=8)
> text(Hitters.tree)

Victoria Puckett
BI and Data Mining with R
2. Yes, there is a redundant branch that is a candidate for pruning. I chose root 12,
CHmRun<213.5, because the root results in two nodes for American League, which is redundant
and does not make logical sense to the average person viewing the decision tree.
> Hitters.tree2 <- snip.tree(Hitters.tree, nodes=c(12))
> plot(Hitters.tree2)
> text(Hitters.tree2)

Victoria Puckett
BI and Data Mining with R

K Nearest Neighbor in R
1. Prepare the data
Create a list of indices that randomly represent 70% of the tuples in Hitters:
> library(class)
> library(ISLR)
> training.size=.7
> set.seed(1)
> training.rows <- sample(1:nrow(Hitters), as.integer(training.size*nrow(Hitters)))
Normalize the attributes:
> Hitters2 <- scale(Hitters[,c(2,3,7,10)])
> head(Hitters2)
Hits HmRun Years
CHmRun
-Andy Allanson -0.7539563 -1.1218446 -1.3081578 -0.793947035
-Alan Ashby
-0.4310614 -0.4329051 1.3308535 -0.005688022
-Alvin Davis
0.6237287 0.8301507 -0.9021560 -0.075240288
-Andre Dawson 0.8605183 1.0597972 0.7218509 1.802670892
-Andres Galarraga -0.3019034 -0.0884353 -1.1051569 -0.666434548
-Alfredo Griffin 1.4632555 -0.7773748 0.7218509 -0.585290238
2. Create knn models
> k1nn.results <- knn(train=Hitters2[training.rows,], test=Hitters2[-training.rows,],
cl=Hitters[training.rows,14], k=1)
> k5nn.results <- knn(train=Hitters2[training.rows,], test=Hitters2[-training.rows,],
cl=Hitters[training.rows,14], k=5)
> k10nn.results <- knn(train=Hitters2[training.rows,], test=Hitters2[-training.rows,],
cl=Hitters[training.rows,14], k=10)
> knn.results <- data.frame(Hitters[-training.rows, 14], k1nn.results, k5nn.results, k10nn.results)
> head(knn.results)
Hitters..training.rows..14. k1nn.results k5nn.results k10nn.results
1
A
N
N
N
2
N
A
A
A
3
N
N
N
N
4
A
A
N
A
5
A
N
N
N
6
N
N
N
N

Victoria Puckett
BI and Data Mining with R
3. Interpret results
The lowest misclassification results from setting k=1 because it correctly classifies 50.52% of the
data, while k=5 correctly classifies 46.39% and k=10 correctly classifies the least at 43.30%.
> k1.results <- sum(Hitters[-training.rows,14]==k1nn.results)/(nrow(Hitters)length(training.rows))
> k1.results
[1] 0.5051546
> k5.results <- sum(Hitters[-training.rows,14]==k5nn.results)/(nrow(Hitters)length(training.rows))
> k5.results
[1] 0.4639175
> k10.results <- sum(Hitters[-training.rows,14]==k10nn.results)/(nrow(Hitters)length(training.rows))
> k10.results
[1] 0.4329897

Victoria Puckett
BI and Data Mining with R

Logistic Classification in R
2a) Create Models
> library(ISLR)
> College$AcceptanceRate=(College$Accept/College$Apps) #creating new variable AcceptanceRate
> College.m1 <- glm(Private~Room.Board+Grad.Rate+Top10perc,College,family=binomial)
> College.m2 <- glm(Private~Outstate+PhD+AcceptanceRate,College,family=binomial)
> summary(College.m1)
Call:
glm(formula = Private ~ Room.Board + Grad.Rate + Top10perc, family = binomial,
data = College)
Deviance Residuals:
Min
1Q Median
3Q Max
-2.3160 -0.9142 0.5003 0.7658 1.9811
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.0688130 0.4816571 -8.448 < 2e-16 ***
Room.Board 0.0006951 0.0001066 6.519 7.08e-11 ***
Grad.Rate 0.0376178 0.0064845 5.801 6.58e-09 ***
Top10perc -0.0066575 0.0068743 -0.968 0.333
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 910.75 on 776 degrees of freedom
Residual deviance: 769.20 on 773 degrees of freedom
AIC: 777.2
Number of Fisher Scoring iterations: 5
> summary(College.m2)
Call:
glm(formula = Private ~ Outstate + PhD + AcceptanceRate, family = binomial,
data = College)
Deviance Residuals:
Min
1Q Median
3Q Max
-3.2216 -0.2611 0.1173 0.3338 3.1297
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.1552852 1.0826173 1.067 0.286

Victoria Puckett
BI and Data Mining with R
Outstate
0.0008488 0.0000658 12.900 <2e-16 ***
PhD
-0.1194898 0.0128988 -9.264 <2e-16 ***
AcceptanceRate 1.4311132 0.9200029 1.556 0.120
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 910.75 on 776 degrees of freedom
Residual deviance: 413.47 on 773 degrees of freedom
AIC: 421.47
Number of Fisher Scoring iterations: 7
2b) Comparison
In model 1, College.m1, the variables Room.Board and Grad.Rate statistically explain the variation
between identifying schools as private or public. Both variables vary positively to the probability of a
college being classified as private.
In model 2, College.m2, the variables Outstate and PhD statistically explain the variation between
identifying schools as private or public. Outstate varies positively to the probability of a college being
classified as private, while PhD varies negatively.
3b) Interpretation
I chose to create a new model by dropping the variable AcceptanceRate from College.m2.
> College.m3 <- glm(Private~Outstate+PhD,College,family=binomial)
> summary(College.m3)
Call:
glm(formula = Private ~ Outstate + PhD, family = binomial, data = College)
Deviance Residuals:
Min
1Q Median
3Q Max
-3.2417 -0.2651 0.1097 0.3377 3.1437
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.3663216 0.7522037 3.146 0.00166 **
Outstate 0.0008603 0.0000662 12.996 < 2e-16 ***
PhD
-0.1225496 0.0127437 -9.617 < 2e-16 ***
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 910.75 on 776 degrees of freedom
Residual deviance: 415.90 on 774 degrees of freedom
AIC: 421.9

Victoria Puckett
BI and Data Mining with R
Number of Fisher Scoring iterations: 7
I chose this model based on comparing the P values of the three test models with variables. Though
College.m2 and College.m3 have the same Pvalue and AIC score, I chose College.m3 because it was
simple and only included the significant variables.
> 1-pchisq(769.20,773) #College.m1
[1] 0.5318179
> 1-pchisq(413.47,773) #College.m2
[1] 1
> 1-pchisq(415.90,773) #College.m3
[1] 1
Using the College.m3 the probability of a school being private with Outstate = 8000 and PhD = 70 is:
2.3663216 + 0.0008603(8000) -0.1225496(70)

probability = e
/ (1+ e
= 66.16% probability that the school is private

2.3663216 + 0.0008603(8000) -0.1225496 (70)

Victoria Puckett
BI and Data Mining with R

K-Means in R
1. Kmeans and R
> hw4 <- read.csv('homework4.csv')
> set.seed(1)
> edu.cluster2 <- kmeans(hw4[,c("SEC_EDU", "SCI_15", "MTH_15", "RED_15", "SAT_EDU")],
centers = 2)
> set.seed(1)
> edu.cluster3 <- kmeans(hw4[,c("SEC_EDU", "SCI_15", "MTH_15", "RED_15", "SAT_EDU")],
centers = 3)
> set.seed(1)
> edu.cluster4 <- kmeans(hw4[,c("SEC_EDU", "SCI_15", "MTH_15", "RED_15", "SAT_EDU")],
centers = 4)
2. Record the centroids and cluster variation
- For edu.cluster2, the cluster means are:
SEC_EDU SCI_15 MTH_15 RED_15 SAT_EDU
1 83.25833 506.8611 501.9722 496.7778 62.62778
2 67.99500 401.0500 395.7500 400.6000 58.44000
There are 36 members in the first cluster and 20 in cluster two.
- For edu.cluster3, the cluster means are:
SEC_EDU SCI_15 MTH_15 RED_15 SAT_EDU
1 85.51538 535.5385 535.3077 521.9231 64.70000
2 82.11250 488.6667 481.4167 480.8750 61.31250
3 67.09474 398.8421 393.3158 398.4211 58.46316
There are 13 members in cluster 1, 24 members in cluster 2, and 19 in cluster 3.
- For edu.cluster4, the cluster means are:
SEC_EDU SCI_15 MTH_15 RED_15 SAT_EDU
1 84.13000 540.7000 541.9000 528.1000 66.39000
2 85.78696 498.4783 491.6087 487.2174 60.51304
3 68.44000 438.5000 432.3000 440.8000 58.52000
4 66.03077 385.4615 379.7692 384.6923 60.19231
There are 10 members in cluster 1, 23 members in cluster 2, 10 members in cluster 3 and 13
members in cluster 4.
Commands used
> edu.cluster2
K-means clustering with 2 clusters of sizes 36, 20
Cluster means:

Victoria Puckett
BI and Data Mining with R
SEC_EDU SCI_15 MTH_15 RED_15 SAT_EDU
1 83.25833 506.8611 501.9722 496.7778 62.62778
2 67.99500 401.0500 395.7500 400.6000 58.44000
Clustering vector:
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 2 1 1 1 1 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 1
22122
Within cluster sum of squares by cluster:
[1] 96836.42 62583.78
(between_SS / total_SS = 72.1 %)
Available components:
[1] "cluster" "centers" "totss"
"withinss" "tot.withinss" "betweenss" "size"
"iter"
[9] "ifault"
> edu.cluster3
K-means clustering with 3 clusters of sizes 13, 24, 19
Cluster means:
SEC_EDU SCI_15 MTH_15 RED_15 SAT_EDU
1 85.51538 535.5385 535.3077 521.9231 64.70000
2 82.11250 488.6667 481.4167 480.8750 61.31250
3 67.09474 398.8421 393.3158 398.4211 58.46316
Clustering vector:
[1] 1 2 1 1 1 2 2 1 1 1 1 2 2 1 2 1 2 1 2 2 2 2 2 2 2 1 2 3 2 2 3 2 2 2 2 3 3 3 2 3 3 3 3 2 3 3 3 3 3 3 2
33133
Within cluster sum of squares by cluster:
[1] 18344.60 31484.76 56367.36
(between_SS / total_SS = 81.4 %)
Available components:
[1] "cluster" "centers" "totss"
"withinss" "tot.withinss" "betweenss" "size"
"iter"
[9] "ifault"
> edu.cluster4
K-means clustering with 4 clusters of sizes 10, 23, 10, 13

Victoria Puckett
BI and Data Mining with R
Cluster means:
SEC_EDU SCI_15 MTH_15 RED_15 SAT_EDU
1 84.13000 540.7000 541.9000 528.1000 66.39000
2 85.78696 498.4783 491.6087 487.2174 60.51304
3 68.44000 438.5000 432.3000 440.8000 58.52000
4 66.03077 385.4615 379.7692 384.6923 60.19231
Clustering vector:
[1] 1 2 1 2 1 2 2 1 1 1 1 2 3 2 2 1 2 1 2 2 2 2 2 2 2 2 2 4 2 2 3 2 3 2 2 4 3 4 2 3 3 4 3 3 3 4 4 4 4 4 3
44144
Within cluster sum of squares by cluster:
[1] 12858.43 18214.72 13481.14 28711.58
(between_SS / total_SS = 87.2 %)
Available components:
[1] "cluster" "centers" "totss"
"withinss" "tot.withinss" "betweenss" "size"
"iter"
[9] "ifault"
3. Compare within-cluster variation
The within-cluster sum of squares for edu.cluster2 is 96,836.42 for the first cluster and
62,583.78 for the second cluster.
The within-cluster sum of squares for edu.cluster3 is 18,344.60 for the first cluster, 31,484.76
for the second cluster and 56,367.36 for the third cluster.
The within-cluster sum of squares for edu.cluster3 is 12,858.43 for the first cluster, 18,214.72
for the second cluster, 13,481.14 for the third cluster, and 28,711.58 for the fourth cluster.
As the number of clusters increase across the three cluster sets the between_SS/total_SS
continues to grow. This change is desirable because it the greater the between_SS to total_SS
the smaller the within_SS which translates to less distance within clusters and greater distance
between, which is a key goal of clustering.
4. Data processing
Before performing a kmeans function on a data set, it may be necessary to normalize or scale
your data so one attribute does not overwhelm the entire model. Identifying outliers and
determining whether or not to include them or not is important since k-means is sensitive to
outliers. It may also be necessary to consider null values, whether you want to keep them or
ignore them, depending on the prevalence of nulls in the data set being used.

Vous aimerez peut-être aussi