Académique Documents
Professionnel Documents
Culture Documents
Analytics I
Mock Midterm
Case #1
This dataset contains state data and has 50 observations (one for each US state) and the following 8
variables:
• Population - the population estimate of the state in 1975
• Income - per capita income in 1974
• Illiteracy - illiteracy rates in 1970, as a percent of the population
• Life.Exp - the life expectancy in years of residents of the state in 1970
• Murder - the murder and non-negligent manslaughter rate per 100,000 population in 1976
• HS.Grad - percent of high-school graduates in 1970
• Frost - the mean number of days with minimum temperature below freezing from 1931–1960 in
the capital or a large city of the state
• Area - the land area (in square miles) of the state
Refer to the tables and reports from Appendix 1 to solve Case #1.
Calculate the sum of squared errors (SSE) between the predicted life
expectancies using a linear regression model and the actual life
expectancies (Use all predictors)
Calculate the sum of squared errors (SSE) between the predicted life
expectancies using this model and the actual life expectancies (use
the best 4 variables as predictors)
Use a CART model to predict Life.Exp using all of the other variables as independent variables Use
default minbucket parameter. We are not as interested in predicting life expectancies for new
observations, so use all of the data to build our model. You shouldn't use the method="class" argument
since this is a regression tree. Which of these variables appear in the tree? Circle all that apply.
Setting the minbucket parameter to 5, and recreating the tree; which variables appear in this new tree?
Select all that apply.
Recall the three previous trees: with default parameters, with minbucket = 5, and with cross validation.
Given what you have learned about cross-validation, which of the three models would you expect to be
better if we did use it for prediction on a test set? Circle the right option and provide a reason. For this
question, suppose we had actually set aside a few observations (states) in a test set, and we want to make
predictions on those states.
Reason:
The lower left leaf (or bucket) corresponds to the lowest predicted Life.Exp of 70. Observations in this
leaf correspond to states with area:
Case #2
A stock market is where buyers and sellers trade shares of a company, and is one of the most popular ways
for individuals and companies to invest money. The size of the world stock market is now estimated to be
in the trillions. The largest stock market in the world is the New York Stock Exchange (NYSE), located
in New York City. About 2,800 companies are listed on the NSYE. In this problem, we have the monthly
stock prices of five of these companies: IBM, General Electric (GE), Procter and Gamble, Coca Cola,
and Boeing.
Data frames are called: "IBM", "GE", "ProcterGamble", "CocaCola", and "Boeing", respectively. Each
data frame has two variables, described as follows:
Date: the date of the stock price, always given as the first of the month.
StockPrice: the average stock price of the company in the given month.
All datasets have the same number of observations. Refer to the tables and reports from Appendix 2 to
solve Case #2.
What is the mean stock price of IBM over this time period?
What should you write to get the standard deviation of the stock
price of Procter & Gamble over this time period?
3
Mock Midterm Exam – Eng. Analytics I
Around what year did Coca-Cola has its lowest stock price in this
time period?
In the time period shown in the plot for Coca Cola and Procter
and Gamble, which stock generally has lower values?
Which stock fell the most right after the technology bubble burst
in March 2000?
In the last two years of this time period (2004 and 2005) which
stock seems to be performing the best, in terms of increasing stock
price?
4
Mock Midterm Exam – Eng. Analytics I
Appendix 1
> str(data)
'data.frame': 50 obs. of 8 variables:
$ Population: int 3615 365 2212 2110 21198 2541 3100 579 8277 4931 ...
$ Income : int 3624 6315 4530 3378 5114 4884 5348 4809 4815 4091 ...
$ Illiteracy: num 2.1 1.5 1.8 1.9 1.1 0.7 1.1 0.9 1.3 2 ...
$ Life.Exp : num 69 69.3 70.5 70.7 71.7 ...
$ Murder : num 15.1 11.3 7.8 10.1 10.3 6.8 3.1 6.2 10.7 13.9 ...
$ HS.Grad : num 41.3 66.7 58.1 39.9 62.6 63.9 56 54.6 52.6 40.6 ...
$ Frost : int 20 152 15 65 20 166 139 103 11 60 ...
$ Area : int 50708 566432 113417 51945 156361 103766 4862 1982 54090 ...
> cor(data)
Population Income Illiteracy Life.Exp Murder HS.Grad Frost Area
Population 1.00000000 0.2082276 0.10762237 -0.06805195 0.3436428 -0.09848975 -0.3321525 0.02254384
Income 0.20822756 1.0000000 -0.43707519 0.34025534 -0.2300776 0.61993232 0.2262822 0.36331544
Illiteracy 0.10762237 -0.4370752 1.00000000 -0.58847793 0.7029752 -0.65718861 -0.6719470 0.07726113
Life.Exp -0.06805195 0.3402553 -0.58847793 1.00000000 -0.7808458 0.58221620 0.2620680 -0.10733194
Murder 0.34364275 -0.2300776 0.70297520 -0.78084575 1.0000000 -0.48797102 -0.5388834 0.22839021
HS.Grad -0.09848975 0.6199323 -0.65718861 0.58221620 -0.4879710 1.00000000 0.3667797 0.33354187
Frost -0.33215245 0.2262822 -0.67194697 0.26206801 -0.5388834 0.36677970 1.0000000 0.05922910
Area 0.02254384 0.3633154 0.07726113 -0.10733194 0.2283902 0.33354187 0.0592291 1.00000000
Call:
rpart(formula = Life.Exp ~ ., data = data)
n= 50
> Predictions=predict(rpart(Life.Exp~.,data=data))
> sum((data$Life.Exp-Predictions)^2)
[1] 28.99848
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.103e+01 9.529e-01 74.542 < 2e-16 ***
Population 5.014e-05 2.512e-05 1.996 0.05201 .
Murder -3.001e-01 3.661e-02 -8.199 1.77e-10 ***
HS.Grad 4.658e-02 1.483e-02 3.142 0.00297 **
Frost -5.943e-03 2.421e-03 -2.455 0.01802 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
[1] 23.30804
> fitcontrol=trainControl(method="cv",number=10)
> cartgrid=expand.grid(.cp=seq(0.01,0.2,0.01))
> train(Life.Exp~Area,data=data,method="rpart",trControl=fitcontrol,tuneGrid=cartg
rid)
CART
50 samples, 1 predictor
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 46, 45, 44, 43, 46, 46, ...
Resampling results across tuning parameters:
cp RMSE Rsquared MAE
0.01 1.231586 0.39845523 1.006585
0.02 1.243748 0.38909431 1.016449
0.03 1.243748 0.38909431 1.016449
0.04 1.246020 0.38659302 1.020503
0.05 1.246020 0.38659302 1.020503
0.06 1.257693 0.38542329 1.028342
0.07 1.284635 0.37765344 1.063739
0.08 1.295051 0.36309730 1.080484
0.09 1.295051 0.36309730 1.080484
0.10 1.295051 0.36309730 1.080484
0.11 1.393421 0.26634509 1.155116
0.12 1.393421 0.26634509 1.155116
0.13 1.393421 0.26634509 1.155116
0.14 1.408479 0.20191108 1.177991
0.15 1.391087 0.23690897 1.190359
0.16 1.391087 0.23690897 1.190359
0.17 1.368303 0.08957212 1.147769
0.18 1.369533 0.01795209 1.142731
0.19 1.356254 0.01795209 1.131768
0.20 1.328025 NaN 1.094530
RMSE was used to select the optimal model using the smallest value.
Call:
rpart(formula = Life.Exp ~ ., data = data, minbucket = 5)
n= 50
6
Mock Midterm Exam – Eng. Analytics I
> Predictions=predict(rpart(Life.Exp~.,data=data,minbucket=5))
> sum((data$Life.Exp-Predictions)^2)
[1] 23.64283
Call:
rpart(formula = Life.Exp ~ ., data = data, cp = 0.12)
n= 50
> Predictions=predict(rpart(Life.Exp~.,data=data,cp=0.12))
> sum((data$Life.Exp-Predictions)^2)
[1] 32.86549
Call:
rpart(formula = Life.Exp ~ Area, data = data, minbucket = 1)
n= 50
7
Mock Midterm Exam – Eng. Analytics I
> Predictions=predict(rpart(Life.Exp~Area,data=data,minbucket=1))
> sum((data$Life.Exp-Predictions)^2)
[1] 9.312442
> fitcontrol=trainControl(method="cv",number=10)
> cartgrid=expand.grid(.cp=seq(0.01,0.2,0.01))
> train(Life.Exp~.,data=data,method="rpart",trControl=fitcontrol,tuneGrid=cartgrid)
CART
50 samples, 7 predictor
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 46, 45, 44, 43, 46, 46, ...
Resampling results across tuning parameters:
RMSE was used to select the optimal model using the smallest value.
Call:
rpart(formula = Life.Exp ~ Area, data = data, cp = 0.01)
n= 50
> Predictions=predict(rpart(Life.Exp~Area,data=data,cp=0.01))
> sum((data$Life.Exp-Predictions)^2)
[1] 44.26817
8
Mock Midterm Exam – Eng. Analytics I
Call:
lm(formula = Life.Exp ~ ., data = data)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.094e+01 1.748e+00 40.586 < 2e-16 ***
Population 5.180e-05 2.919e-05 1.775 0.0832 .
Income -2.180e-05 2.444e-04 -0.089 0.9293
Illiteracy 3.382e-02 3.663e-01 0.092 0.9269
Murder -3.011e-01 4.662e-02 -6.459 8.68e-08 ***
HS.Grad 4.893e-02 2.332e-02 2.098 0.0420 *
Frost -5.735e-03 3.143e-03 -1.825 0.0752 .
Area -7.383e-08 1.668e-06 -0.044 0.9649
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> Predictions=predict(lm(Life.Exp~.,data=data))
> sum((data$Life.Exp-Predictions)^2)
[1] 23.29714
9
Mock Midterm Exam – Eng. Analytics I
Appendix 2
> str(IBM)
'data.frame': 480 obs. of 2 variables:
$ Date : Date, format: "1970-01-01" "1970-02-01" ...
$ StockPrice: num 360 347 327 320 270 ...
> summary(IBM) > summary(CocaCola)
Date StockPrice Date StockPrice
Min. :1970-01-01 Min. : 43.40 Min. :1970-01-01 Min. : 30.06
1st Qu.:1979-12-24 1st Qu.: 88.34 1st Qu.:1979-12-24 1st Qu.: 42.76
Median :1989-12-16 Median :112.11 Median :1989-12-16 Median : 51.44
Mean :1989-12-15 Mean :144.38 Mean :1989-12-15 Mean : 60.03
3rd Qu.:1999-12-08 3rd Qu.:165.41 3rd Qu.:1999-12-08 3rd Qu.: 69.62
Max. :2009-12-01 Max. :438.90 Max. :2009-12-01 Max. :146.58
> summary(GE) > summary(Boeing)
Date StockPrice Date StockPrice
Min. :1970-01-01 Min. : 9.294 Min. :1970-01-01 Min. : 12.74
1st Qu.:1979-12-24 1st Qu.: 44.214 1st Qu.:1979-12-24 1st Qu.: 34.64
Median :1989-12-16 Median : 55.812 Median :1989-12-16 Median : 44.88
Mean :1989-12-15 Mean : 59.303 Mean :1989-12-15 Mean : 46.59
3rd Qu.:1999-12-08 3rd Qu.: 72.226 3rd Qu.:1999-12-08 3rd Qu.: 57.21
Max. :2009-12-01 Max. :156.844 Max. :2009-12-01 Max. :107.28
10
Mock Midterm Exam – Eng. Analytics I
Next figure shows: Dates (x) – Stock price (y). The dashed line is for Coca Cola and the solid one is for
Procter and Gamble.
Next figure shows: Dates (x) – Stock price (y) from 1995 to 2005. The solid line is for Procter &
Gamble, the dashed line is for Coca Cola, the dotted line is for IBM, the dotdash line is for GE and the
longdash line is for Boeing.
11
Mock Midterm Exam – Eng. Analytics I