Vous êtes sur la page 1sur 11

EIN6934 – Eng.

Analytics I
Mock Midterm

Case #1
This dataset contains state data and has 50 observations (one for each US state) and the following 8
variables:
• Population - the population estimate of the state in 1975
• Income - per capita income in 1974
• Illiteracy - illiteracy rates in 1970, as a percent of the population
• Life.Exp - the life expectancy in years of residents of the state in 1970
• Murder - the murder and non-negligent manslaughter rate per 100,000 population in 1976
• HS.Grad - percent of high-school graduates in 1970
• Frost - the mean number of days with minimum temperature below freezing from 1931–1960 in
the capital or a large city of the state
• Area - the land area (in square miles) of the state

Refer to the tables and reports from Appendix 1 to solve Case #1.

What is the adjusted R-squared of the model? (Use all predictors)

Calculate the sum of squared errors (SSE) between the predicted life
expectancies using a linear regression model and the actual life
expectancies (Use all predictors)

What is the Multiple R-squared of the model? (use the best 4


variables as predictors)

Calculate the sum of squared errors (SSE) between the predicted life
expectancies using this model and the actual life expectancies (use
the best 4 variables as predictors)

Which of the following is correct? Circle the correct option.


a) Trying different combinations of variables in linear regression is like trying different numbers of
splits in a tree - this controls the complexity of the model.
b) Using many variables in a linear regression is always better than using just a few.
c) The variables we removed were uncorrelated with Life.Exp

Use a CART model to predict Life.Exp using all of the other variables as independent variables Use
default minbucket parameter. We are not as interested in predicting life expectancies for new
observations, so use all of the data to build our model. You shouldn't use the method="class" argument
since this is a regression tree. Which of these variables appear in the tree? Circle all that apply.

Population Murder Frost HS.Grad Area


1
Mock Midterm Exam – Eng. Analytics I
What is the SSE?

How is the SSE comparing with the linear regression models?

Setting the minbucket parameter to 5, and recreating the tree; which variables appear in this new tree?
Select all that apply.

Population Murder Frost HS.Grad Area

Do you think the default minbucket parameter is smaller


or larger than 5 based on the tree that was built? Why?

Create a tree that predicts Life.Exp using only Area, with


the minbucket parameter to 1.
What is the SSE of this newest tree?

A rule of thumb is that simpler models are more interpretable


and generalizeable. We will now tune our regression tree to see
if we can improve the fit of our tree while keeping it as simple
as possible. We used cp varying over the range 0.01 to 0.50 in
increments of 0.01. The train function determines the
best cp value for a CART model using all of the available
independent variables, and the entire dataset data. What value of cp does the train function recommend?
(Remember that the train function tells you to pick the largest value of cp with the lowest error when there
are ties, and explains this at the bottom of the output.)

Recall the three previous trees: with default parameters, with minbucket = 5, and with cross validation.
Given what you have learned about cross-validation, which of the three models would you expect to be
better if we did use it for prediction on a test set? Circle the right option and provide a reason. For this
question, suppose we had actually set aside a few observations (states) in a test set, and we want to make
predictions on those states.
Reason:

The first model


The second model
The model we just made with the "best" cp
2
Mock Midterm Exam – Eng. Analytics I
On previous question, we made a very complex tree using just Area. Now we use train with the same
parameters as before but just using Area as an independent variable to find the best cp value. We built a
new tree using just Area and this value of cp.

How many splits does the tree have?

The lower left leaf (or bucket) corresponds to the lowest predicted Life.Exp of 70. Observations in this
leaf correspond to states with area:

greater than or equal to _________ and area less than___________

Case #2
A stock market is where buyers and sellers trade shares of a company, and is one of the most popular ways
for individuals and companies to invest money. The size of the world stock market is now estimated to be
in the trillions. The largest stock market in the world is the New York Stock Exchange (NYSE), located
in New York City. About 2,800 companies are listed on the NSYE. In this problem, we have the monthly
stock prices of five of these companies: IBM, General Electric (GE), Procter and Gamble, Coca Cola,
and Boeing.

Data frames are called: "IBM", "GE", "ProcterGamble", "CocaCola", and "Boeing", respectively. Each
data frame has two variables, described as follows:

Date: the date of the stock price, always given as the first of the month.
StockPrice: the average stock price of the company in the given month.

All datasets have the same number of observations. Refer to the tables and reports from Appendix 2 to
solve Case #2.

How many observations are there in each data set?

What is the earliest year in our datasets?

What is the mean stock price of IBM over this time period?

What is the minimum stock price of General Electric (GE) over


this time period?

What should you write to get the standard deviation of the stock
price of Procter & Gamble over this time period?

3
Mock Midterm Exam – Eng. Analytics I
Around what year did Coca-Cola has its lowest stock price in this
time period?

In March of 2000, the technology bubble burst, and a stock market


crash occurred. Between Coca Cola and Procter and Gamble,
which company's stock dropped more?

Around 1983, the stock for one of these companies (Coca-Cola or


Procter and Gamble) was going up, while the other was going
down. Which one was going up?

In the time period shown in the plot for Coca Cola and Procter
and Gamble, which stock generally has lower values?

Which stock fell the most right after the technology bubble burst
in March 2000?

In October of 1997, there was a global stock market crash that


was caused by an economic crisis in Asia. Comparing September
1997 to November 1997, which companies saw a decreasing trend
in their stock price? (Indicate all that apply.)

In the last two years of this time period (2004 and 2005) which
stock seems to be performing the best, in terms of increasing stock
price?

For IBM, compare the monthly averages to the overall average


stock price. In which months has IBM historically had a higher
stock price (on average)? Indicate all that apply.

General Electric and Coca-Cola both have their highest average


stock price in the same month. Which month is this and how
much are their highest values

For the months of December and January, every company's


average stock is higher in one month and lower in the other. In
which month are the stock prices lower?

4
Mock Midterm Exam – Eng. Analytics I
Appendix 1

> str(data)
'data.frame': 50 obs. of 8 variables:
$ Population: int 3615 365 2212 2110 21198 2541 3100 579 8277 4931 ...
$ Income : int 3624 6315 4530 3378 5114 4884 5348 4809 4815 4091 ...
$ Illiteracy: num 2.1 1.5 1.8 1.9 1.1 0.7 1.1 0.9 1.3 2 ...
$ Life.Exp : num 69 69.3 70.5 70.7 71.7 ...
$ Murder : num 15.1 11.3 7.8 10.1 10.3 6.8 3.1 6.2 10.7 13.9 ...
$ HS.Grad : num 41.3 66.7 58.1 39.9 62.6 63.9 56 54.6 52.6 40.6 ...
$ Frost : int 20 152 15 65 20 166 139 103 11 60 ...
$ Area : int 50708 566432 113417 51945 156361 103766 4862 1982 54090 ...

> cor(data)
Population Income Illiteracy Life.Exp Murder HS.Grad Frost Area
Population 1.00000000 0.2082276 0.10762237 -0.06805195 0.3436428 -0.09848975 -0.3321525 0.02254384
Income 0.20822756 1.0000000 -0.43707519 0.34025534 -0.2300776 0.61993232 0.2262822 0.36331544
Illiteracy 0.10762237 -0.4370752 1.00000000 -0.58847793 0.7029752 -0.65718861 -0.6719470 0.07726113
Life.Exp -0.06805195 0.3402553 -0.58847793 1.00000000 -0.7808458 0.58221620 0.2620680 -0.10733194
Murder 0.34364275 -0.2300776 0.70297520 -0.78084575 1.0000000 -0.48797102 -0.5388834 0.22839021
HS.Grad -0.09848975 0.6199323 -0.65718861 0.58221620 -0.4879710 1.00000000 0.3667797 0.33354187
Frost -0.33215245 0.2262822 -0.67194697 0.26206801 -0.5388834 0.36677970 1.0000000 0.05922910
Area 0.02254384 0.3633154 0.07726113 -0.10733194 0.2283902 0.33354187 0.0592291 1.00000000

Call:
rpart(formula = Life.Exp ~ ., data = data)
n= 50

> Predictions=predict(rpart(Life.Exp~.,data=data))
> sum((data$Life.Exp-Predictions)^2)

[1] 28.99848

lm(formula = Life.Exp ~ Population + Murder + HS.Grad + Frost, data = data)

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.103e+01 9.529e-01 74.542 < 2e-16 ***
Population 5.014e-05 2.512e-05 1.996 0.05201 .
Murder -3.001e-01 3.661e-02 -8.199 1.77e-10 ***
HS.Grad 4.658e-02 1.483e-02 3.142 0.00297 **
Frost -5.943e-03 2.421e-03 -2.455 0.01802 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.7197 on 45 degrees of freedom


Multiple R-squared: 0.736, Adjusted R-squared: 0.7126
F-statistic: 31.37 on 4 and 45 DF, p-value: 1.696e-12
5
Mock Midterm Exam – Eng. Analytics I
> Predictions=predict(lm(Life.Exp~Population+Murder+HS.Grad+Frost,data=data))
> sum((data$Life.Exp-Predictions)^2)

[1] 23.30804

> fitcontrol=trainControl(method="cv",number=10)
> cartgrid=expand.grid(.cp=seq(0.01,0.2,0.01))
> train(Life.Exp~Area,data=data,method="rpart",trControl=fitcontrol,tuneGrid=cartg
rid)
CART
50 samples, 1 predictor

No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 46, 45, 44, 43, 46, 46, ...
Resampling results across tuning parameters:
cp RMSE Rsquared MAE
0.01 1.231586 0.39845523 1.006585
0.02 1.243748 0.38909431 1.016449
0.03 1.243748 0.38909431 1.016449
0.04 1.246020 0.38659302 1.020503
0.05 1.246020 0.38659302 1.020503
0.06 1.257693 0.38542329 1.028342
0.07 1.284635 0.37765344 1.063739
0.08 1.295051 0.36309730 1.080484
0.09 1.295051 0.36309730 1.080484
0.10 1.295051 0.36309730 1.080484
0.11 1.393421 0.26634509 1.155116
0.12 1.393421 0.26634509 1.155116
0.13 1.393421 0.26634509 1.155116
0.14 1.408479 0.20191108 1.177991
0.15 1.391087 0.23690897 1.190359
0.16 1.391087 0.23690897 1.190359
0.17 1.368303 0.08957212 1.147769
0.18 1.369533 0.01795209 1.142731
0.19 1.356254 0.01795209 1.131768
0.20 1.328025 NaN 1.094530

RMSE was used to select the optimal model using the smallest value.

Call:
rpart(formula = Life.Exp ~ ., data = data, minbucket = 5)
n= 50

6
Mock Midterm Exam – Eng. Analytics I
> Predictions=predict(rpart(Life.Exp~.,data=data,minbucket=5))
> sum((data$Life.Exp-Predictions)^2)

[1] 23.64283

Call:
rpart(formula = Life.Exp ~ ., data = data, cp = 0.12)
n= 50

> Predictions=predict(rpart(Life.Exp~.,data=data,cp=0.12))
> sum((data$Life.Exp-Predictions)^2)

[1] 32.86549

Call:
rpart(formula = Life.Exp ~ Area, data = data, minbucket = 1)
n= 50

7
Mock Midterm Exam – Eng. Analytics I
> Predictions=predict(rpart(Life.Exp~Area,data=data,minbucket=1))
> sum((data$Life.Exp-Predictions)^2)

[1] 9.312442

> fitcontrol=trainControl(method="cv",number=10)
> cartgrid=expand.grid(.cp=seq(0.01,0.2,0.01))
> train(Life.Exp~.,data=data,method="rpart",trControl=fitcontrol,tuneGrid=cartgrid)
CART
50 samples, 7 predictor
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 46, 45, 44, 43, 46, 46, ...
Resampling results across tuning parameters:

cp RMSE Rsquared MAE


0.01 1.1375138 0.4734757 0.9362281
0.02 1.1375138 0.4734757 0.9362281
0.03 1.1375138 0.4734757 0.9362281
0.04 1.1231367 0.4734757 0.9286572
0.05 1.0817092 0.4923520 0.8907357
0.06 1.0709561 0.4923520 0.8822194
0.07 1.0149896 0.5275358 0.8371557
0.08 0.9933235 0.5436269 0.8117516
0.09 0.9933235 0.5436269 0.8117516
0.10 0.9933235 0.5436269 0.8117516
0.11 0.9933235 0.5436269 0.8117516
0.12 0.9933235 0.5436269 0.8117516
0.13 1.0568377 0.4988101 0.8515894
0.14 1.1221457 0.4758470 0.8944743
0.15 1.1399888 0.4642198 0.9140933
0.16 1.1244999 0.4642198 0.8983067
0.17 1.1244999 0.4642198 0.8983067
0.18 1.1463316 0.4642198 0.9112852
0.19 1.1463316 0.4642198 0.9112852
0.20 1.1463316 0.4642198 0.9112852

RMSE was used to select the optimal model using the smallest value.

Call:
rpart(formula = Life.Exp ~ Area, data = data, cp = 0.01)
n= 50

> Predictions=predict(rpart(Life.Exp~Area,data=data,cp=0.01))
> sum((data$Life.Exp-Predictions)^2)

[1] 44.26817

8
Mock Midterm Exam – Eng. Analytics I
Call:
lm(formula = Life.Exp ~ ., data = data)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.094e+01 1.748e+00 40.586 < 2e-16 ***
Population 5.180e-05 2.919e-05 1.775 0.0832 .
Income -2.180e-05 2.444e-04 -0.089 0.9293
Illiteracy 3.382e-02 3.663e-01 0.092 0.9269
Murder -3.011e-01 4.662e-02 -6.459 8.68e-08 ***
HS.Grad 4.893e-02 2.332e-02 2.098 0.0420 *
Frost -5.735e-03 3.143e-03 -1.825 0.0752 .
Area -7.383e-08 1.668e-06 -0.044 0.9649
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.7448 on 42 degrees of freedom


Multiple R-squared: 0.7362, Adjusted R-squared: 0.6922
F-statistic: 16.74 on 7 and 42 DF, p-value: 2.534e-10

> Predictions=predict(lm(Life.Exp~.,data=data))
> sum((data$Life.Exp-Predictions)^2)

[1] 23.29714

9
Mock Midterm Exam – Eng. Analytics I
Appendix 2

> str(IBM)
'data.frame': 480 obs. of 2 variables:
$ Date : Date, format: "1970-01-01" "1970-02-01" ...
$ StockPrice: num 360 347 327 320 270 ...
> summary(IBM) > summary(CocaCola)
Date StockPrice Date StockPrice
Min. :1970-01-01 Min. : 43.40 Min. :1970-01-01 Min. : 30.06
1st Qu.:1979-12-24 1st Qu.: 88.34 1st Qu.:1979-12-24 1st Qu.: 42.76
Median :1989-12-16 Median :112.11 Median :1989-12-16 Median : 51.44
Mean :1989-12-15 Mean :144.38 Mean :1989-12-15 Mean : 60.03
3rd Qu.:1999-12-08 3rd Qu.:165.41 3rd Qu.:1999-12-08 3rd Qu.: 69.62
Max. :2009-12-01 Max. :438.90 Max. :2009-12-01 Max. :146.58
> summary(GE) > summary(Boeing)
Date StockPrice Date StockPrice
Min. :1970-01-01 Min. : 9.294 Min. :1970-01-01 Min. : 12.74
1st Qu.:1979-12-24 1st Qu.: 44.214 1st Qu.:1979-12-24 1st Qu.: 34.64
Median :1989-12-16 Median : 55.812 Median :1989-12-16 Median : 44.88
Mean :1989-12-15 Mean : 59.303 Mean :1989-12-15 Mean : 46.59
3rd Qu.:1999-12-08 3rd Qu.: 72.226 3rd Qu.:1999-12-08 3rd Qu.: 57.21
Max. :2009-12-01 Max. :156.844 Max. :2009-12-01 Max. :107.28

> tapply(IBM$StockPrice, months(IBM$Date), mean)


April August December February January July June March
152.1168 140.1455 140.7593 152.6940 150.2384 139.0670 139.0907 152.4327
May November October September
151.5022 138.0187 137.3466 139.0885

> tapply(Boeing$StockPrice, months(Boeing$Date), mean)


April August December February January July June March
47.04686 46.86311 46.17315 46.89223 46.51097 46.55360 47.38525 46.88208
May November October September
48.13716 45.14990 45.21603 46.30485

> tapply(GE$StockPrice, months(GE$Date), mean)


April August December February January July June March
64.48009 56.50315 59.10217 62.52080 62.04511 56.73349 56.46844 63.15055
May November October September
60.87135 57.28879 56.23897 56.23913

> tapply(ProcterGamble$StockPrice, months(ProcterGamble$Date), mean)


April August December February January July June March
77.68671 76.82266 78.29661 79.02575 79.61798 76.64556 77.39275 77.34761
May November October September
77.85958 78.45610 76.67903 76.62385

> tapply(CocaCola$StockPrice, months(CocaCola$Date), mean)


April August December February January July June March
62.68888 58.88014 59.73223 60.73475 60.36849 58.98346 60.81208 62.07135
May November October September
61.44358 59.10268 57.93887 57.60024

10
Mock Midterm Exam – Eng. Analytics I
Next figure shows: Dates (x) – Stock price (y). The dashed line is for Coca Cola and the solid one is for
Procter and Gamble.

Next figure shows: Dates (x) – Stock price (y) from 1995 to 2005. The solid line is for Procter &
Gamble, the dashed line is for Coca Cola, the dotted line is for IBM, the dotdash line is for GE and the
longdash line is for Boeing.

11
Mock Midterm Exam – Eng. Analytics I

Vous aimerez peut-être aussi