Vous êtes sur la page 1sur 5

DSC 2008 Business AnalyticsData and Decisions Tutorial 5

This is a long tutorial. Try to do it with members of your Study Team (Project
Group). Please read the helpful last page of the answers when it is available for
downloadit will wrap up this half module for you.
Q2 & Q4 are for tutorial discussions; answers for the rest are already included.
Again, BACT offers useful consulting before each tutorial class.
Please be reminded that Excel will not be tested in the Final Exam.
(1) Baseball team owners want to attract fans to games. The New York Mets
acquired the pitcher Pedro Martinez in 2005. Martinez is considered one of
the best pitchers of his era, having won the Cy Young Award three times.
Martinez had his own fans. Possibly, he attracted more fans to the ballpark
when he pitched at home, helping to justify his 4-year, US$53 million contract.
Is there really a Pedro effect in attendance? The data for the Mets home
games of the 2005 season are in Tut5-Q1-Baseball_attendance.xlsx. Perform
the regression of attendance on Weekend, Yankees, Rain Delay, Opening Day,
and Pedro Start.
Note that Weekend includes Friday, and that both Rain Delay and Opening
Day only have one observation with 1. We may let the variables-selection
procedure decide whether this matters.
See Tut5-Q1-RegressBaseball_attendance.xlsm with solution at Tut5-Q1RegressBaseball_attendanceSol.xlsm (no variable dropped) showing Pedro
Start as the second last in importance.
(a) All of these predictors are of a special kind. What are they called?
Indicator or dummy variables.
(b) What is the interpretation of the coefficient for Pedro Start?
On days when Pedro starts, on average, about 5428 more fans come to the
ballpark, after allowing for effects of games being on weekends, against
the Yankees, opening day, or day with rain delay.
(c) If we are primarily interested in Pedros effect on attendance, why is it
important to have the other variables in the model?
The other variables remove effects that could disguise the true effect of
Pedros starts on attendance.
(d) Could Pedros agent claim, based on this regression, that his man attracts
more fans to the ballpark? What statistics should he cite?
Yes, the coefficient for Pedro Start is more than 5400, estimating an
additional 5400 fans when Pedro pitches, and its P-value for the coefficient
1

is 0.0088, which is quite small. We can be quite confident that when Pedro
starts, attendance increases.
(e) Are there ways to improve the model, and thus arriving at a different
conclusion?
If we construct dummies for Tues, Wed, Thurs, Fri and Sat, together with
the original dummy for Weekend (Tut5-Q1Baseball_attendance(answer).xlsx; cant include Sun since the Weekend
dummy is the sum of the Fri, Sat and Sun dummies; Mon taken as the
base/reference), then we have 10 dummies as Xs: Tut5-Q1RegressBaseball_attendance(answer).xlsm. Note that Yankees only played
on 1 weekend; we use the last two interaction terms involving Yankees to
see if it mattered on which day during that weekend Yankees played.
If we use variable selection, we arrive at a better model, with the Pedro
Start coefficient of only 4756 and P-value of 0.015not as dramatic, but
perhaps closer to the truth (Tut5-Q1RegressBaseball_attendance(answer)Sol.xlsm, after discovering that
combining Tues, Wed and Thurs reduced Significance F) [After variables
selection, if the coefficients of 2 dummies from the same categorical
variable are close, we can try to replace the 2 dummies by their sum (a
new dummy for both categories) and see if the model improves
(Significance F drops).]
Tut5-Q1-RegressBaseball_attendance(answer)3a.xlsm was probably the
cleanest start, yielding the cleanest model in Tut5-Q1RegressBaseball_attendance(answer)3aSol.xlsm which clearly suggests
that Pedro only drew extra fans on Wednesdays and Saturdays.
Further efforts using an alternate (turned out to be the best) start Tut5-Q1RegressBaseball_attendance(answer)4.xlsm finally yielded Tut5-Q1RegressBaseball_attendance(answer)4Sol.xlsm (the best model overall)
which also suggests that Pedro only made a difference on Wednesdays and
Saturdays, but then he was good for another 10,000 fans!
However, its not clear that such intense data-mining is justified on such
a small dataset.
So, what do you think Petros true worth is?
(2) Does the cost of making a movie depend on its audience? Using Tut5-Q2Movie_ratings.xlsx, plots were done in Tut5-Q2-Movie_ratings(plot).xlsx and
reproduced below.

180
160
140
R

120

Linear (R)

PG-13

Linear (PG-13)

Linear (G)

100
Budget

80
60
40

PG

Linear (PG)

20
0
60

80

100 120 140 160 180


Run Time

Movies with an R rating are coloured blue, those with PC-13 rating are red,
those with a PG rating are green, and those with G are black. Trend lines are
also added for each group, using the same colours. (It is instructive to try to
reproduce the above plot in Excel.)
(a) Looking at the plots, in what ways is the relationship between run times
and budgets similar for the first 3 ratings groups?
(b) Looking at the plots, how do the costs of R-rated movies differ from those
of PG-13 and PG rated movies? Discuss both the slopes and the intercepts.
(c) The film King Kong, with a run time of 187 minutes, is the red point sitting
at the lower right. If it were omitted from this analysis, how might that
change your conclusions about PG-13 movies?
(d) Run the regression that is the equivalent of the plots; i.e., use interactions
with dummy variables. See if the model could be improved by omitting
some variables. Also investigate the effect of deleting the King-Kong case.
Discuss your findings.
(3) The Texas Transportation Institute (tti.tamu.edu) studies traffic delays. They
estimate that in the year 2000, the 75 largest metropolitan areas experienced
3.6 billion vehicle hours of delay, resulting in 5.7 billion gallons of wasted fuel
and US$67.5 billion in lost productivity. Thats about 0.7% of U.S.s GDP that
year. Tut5-Q3-Traffic_delays.xlsx, published by the Institute for the year 2001,
includes information on the Total Delay per Person (hours per year spent
delayed by traffic), the Average Arterial Road Speed (mph), the Average
Highway Road Speed (mph), and the Size of the city (Small, Medium, Large,
Very Large).
3

(a) Use Medium as the reference case, run the regression for Delay Per Person.
Why cant we include the dummies for all 4 sizes of cities?
This set of indicators uses Medium as the base case. The coefficients of
Small, Large and Very Large estimate the average change in the amount of
Delay/Person relative to the amount for Medium size cities. If there were
an indicator for Medium as well, the four dummies will be (perfectly)
collinear, leading to difficulty in computing the coefficients.
(b) Explain how the coefficients of Small, Large and Very Large account for the
size of the city in this model.
The coefficient for Small says that traffic delays are about 3.6 hours per
person per year lower in Small cities than in Medium cities, after allowing
for the effects of highway and arterial speeds; similarly for the coefficients
of Large and Very Large.
(c) See if including suitable transformations of the explanatory variables will
improve the fit of the model.
First, without any transformation. Starting from Tut5-Q3-RegressTraffic_delays.xlsm, if we apply backward stepwise regression using as
criterion the p-value for the F-stat (PR>F or Significance F), we would find
that only Large is different from Medium; i.e., Small and Very Large, since
they were deleted, join Medium as a group. Tut5-Q3-RegressTraffic_delaysSol.xlsm. Is there a good explanation for this?
Now we add the transformations sq, sqrt, inverse & log10. This bit of data
mining, starting with 43 variables (Tut5-Q3-Traffic_delays(answer).xlsx and
Tut5-Q3-Regress-Traffic_delays2.xlsm) instead of just the original 5, shows
that a better model (Tut5-Q3-Regress-Traffic_delays2Sol.xlsm) only has the
3 variables 1/(HiWay MPH), 1/(Arterial MPH) and Large*(1/(Arterial MPH),
after all the other variables (including the many transformations) were
deleted.
This suggests that delays are actually inversely proportional to the speeds
(hours per mile more appropriate than miles per hour), with a different
1/(Arterial MPH) slope for Large cities (but the same intercept, since Large
itself is not in the model). Note that the improvement in Significance F,
from 3.8E-21 (using the original 5 Xs) to 2.6E-21, is small though. Is it
worth the effort?
You would think that the inverse relationships would have been obvious
from plots of Delay vs HiWay MPH and Delay vs Arterial MPH. Look at the 2
plots at Cell BC55 of Tut5-Q3-Regress-Traffic_delays2.xlsm. Do they
suggest any inverse relationship? Goes to show how (un)informative is a
plot of y vs x (very difficult to discern any curvature in either scatterplot).

The 2 upper plots at Cell BC37 further show that, for the ranges involved,
1/MPH isnt much different from MPH, since both plots are almost linear
(especially the one for Arterial MPH). No wonder we cant tell the
difference! Apparently, however, SSS variables selection can make out the
difference. And now it even seems obvious, since the y-variable Delay, a
measure of time, clearly has more to do with hours-per-mile than its
inverse, MPH. But see the next model.
Starting from Tut5-Q3-Regress-Traffic_delays3.xlsm (with the same 68
variables as before, but in a different order), an even better theoretical
model is Tut5-Q3-Regress-Traffic_delays3Sol.xlsm with Significance F of
2.4E-21 (previous best was 2.6E-21), but it is harder to explain. Is Arterial
MPH untransformed because the plot had earlier shown it to be almost
linear with 1/Arterial MPH?
All in all, modelling can be confusing, although we must still finally present
a convincing summary to executives.
(d) How does this Question relate to the situation in Singapore?
Widening the perspective: in Singapores context, this is the COE quota
problem: a balance between national GDP loss (as a result of jams) and
individuals freedom to own cars (which might find expression in voting
behaviour).
(4) 4e: P 597, 5e: P 530, 6e: P 480, Case 10.2s data set Tut5-Q4-MidCity.xls has
data for 128 homes. Build the best regression model that you know how for
predicting Price. (Hints: try transformations of Xs; a brick house in
neighbourhood 3 might be special. ;) Its possible to achieve a Significance F
of 4.70E-51.)

Vous aimerez peut-être aussi