Vous êtes sur la page 1sur 4

This note is incomplete and subject to errors, mostly typos.

Chapter 6

In order to assess the relationsip between a categorical variable and a quantitative dependent
variable, we introduce dummy (indicator) variables in the regression model.
A dummy variable assumes values 0 and 1, representing the absence and presence of a certain
characteristic, respectively.

For example, for a gender variable, we can define a dummy variable either





(6.1)
or






If we conder the folloiwn model



whereY is the annual expenditure on food and D is the dummy variable as defined in (6.1).

Then, the mean food expenditure for males is



and that for females is



We note that the slope coefficient associated with the dummy variable represents by how much
the mean of Y (food expenditure) of females is different from that of males.
It is often called the differential intercept coefficient.

Next, consider the following model

(6.2)

by adding the quantitative variable

, for example, the after-tax income.




Then the mean food expenditure for males is



and that for females is



These two regression lines are parallel with different intercept, is known as the parallel regression
model. The difference in mean of the food expenditure remains constant regardless of the value
of X.

2
If the main interest is to compare the means of different categories, then quantitative explanatory
variables in the model are called the covariates or control variabels. Models with both categorical
variables and quantitative variables as explanatory variabes are called analysis-of-covariance
(ACOVA) models.

Next, consider the following model

(6.3)

by adding the quantitative variable

, for example, the after-tax income.




Then the mean food expenditure for males is



and that for females is



These two regression lines have different intercetunless

. The product term

is called
the interaction term between

and

.
The difference in mean of the food expenditure depends on the value of X.

In model (6.3)
if

is called coincidenceregresson;
if

and

is called parallel regresson;


if

and

is called concurrent regresson;


if

and

is called dissimilar regresson.



If a categorical variable has c categories, dummy variables are needed. The category
without an indiactor variable defined serves as the base (reference or benchmark) line. (If one
uses c dummy variables, then the intercept term should be set zero.)

If one wants to incorporate the quarterly effect on may define three dummy variables as



.

In this example, since where if no dummy variable defined for the first quarter, the regression line
for the first quarter serves as the base line, in that, comparisons against the first quarter can be
easily made.

Example:
Data in the Excel file named RealProp were collected as a part of a citywide study of real estate
property valuation. They are observations on 60 parcels that sold in a particular calendar year
and neighborhood. The variable in the first four columns are: Market for the selling price of
parcel, Sq.ft for the square feet of living area, Grade for the type of construction (coded such that
1 for high, 0 for medium and -1 for low), and Assessed for the most recent assessed value on city
assessors books. The variables in the remaining columns are defined where appropriate. The
goal is to develop a model to predict the selling price of a parcel based on the size, grade and
assessed value.
3

Define
Y: the selling price

: the size of living area

: the grade for the type of construction



Consider the following models

(6.3)
and

(6.4)
where





.

1. Find the regression line for each of the grade for model (6.3).
2. Find the regression line for each of the grade for model (6.4).
3. Discuss the implication of the use of

for the grade versus the use of the dummy


variables and identify the case where (6.3) and (6.4) are indentical.
4. How would you test if (6.3) and (6.4) are identical?
5. Find the degrees of freedom associated with the sums of squares in the ANOVA for each
model.
6. If the interaction term between the size and the grade is needed, what is the regression
model.
7. Fit the model with interaction term(s) and test if there is interaction.
8. Is the square of the size, i.e.,

is necessary in the model?



Homework: 6.9, 6.19

What if the dependent variable is a catergorical variable? Then the linear regression model may
not be appropriate because the error term can nolonger be assumed to be normally distributed and
the fitted values are notnecessarily 0 or 1, and often morel than 1 or less than 0.Therefore, we
need to consider a model that overcomes these drawbacks.
To this end, we consider logistic regression discussed in Section 12.6 on p. 386 of the textbook.

Logistic regression:

When the dependent (response) variable is a binary variable assuming 0 and 1, often considered is
a logistic regression model where the log of the odds ratio is modeled as a function of the
predictor (independent) variables as oppose to the response itself is modeled as in the textbook.
Specifically, let

be the probability of observing 1 for the i


th
response variable, i.e.

. Then the odds ratio is

. The log of the odds ratio is

, called logit, and this is


modeled as



Properties of the logit:
1. As

ranges from 0 to 1,

ranges from to .
2.

is linear in X, but not the probability

.
4
3. If the logit

is positive, then the odds that Y equals 1 increases as the value of the
explanatory variable(s) increases. If the logit

is negative, then the odds that Y equals 1


decreases as the value of the explanatory variable(s) increases.


This in turn yields the (non-linear) regression function



The advantage of this regression function is that the estimated regression function is always
between 0 and 1 while the estimated linear regression function can be larger than 1 or less than 0.
If the estimated

is larger than 0.5, then the observation is classified as

and if the
estimated

is smaller than 0.5, then the observation is classified as

.

As the regression function

is (highly) nonlinear in the parameters (as well as in the


explanatory variables), we have a non-linear regression model. In non-linear regression models,
the closed form solution of the parameter estimator is not generally available. Therefore,
numerical optmazation is used to minimize the residual sum of reuares or to maximiaze the
likelihood function. As this is beyond the scope of this course, we will reply on
computersofteware such as Minitab and Stata for analysis.

Example: Refer to the bankruptcy data.

To fit a logistic regression model

Stat> Regression>Binary Logistic Regression> Fit Binary Logistic Model (Specify the
response and predictor variables, then to get the estimated
i
p )>Storage and check
Event Probability (The column with variable name EPRO# will contain the estimated
i
p ).


Homework: 12.13 on p.399

Vous aimerez peut-être aussi