EChap 06

This note is incomplete and subject to errors, mostly typos.
Chapter 6

In order to assess the relationsip between a categorical variable and a quantitative dependent
variable, we introduce dummy (indicator) variables in the regression model.
A dummy variable assumes values 0 and 1, representing the absence and presence of a certain
characteristic, respectively.

For example, for a gender variable, we can define a dummy variable either

(6.1)
or

If we conder the folloiwn model

whereY is the annual expenditure on food and D is the dummy variable as defined in (6.1).

Then, the mean food expenditure for males is

and that for females is

We note that the slope coefficient associated with the dummy variable represents by how much
the mean of Y (food expenditure) of females is different from that of males.
It is often called the differential intercept coefficient.

Next, consider the following model

(6.2)

by adding the quantitative variable
, for example, the after-tax income.

Then the mean food expenditure for males is


These two regression lines are parallel with different intercept, is known as the parallel regression
model. The difference in mean of the food expenditure remains constant regardless of the value
of X.

2
If the main interest is to compare the means of different categories, then quantitative explanatory
variables in the model are called the covariates or control variabels. Models with both categorical
variables and quantitative variables as explanatory variabes are called analysis-of-covariance
(ACOVA) models.

Next, consider the following model

(6.3)

by adding the quantitative variable
, for example, the after-tax income.

Then the mean food expenditure for males is


These two regression lines have different intercetunless
. The product term
is called
the interaction term between
and
.
The difference in mean of the food expenditure depends on the value of X.

In model (6.3)
if
is called coincidenceregresson;
if
and
is called parallel regresson;

if
and
is called concurrent regresson;

if
and
is called dissimilar regresson.

If a categorical variable has c categories, dummy variables are needed. The category
without an indiactor variable defined serves as the base (reference or benchmark) line. (If one
uses c dummy variables, then the intercept term should be set zero.)

If one wants to incorporate the quarterly effect on may define three dummy variables as

.

In this example, since where if no dummy variable defined for the first quarter, the regression line
for the first quarter serves as the base line, in that, comparisons against the first quarter can be
easily made.

Example:
Data in the Excel file named RealProp were collected as a part of a citywide study of real estate
property valuation. They are observations on 60 parcels that sold in a particular calendar year
and neighborhood. The variable in the first four columns are: Market for the selling price of
parcel, Sq.ft for the square feet of living area, Grade for the type of construction (coded such that
1 for high, 0 for medium and -1 for low), and Assessed for the most recent assessed value on city
assessors books. The variables in the remaining columns are defined where appropriate. The
goal is to develop a model to predict the selling price of a parcel based on the size, grade and
assessed value.
3

Define
Y: the selling price
: the size of living area
: the grade for the type of construction

Consider the following models
(6.3)
and
(6.4)
where

.

1. Find the regression line for each of the grade for model (6.3).
2. Find the regression line for each of the grade for model (6.4).
3. Discuss the implication of the use of
for the grade versus the use of the dummy

variables and identify the case where (6.3) and (6.4) are indentical.
4. How would you test if (6.3) and (6.4) are identical?
5. Find the degrees of freedom associated with the sums of squares in the ANOVA for each
model.
6. If the interaction term between the size and the grade is needed, what is the regression
model.
7. Fit the model with interaction term(s) and test if there is interaction.
8. Is the square of the size, i.e.,
is necessary in the model?

Homework: 6.9, 6.19

What if the dependent variable is a catergorical variable? Then the linear regression model may
not be appropriate because the error term can nolonger be assumed to be normally distributed and
the fitted values are notnecessarily 0 or 1, and often morel than 1 or less than 0.Therefore, we
need to consider a model that overcomes these drawbacks.
To this end, we consider logistic regression discussed in Section 12.6 on p. 386 of the textbook.

Logistic regression:

When the dependent (response) variable is a binary variable assuming 0 and 1, often considered is
a logistic regression model where the log of the odds ratio is modeled as a function of the
predictor (independent) variables as oppose to the response itself is modeled as in the textbook.
Specifically, let
be the probability of observing 1 for the i

th
response variable, i.e.
. Then the odds ratio is
. The log of the odds ratio is
, called logit, and this is

modeled as

Properties of the logit:
1. As
ranges from 0 to 1,
ranges from to .
2.
is linear in X, but not the probability
.
4
3. If the logit
is positive, then the odds that Y equals 1 increases as the value of the
explanatory variable(s) increases. If the logit
is negative, then the odds that Y equals 1

decreases as the value of the explanatory variable(s) increases.

This in turn yields the (non-linear) regression function

The advantage of this regression function is that the estimated regression function is always
between 0 and 1 while the estimated linear regression function can be larger than 1 or less than 0.
If the estimated
is larger than 0.5, then the observation is classified as
and if the
estimated
is smaller than 0.5, then the observation is classified as
.

As the regression function
is (highly) nonlinear in the parameters (as well as in the

explanatory variables), we have a non-linear regression model. In non-linear regression models,
the closed form solution of the parameter estimator is not generally available. Therefore,
numerical optmazation is used to minimize the residual sum of reuares or to maximiaze the
likelihood function. As this is beyond the scope of this course, we will reply on
computersofteware such as Minitab and Stata for analysis.

Example: Refer to the bankruptcy data.

To fit a logistic regression model

Stat> Regression>Binary Logistic Regression> Fit Binary Logistic Model (Specify the
response and predictor variables, then to get the estimated
i
p )>Storage and check
Event Probability (The column with variable name EPRO# will contain the estimated
i
p ).

Homework: 12.13 on p.399

EChap 06

Transféré par

Informations du document

Description originale:

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

EChap 06

Transféré par

Droits d'auteur :

Formats disponibles

This note is incomplete and subject to errors, mostly typos.

, for example, the after-tax income.

, for example, the after-tax income.

. The product term

is called parallel regresson;

is called concurrent regresson;

is called dissimilar regresson.

: the size of living area

: the grade for the type of construction

for the grade versus the use of the dummy

is necessary in the model?

be the probability of observing 1 for the i

. Then the odds ratio is

. The log of the odds ratio is

, called logit, and this is

is linear in X, but not the probability

is negative, then the odds that Y equals 1

is larger than 0.5, then the observation is classified as

is smaller than 0.5, then the observation is classified as

is (highly) nonlinear in the parameters (as well as in the

Vous aimerez peut-être aussi