Vous êtes sur la page 1sur 179

2003-2005, The Trustees of Indiana University Comparing Group Means: 1

http://www.indiana.edu/~statmath
Comparing Group Means: The T-test and One-way
ANOVA Using STATA, SAS, and SPSS

Hun Myoung Park

This document summarizes the method of comparing group means and illustrates how to
conduct the t-test and one-way ANOVA using STATA 9.0, SAS 9.1, and SPSS 13.0.

1. Introduction
2. Univariate Samples
3. Paired (dependent) Samples
4. Independent Samples with Equal Variances
5. Independent Samples with Unequal Variances
6. One-way ANOVA, GLM, and Regression
7. Conclusion


1. Introduction

The t-test and analysis of variance (ANOVA) compare group means. The mean of a variable to
be compared should be substantively interpretable. A t-test may examine gender differences in
average salary or racial (white versus black) differences in average annual income. The left-
hand side (LHS) variable to be tested should be interval or ratio, whereas the right-hand side
(RHS) variable should be binary (categorical).


1.1 T-test and ANOVA

While the t-test is limited to comparing means of two groups, one-way ANOVA can compare
more than two groups. Therefore, the t-test is considered a special case of one-way ANOVA.
These analyses do not, however, necessarily imply any causality (i.e., a causal relationship
between the left-hand and right-hand side variables). Table 1 compares the t-test and one-way
ANOVA.

Table 1. Comparison between the T-test and One-way ANOVA
T-test One-way ANOVA
LHS (Dependent) Interval or ratio variable Interval or ratio variable
RHS (Independent) Binary variable with only two groups Categorical variable
Null Hypothesis
2 1
= ...
3 2 1
= = =
Prob. Distribution
*
T distribution F distribution
* In the case of one degree of freedom on numerator, F=t
2
.

The t-test assumes that samples are randomly drawn from normally distributed populations
with unknown population means. Otherwise, their means are no longer the best measures of
central tendency and the t-test will not be valid. The Central Limit Theorem says, however, that
2003-2005, The Trustees of Indiana University Comparing Group Means: 2

http://www.indiana.edu/~statmath
the distributions of
1
y and
2
y are approximately normal when N is large. When 30
2 1
+ n n , in
practice, you do not need to worry too much about the normality assumption.

You may numerically test the normality assumption using the Shapiro-Wilk W (N<=2000),
Shapiro-Francia W (N<=5000), Kolmogorov-Smirnov D (N>2000), and J arque-Bera tests. If N
is small and the null hypothesis of normality is rejected, you my try such nonparametric
methods as the Kolmogorov-Smirnov test, Kruscal-Wallis test, Wilcoxon Rank-Sum Test, or
Log-Rank Test, depending on the circumstances.


1.2 T-test in SAS, STATA, and SPSS

In STATA, the . t t est and . t t est i commands are used to conduct t-tests, whereas
the . anova and . oneway commands perform one-way ANOVA. SAS has the TTEST
procedure for t-test, but the UNIVARIATE, and MEANS procedures also have options for t-
test. SAS provides various procedures for the analysis of variance, such as the ANOVA, GLM,
and MIXED procedures. The ANOVA procedure can handle balanced data only, while the
GLM and MIXED can analyze either balanced or unbalanced data (having the same or different
numbers of observations across groups). However, unbalanced data does not cause any
problems in the t-test and one-way ANOVA. In SPSS, T-TEST, ONEWAY, and UNIANOVA
commands are used to perform t-test and one-way ANOVA.

Table 2 summarizes STATA commands, SAS procedures, and SPSS commands that are
associated with t-test and one-way ANOVA.

Table 2. Related Procedures and Commands in STATA, SAS, and SPSS
STATA 9.0 SE SAS 9.1 SPSS 13.0
Normality Test
. skt est ; . swi l k;
. sf r anci a
UNIVARIATE EXAMINE
Equal Variance
. oneway TTEST T-TEST
Nonparametric
. ksmi r nov; . kwal l i s NPAR1WAY NPAR TESTS
T-test
. t t est TTEST; MEANS T-TEST
ANOVA
. anova; . oneway ANOVA ONEWAY
GLM
*

GLM; MIXED UNIANOVA
* The STATA . gl mcommand is not used for the T test, but for the generalized linear model.


1.3 Data Arrangement

There are two types of data arrangement for t-tests (Figure 1). The first data arrangement has a
variable to be tested and a grouping variable to classify groups (0 or 1). The second,
appropriate especially for paired samples, has two variables to be tested. The two variables in
this type are not, however, necessarily paired nor balanced. SAS and SPSS prefer the first data
arrangement, whereas STATA can handle either type flexibly. Note that the numbers of
observations across groups are not necessarily equal.

2003-2005, The Trustees of Indiana University Comparing Group Means: 3

http://www.indiana.edu/~statmath
Figure 1. Two Types of Data Arrangement
Variable Group Variable1 Variable2
x
x

0
0

x
x

y
y

y
y

1
1








The data set used here is adopted from J . F. Fraumenis study on cigarette smoking and cancer
(Fraumeni 1968). The data are per capita numbers of cigarettes sold by 43 states and the
District of Columbia in 1960 together with death rates per hundred thousand people from
various forms of cancer. Two variables were added to categorize states into two groups. See the
appendix for the details.


2003-2005, The Trustees of Indiana University Comparing Group Means: 4

http://www.indiana.edu/~statmath
2. Univariate Samples

The univariate-sample or one-sample t-test determines whether an unknown population mean
differs from a hypothesized value c that is commonly set to zero: c H = :
0
. The t statistic
follows Students T probability distribution with n-1 degrees of freedom, ) 1 ( ~

= n t
s
c y
t
y
,
where y is a variable to be tested and n is the number of observations.
1


Suppose you want to test if the population mean of the death rates from lung cancer is 20 per
100,000 people at the .01 significance level. Note the default significance level used in most
software is the .05 level.


2.1 T-test in STATA

The . t t est command conducts t-tests in an easy and flexible manner. For a univariate sample
test, the command requires that a hypothesized value be explicitly specified. The l evel ( )
option indicates the confidence level as a percentage. The 99 percent confidence level is
equivalent to the .01 significance level.

. ttest lung=20, level(99)

One- sampl e t t est
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Var i abl e | Obs Mean St d. Er r . St d. Dev. [ 99%Conf . I nt er val ]
- - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
l ung | 44 19. 65318 . 6374133 4. 228122 17. 93529 21. 37108
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
mean = mean( l ung) t = - 0. 5441
Ho: mean = 20 degr ees of f r eedom= 43

Ha: mean < 20 Ha: mean ! = 20 Ha: mean > 20
Pr ( T < t ) = 0. 2946 Pr ( | T| > | t | ) = 0. 5892 Pr ( T > t ) = 0. 7054

STATA first lists descriptive statistics of the variable l ung. The mean and standard deviation
of the 44 observations are 19.653 and 4.228, respectively. The t statistic is -.544 =(19.653-20)
/ .6374. Finally, the degrees of freedom are 43 =44-1.

There are three t-tests at the bottom of the output above. The first and third are one-tailed tests,
whereas the second is a two-tailed test. The t statistic -.544 and its large p-value do not reject
the null hypothesis that the population mean of the death rate from lung cancer is 20 at the .01
level. The mean of the death rate may be 20 per 100,000 people. Note that the hypothesized
value 20 falls into the 99 percent confidence interval 17.935-21.371.
2


1

n
y
y
i
= ,
1
) (
2
2

=

n
y y
s
i
, and standard error
n
s
s
y
= .
2
The 99 percent confidence interval of the mean is 6374 . * 695 . 2 653 . 19
2
=
y
s t y

, where the 2.695 is
the critical value with 43 degree of freedom at the .01 level in the two-tailed test.
2003-2005, The Trustees of Indiana University Comparing Group Means: 5

http://www.indiana.edu/~statmath

If you just have the aggregate data (i.e., the number of observations, mean, and standard
deviation of the sample), use the . t t est i command to replicate the t-test above. Note the
hypothesized value is specified at the end of the summary statistics.

. ttesti 44 19.65318 4.228122 20, level(99)


2.2 T-test Using the SAS TTEST Procedure

The TTEST procedure conducts various types of t-tests in SAS. The H0 option specifies a
hypothesized value, whereas the ALPHA indicates a significance level. If omitted, the default
values zero and .05 respectively are assumed.

PROC TTEST H0=20 ALPHA=.01 DATA=masil.smoking;
VAR lung;
RUN;

The TTEST Procedure

Statistics

Lower CL Upper CL Lower CL Upper CL
Variable N Mean Mean Mean Std Dev Std Dev Std Dev Std Err

lung 44 17.935 19.653 21.371 3.2994 4.2281 5.7989 0.6374


T-Tests

Variable DF t Value Pr > |t|

lung 43 -0.54 0.5892

The TTEST procedure reports descriptive statistics followed by a one-tailed t-test. You may
have a summary data set containing the values of a variable (lung) and their frequencies
(count). The FREQ option of the TTEST procedure provides the solution for this case.

PROC TTEST H0=20 ALPHA=.01 DATA=masil.smoking;
VAR lung;
FREQ count;
RUN;


2.3 T-test Using the SAS UNIVARIATE and MEANS Procedures

The SAS UNIVARIATE and MEANS procedures also conduct a t-test for a univariate-sample.
The UNIVARIATE procedure is basically designed to produces a variety of descriptive
statistics of a variable. Its MU0 option tells the procedure to perform a t-test using the
hypothesized value specified. The VARDEF=DF specifies a divisor (degrees of freedom) used in
2003-2005, The Trustees of Indiana University Comparing Group Means: 6

http://www.indiana.edu/~statmath
computing the variance (standard deviation).
3
The NORMAL option examines if the variable is
normally distributed.

PROC UNIVARIATE MU0=20 VARDEF=DF NORMAL ALPHA=.01 DATA=masil.smoking;
VAR lung;
RUN;

The UNIVARIATE Procedure
Variable: lung

Moments

N 44 Sum Weights 44
Mean 19.6531818 Sum Observations 864.74
Std Deviation 4.22812167 Variance 17.8770129
Skewness -0.104796 Kurtosis -0.949602
Uncorrected SS 17763.604 Corrected SS 768.711555
Coeff Variation 21.5136751 Std Error Mean 0.63741333


Basic Statistical Measures

Location Variability

Mean 19.65318 Std Deviation 4.22812
Median 20.32000 Variance 17.87701
Mode . Range 15.26000
Interquartile Range 6.53000


Tests for Location: Mu0=20

Test -Statistic- -----p Value------

Student's t t -0.5441 Pr > |t| 0.5892
Sign M 1 Pr >= |M| 0.8804
Signed Rank S -36.5 Pr >= |S| 0.6752


Tests for Normality

Test --Statistic--- -----p Value------

Shapiro-Wilk W 0.967845 Pr < W 0.2535
Kolmogorov-Smirnov D 0.086184 Pr > D >0.1500
Cramer-von Mises W-Sq 0.063737 Pr > W-Sq >0.2500
Anderson-Darling A-Sq 0.382105 Pr > A-Sq >0.2500


Quantiles (Definition 5)

Quantile Estimate

100% Max 27.270

3
The VARDEF=N uses N as a divisor, while VARDEF=WDF specifies the sum of weights minus one.
2003-2005, The Trustees of Indiana University Comparing Group Means: 7

http://www.indiana.edu/~statmath
99% 27.270
95% 25.950
90% 25.450
75% Q3 22.815
50% Median 20.320
25% Q1 16.285


Quantiles (Definition 5)

Quantile Estimate

10% 14.110
5% 12.120
1% 12.010
0% Min 12.010


Extreme Observations

-----Lowest---- ----Highest----

Value Obs Value Obs

12.01 39 25.45 16
12.11 33 25.88 1
12.12 30 25.95 27
13.58 10 26.48 18
14.11 36 27.27 8

The third block of the output above reports a t statistic and its p-value. The fourth block
contains several statistics of normality test. Since N is less than 2,000, you should read the
Shapiro-Wilk W, which suggests that lung is normally distributed (p<.2535)

The MEANS procedure also conducts t-tests using the T and PROBT options that request the t
statistic and its two-tailed p-value. The CLM option produces the two-tailed confidence interval
(or upper and lower limits). The MEAN, STD, and STDERR respectively print the sample mean,
standard deviation, and standard error.

PROC MEANS MEAN STD STDERR T PROBT CLM VARDEF=DF ALPHA=.01 DATA=masil.smoking;
VAR lung;
RUN;

The MEANS Procedure

Analysis Variable : lung

Lower 99% Upper 99%
Mean Std Dev Std Error t Value Pr > |t| CL for Mean CL for Mean

19.6531818 4.2281217 0.6374133 30.83 <.0001 17.9352878 21.3710758


2003-2005, The Trustees of Indiana University Comparing Group Means: 8

http://www.indiana.edu/~statmath
The MEANS procedure does not, however, have an option to specify a hypothesized value to
anything other than zero. Thus, the null hypothesis here is that the population mean of death
rate from lung cancer is zero. The t statistic 30.83 is (19.6532-0)/.6374. The large t statistic and
small p-value reject the null hypothesis, reporting a consistent conclusion.


2.4 T-test in SPSS

The SPSS has the T-TEST command for t-tests. The /TESTVAL subcommand specifies the value
with which the sample mean is compared, whereas the /VARIABLES list the variables to be tested.
Like STATA, SPSS specifies a confidence level rather than a significance level in the
/CRITERIA=CI() subcommand.

T-TEST
/TESTVAL = 20
/VARIABLES = lung
/MISSING = ANALYSIS
/CRITERIA = CI(.99) .


2003-2005, The Trustees of Indiana University Comparing Group Means: 9

http://www.indiana.edu/~statmath
3. Paired (Dependent) Samples

When two variables are not independent, but paired, the difference of these two variables,
i i i
y y d
2 1
= , is treated as if it were a single sample. This test is appropriate for pre-post
treatment responses. The null hypothesis is that the true mean difference of the two variables is
D
0,
0 0
: D H
d
= .
4
The difference is typically assumed to be zero unless explicitly specified.


3.1 T-test in STATA

In order to conduct a paired sample t-test, you need to list two variables separated by an equal
sign. The interpretation of the t-test remains almost unchanged. The -1.871 =(-10.1667-
0)/5.4337 at 35 degrees of freedom does not reject the null hypothesis that the difference is zero.

. ttest pre=post0, level(95)

Pai r ed t t est
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Var i abl e | Obs Mean St d. Er r . St d. Dev. [ 95%Conf . I nt er val ]
- - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
pr e | 36 176. 0278 6. 529723 39. 17834 162. 7717 189. 2838
post 0 | 36 186. 1944 7. 826777 46. 96066 170. 3052 202. 0836
- - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
di f f | 36 - 10. 16667 5. 433655 32. 60193 - 21. 19757 . 8642387
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
mean( di f f ) = mean( pr e post 0) t = - 1. 8711
Ho: mean( di f f ) = 0 degr ees of f r eedom= 35

Ha: mean( di f f ) < 0 Ha: mean( di f f ) ! = 0 Ha: mean( di f f ) > 0
Pr ( T < t ) = 0. 0349 Pr ( | T| > | t | ) = 0. 0697 Pr ( T > t ) = 0. 9651

Alternatively, you may first compute the difference between the two variables, and then
conduct one-sample t-test. Note that the default confidence level, l evel ( 95) , can be omitted.

. gen d=prepost0
. ttest d=0


3.2 T-test in SAS

In the TTEST procedure, you have to use the PAIRED instead of the VAR statement. For the
output of the following procedure, refer to the end of this section.

PROC TTEST DATA=temp.drug;
PAIRED pre*post0;
RUN;


4
) 1 ( ~
0

= n t
s
D d
t
d
d
, where
n
d
d
i
= ,
1
) (
2
2

=

n
d d
s
i
d
, and
n
s
s
d
d
=
2003-2005, The Trustees of Indiana University Comparing Group Means: 10

http://www.indiana.edu/~statmath
The PAIRED statement provides various ways of comparing variables using asterisk (*) and
colon (:) operators. The asterisk requests comparisons between each variable on the left with
each variable on the right. The colon requests comparisons between the first variable on the left
and the first on the right, the second on the left and the second on the right, and so forth.
Consider the following examples.

PROC TTEST;
PAIRED pro: post0;
PAIRED (a b)*(c d); /* Equivalent to PAIRED a*c a*d b*c b*d; */
PAIRED (a b):(c d); /* Equivalent to PAIRED a*c b*c; */
PAIRED (a1-a10)*(b1-b10);
RUN;

The first PAIRED statement is the same as the PAIRED pre*post0. The second and the third
PAIRED statements contrast differences between asterisk and colon operators. The hyphen ()
operator in the last statement indicates a1 through a10 and b1 through b10. Let us consider an
example of the PAIRED statement.

PROC TTEST DATA=temp.drug;
PAIRED (pre)*(post0-post1);
RUN;

The TTEST Procedure

Statistics

Lower CL Upper CL Lower CL Upper CL
Difference N Mean Mean Mean Std Dev Std Dev Std Dev Std Err

pre - post0 36 -21.2 -10.17 0.8642 26.443 32.602 42.527 5.4337
pre - post1 36 -30.43 -20.39 -10.34 24.077 29.685 38.723 4.9475


T-Tests

Difference DF t Value Pr > |t|

pre - post0 35 -1.87 0.0697
pre - post1 35 -4.12 0.0002

The first t statistic for pre versus post0 is identical to that of the previous section. The second
for pre versus post1 rejects the null hypothesis of no mean difference at the .01 level (p<.0002).

In order to use the UNIVARIATE and MEANS procedures, the difference between two paired
variables should be computed in advance.

DATA temp.drug2;
SET temp.drug;
d1 = pre - post0;
d2 = pre - post1;
RUN;

2003-2005, The Trustees of Indiana University Comparing Group Means: 11

http://www.indiana.edu/~statmath
PROC UNIVARIATE MU0=0 VARDEF=DF NORMAL; VAR d1 d2; RUN;
PROC MEANS MEAN STD STDERR T PROBT CLM; VAR d1 d2; RUN;
PROC TTEST ALPHA=.05; VAR d1 d2; RUN;


3.3 T-test in SPSS

In SPSS, the PAIRS subcommand indicates a paired sample t-test.

T-TEST PAIRS = pre post0
/CRITERIA = CI(.95)
/MISSING = ANALYSIS .


2003-2005, The Trustees of Indiana University Comparing Group Means: 12

http://www.indiana.edu/~statmath
4. Independent Samples with Equal Variances

You should check three assumptions first when testing the mean difference of two independent
samples. First, the samples are drawn from normally distributed populations with unknown
parameters. Second, the two samples are independent in the sense that they are drawn from
different populations and/or the elements of one sample are not related to those of the other
sample. Finally, the population variances of the two groups,
2
1
and
2
2
are equal.
5
If any one
of assumption is violated, the t-test is not valid.

An example here is to compare mean death rates from lung cancer between smokers and non-
smokers. Let us begin with discussing the equal variance assumption.


4.1 F test for Equal Variances

The folded form F test is widely used to examine whether two populations have the same
variance. The statistic is ) 1 , 1 ( ~
2
2

S L
S
L
n n F
s
s
, where L and S respectively indicate groups
with larger and smaller sample variances. Unless the null hypothesis of equal variances is
rejected, the pooled variance estimate
2
pool
s is used. The null hypothesis of the independent
sample t-test is
0 2 1 0
: D H = .

) 2 ( ~
1 1
) (
2 1
2 1
0 2 1
+
+

= n n t
n n
s
D y y
t
pool
, where
2
) 1 ( ) 1 (
2
) ( ) (
2 1
2
2 2
2
1 1
2 1
2
2 2
2
1 1 2
+
+
=
+
+
=

n n
s n s n
n n
y y y y
s
j i
pool
.

When the assumption is violated, the t-test requires the approximations of the degree of
freedom. The null hypothesis and other components of the t-test, however, remain unchanged.
Satterthwaites approximation for the degree of freedom is commonly used. Note that the
approximation is a real number, not an integer.
) ( ~ '
2
2
2
1
2
1
0 2 1
ite Satterthwa
df t
n
s
n
s
D y y
t
+

= , where

2
2
2
1
2 1
) 1 ( ) 1 )( 1 (
) 1 )( 1 (
c n c n
n n
df
ite Satterthwa
+

= and
2
2
2 1
2
1
1
2
1
n s n s
n s
c
+
=

5

2 1 2 1
) ( = x x E ,

+ = + =
2 1
2
2
2
2
1
2
1
2 1
1 1
) (
n n n n
x x Var


2003-2005, The Trustees of Indiana University Comparing Group Means: 13

http://www.indiana.edu/~statmath

The SAS TTEST procedure and SPSS T-TEST command conduct F tests for equal variance.
SAS reports the folded form F statistic, whereas SPSS computes Levene's weighted F statistic.
In STATA, the . oneway command produces Bartletts statistic for the equal variance test. The
following is an example of Bartlett's test that does not reject the null hypothesis of equal
variance.

. oneway lung smoke

Anal ysi s of Var i ance
Sour ce SS df MS F Pr ob > F
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Bet ween gr oups 313. 031127 1 313. 031127 28. 85 0. 0000
Wi t hi n gr oups 455. 680427 42 10. 849534
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Tot al 768. 711555 43 17. 8770129

Bar t l et t ' s t est f or equal var i ances: chi 2( 1) = 0. 1216 Pr ob>chi 2 = 0. 727

STATA, SAS, and SPSS all compute Satterthwaites approximation of the degrees of freedom.
In addition, the SAS TTEST procedure reports Cochran-Cox approximation and the
STATA . t t est command provides Welchs degrees of freedom.


4.2 T-test in STATA

With the .ttest command, you have to specify a grouping variable smoke in this example in
the parenthesis of the by option.

. ttest lung, by(smoke) level(95)

Two- sampl e t t est wi t h equal var i ances
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Gr oup | Obs Mean St d. Er r . St d. Dev. [ 95%Conf . I nt er val ]
- - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
0 | 22 16. 98591 . 6747158 3. 164698 15. 58276 18. 38906
1 | 22 22. 32045 . 7287523 3. 418151 20. 80493 23. 83598
- - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
combi ned | 44 19. 65318 . 6374133 4. 228122 18. 36772 20. 93865
- - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
di f f | - 5. 334545 . 9931371 - 7. 338777 - 3. 330314
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
di f f = mean( 0) - mean( 1) t = - 5. 3714
Ho: di f f = 0 degr ees of f r eedom= 42

Ha: di f f < 0 Ha: di f f ! = 0 Ha: di f f > 0
Pr ( T < t ) = 0. 0000 Pr ( | T| > | t | ) = 0. 0000 Pr ( T > t ) = 1. 0000

Let us first check the equal variance. The F statistic is ) 21 , 21 ( ~
1647 . 3
4182 . 3
17 . 1
2
2
2
2
F
s
s
S
L
= = . The
degrees of freedom of the numerator and denominator are 21 (=22-1). The p-value of .7273,
virtually the same as that of Bartletts test above, does not reject the null hypothesis of equal
variance. Thus, the t-test here is valid (t=-5.3714 and p<.0000).

2003-2005, The Trustees of Indiana University Comparing Group Means: 14

http://www.indiana.edu/~statmath
) 2 22 22 ( ~ 3714 . 5
22
1
22
1
0 ) 3205 . 22 9859 . 16 (
+ =
+

= t
s
t
pool
, where
8497 . 10
2 22 22
4182 . 3 ) 1 22 ( 1647 . 3 ) 1 22 (
2 2
2
=
+
+
=
pool
s

If only aggregate data of the two variables are available, use the . t t est i command and list the
number of observations, mean, and standard deviation of the two variables.

. ttesti 22 16.85591 3.164698 22 22.32045 3.418151, level(95)

Suppose a data set is differently arranged (second type in Figure 1) so that one variable
smk_l ung has data for smokers and the other non_l ung for non-smokers. You have to use the
unpai r ed option to indicate that two variables are not paired. A grouping variable here is not
necessary. Compare the following output with what is printed above.

. ttest smk_lung=non_lung, unpaired

Two- sampl e t t est wi t h equal var i ances
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Var i abl e | Obs Mean St d. Er r . St d. Dev. [ 95%Conf . I nt er val ]
- - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
smk_l ung | 22 22. 32045 . 7287523 3. 418151 20. 80493 23. 83598
non_l ung | 22 16. 98591 . 6747158 3. 164698 15. 58276 18. 38906
- - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
combi ned | 44 19. 65318 . 6374133 4. 228122 18. 36772 20. 93865
- - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
di f f | 5. 334545 . 9931371 3. 330313 7. 338777
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
di f f = mean( smk_l ung) - mean( non_l ung) t = 5. 3714
Ho: di f f = 0 degr ees of f r eedom= 42

Ha: di f f < 0 Ha: di f f ! = 0 Ha: di f f > 0
Pr ( T < t ) = 1. 0000 Pr ( | T| > | t | ) = 0. 0000 Pr ( T > t ) = 0. 0000

This unpai r ed option is very useful since it enables you to conduct a t-test without additional
data manipulation. You may run the . t t est command with the unpai r ed option to compare
two variables, say l eukemi a and ki dney, as independent samples in STATA. In SAS and
SPSS, however, you have to stack up two variables and generate a grouping variable before t-
tests.

. ttest leukemia=kidney, unpaired

Two- sampl e t t est wi t h equal var i ances
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Var i abl e | Obs Mean St d. Er r . St d. Dev. [ 95%Conf . I nt er val ]
- - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
l eukemi a | 44 6. 829773 . 0962211 . 6382589 6. 635724 7. 023821
ki dney | 44 2. 794545 . 0782542 . 5190799 2. 636731 2. 95236
- - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
combi ned | 88 4. 812159 . 2249261 2. 109994 4. 365094 5. 259224
- - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
di f f | 4. 035227 . 1240251 3. 788673 4. 281781
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
2003-2005, The Trustees of Indiana University Comparing Group Means: 15

http://www.indiana.edu/~statmath
di f f = mean( l eukemi a) - mean( ki dney) t = 32. 5356
Ho: di f f = 0 degr ees of f r eedom= 86

Ha: di f f < 0 Ha: di f f ! = 0 Ha: di f f > 0
Pr ( T < t ) = 1. 0000 Pr ( | T| > | t | ) = 0. 0000 Pr ( T > t ) = 0. 0000

The F 1.5119 =(.6532589^2)/(.5190799^2) and its p-value (=.1797) do not reject the null
hypothesis of equal variance. The large t statistic 32.5356 rejects the null hypothesis that death
rates from leukemia and kidney cancers have the same mean.


4.3 T-test in SAS

The TTEST procedure by default examines the hypothesis of equal variances, and provides T
statistics for either case. The procedure by default reports Satterthwaites approximation for the
degrees of freedom. Keep in mind that a variable to be tested is grouped by the variable that is
specified in the CLASS statement.

PROC TTEST H0=0 ALPHA=.05 DATA=masil.smoking;
CLASS smoke;
VAR lung;
RUN;

The TTEST Procedure

Statistics

Lower CL Upper CL Lower CL Upper CL
Variable smoke N Mean Mean Mean Std Dev Std Dev Std Dev

lung 0 22 15.583 16.986 18.389 2.4348 3.1647 4.5226
lung 1 22 20.805 22.32 23.836 2.6298 3.4182 4.8848
lung Diff (1-2) -7.339 -5.335 -3.33 2.7159 3.2939 4.1865


Statistics

Variable smoke Std Err Minimum Maximum

lung 0 0.6747 12.01 25.45
lung 1 0.7288 12.11 27.27
lung Diff (1-2) 0.9931


T-Tests

Variable Method Variances DF t Value Pr > |t|

lung Pooled Equal 42 -5.37 <.0001
lung Satterthwaite Unequal 41.8 -5.37 <.0001


Equality of Variances

Variable Method Num DF Den DF F Value Pr > F
2003-2005, The Trustees of Indiana University Comparing Group Means: 16

http://www.indiana.edu/~statmath

lung Folded F 21 21 1.17 0.7273

The F test for equal variance does not reject the null hypothesis of equal variances. Thus, the t-
test labeled as Pooled should be referred to in order to get the t -5.37 and its p-value .0001. If
the equal variance assumption is violated, the statistics of Satterthwaite and Cochran
should be read.

If you have a summary data set with the values of variables (lung) and their frequency (count),
specify the count variable in the FREQ statement.

PROC TTEST DATA=masil.smoking;
CLASS smoke;
VAR lung;
FREQ count;
RUN;

Now, let us compare the death rates from leukemia and kidney in the second data arrangement
type of Figure 1. As mentioned before, you need to rearrange the data set to stack up two
variables into one and generate a grouping variable (first type in Figure 1).

DATA masil.smoking2;
SET masil.smoking;
death = leukemia; leu_kid ='Leukemia'; OUTPUT;
death = kidney; leu_kid ='Kidney'; OUTPUT;
KEEP leu_kid death;
RUN;

PROC TTEST COCHRAN DATA=masil.smoking2; CLASS leu_kid; VAR death; RUN;

The TTEST Procedure

Statistics

Lower CL Upper CL Lower CL Upper CL
Variable leu_kid N Mean Mean Mean Std Dev Std Dev Std Dev Std Err

death Kidney 44 2.6367 2.7945 2.9524 0.4289 0.5191 0.6577 0.0783
death Leukemia 44 6.6357 6.8298 7.0238 0.5273 0.6383 0.8087 0.0962
death Diff (1-2) -4.282 -4.035 -3.789 0.5063 0.5817 0.6838 0.124


T-Tests

Variable Method Variances DF t Value Pr > |t|

death Pooled Equal 86 -32.54 <.0001
death Satterthwaite Unequal 82.6 -32.54 <.0001
death Cochran Unequal 43 -32.54 <.0001


Equality of Variances

Variable Method Num DF Den DF F Value Pr > F
2003-2005, The Trustees of Indiana University Comparing Group Means: 17

http://www.indiana.edu/~statmath

death Folded F 43 43 1.51 0.1794

Compare this SAS output with that of STATA in the previous section.


4.4 T-test in SPSS

In the T-TEST command, you need to use the /GROUP subcommand in order to specify a
grouping variable. SPSS reports Levene's F .0000 that does not reject the null hypothesis of
equal variance (p<.995).

T-TEST GROUPS = smoke(0 1)
/VARIABLES = lung
/MISSING = ANALYSIS
/CRITERIA = CI(.95) .


2003-2005, The Trustees of Indiana University Comparing Group Means: 18

http://www.indiana.edu/~statmath
5. Independent Samples with Unequal Variances

If the assumption of equal variances is violated, we have to compute the adjusted t statistic
using individual sample standard deviations rather than a pooled standard deviation. It is also
necessary to use the Satterthwaite, Cochran-Cox (SAS), or Welch (STATA) approximations of
the degrees of freedom. In this chapter, you compare mean death rates from kidney cancer
between the west (south) and east (north).


5.1 T-test in STATA

As discussed earlier, let us check equality of variances using the . oneway command. The
t abul at e option produces a table of summary statistics for the groups.

. oneway kidney west, tabulate

| Summar y of ki dney
west | Mean St d. Dev. Fr eq.
- - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
0 | 3. 006 . 3001298 20
1 | 2. 6183333 . 59837219 24
- - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Tot al | 2. 7945455 . 51907993 44

Anal ysi s of Var i ance
Sour ce SS df MS F Pr ob > F
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Bet ween gr oups 1. 63947758 1 1. 63947758 6. 92 0. 0118
Wi t hi n gr oups 9. 94661333 42 . 236824127
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Tot al 11. 5860909 43 . 269443975

Bar t l et t ' s t est f or equal var i ances: chi 2( 1) = 8. 6506 Pr ob>chi 2 = 0. 003

Bartletts chi-squared statistic rejects the null hypothesis of equal variance at the .01 level. It is
appropriate to use the unequal option in the . t t est command, which calculates
Satterthwaites approximation for the degrees of freedom.

Unlike the SAS TTEST procedure, the . t t est command cannot specify the mean difference
D
0
other than zero. Thus, the null hypothesis is that the mean difference is zero.

. ttest kidney, by(west) unequal level(95)

Two- sampl e t t est wi t h unequal var i ances
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Gr oup | Obs Mean St d. Er r . St d. Dev. [ 95%Conf . I nt er val ]
- - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
0 | 20 3. 006 . 0671111 . 3001298 2. 865535 3. 146465
1 | 24 2. 618333 . 1221422 . 5983722 2. 365663 2. 871004
- - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
combi ned | 44 2. 794545 . 0782542 . 5190799 2. 636731 2. 95236
- - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
di f f | . 3876667 . 139365 . 1047722 . 6705611
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
2003-2005, The Trustees of Indiana University Comparing Group Means: 19

http://www.indiana.edu/~statmath
di f f = mean( 0) - mean( 1) t = 2. 7817
Ho: di f f = 0 Sat t er t hwai t e' s degr ees of f r eedom= 35. 1098

Ha: di f f < 0 Ha: di f f ! = 0 Ha: di f f > 0
Pr ( T < t ) = 0. 9957 Pr ( | T| > | t | ) = 0. 0086 Pr ( T > t ) = 0. 0043

See Satterthwaites approximation of 35.110 in the middle of the output. If you want to get
Welchs approximation, use the wel ch as well as unequal options; without the unequal option,
the wel ch is ignored.

. ttest kidney, by(west) unequal welch

Two- sampl e t t est wi t h unequal var i ances
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Gr oup | Obs Mean St d. Er r . St d. Dev. [ 95%Conf . I nt er val ]
- - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
0 | 20 3. 006 . 0671111 . 3001298 2. 865535 3. 146465
1 | 24 2. 618333 . 1221422 . 5983722 2. 365663 2. 871004
- - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
combi ned | 44 2. 794545 . 0782542 . 5190799 2. 636731 2. 95236
- - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
di f f | . 3876667 . 139365 . 1050824 . 6702509
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
di f f = mean( 0) - mean( 1) t = 2. 7817
Ho: di f f = 0 Wel ch' s degr ees of f r eedom= 36. 2258

Ha: di f f < 0 Ha: di f f ! = 0 Ha: di f f > 0
Pr ( T < t ) = 0. 9957 Pr ( | T| > | t | ) = 0. 0085 Pr ( T > t ) = 0. 0043

Satterthwaites approximation is slightly smaller than Welchs 36.2258. Again, keep in mind
that these approximations are not integers, but real numbers. The t statistic 2.7817 and its p-
value .0086 reject the null hypothesis of equal population means. The north and east have
larger death rates from kidney cancer per 100 thousand people than the south and west.

For aggregate data, use the . t t est i command with the necessary options.

. ttesti 20 3.006 .3001298 24 2.618333 .5983722, unequal welch

As mentioned earlier, the unpai r ed option of the . t t est command directly compares two
variables without data manipulation. The option treats the two variables as independent of each
other. The following is an example of the unpaired and unequal options.

. ttest bladder=kidney, unpaired unequal welch

Two- sampl e t t est wi t h unequal var i ances
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Var i abl e | Obs Mean St d. Er r . St d. Dev. [ 95%Conf . I nt er val ]
- - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
bl adder | 44 4. 121136 . 1454679 . 9649249 3. 827772 4. 4145
ki dney | 44 2. 794545 . 0782542 . 5190799 2. 636731 2. 95236
- - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
combi ned | 88 3. 457841 . 1086268 1. 019009 3. 241933 3. 673748
- - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
di f f | 1. 326591 . 1651806 . 9968919 1. 65629
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
di f f = mean( bl adder ) - mean( ki dney) t = 8. 0312
Ho: di f f = 0 Wel ch' s degr ees of f r eedom= 67. 0324
2003-2005, The Trustees of Indiana University Comparing Group Means: 20

http://www.indiana.edu/~statmath

Ha: di f f < 0 Ha: di f f ! = 0 Ha: di f f > 0
Pr ( T < t ) = 1. 0000 Pr ( | T| > | t | ) = 0. 0000 Pr ( T > t ) = 0. 0000

The F 3.4556 =(.9649249^2)/(.5190799^2) rejects the null hypothesis of equal variance
(p<0001). If the wel ch option is omitted, Satterthwaite's degree of freedom 65.9643 will be
produced instead.

For aggregate data, again, use the . t t est i command without the unpai r ed option.

. ttesti 44 4.121136 .9649249 44 2.794545 .5190799, unequal welch level(95)


5.2 T-test in SAS

The TTEST procedure reports statistics for cases of both equal and unequal variance. You may
add the COCHRAN option to compute Cochran-Cox approximations for the degree of freedom.

PROC TTEST COCHRAN DATA=masil.smoking;
CLASS west;
VAR kidney;
RUN;

The TTEST Procedure

Statistics

Lower CL Upper CL Lower CL Upper CL
Variable s_west N Mean Mean Mean Std Dev Std Dev Std Dev

kidney 0 20 2.8655 3.006 3.1465 0.2282 0.3001 0.4384
kidney 1 24 2.3657 2.6183 2.871 0.4651 0.5984 0.8394
kidney Diff (1-2) 0.0903 0.3877 0.685 0.4013 0.4866 0.6185

Statistics

Variable west Std Err Minimum Maximum

kidney 0 0.0671 2.34 3.62
kidney 1 0.1221 1.59 4.32
kidney Diff (1-2) 0.1473

T-Tests

Variable Method Variances DF t Value Pr > |t|

kidney Pooled Equal 42 2.63 0.0118
kidney Satterthwaite Unequal 35.1 2.78 0.0086
kidney Cochran Unequal . 2.78 0.0109

Equality of Variances

Variable Method Num DF Den DF F Value Pr > F

kidney Folded F 23 19 3.97 0.0034
2003-2005, The Trustees of Indiana University Comparing Group Means: 21

http://www.indiana.edu/~statmath

F 3.9749 =(.5983722^2)/(.3001298^2) and p <.0034 reject the null hypothesis of equal
variances. Thus, individual sample standard deviations need to be used to compute the adjusted
t, and either Satterthwaites or the Cochran-Cox approximation should be used in computing
the p-value. See the following computations.

78187 . 2
24
5984 .
20
3001 .
6183 . 2 006 . 3
'
2 2
=
+

= t ,
.2318
24 5984 . 20 3001 .
20 3001 .
2 2
2
2
2
2 1
2
1
1
2
1
=
+
=
+
=
n s n s
n s
c , and
1071 . 35
2318 ). 1 24 ( ) 2318 . 1 )( 1 20 (
) 1 24 )( 1 20 (
) 1 ( ) 1 )( 1 (
) 1 )( 1 (
2 2 2
2
2
1
2 1
=
+

=
+

=
c n c n
n n
df
ite Satterthwa


The t statistic 2.78 rejects the null hypothesis of no difference in mean death rates between the
two regions (p<.0086).

Now, let us compare death rates from bladder and kidney cancers using SAS.

DATA masil.smoking3;
SET masil.smoking;
death = bladder; bla_kid ='Bladder'; OUTPUT;
death = kidney; bla_kid ='Kidney'; OUTPUT;
KEEP bla_kid death;
RUN;

PROC TTEST COCHRAN DATA=masil.smoking3; CLASS bla_kid; VAR death; RUN;

The TTEST Procedure

Statistics

Lower CL Upper CL Lower CL Upper CL
Variable bla_kid N Mean Mean Mean Std Dev Std Dev Std Dev Std Err

death Bladder 44 3.8278 4.1211 4.4145 0.7972 0.9649 1.2226 0.1455
death Kidney 44 2.6367 2.7945 2.9524 0.4289 0.5191 0.6577 0.0783
death Diff (1-2) 0.9982 1.3266 1.655 0.6743 0.7748 0.9107 0.1652


T-Tests

Variable Method Variances DF t Value Pr > |t|

death Pooled Equal 86 8.03 <.0001
death Satterthwaite Unequal 66 8.03 <.0001
death Cochran Unequal 43 8.03 <.0001


Equality of Variances
2003-2005, The Trustees of Indiana University Comparing Group Means: 22

http://www.indiana.edu/~statmath

Variable Method Num DF Den DF F Value Pr > F

death Folded F 43 43 3.46 <.0001

Fortunately, the t-tests under equal and unequal variance in this case lead the same conclusion
at the .01 level; that is, the means of the two death rates are not the same.


5.3 T-test in SPSS

Like SAS, SPSS also reports t statistics for cases of both equal and unequal variance. Note that
Levene's F 5.466 rejects the null hypothesis of equal variance at the .05 level (p<.024).

T-TEST GROUPS = west(0 1)
/VARIABLES = kidney
/MISSING = ANALYSIS
/CRITERIA = CI(.95) .


2003-2005, The Trustees of Indiana University Comparing Group Means: 23

http://www.indiana.edu/~statmath
6. One-way ANOVA, GLM, and Regression

The t-test is a special case of one-way ANOVA. Thus, one-way ANOVA produces equivalent
results to those of the t-test. ANOVA examines mean differences using the F statistic, whereas
the t-test reports the t statistic. The one-way ANOVA (t-test), GLM, and linear regression
present essentially the same things in different ways.


6.1 One-way ANOVA

Consider the following ANOVA procedure. The CLASS statement is used to specify
categorical variables. The MODEL statement lists the variable to be compared and a grouping
variable, separating them with an equal sign.

PROC ANOVA DATA=masil.smoking;
CLASS smoke;
MODEL lung=smoke;
RUN;

The ANOVA Procedure

Dependent Variable: lung

Sum of
Source DF Squares Mean Square F Value Pr > F

Model 1 313.0311273 313.0311273 28.85 <.0001
Error 42 455.6804273 10.8495340
Corrected Total 43 768.7115545

R-Square Coeff Var Root MSE lung Mean

0.407215 16.75995 3.293863 19.65318

Source DF Anova SS Mean Square F Value Pr > F
smoke 1 313.0311273 313.0311273 28.85 <.0001

STATA . anova and . oneway commands also conduct one-way ANOVA.

. anova lung smoke

Number of obs = 44 R- squar ed = 0. 4072
Root MSE = 3. 29386 Adj R- squar ed = 0. 3931

Sour ce | Par t i al SS df MS F Pr ob > F
- - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Model | 313. 031127 1 313. 031127 28. 85 0. 0000
|
smoke | 313. 031127 1 313. 031127 28. 85 0. 0000
|
Resi dual | 455. 680427 42 10. 849534
- - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Tot al | 768. 711555 43 17. 8770129

2003-2005, The Trustees of Indiana University Comparing Group Means: 24

http://www.indiana.edu/~statmath
In SPSS, the ONEWAY command is used.

ONEWAY lung BY smoke
/MISSING ANALYSIS .


6.2 Generalized Linear Model (GLM)

The SAS GLM and MIXED procedures and the SPSS UNIANOVA command also report the F
statistic for one-way ANOVA. Note that STATAs . gl mcommand does not perform one-way
ANOVA.

PROC GLM DATA=masil.smoking;
CLASS smoke;
MODEL lung=smoke /SS3;
RUN;

The GLM Procedure

Dependent Variable: lung

Sum of
Source DF Squares Mean Square F Value Pr > F

Model 1 313.0311273 313.0311273 28.85 <.0001
Error 42 455.6804273 10.8495340
Corrected Total 43 768.7115545


R-Square Coeff Var Root MSE lung Mean

0.407215 16.75995 3.293863 19.65318


Source DF Type III SS Mean Square F Value Pr > F

smoke 1 313.0311273 313.0311273 28.85 <.0001

The MIXED procedure has the similar usage as the GLM procedure. The output here is skipped.

PROC MIXED; CLASS smoke; MODEL lung=smoke; RUN;

In SPSS, the UNIANOVA command estimates univariate ANOVA models using the GLM
method.

UNIANOVA lung BY smoke
/METHOD = SSTYPE(3)
/INTERCEPT = INCLUDE
/CRITERIA = ALPHA(.05)
/DESIGN = smoke .


6.3 Regression
2003-2005, The Trustees of Indiana University Comparing Group Means: 25

http://www.indiana.edu/~statmath

The SAS REG procedure, STATA . r egr ess command, and SPSS REGRESSION command
estimate linear regression models.

PROC REG DATA=masil.smoking;
MODEL lung=smoke;
RUN;

The REG Procedure
Model: MODEL1
Dependent Variable: lung

Number of Observations Read 44
Number of Observations Used 44

Analysis of Variance

Sum of Mean
Source DF Squares Square F Value Pr > F

Model 1 313.03113 313.03113 28.85 <.0001
Error 42 455.68043 10.84953
Corrected Total 43 768.71155


Root MSE 3.29386 R-Square 0.4072
Dependent Mean 19.65318 Adj R-Sq 0.3931
Coeff Var 16.75995


Parameter Estimates

Parameter Standard
Variable DF Estimate Error t Value Pr > |t|

Intercept 1 16.98591 0.70225 24.19 <.0001
smoke 1 5.33455 0.99314 5.37 <.0001

Look at the results above. The coefficient of the intercept 16.9859 is the mean of the first group
(smoke=0). The coefficient of smoke is, in fact, mean difference between two groups with its
sign reversed (5.33455=16.9859-22.3205). Finally, the standard error of the coefficient is the
denominator of the independent sample t-test, .99314=
22
1
22
1
2939 . 3
1 1
2 1
+ = +
n n
s
pool
,
where the pooled variance estimate 10.8497=3.2939^2 (see page 11 and 13). Thus, the t 5.37 is
identical to the t statistic of the independent sample t-test with equal variance.

The STATA . r egr ess command is quite simple. Note that a dependent variable precedes a list
of independent variables.

. regress lung smoke

Sour ce | SS df MS Number of obs = 44
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - F( 1, 42) = 28. 85
2003-2005, The Trustees of Indiana University Comparing Group Means: 26

http://www.indiana.edu/~statmath
Model | 313. 031127 1 313. 031127 Pr ob > F = 0. 0000
Resi dual | 455. 680427 42 10. 849534 R- squar ed = 0. 4072
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Adj R- squar ed = 0. 3931
Tot al | 768. 711555 43 17. 8770129 Root MSE = 3. 2939

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
l ung | Coef . St d. Er r . t P>| t | [ 95%Conf . I nt er val ]
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
smoke | 5. 334545 . 9931371 5. 37 0. 000 3. 330314 7. 338777
_cons | 16. 98591 . 702254 24. 19 0. 000 15. 5687 18. 40311
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

The SPSS REGRESSION command looks complicated compared to the SAS REG procedure
and STATA . r egr ess command.

REGRESSION
/MISSING LISTWISE
/STATISTICS COEFF OUTS R ANOVA
/CRITERIA=PIN(.05) POUT(.10)
/NOORIGIN
/DEPENDENT lung
/METHOD=ENTER smoke.

Note that ANOVA, GLM, and regression report the same F (1, 42) 28.85, which is equivalent
to t (42) -5.3714. As long as the degrees of freedom of the numerator is 1, F is always t^2
(28.85=-5.3714^2).


2003-2005, The Trustees of Indiana University Comparing Group Means: 27

http://www.indiana.edu/~statmath
7. Conclusion

The t-test is a basic statistical method for examining the mean difference between two groups.
One-way ANOVA can compare means of more than two groups. The number of observations
in individual groups does not matter in the t-test or one-way ANOVA; both balanced and
unbalanced data are fine. One-way ANOVA, GLM, and linear regression models all use the
variance-covariance structure in their analysis, but present the results in different ways.

Researchers must check four issues when performing t-tests. First, a variable to be tested
should be interval or ratio so that its mean is substantively meaningful. Do not, for example,
run a t-test to compare the mean of skin colors (white=0, yellow=1, black=2) between two
countries. If you have a latent variable measured by several Likert-scaled manifest variables,
first run a factor analysis to get that latent variable.

Second, examine the normality assumptions before conducting a t-test. It is awkward to
compare means of variables that are not normally distributed. Figure 2 illustrates a normal
probability distribution on top and a Poisson distribution skewed to the right on the bottom.
Although the two distributions have the same mean and variance of 1, they are not likely to be
substantively interpretable. This is the rationale to conduct normality test such as Shapiro-Wilk
W, Shapiro-Francia W, and Kolmogorov-Smirnov D statistics. If the normality assumption is
violated, try to use nonparametric methods.

Figure 2. Comparing Normal and Poisson Probability Distributions (
2
=1 and =1)




2003-2005, The Trustees of Indiana University Comparing Group Means: 28

http://www.indiana.edu/~statmath
Third, check the equal variance assumption. You should be careful when comparing means of
normally distributed variables with different variances. You may conduct the folded form F test.
If the equal variance assumption is violated, compute the adjusted t and approximations of the
degree of freedom.

Finally, consider the types of t-tests, data arrangement, and functionalities available in each
statistical software (e.g., STATA, SAS, and SPSS) to determine the best strategy for data
analysis (Table 3). The first data arrangement in Figure 1 is commonly used for independent
sample t-tests, whereas the second arrangement is appropriate for a paired sample test. Keep in
mind that the type II data sets in Figure 1 needs to be reshaped into type I in SAS and SPSS.

Table 3. Comparison of T-test Functionalities of STATA, SAS and SPSS
STATA 9.0 SAS 9.1 SPSS 13.0
Test for equal variance Bartletts chi-squared
(. t t est command)
Folded form F
(TTEST procedure)
Levenes weighted F
(T- TEST command)
Approximation of the
degrees of freedom (DF)
Satterthwaites DF
Welchs DF
Satterthwaites DF
Cochran-Cox DF
Satterthwaites DF
Second Data Arrangement var1=var2 Reshaping the data set Reshaping the data set
Aggregate Data . t t est i command FREQ option N/A

SAS has several procedures (e.g., TTEST, MEANS, and UNIVARIATE) and useful options for
t-tests. The STATA . t t est and . t t est i commands provide very flexible ways of handling
different data arrangements and aggregate data. Table 4 summarizes usages of options in these
two commands.

Table 4. Summary of the Usages of the . t t est and . t t est Command Options
Usage by(group var)
unequal welch
unpaired
*
Univariate sample
var=c
Paired (dependent) sample
var1=var2
Equal variance (1 variable)
Var O
Equal variance (2 variables)
**

var1=var2 O
Unequal variance (1 variable)
Var O O O
Unequal variance (2 variables)
var1=var2 O O O
* The . t t est i command does not allow the unpai r ed option.
** The var1=var2 assumes second type of data arrangement in Figure 1.
2003-2005, The Trustees of Indiana University Comparing Group Means: 29

http://www.indiana.edu/~statmath
Appendix: Data Set

Literature: Fraumeni, J . F. 1968. "Cigarette Smoking and Cancers of the Urinary Tract:
Geographic Variations in the United States," Journal of the National Cancer Institute, 41(5):
1205-1211.

Data Source: http://lib.stat.cmu.edu

The data are per capita numbers of cigarettes smoked (sold) by 43 states and the District of
Columbia in 1960 together with death rates per 100 thousand people from various forms of
cancer. The variables used in this document are,

cigar =number of cigarettes smoked (hds per capita)
bladder =deaths per 100k people from bladder cancer
lung =deaths per 100k people from lung cancer
kidney =deaths per 100k people from kidney cancer
leukemia =deaths per 100k people from leukemia
smoke =1 for those whose cigarette consumption is larger than the median and 0 otherwise.
west =1 for states in the South or West and 0 for those in the North, East or Midwest.

The followings are summary statistics and normality tests of these variables.

. sum cigar-leukemia

Var i abl e | Obs Mean St d. Dev. Mi n Max
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
ci gar | 44 24. 91409 5. 573286 14 42. 4
bl adder | 44 4. 121136 . 9649249 2. 86 6. 54
l ung | 44 19. 65318 4. 228122 12. 01 27. 27
ki dney | 44 2. 794545 . 5190799 1. 59 4. 32
l eukemi a | 44 6. 829773 . 6382589 4. 9 8. 28


. sfrancia cigar-leukemia

Shapi r o- Fr anci a W' t est f or nor mal dat a
Var i abl e | Obs W' V' z Pr ob>z
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
ci gar | 44 0. 93061 3. 258 2. 203 0. 01381
bl adder | 44 0. 94512 2. 577 1. 776 0. 03789
l ung | 44 0. 97809 1. 029 0. 055 0. 47823
ki dney | 44 0. 97732 1. 065 0. 120 0. 45217
l eukemi a | 44 0. 97269 1. 282 0. 474 0. 31759


. tab west smoke

| smoke
west | 0 1 | Tot al
- - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - +- - - - - - - - - -
0 | 7 13 | 20
1 | 15 9 | 24
- - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - +- - - - - - - - - -
Tot al | 22 22 | 44
2003-2005, The Trustees of Indiana University Comparing Group Means: 30

http://www.indiana.edu/~statmath
References

Fraumeni, J . F. 1968. "Cigarette Smoking and Cancers of the Urinary Tract: Geographic
Variations in the United States," Journal of the National Cancer Institute, 41(5): 1205-
1211.
Ott, R. Lyman. 1993. An Introduction to Statistical Methods and Data Analysis. Belmont, CA:
Duxbury Press.
SAS Institute. 2005. SAS/STAT User's Guide, Version 9.1. Cary, NC: SAS Institute.
SPSS. 2001. SPSS 11.0 Syntax Reference Guide. Chicago, IL: SPSS Inc.
STATA Press. 2005. STATA Reference Manual Release 9. College Station, TX: STATA Press.
Walker, Glenn A. 2002. Common Statistical Methods for Clinical Research with SAS
Examples. Cary, NC: SAS Institute.


Acknowledgements

I am grateful to J eremy Albright, Takuya Noguchi, and Kevin Wilhite at the UITS Center for
Statistical and Mathematical Computing, Indiana University, who provided valuable comments
and suggestions.


Revision History

2003. First draft
2004. Second draft
2005. Third draft (Added data arrangements and conclusion).

2003-2005, The Trustees of Indiana University Regression Models for Event Count Data: 1
http://www.indiana.edu/~statmath
Regression Models for Event Count Data
Using SAS, STATA, and LIMDEP

Hun Myoung Park

This document summarizes regression models for event count data and illustrates how to
estimate individual models using SAS, STATA, and LIMDEP. Example models were tested in SAS
9.1, STATA 9.0, and LIMDEP 8.0.

1. Introduction
2. The Poisson Regression Model (PRM)
3. The Negative Binomial Regression Model (NBRM)
4. The Zero-Inflated Poisson Regression Model (ZIP)
5. The Zero-Inflated Negative Binomial Regression Model (ZINB)
6. Conclusion
7. Appendix


1. Introduction

An event count is the realization of a nonnegative integer-valued random variable (Cameron and
Trivedi 1998). Examples are the number of car accidents per month, thunder storms per year, and
wild fires per year. The ordinary least squares (OLS) method for event count data results in
biased, inefficient, and inconsistent estimates (Long 1997). Thus, researchers have developed
various nonlinear models that are based on the Poisson distribution and negative binomial
distribution.


1.1 Count Data Regression Models

The left-hand side (LHS) of the equation has event count data. Independent variables are, as in
the OLS, located at the right-hand side (RHS). These RHS variables may be interval, ratio, or
binary (dummy). Table 1 below summarizes the categorical dependent variable regression
models (CDVMs) according to the level of measurement of the dependent variable.

Table 1. Ordinary Least Squares and CDVMs
Model Dependent (LHS) Method Independent (RHS)
OLS
Ordinary least
squares
Interval or ratio
Moment based
method
Binary response Binary (0 or 1)
Ordinal response Ordinal (1
st
, 2
nd
, 3
rd
)
Nominal response Nominal (A, B, C )
CDVMs
Event count data Count (0, 1, 2, 3)
Maximum
likelihood
method
A linear function of
interval/ratio or binary
variables
...
2 2 1 1 0
X X + +
The Poisson regression model (PRM) and negative binomial regression model (NBRM) are basic
models for count data analysis. Either the zero-inflated Poisson (ZIP) or the zero-inflated
2003-2005, The Trustees of Indiana University Regression Models for Event Count Data: 2
http://www.indiana.edu/~statmath
negative binomial regression model (ZINB) is used when there are many zero counts. Other
count models are developed to handle censored, truncated, or sample selected count data. This
document, however, focuses on the PRM, NBRM, ZIP, and ZINB.


1.2 Poisson Models versus Negative Binomial Models

The Poisson probability distribution,
!
) | (
y
e
y P
y


= , has the same mean and variance
(equidispersion), Var(y)=E(y)= . As the mean of a Poisson distribution increases, the
probability of zeros decreases and the distribution approximates a normal distribution (Figure 1).
The Poisson distribution also has the strong assumption that events are independent. Thus, this
distribution does not fit well if differs across observations (heterogeneity) (Long 1997).

The Poisson regression model (PRM) incorporates observed heterogeneity into the Poisson
distribution function, ) exp( ) | ( ) | (
i i i i i i
x x y E x y Var = = = . As increases, the conditional
variance of y increases, the proportion of predicted zeros decreases, and the distribution around
the expected value becomes approximately normal (Long 1997). The conditional mean of the
errors is zero, but the variance of the errors is a function of independent variables,
) exp( ) | ( x x Var = . The errors are heteroscedastic. Thus, the PRM rarely fits in practice due to
overdispersion (Long 1997; Maddala 1983).

Figure 1. Poisson Probability Distribution with Means of .5, 1, 2, and 5


The negative binomial probability distribution is
i i
y
i i
i
v
i i
i
i i
i i
i i
v v
v
v y
v y
x y P

+
+
=

) ( !
) (
) | ( ,
where = v / 1 determines the degree of dispersion and is the Gamma probability distribution.
As the dispersion parameter increases, the variance of the negative binomial distribution also
increases, ( )
i i i i i
v x y Var + = 1 ) | ( . The negative binomial regression model (NBRM)
incorporates observed and unobserved heterogeneity into the conditional mean,
2003-2005, The Trustees of Indiana University Regression Models for Event Count Data: 3
http://www.indiana.edu/~statmath
) exp(
i i i
x + = (Long 1997). Thus, the conditional variance of y becomes larger than its
conditional mean,
i i i
x y E = ) | ( , which remains unchanged. Figure 2 illustrates how the
probabilities for small and larger counts increase in the negative binomial distribution as the
conditional variance of y increases, given 3 = .

Figure 2. Negative Binomial Probability Distribution with Alpha of .01, .5, 1, and 5


The PRM and NBRM, however, have the same mean structure. If 0 = , the NBRM reduces to
the PRM (Cameron and Trivedi 1998; Long 1997).


1.3 Overdispersion

When ) | ( ) | (
i i i i
x y E x y Var > , we are said to have overdispersion. Estimates of a PRM for
overdispersed data are unbiased, but inefficient with standard errors biased downward (Cameron
and Trivedi 1998; Long 1997). The likelihood ratio test is developed to examine the null
hypothesis of no overdispersion, 0 :
0
= H . The likelihood ratio follows the Chi-squared
distribution with one degree of freedom, ) 1 ( ~ ) ln (ln * 2
2

Poisson NB
L L LR = . If the null
hypothesis is rejected, the NBRM is preferred to the PRM.

Zero-inflated models handle overdispersion by changing the mean structure to explicitly model
the production of zero counts (Long 1997). These models assume two latent groups. One is the
always-zero group and the other is the not-always-zero or sometime-zero group. Thus, zero
counts come from the former group and some of the latter group with a certain probability.

The likelihood ratio, ) 1 ( ~ ) ln (ln * 2
2

ZIP ZINB
L L LR = , tests 0 :
0
= H to compare the ZIP
and NBRM. The PRM and ZIP as well as NBRM and ZINB cannot, however, be tested by this
likelihood ratio, since they are not nested respectively. The Voungs statistic compares these
non-nested models. If V is greater than 1.96, the ZIP or ZINB is favored. If V is less than -1.96,
the PRM or NBRM is preferred (Long 1997).
2003-2005, The Trustees of Indiana University Regression Models for Event Count Data: 4
http://www.indiana.edu/~statmath


1.4 Estimation in SAS, STATA, and LIMDEP

The SAS GENMOD procedure estimates Poisson and negative binomial regression models.
STATA has individual commands (e.g., . poi sson and . nbr eg) for the corresponding count data
models. LIMDEP has Poi sson$ and Negbi n$ commands to estimate various count data models
including zero-inflated and zero-truncated models. Table 2 summarizes the procedures and
commands for count data regression models.

Table 2. Comparison of the Procedures and Commands for Count Data Models
Model SAS 9.1 STATA 9.0 LIMDEP 8.0
Poisson Regression (PRM) GENMOD . poi sson Poi sson$
Negative Binomial Regression (NBRM) GENMOD . nbr eg Negbi n$
Zero-Inflated Poisson (ZIP) - . zi p Poi sson; Zi p; Rh2$
Zero-inflated Negative Binomial (ZINB) - . zi nb Negbi n; Zi p; Rh2$
Zero-truncated Poisson (ZTP) - . zt p Poi sson; Tr uncat i on$
Zero-truncated Negative Binomial (ZTNB) - . zt nb Negbi n; Tr uncat i on$

The example here examines how waste quotas (emps) and the strictness of policy implementation
(st r i ct ) affect the frequency of waste spill accidents of plants (acci dent ).


1. 5 Long and Freeses SPost Module

STATA users may take advantages of user-written modules such as SPost written by J . Scott
Long and J eremy Freese. The module allows researchers to conduct follow-up analyses of
various CDVMs including event count data models. See 2.3 for examples of major SPost
commands.

In order to install SPost, execute the following commands consecutively. For more details, visit J .
Scott Longs Web site at http://www.indiana.edu/~jslsoc/spost_install.htm.

. net from http://www.indiana.edu/~jslsoc/stata/

. net install spost9_ado, replace

. net get spost9_do, replace


2003-2005, The Trustees of Indiana University Regression Models for Event Count Data: 5
http://www.indiana.edu/~statmath
2. The Poisson Regression Model

The SAS GENMOD procedure, STATA . poi sson command, and LIMDEP Poi sson$
command estimate the Poisson regression model (PRM).


2.1 PRM in SAS

SAS has the GENMOD procedure for the PRM. The /DIST=POISSON option tells SAS to use
the Poisson distribution.

PROC GENMOD DATA = masil.accident;
MODEL accident=emps strict /DIST=POISSON LINK=LOG;
RUN;

The GENMOD Procedure

Model Information

Data Set COUNT.WASTE
Distribution Poisson
Link Function Log
Dependent Variable Accident
Observations Used 778


Criteria For Assessing Goodness Of Fit

Criterion DF Value Value/DF

Deviance 775 2827.2079 3.6480
Scaled Deviance 775 2827.2079 3.6480
Pearson Chi-Square 775 4944.9473 6.3806
Scaled Pearson X2 775 4944.9473 6.3806
Log Likelihood -667.2291

Algorithm converged.


Analysis Of Parameter Estimates

Standard Wald 95% Confidence Chi-
Parameter DF Estimate Error Limits Square Pr > ChiSq

Intercept 1 0.3901 0.0467 0.2986 0.4816 69.84 <.0001
Emps 1 0.0054 0.0007 0.0040 0.0069 53.13 <.0001
Strict 1 -0.7042 0.0668 -0.8350 -0.5733 111.25 <.0001
Scale 0 1.0000 0.0000 1.0000 1.0000

NOTE: The scale parameter was held fixed.

You will need to run a restricted model without regressors in order to conduct the likelihood ratio
test for goodness-of-fit, ) ( ~ ) ln (ln * 2
2
Re
J L L LR
stricted ed Unrestrict
= , where J is the difference in
2003-2005, The Trustees of Indiana University Regression Models for Event Count Data: 6
http://www.indiana.edu/~statmath
the number of regressors between the unrestricted and restricted models. The chi-squared
statistic is 124.8218 =2* [-667.2291 - (-729.6400)] (p<.0000).

PROC GENMOD DATA = masil.accident;
MODEL accident= /DIST=POISSON LINK=LOG;
RUN;

The GENMOD Procedure

Model Information

Data Set MASIL.ACCIDENT
Distribution Poisson
Link Function Log
Dependent Variable accident


Number of Observations Read 778
Number of Observations Used 778


Criteria For Assessing Goodness Of Fit

Criterion DF Value Value/DF

Deviance 777 2952.0297 3.7993
Scaled Deviance 777 2952.0297 3.7993
Pearson Chi-Square 777 4919.9745 6.3320
Scaled Pearson X2 777 4919.9745 6.3320
Log Likelihood -729.6400

Algorithm converged.


Analysis Of Parameter Estimates

Standard Wald 95% Confidence Chi-
Parameter DF Estimate Error Limits Square Pr > ChiSq

Intercept 1 0.3168 0.0306 0.2568 0.3768 107.20 <.0001
Scale 0 1.0000 0.0000 1.0000 1.0000

NOTE: The scale parameter was held fixed.


2.2 PRM in STATA

STATA has the . poi sson command for the PRM. This command provides likelihood ratio and
Pseudo R
2
statistics.

. poisson accident emps strict

I t er at i on 0: l og l i kel i hood = - 1821. 5112
I t er at i on 1: l og l i kel i hood = - 1821. 5101
I t er at i on 2: l og l i kel i hood = - 1821. 5101
2003-2005, The Trustees of Indiana University Regression Models for Event Count Data: 7
http://www.indiana.edu/~statmath

Poi sson r egr essi on Number of obs = 778
LR chi 2( 2) = 124. 82
Pr ob > chi 2 = 0. 0000
Log l i kel i hood = - 1821. 5101 Pseudo R2 = 0. 0331

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
acci dent | Coef . St d. Er r . z P>| z| [ 95%Conf . I nt er val ]
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
emps | . 0054186 . 0007434 7. 29 0. 000 . 0039615 . 0068757
st r i ct | - . 7041664 . 0667619 - 10. 55 0. 000 - . 8350174 - . 5733154
_cons | . 3900961 . 0466787 8. 36 0. 000 . 2986076 . 4815846
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Let us run a restricted model and then run the . di spl ay command in order to double check that
the likelihood ratio for goodness-of-fit is 124.8218.


. poisson accident

I t er at i on 0: l og l i kel i hood = - 1883. 921
I t er at i on 1: l og l i kel i hood = - 1883. 921

Poi sson r egr essi on Number of obs = 778
LR chi 2( 0) = 0. 00
Pr ob > chi 2 = .
Log l i kel i hood = - 1883. 921 Pseudo R2 = 0. 0000

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
acci dent | Coef . St d. Er r . z P>| z| [ 95%Conf . I nt er val ]
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
_cons | . 3168165 . 0305995 10. 35 0. 000 . 2568426 . 3767904
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

. display 2 * (-1821.5101 - (-1883.921))
124. 8218


2.3 Using the SPost Module in STATA

The SPost module provides useful commands for follow-up analyses of various categorical
dependent variable models. The . f i t st at command calculates various goodness-of-fit statistics
such as log likelihood, McFaddens R
2
(or Pseudo R
2
), Akaike Information Criterion (AIC), and
Bayesian Information Criterion (BIC).

. quietly poisson accident emps strict

. fitstat

Measur es of Fi t f or poi sson of acci dent

Log- Li k I nt er cept Onl y: - 1883. 921 Log- Li k Ful l Model : - 1821. 510
D( 775) : 3643. 020 LR( 2) : 124. 822
Pr ob > LR: 0. 000
McFadden' s R2: 0. 033 McFadden' s Adj R2: 0. 032
Maxi mumLi kel i hood R2: 0. 148 Cr agg & Uhl er ' s R2: 0. 149
AI C: 4. 690 AI C*n: 3649. 020
BI C: - 1515. 943 BI C' : - 111. 508

2003-2005, The Trustees of Indiana University Regression Models for Event Count Data: 8
http://www.indiana.edu/~statmath
The . l i st coef command lists unstandardized coefficients (parameter estimates), factor and
percent changes, and standardized coefficients to help interpret regression results.

. listcoef, help

poi sson ( N=778) : Fact or Change i n Expect ed Count

Obser ved SD: 2. 9482675

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
acci dent | b z P>| z| e^b e^bSt dX SDof X
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
emps | 0. 00542 7. 289 0. 000 1. 0054 1. 2297 38. 1548
st r i ct | - 0. 70417 - 10. 547 0. 000 0. 4945 0. 7031 0. 5003
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
b = r aw coef f i ci ent
z = z- scor e f or t est of b=0
P>| z| = p- val ue f or z- t est
e^b = exp( b) = f act or change i n expect ed count f or uni t i ncr ease i n X
e^bSt dX = exp( b*SD of X) = change i n expect ed count f or SD i ncr ease i n X
SDof X = st andar d devi at i on of X

The . pr t ab command constructs a table of predicted values (events) for all combinations of
categorical variables listed. The following example shows that the predicted number of accidents
under the strict policy is .9172 at the mean waste quota (emps=42.0129).

. prtab strict

poi sson: Pr edi ct ed r at es f or acci dent

- - - - - - - - - - - - - - - - - - - - - -
st r i ct | Pr edi ct i on
- - - - - - - - - - +- - - - - - - - - - -
0 | 1. 8547
1 | 0. 9172
- - - - - - - - - - - - - - - - - - - - - -

emps st r i ct
x= 42. 012853 . 50771208

The . pr val ue lists predicted values for a given set of values for the independent variables. For
example, the predicted probability of a zero count is .3996 at the mean waste quota under the
strict policy (st r i ct =1). Note that the predicted rate of .917 is equivalent to .9172 in the . pr t ab
above.

. prvalue, x(strict=1) maxcnt(5)

poi sson: Pr edi ct i ons f or acci dent

Pr edi ct ed r at e: . 917 95%CI [ . 827 , 1. 02]

Pr edi ct ed pr obabi l i t i es:

Pr ( y=0| x) : 0. 3996 Pr ( y=1| x) : 0. 3665
Pr ( y=2| x) : 0. 1681 Pr ( y=3| x) : 0. 0514
Pr ( y=4| x) : 0. 0118 Pr ( y=5| x) : 0. 0022

emps st r i ct
x= 42. 012853 1

2003-2005, The Trustees of Indiana University Regression Models for Event Count Data: 9
http://www.indiana.edu/~statmath
The most useful command is the . pr change that calculates marginal effects (changes) and
discrete changes. For instance, a standard deviation increase in waste quota form its mean will
increase accidents by .3841 under the lenient policy (st r i ct =0).

. prchange, x(strict=0)

poi sson: Changes i n Pr edi ct ed Rat e f or acci dent

mi n- >max 0- >1 - +1/ 2 - +sd/ 2 Mar gEf ct
emps 2. 3070 0. 0080 0. 0101 0. 3841 0. 0101
st r i ct - 0. 9375 - 0. 9375 - 1. 3332 - 0. 6568 - 1. 3060

exp( xb) : 1. 8547

emps st r i ct
x= 42. 0129 0
sd( x) = 38. 1548 . 500262

SPost also includes the . pr gen command, which computes a series of predictions by holding all
variables but one constant and allowing that variable to vary (Long and Freese 2003). These
SPost commands work with most categorical and count data models such
as . l ogi t , . pr obi t , . poi sson, . nbr eg, . zi p, and . zi nb.


2.4 PRM in LIMDEP

The LIMDEP Poi sson$ command estimates the PRM. LIMDEP reports log likelihoods of both
the unrestricted and restricted models. Keep in mind that you must include the ONE for the
intercept.

POISSON;
Lhs=ACCIDENT;
Rhs=ONE,EMPS,STRICT$

+---------------------------------------------+
| Poisson Regression |
| Maximum Likelihood Estimates |
| Model estimated: Aug 24, 2005 at 04:56:45PM.|
| Dependent variable ACCIDENT |
| Weighting variable None |
| Number of observations 778 |
| Iterations completed 8 |
| Log likelihood function -1821.510 |
| Restricted log likelihood -1883.921 |
| Chi squared 124.8218 |
| Degrees of freedom 2 |
| Prob[ChiSqd > value] = .0000000 |
| Chi- squared = 4944.94781 RsqP= -.0051 |
| G - squared = 2827.20794 RsqD= .0423 |
| Overdispersion tests: g=mu(i) : 4.720 |
| Overdispersion tests: g=mu(i)^2: 4.253 |
+---------------------------------------------+
+---------+--------------+----------------+--------+---------+----------+
|Variable | Coefficient | Standard Error |b/St.Er.|P[|Z|>z] | Mean of X|
+---------+--------------+----------------+--------+---------+----------+
2003-2005, The Trustees of Indiana University Regression Models for Event Count Data: 10
http://www.indiana.edu/~statmath
Constant .3900961420 .46678663E-01 8.357 .0000
EMPS .5418599057E-02 .74341923E-03 7.289 .0000 42.012853
STRICT -.7041663804 .66761926E-01 -10.547 .0000 .50771208
(Note: E+nn or E-nn means multiply by 10 to + or -nn power.)

SAS, STATA, and LIMDEP produce almost the same parameter estimates and standard errors
(Table 3). The log likelihood in SAS is different from that of STATA and LIMDEP (-667.291
versus -1821.5101). This difference seems to come from the generalized linear model that the
GENMOD procedure uses. These log likelihoods are, however, equivalent in the sense that they
result in the same likelihood ratio.

Table 3. Summary of the Poisson Regression Model in SAS, STATA, and LIMDEP
Model SAS 9.1 STATA 9.0 LIMDEP 8.0
Intercept
. 3901
( . 0467)
. 3901
( . 0467)
. 3901
( . 0467)
EMPS
. 0054
( . 0007)
. 0054
( . 0007)
. 0054
( . 0007)
STRICT
- . 7042
( . 0668)
- . 7042
( . 0668)
- . 7042
( . 0668)
Log Likelihood (unrestricted) - 667. 2291 - 1821. 5101 - 1821. 510
Log Likelihood (restricted) - 729. 6400 - 1883. 921 - 1883. 921
Likelihood Ratio for Goodness-of-fit 124. 8218 124. 82 124. 8218


2003-2005, The Trustees of Indiana University Regression Models for Event Count Data: 11
http://www.indiana.edu/~statmath
3. The Negative Binomial Regression Model

The SAS GENMODE procedure, STATA . nbr eg command, and LIMDEP Negbi n$ command
estimate the negative binomial regression model (NBRM).

3.1 NBRM in SAS

The GENMOD procedure estimates the NBRM using the /DIST=NEGBIN option. Note that the
dispersion parameter is equivalent to the alpha in STATA and LIMDEP.

PROC GENMOD DATA = masil.accident;
MODEL accident=emps strict /DIST=NEGBIN LINK=LOG;
RUN;

The GENMOD Procedure

Model Information

Data Set COUNT.WASTE
Distribution Negative Binomial
Link Function Log
Dependent Variable Accident
Observations Used 778

Criteria For Assessing Goodness Of Fit

Criterion DF Value Value/DF

Deviance 775 589.7752 0.7610
Scaled Deviance 775 589.7752 0.7610
Pearson Chi-Square 775 845.6033 1.0911
Scaled Pearson X2 775 845.6033 1.0911
Log Likelihood 37.5628

Algorithm converged.

Analysis Of Parameter Estimates

Standard Wald 95% Confidence Chi-
Parameter DF Estimate Error Limits Square Pr > ChiSq

Intercept 1 0.3851 0.1278 0.1345 0.6357 9.07 0.0026
Emps 1 0.0052 0.0023 0.0008 0.0096 5.29 0.0214
Strict 1 -0.6703 0.1671 -0.9978 -0.3427 16.09 <.0001
Dispersion 1 3.9554 0.3501 3.3254 4.7048

NOTE: The negative binomial dispersion parameter was estimated by maximum likelihood.

The restricted model produces a log likelihood of 28.8627. Thus, the likelihood ratio for
goodness-of-fit is 17.4002 =2 * (37.5628 28.8627) (p<.00017).

PROC GENMOD DATA = masil.accident;
MODEL accident= /DIST=NEGBIN LINK=LOG;
RUN;
2003-2005, The Trustees of Indiana University Regression Models for Event Count Data: 12
http://www.indiana.edu/~statmath

The likelihood ratio for overdispersion is 1409.5838 =2 * (37.5628 - (-667.2291)).

3.2 NBRM in STATA

STATA has the . nbr eg command for the NBRM. The command reports three log likelihood
statistics: for the PRM, restricted NBRM (constant-only model), and unrestricted NBRM (full
model), which make it easy to conduct likelihood ratio tests.

. nbreg accident emps strict

Fi t t i ng compar i son Poi sson model :

I t er at i on 0: l og l i kel i hood = - 1821. 5112
I t er at i on 1: l og l i kel i hood = - 1821. 5101
I t er at i on 2: l og l i kel i hood = - 1821. 5101

Fi t t i ng const ant - onl y model :

I t er at i on 0: l og l i kel i hood = - 1256. 6761
I t er at i on 1: l og l i kel i hood = - 1152. 6155
I t er at i on 2: l og l i kel i hood = - 1125. 6643
I t er at i on 3: l og l i kel i hood = - 1125. 4183
I t er at i on 4: l og l i kel i hood = - 1125. 4183

Fi t t i ng f ul l model :

I t er at i on 0: l og l i kel i hood = - 1117. 1731
I t er at i on 1: l og l i kel i hood = - 1116. 7201
I t er at i on 2: l og l i kel i hood = - 1116. 7182
I t er at i on 3: l og l i kel i hood = - 1116. 7182

Negat i ve bi nomi al r egr essi on Number of obs = 778
LR chi 2( 2) = 17. 40
Pr ob > chi 2 = 0. 0002
Log l i kel i hood = - 1116. 7182 Pseudo R2 = 0. 0077

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
acci dent | Coef . St d. Er r . z P>| z| [ 95%Conf . I nt er val ]
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
emps | . 0051981 . 0022595 2. 30 0. 021 . 0007694 . 0096267
st r i ct | - . 6702548 . 1671191 - 4. 01 0. 000 - . 9978021 - . 3427074
_cons | . 3851111 . 1278468 3. 01 0. 003 . 134536 . 6356861
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
/ l nal pha | 1. 37509 . 0885176 1. 201599 1. 548582
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
al pha | 3. 955434 . 3501257 3. 32543 4. 704793
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Li kel i hood r at i o t est of al pha=0: chi bar 2( 01) = 1409. 58 Pr ob>=chi bar 2 = 0. 000

The restricted model or constant-only model gives us a log likelihood -1125.4183. Thus, the
likelihood ratio for goodness-of-fit is 17.4002 =2 * [-1116.7182 - (-1125.4183)] (p<.00017).
The p-value is computed as follows (Note the . di sp or . di is an abbreviation of the . di spl ay).

. disp chi2tail(2, 17.4002)
. 00016657

2003-2005, The Trustees of Indiana University Regression Models for Event Count Data: 13
http://www.indiana.edu/~statmath
The likelihood ratio test for overdispersion results in a chi-squared of 1409.5838 (p<.0000) and
rejects the null hypothesis of alpha=0. The statistically significant evidence of overdispersion
indicates that the NBRM is preferred to the PRM.

. di 2 * (-1116.7182 - (-1821.5101))
1409. 5838

The p-value of the likelihood ratio for overdispersion is computed as,

. di chi2tail(1, 1409.5838)
1. 74e- 308

Now, let us calculate marginal effects (or changes) at the means of independent variables. You
should the read the discrete change labeled 0- >1 of a binary variable st r i ct , since its
marginal change at the mean (.5077) is meaningless.

. prchange

nbr eg: Changes i n Pr edi ct ed Rat e f or acci dent

mi n- >max 0- >1 - +1/ 2 - +sd/ 2 Mar gEf ct
emps 1. 5326 0. 0055 0. 0068 0. 2585 0. 0068
st r i ct - 0. 8931 - 0. 8931 - 0. 8885 - 0. 4383 - 0. 8721

exp( xb) : 1. 3011

emps st r i ct
x= 42. 0129 . 507712
sd( x) = 38. 1548 . 500262

3.3 NBRM in LIMDEP

LIMDEP has the Negbi n$ command for the NBRM that reports the PRM as well. Note that the
standard errors of parameter estimates are slightly different from those of SAS and STATA. The
Mar gi nal Ef f ect s$ and the Means$ subcommands compute marginal effects at the mean of
independent variables. You may not omit the Means$ subcommand.

NEGBIN;
Lhs=ACCIDENT;
Rhs=ONE,EMPS,STRICT;
Marginal Effects;
Means$

+---------------------------------------------+
| Poisson Regression |
| Maximum Likelihood Estimates |
| Model estimated: Sep 08, 2005 at 09:35:36AM.|
| Dependent variable ACCIDENT |
| Weighting variable None |
| Number of observations 778 |
| Iterations completed 8 |
| Log likelihood function -1821.510 |
| Restricted log likelihood -1883.921 |
| Chi squared 124.8218 |
| Degrees of freedom 2 |
2003-2005, The Trustees of Indiana University Regression Models for Event Count Data: 14
http://www.indiana.edu/~statmath
| Prob[ChiSqd > value] = .0000000 |
| Chi- squared = 4944.94781 RsqP= -.0051 |
| G - squared = 2827.20794 RsqD= .0423 |
| Overdispersion tests: g=mu(i) : 4.720 |
| Overdispersion tests: g=mu(i)^2: 4.253 |
+---------------------------------------------+
+---------+--------------+----------------+--------+---------+----------+
|Variable | Coefficient | Standard Error |b/St.Er.|P[|Z|>z] | Mean of X|
+---------+--------------+----------------+--------+---------+----------+
Constant .3900961420 .46678663E-01 8.357 .0000
EMPS .5418599057E-02 .74341923E-03 7.289 .0000 42.012853
STRICT -.7041663804 .66761926E-01 -10.547 .0000 .50771208
(Note: E+nn or E-nn means multiply by 10 to + or -nn power.)

Normal exit from iterations. Exit status=0.

+---------------------------------------------+
| Negative Binomial Regression |
| Maximum Likelihood Estimates |
| Model estimated: Sep 08, 2005 at 09:35:36AM.|
| Dependent variable ACCIDENT |
| Weighting variable None |
| Number of observations 778 |
| Iterations completed 8 |
| Log likelihood function -1116.718 |
| Restricted log likelihood -1821.510 |
| Chi squared 1409.584 |
| Degrees of freedom 1 |
| Prob[ChiSqd > value] = .0000000 |
+---------------------------------------------+
+---------+--------------+----------------+--------+---------+----------+
|Variable | Coefficient | Standard Error |b/St.Er.|P[|Z|>z] | Mean of X|
+---------+--------------+----------------+--------+---------+----------+
Constant .3851110699 .12855240 2.996 .0027
EMPS .5198057234E-02 .22602075E-02 2.300 .0215 42.012853
STRICT -.6702547660 .16729839 -4.006 .0001 .50771208
Dispersion parameter for count data model
Alpha 3.955434012 .35680876 11.086 .0000
(Note: E+nn or E-nn means multiply by 10 to + or -nn power.)

+-------------------------------------------+
| Partial derivatives of expected val. with |
| respect to the vector of characteristics. |
| They are computed at the means of the Xs. |
| Observations used for means are All Obs. |
| Conditional Mean at Sample Point 1.3011 |
| Scale Factor for Marginal Effects 1.3011 |
+-------------------------------------------+
+---------+--------------+----------------+--------+---------+----------+
|Variable | Coefficient | Standard Error |b/St.Er.|P[|Z|>z] | Mean of X|
+---------+--------------+----------------+--------+---------+----------+
Constant .5010628939 .19396434 2.583 .0098
EMPS .6763123170E-02 .29746591E-02 2.274 .0230 42.012853
STRICT -.8720595665 .22469308 -3.881 .0001 .50771208
(Note: E+nn or E-nn means multiply by 10 to + or -nn power.)

2003-2005, The Trustees of Indiana University Regression Models for Event Count Data: 15
http://www.indiana.edu/~statmath
Read the coefficients (.0068 and -.8721) to confirm that they are identical to the corresponding
marginal effects calculated in STATA.

SAS, STATA, and LIMDEP produce almost the same parameter estimates and goodness-of-fit
statistics (Table 4). Note that SAS reports different log likelihoods, but the same likelihood ratio.

Table 4. Summary of the Negative Binomial Regression Model in SAS, STATA, and LIMDEP
Model SAS 9.1 STATA 9.0 LIMDEP 8.0
Intercept
. 3851
( . 1278)
. 3851
( . 1278)
. 3851
( . 1286)
EMPS
. 0052
( . 0023)
. 0052
( . 0023)
. 0052
( . 0023)
STRICT
- . 6703
( . 1671)
- . 6703
( . 1671)
- . 6703
( . 1673)
Dispersion Parameter (Alpha)
3. 9554
( . 3501)
3. 9554
( . 3501)
3. 9554
( . 3568)
Log Likelihood (unrestricted) 37. 5628 - 1116. 7182 - 1116. 718
Log Likelihood (restricted) 28. 8627 - 1125. 4183 - 1125. 418
*

Likelihood Ratio for Goodness-of-fit 17. 4002 17. 40 17. 4002
Likelihood Ratio for Overdispersion 1409. 5838 1409. 5838 1409. 5838
* LIMDEP mistakenly reports the log likelihood of the unrestricted Poisson regression model.

The following plot compares the PRM and NBRM. Look at the predictions for zero counts of the
two models. As the likelihood ratio test indicates, the NBRM seems to fit these data better than
PRM.

Figure 3. Comparison of the Poisson and Negative Binomial Regression Models

2003-2005, The Trustees of Indiana University Regression Models for Event Count Data: 16
http://www.indiana.edu/~statmath
4. The Zero-Inflated Poisson Regression Model

STATA and LIMDEP have commands for the zero-inflated Poisson regression model (ZIP).


4.1 ZIP in STATA (.zip)

STATA has the . zi p command to estimate the ZIP. The i nf l at e( ) option specifies a list of
variables that determines whether the observed count is zero. The vuong option computes the
Vuong statistic to compare the ZIP and PRM.

. zip accident emps strict, inflate(emps strict) vuong

Fi t t i ng const ant - onl y model :

I t er at i on 0: l og l i kel i hood = - 1627. 0779
I t er at i on 1: l og l i kel i hood = - 1309. 5825
I t er at i on 2: l og l i kel i hood = - 1272. 433
I t er at i on 3: l og l i kel i hood = - 1270. 9543
I t er at i on 4: l og l i kel i hood = - 1270. 9523
I t er at i on 5: l og l i kel i hood = - 1270. 9523

Fi t t i ng f ul l model :

I t er at i on 0: l og l i kel i hood = - 1270. 9523
I t er at i on 1: l og l i kel i hood = - 1269. 7219
I t er at i on 2: l og l i kel i hood = - 1269. 7206
I t er at i on 3: l og l i kel i hood = - 1269. 7206

Zer o- i nf l at ed Poi sson r egr essi on Number of obs = 778
Nonzer o obs = 280
Zer o obs = 498

I nf l at i on model = l ogi t LR chi 2( 2) = 2. 46
Log l i kel i hood = - 1269. 721 Pr ob > chi 2 = 0. 2918

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
| Coef . St d. Er r . z P>| z| [ 95%Conf . I nt er val ]
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
acci dent |
emps | - . 000277 . 0008633 - 0. 32 0. 748 - . 001969 . 001415
st r i ct | - . 0923911 . 0729023 - 1. 27 0. 205 - . 2352771 . 0504948
_cons | 1. 361978 . 0493222 27. 61 0. 000 1. 265308 1. 458647
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
i nf l at e |
emps | - . 0109897 . 0022678 - 4. 85 0. 000 - . 0154344 - . 006545
st r i ct | 1. 057031 . 1767509 5. 98 0. 000 . 7106059 1. 403457
_cons | . 488656 . 1211099 4. 03 0. 000 . 2512849 . 726027
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Vuong t est of zi p vs. st andar d Poi sson: z = 8. 40 Pr >z = 0. 0000

The restricted model is estimated with the intercept only.

. zip accident, inflate(emps strict)

The Vuong statistic at the bottom compares the ZIP and PRM. Since the V 8.40 is greater than
1.96, we conclude that the ZIP is preferred to the PRM.
2003-2005, The Trustees of Indiana University Regression Models for Event Count Data: 17
http://www.indiana.edu/~statmath


4.2 ZIP in LIMDEP

The LIMDEP Poi sson$ command needs to have the Zi p and Rh2 subcommands. The Rh2 is
equivalent to the i nf l at e( ) option in STATA. The Al g=Newt on$ subcommand is needed to use
the Newton-Raphson algorithm because the default Broyden algorithm failed to converge.
1


POISSON;
Lhs=ACCIDENT;
Rhs=ONE,EMPS,STRICT;
ZIP;
Rh2=ONE,EMPS,STRICT;
Alg=Newton$

+---------------------------------------------+
| Poisson Regression |
| Maximum Likelihood Estimates |
| Model estimated: Sep 06, 2005 at 00:25:07PM.|
| Dependent variable ACCIDENT |
| Weighting variable None |
| Number of observations 778 |
| Iterations completed 8 |
| Log likelihood function -1821.510 |
| Restricted log likelihood -1883.921 |
| Chi squared 124.8218 |
| Degrees of freedom 2 |
| Prob[ChiSqd > value] = .0000000 |
| Chi- squared = 4944.94781 RsqP= -.0051 |
| G - squared = 2827.20794 RsqD= .0423 |
| Overdispersion tests: g=mu(i) : 4.720 |
| Overdispersion tests: g=mu(i)^2: 4.253 |
+---------------------------------------------+
+---------+--------------+----------------+--------+---------+----------+
|Variable | Coefficient | Standard Error |b/St.Er.|P[|Z|>z] | Mean of X|
+---------+--------------+----------------+--------+---------+----------+
Constant .3900961420 .46678663E-01 8.357 .0000
EMPS .5418599057E-02 .74341923E-03 7.289 .0000 42.012853
STRICT -.7041663804 .66761926E-01 -10.547 .0000 .50771208
(Note: E+nn or E-nn means multiply by 10 to + or -nn power.)


Normal exit from iterations. Exit status=0.

+----------------------------------------------------------------------+
| Zero Altered Poisson Regression Model |
| Logistic distribution used for splitting model. |
| ZAP term in probability is F[tau x Z(i) ] |
| Comparison of estimated models |
| Pr[0|means] Number of zeros Log-likelihood |
| Poisson .27329 Act.= 498 Prd.= 212.6 -1821.51007 |

1
If you get a warning message of Error: 806: Line search does not improve fn. Exit iterations.
Status=3 or Error: 805: Initial iterations cannot improve function. Status=3, you may
change the optimization algorithm or increase the maximum number of iterations (e.g., Maxi t =1000$).
2003-2005, The Trustees of Indiana University Regression Models for Event Count Data: 18
http://www.indiana.edu/~statmath
| Z.I.Poisson .64642 Act.= 498 Prd.= 502.9 -1259.88568 |
| Note, the ZIP log-likelihood is not directly comparable. |
| ZIP model with nonzero Q does not encompass the others. |
| Vuong statistic for testing ZIP vs. unaltered model is 9.5740 |
| Distributed as standard normal. A value greater than |
| +1.96 favors the zero altered Z.I.Poisson model. |
| A value less than -1.96 rejects the ZIP model. |
+----------------------------------------------------------------------+
+---------+--------------+----------------+--------+---------+----------+
|Variable | Coefficient | Standard Error |b/St.Er.|P[|Z|>z] | Mean of X|
+---------+--------------+----------------+--------+---------+----------+
Poisson/NB/Gamma regression model
Constant 1.361977491 .23944641E-01 56.880 .0000
EMPS -.2770010575E-03 .37770090E-03 -.733 .4633 42.012853
STRICT -.9239125073E-01 .33326502E-01 -2.772 .0056 .50771208
Zero inflation model
Constant .4886559537 .12210013 4.002 .0001
EMPS -.1098971050E-01 .22152492E-02 -4.961 .0000 42.012853
STRICT 1.057031399 .17715551 5.967 .0000 .50771208
(Note: E+nn or E-nn means multiply by 10 to + or -nn power.)

In order to estimate the restricted model, run the following command with the ONE only in the
Lhs$ subcommand. The Rh2$ subcommand remains unchanged.

POISSON;
Lhs=ACCIDENT;
Rhs=ONE;
ZIP; Alg=Newton;
Rh2=ONE,EMPS,STRICT$

Table 5 summarizes parameter estimates and goodness-of-fit statistics for the zero-inflated
Poisson model. STATA and LIMDEP report the same parameter estimates, but they produce
different standard errors and log likelihoods. In particular, LIMDEP returned a suspicious log
likelihood for the restricted model, and thus ended up with the unlikely likelihood ratio of -
.0304. In addition, the Vuong statistics in STATA and LIMDEP are different.

Table 5. Summary of the Zero-Inflated Poisson Regression Model in STATA, and LIMDEP
Model SAS 9.1 STATA 9.0 LIMDEP 8.0
Intercept
1. 3620
( . 0493)
1. 3620
( . 0239)
EMPS
- . 0003
( . 0009)
- . 0003
( . 0004)
STRICT
- . 0924
( . 0729)
- . 0924
( . 0333)
Intercept (Zero-inflated)
. 4887
( . 1211)
. 4887
( . 1221)
EMPS (Zero-inflated)
- . 0110
( . 0023)
- . 0110
( . 0022)
STRICT (Zero-inflated)
1. 0570
( . 1768)
1. 0570
( . 1772)
Log Likelihood (unrestricted) - 1269. 7206 - 1259. 8857
Log Likelihood (restricted) - 1270. 9523 - 1259. 8705
Likelihood Ratio for Goodness-of-fit 2. 46 - . 0304
Vuong Statistic (ZINB versus NBRM) 8. 40 9. 5740
2003-2005, The Trustees of Indiana University Regression Models for Event Count Data: 19
http://www.indiana.edu/~statmath
5. The Zero-Inflated NB Regression Model

STATA and LIMDEP can estimate the zero-inflated negative binomial regression model (ZINB).


5.1 ZINB in STATA (.zinb)

The STATA . zi nb command estimates the ZINB. The vuong option computes the Vuong
statistic to compare the ZINB and NBRM.

. zinb accident emps strict, inflate(emps strict) vuong

Fi t t i ng const ant - onl y model :

I t er at i on 0: l og l i kel i hood = - 1190. 5117 ( not concave)
I t er at i on 1: l og l i kel i hood = - 1106. 9874
I t er at i on 2: l og l i kel i hood = - 1098. 8642
I t er at i on 3: l og l i kel i hood = - 1095. 3638
I t er at i on 4: l og l i kel i hood = - 1094. 0237
I t er at i on 5: l og l i kel i hood = - 1093. 063
I t er at i on 6: l og l i kel i hood = - 1092. 6216
I t er at i on 7: l og l i kel i hood = - 1091. 798
I t er at i on 8: l og l i kel i hood = - 1091. 7332
I t er at i on 9: l og l i kel i hood = - 1091. 7329
I t er at i on 10: l og l i kel i hood = - 1091. 7329

Fi t t i ng f ul l model :

I t er at i on 0: l og l i kel i hood = - 1091. 7329
I t er at i on 1: l og l i kel i hood = - 1089. 5565
I t er at i on 2: l og l i kel i hood = - 1089. 5198
I t er at i on 3: l og l i kel i hood = - 1089. 5198

Zer o- i nf l at ed negat i ve bi nomi al r egr essi on Number of obs = 778
Nonzer o obs = 280
Zer o obs = 498

I nf l at i on model = l ogi t LR chi 2( 2) = 4. 43
Log l i kel i hood = - 1089. 52 Pr ob > chi 2 = 0. 1094

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
| Coef . St d. Er r . z P>| z| [ 95%Conf . I nt er val ]
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
acci dent |
emps | - . 0004407 . 0020554 - 0. 21 0. 830 - . 0044691 . 0035877
st r i ct | - . 3251317 . 1659173 - 1. 96 0. 050 - . 6503235 . 0000602
_cons | . 7763065 . 1508037 5. 15 0. 000 . 4807367 1. 071876
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
i nf l at e |
emps | - . 2087768 . 0955122 - 2. 19 0. 029 - . 3959772 - . 0215763
st r i ct | 7. 562388 3. 055775 2. 47 0. 013 1. 573179 13. 5516
_cons | . 1032115 . 3800045 0. 27 0. 786 - . 6415835 . 8480065
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
/ l nal pha | . 9252514 . 1351387 6. 85 0. 000 . 6603845 1. 190118
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
al pha | 2. 522502 . 3408876 1. 935536 3. 28747
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Vuong t est of zi nb vs. st andar d negat i ve bi nomi al : z = 4. 13 Pr >z = 0. 0000

2003-2005, The Trustees of Indiana University Regression Models for Event Count Data: 20
http://www.indiana.edu/~statmath
The likelihood ratio, 360.4024=2*(-1089.5198 - (-1269.721)), rejects the null hypothesis of no
overdispersion, indicating that the ZINB can improve goodness-of-fit over the ZIP (p<.0000).
The Vuong test, 4.13 >1.96, suggests that the ZINB is preferred to the NBRM.


5.2 ZINB in LIMDEP

The LIMDEP Negbi n$ command needs to have the Zi p and Rh2 subcommands for the ZINB.
The following command produces the Poisson regression model, negative binomial model, and
zero-inflated negative binomial model. You may omit the Al g=Newt on$ subcommand.

NEGBIN;
Lhs=ACCIDENT;
Rhs=ONE,EMPS,STRICT; Rh2=ONE,EMPS,STRICT;
ZIP; Alg=Newton$

+---------------------------------------------+
| Poisson Regression |
| Maximum Likelihood Estimates |
| Model estimated: Sep 10, 2005 at 00:20:00AM.|
| Dependent variable ACCIDENT |
| Weighting variable None |
| Number of observations 778 |
| Iterations completed 8 |
| Log likelihood function -1821.510 |
| Restricted log likelihood -1883.921 |
| Chi squared 124.8218 |
| Degrees of freedom 2 |
| Prob[ChiSqd > value] = .0000000 |
| Chi- squared = 4944.94781 RsqP= -.0051 |
| G - squared = 2827.20794 RsqD= .0423 |
| Overdispersion tests: g=mu(i) : 4.720 |
| Overdispersion tests: g=mu(i)^2: 4.253 |
+---------------------------------------------+
+---------+--------------+----------------+--------+---------+----------+
|Variable | Coefficient | Standard Error |b/St.Er.|P[|Z|>z] | Mean of X|
+---------+--------------+----------------+--------+---------+----------+
Constant .3900961420 .46678663E-01 8.357 .0000
EMPS .5418599057E-02 .74341923E-03 7.289 .0000 42.012853
STRICT -.7041663804 .66761926E-01 -10.547 .0000 .50771208
(Note: E+nn or E-nn means multiply by 10 to + or -nn power.)

Normal exit from iterations. Exit status=0.

+---------------------------------------------+
| Negative Binomial Regression |
| Maximum Likelihood Estimates |
| Model estimated: Sep 10, 2005 at 00:20:00AM.|
| Dependent variable ACCIDENT |
| Weighting variable None |
| Number of observations 778 |
| Iterations completed 12 |
| Log likelihood function -1116.718 |
| Restricted log likelihood -1821.510 |
| Chi squared 1409.584 |
2003-2005, The Trustees of Indiana University Regression Models for Event Count Data: 21
http://www.indiana.edu/~statmath
| Degrees of freedom 1 |
| Prob[ChiSqd > value] = .0000000 |
+---------------------------------------------+
+---------+--------------+----------------+--------+---------+----------+
|Variable | Coefficient | Standard Error |b/St.Er.|P[|Z|>z] | Mean of X|
+---------+--------------+----------------+--------+---------+----------+
Constant .3851110482 .12855240 2.996 .0027
EMPS .5198057322E-02 .22602075E-02 2.300 .0215 42.012853
STRICT -.6702547787 .16729839 -4.006 .0001 .50771208
Dispersion parameter for count data model
Alpha 3.955434128 .35680877 11.086 .0000
(Note: E+nn or E-nn means multiply by 10 to + or -nn power.)

Normal exit from iterations. Exit status=0.

+----------------------------------------------------------------------+
| Zero Altered Neg.Binomial Regression Model |
| Logistic distribution used for splitting model. |
| ZAP term in probability is F[tau x Z(i) ] |
| Comparison of estimated models |
| Pr[0|means] Number of zeros Log-likelihood |
| Poisson .27329 Act.= 498 Prd.= 212.6 -1821.51007 |
| Neg. Bin. .32470 Act.= 498 Prd.= 252.6 -1116.71820 |
| Z.I.Neg_Bin .62918 Act.= 498 Prd.= 489.5 -1089.51977 |
| Note, the ZIP log-likelihood is not directly comparable. |
| ZIP model with nonzero Q does not encompass the others. |
| Vuong statistic for testing ZIP vs. unaltered model is 4.1270 |
| Distributed as standard normal. A value greater than |
| +1.96 favors the zero altered Z.I.Neg_Bin model. |
| A value less than -1.96 rejects the ZIP model. |
+----------------------------------------------------------------------+
+---------+--------------+----------------+--------+---------+----------+
|Variable | Coefficient | Standard Error |b/St.Er.|P[|Z|>z] | Mean of X|
+---------+--------------+----------------+--------+---------+----------+
Poisson/NB/Gamma regression model
Constant .7763063017 .15178042 5.115 .0000
EMPS -.4407244013E-03 .20262626E-02 -.218 .8278 42.012853
STRICT -.3251315411 .16179883 -2.009 .0445 .50771208
Dispersion parameter
Alpha 2.522502810 .29924002 8.430 .0000
Zero inflation model
Constant .1032103951 .37413759 .276 .7827
EMPS -.2087767804 .68774937E-01 -3.036 .0024 42.012853
STRICT 7.562389399 2.2216392 3.404 .0007 .50771208
(Note: E+nn or E-nn means multiply by 10 to + or -nn power.)

In order to estimate the restricted model, run the following command. You have to use the
Al g=Newt on$ subcommand to get the restricted model to converge.

Negbin;
Lhs=ACCIDENT;
Rhs=ONE; Rh2=ONE,EMPS,STRICT;
ZIP; Alg=Newton$

2003-2005, The Trustees of Indiana University Regression Models for Event Count Data: 22
http://www.indiana.edu/~statmath
Table 6 summarizes parameter estimates and goodness-of-fit statistics for the zero-inflated
negative binomial regression model. STATA and LIMDEP reports the same results except
standard errors and likelihood ratio for overdispersion.

Table 6. Summary of the Zero-Inflated NBRM in STATA, and LIMDEP
Model SAS 9.1 STATA 9.0 LIMDEP 8.0
Intercept
. 7763
( . 1508)
. 7763
( . 1518)
EMPS
- . 0004
( . 0021)
- . 0004
( . 0020)
STRICT
- . 3251
( . 1659)
- . 3251
( . 1618)
Intercept (Zero-inflated)
. 1032
( . 3800)
. 1032
( . 3741)
EMPS (Zero-inflated)
- . 2088
( . 0955)
- . 2088
( . 0688)
STRICT (Zero-inflated)
7. 5624
( 3. 0558)
7. 5624
( 2. 2216)
Dispersion Parameter (Alpha)
2. 5225
( . 3409)
2. 5225
( . 2992)
Log Likelihood (unrestricted) - 1089. 5198 - 1089. 5198
Log Likelihood (restricted) - 1091. 7329 - 1091. 7329
Likelihood Ratio for Goodness-of-fit 4. 43 4. 43
Likelihood Ratio for Overdispersion 360. 4024 340. 7318
*

Vuong Statistic (ZINB versus NBRM) 4. 13 4. 1270
* The likelihood ratio for overdispersion is 340.7318 =2*(-1089.5198 - (-1259.8857))

Figure 4. Comparison of the Zero-Inflated PRM and the Zero-Inflated NBRM

2003-2005, The Trustees of Indiana University Regression Models for Event Count Data: 23
http://www.indiana.edu/~statmath
6. Conclusion

Like other econometric models, researchers must first examine the data generation process of a
dependent variable to understand its behavior. Sophisticated researchers pay special attention to
excess zeros, censored and/or truncated counts, sample selection, and other particular patterns of
the data generation, and then decide which model best describes the data generation process.

The Poisson regression model and negative binomial regression model have the same mean
structure, but they describe the behavior of a dependent variable in different ways. Zero-inflated
regression models integrate two different data generation processes to deal with overdispersion.
Truncated or censored regression models are appropriate when data are (left and/or right)
truncated or censored.

Researchers need to spend more time and effort interpreting the results substantively. Like other
categorical dependent variable models, count data models produce estimates that are difficult to
interpret intuitively. Reporting parameter estimates and goodness-of-fit statistics are not
sufficient. J . Scott Long (1997) and Long and Freese (2003) provide good examples of
meaningful count data model interpretations.

Regarding statistical software, I would recommend STATA for general count data models and
LIMDEP for special types of models. Although able to handle various models, LIMDEP does
not seem stable and reliable. The SAS GENMODE procedure estimates the Poisson regression
model and the negative binomial model, but it does not have easy ways of estimating other
models. We encourage SAS Institute to develop an individual procedure, say the CLIM (Count
and Limited Dependent Variable Model) procedure, to handle a variety of count data models.

2003-2005, The Trustees of Indiana University Regression Models for Event Count Data: 24
http://www.indiana.edu/~statmath
Appendix: Data Set

The data set used here is a part of the data provided for David H. Goods class of the School of
Public and Environmental Affairs, Indiana University. Note that these data have been
manipulated for the sake of data security. The variables in the data set include,

1. emps: the size of the waste quotas
2. st r i ct : strictness of policy implementation (1=strict)
3. acci dent : the frequency of waste spill accidents of plant

The followings summarize descriptive statistics of these variables. Note that there are many zero
counts that indicate an overdispersion problem.


. summar i ze acci dent emps st r i ct

Var i abl e | Obs Mean St d. Dev. Mi n Max
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
acci dent | 778 1. 372751 2. 948267 0 31
emps | 778 42. 01285 38. 1548 1 174
st r i ct | 778 . 5077121 . 5002621 0 1



. t ab acci dent st r i ct

| st r i ct
acci dent | 0 1 | Tot al
- - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - +- - - - - - - - - -
0 | 214 284 | 498
1 | 41 29 | 70
2 | 38 32 | 70
3 | 28 13 | 41
4 | 16 13 | 29
5 | 10 3 | 13
6 | 12 7 | 19
7 | 4 3 | 7
8 | 4 2 | 6
9 | 3 2 | 5
10 | 0 2 | 2
11 | 3 1 | 4
12 | 2 0 | 2
13 | 1 0 | 1
14 | 1 0 | 1
15 | 3 0 | 3
16 | 1 0 | 1
17 | 0 2 | 2
18 | 1 1 | 2
21 | 0 1 | 1
31 | 1 0 | 1
- - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - +- - - - - - - - - -
Tot al | 383 395 | 778
2003-2005, The Trustees of Indiana University Regression Models for Event Count Data: 25
http://www.indiana.edu/~statmath
References

Allison, Paul D. 1991. Logistic Regression Using the SAS System: Theory and Application. Cary,
NC: SAS Institute.
Cameron, A. Colin, and Pravin K. Trivedi. 1998. Regression Analysis of Count Data. New
York: Cambridge University Press.
Greene, William H. 2003. Econometric Analysis, 5
th
ed. Upper Saddle River, NJ : Prentice Hall.
Greene, William H. 2002. LIMDEP Version 8.0 Econometric Modeling Guide. Plainview, New
York: Econometric Software.
Long, J . Scott, and J eremy Freese. 2003. Regression Models for Categorical Dependent
Variables Using STATA, 2
nd
ed. College Station, TX: STATA Press.
Long, J . Scott. 1997. Regression Models for Categorical and Limited Dependent Variables.
Advanced Quantitative Techniques in the Social Sciences. Sage Publications.
Maddala, G. S. 1983. Limited Dependent and Qualitative Variables in Econometrics. New York:
Cambridge University Press.
SAS Institute. 2004. SAS/STAT 9.1 User's Guide. Cary, NC: SAS Institute.
STATA Press. 2005. STATA Base Reference Manual, Release 9. College Station, TX: STATA
Press.


Acknowledgements

I am grateful to J eremy Albright and Kevin Wilhite at the UITS Center for Statistical and
Mathematical Computing, Indiana University, who provided valuable comments and
suggestions.


Revision History

2003. First draft
2004. Second draft
2005. Third draft (Added LIMDEP examples)
2005 The Trustees of Indiana University (12/10/2005) Linear Regression Model for Panel Data: 1
http://www.indiana.edu/~statmath

Linear Regression Models for Panel Data Using SAS,
STATA, LIMDEP, and SPSS

Hun Myoung Park

This document summarizes linear regression models for panel data and illustrates how to
estimate each model using SAS 9.1, STATA 9.0, LIMDEP 8.0, and SPSS 13.0. This document
does not address nonlinear models (i.e., logit and probit models), but focuses on linear
regression models.

1. Introduction
2. Least Squares Dummy Variable Regression
3. Panel Data Models
4. The Fixed Group Effect Model
5. The Fixed Time Effect Model
6. The Fixed Group and Time Effect Model
7. Random Effect Models
8. The Poolability Test
9. Conclusion


1. Introduction

Panel data are cross sectional and longitudinal (time series). Some examples are the cumulative
General Social Survey (GSS) and Current Population Survey (CPS) data. Panel data may have
group effects, time effects, or the both. These effects are analyzed by fixed effect and random
effect models.

1.1 Data Arrangement

A panel data set contains observations on n individuals (e.g., firms and states), each measured
at T points in time. In other word, each individual (1 through n subject) includes T observations
(1 through t time period). Thus, the total number of observations is nT. Figure 1 illustrates the
data arrangement of a panel data set.

Figure 1. Data Arrangement of Panel Data
Group Time Variable1 Variable2 Variable3
1 1
1 2

1 T
2 1
2 2
...
2 T

2005 The Trustees of Indiana University (12/10/2005) Linear Regression Model for Panel Data: 2
http://www.indiana.edu/~statmath


n 1
n 2

n T

1.2 Fixed Effect versus Random Effect Models

Panel data models estimate fixed and/or random effects models using dummy variables. The
core difference between fixed and random effect models lies in the role of dummies. If
dummies are considered as a part of the intercept, it is a fixed effect model. In a random effect
model, the dummies act as an error term (see Table 1).

The fixed effect model examines group differences in intercepts, assuming the same slopes and
constant variance across groups. Fixed effect models use least square dummy variable (LSDV),
within effect, and between effect estimation methods. Thus, ordinary least squares (OLS)
regressions with dummies, in fact, are fixed effect models.

Table 1. Fixed Effect and Random Effect Models
Fixed Effect Model Random Effect Model
Functional form
*

it it i it
v X y + + + =
'
) ( ) (
'
it i it it
v X y + + + =
Intercepts Varying across group and/or time Constant
Error variances Constant Varying across group and/or time
Slopes Constant Constant
Estimation LSDV, within effect, between effect GLS, FGLS
Hypothesis test Incremental F test Breusch-Pagan LM test
* ) , 0 ( ~
2
v it
IID v

The random effect model, by contrast, estimates variance components for groups and error,
assuming the same intercept and slopes. The difference among groups (or time periods) lies in
the variance of the error term. This model is estimated by generalized least squares (GLS) when
the matrix, a variance structure among groups, is known. The feasible generalized least
squares (FGLS) method is used to estimate the variance structure when is not known. A
typical example is the groupwise heteroscedastic regression model (Greene 2003). There are
various estimation methods for FGLS including maximum likelihood methods and simulations
(Baltagi and Cheng 1994).

Fixed effects are tested by the (incremental) F test, while random effects are examined by the
Lagrange multiplier (LM) test (Breusch and Pagan 1980). If the null hypothesis is not rejected,
the pooled OLS regression is favored. The Hausman specification test (Hausman 1978)
compares fixed effect and random effect models. Table 1 compares the fixed effect and random
effect models.

Group effect models create dummies using grouping variables (e.g., country, firm, and race). If
one grouping variable is considered, it is called a one-way fixed or random group effects model.
Two-way group effect models have two sets of dummy variables, one for a grouping variable
and the other for a time variable.
2005 The Trustees of Indiana University (12/10/2005) Linear Regression Model for Panel Data: 3
http://www.indiana.edu/~statmath


1.3 Estimation and Software Issues

LSDV regression, the within effect model, the between effect model (group or time mean
model), GLS, and FGLS are fundamentally based on OLS in terms of estimation. Thus, any
procedure and command for OLS is good for the panel data models.

The REG procedure of SAS/STAT, STATA . r egr ess (. cnsr eg), LIMDEP r egr ess$, and
SPSS r egr essi on commands all fit LSDV1 dropping one dummy and have options to
suppress the intercept (LSDV2). SAS, STATA, and LIMDEP can estimate OLS with
restrictions (LSDV3), but SPSS cannot. Note that the STATA . cnsr eg command requires
the . const r ai nt command that defines a restriction (Table 2).

Table 2. Procedures and Commands in SAS, STATA, LIMDEP, and SPSS
SAS 9.1 STATA 9.0 LIMDEP 8.0 SPSS 13.0
Regression (OLS)
PROC REG . r egr ess Regr ess$ Regr essi on
LSDV1 w/o a dummy w/o a dummy w/o a dummy w/o a dummy
LSDV2 /NOINT Noconst ant
w/o One in Rhs
/ Or i gi n
LSDV3
RESTRICT . cnsr eg Cl s: N/A
Fixed effect
(within effect)
TSCSREG /FIXONE
PANEL /FIXONE
. xt r eg w/ f e
Regr ess; Panel ; St
r =; Pds=; Fi xed$
N/A
Two-way fixed
(within effect)
TSCSREG /FIXTWO
PANEL /FIXTWO
N/A Regr ess; Panel ; St
r =; Pds=; Fi xed$
N/A
Between effect
PANEL /BTWNG
PANEL /BTWNT
. xt r eg w/ be
Regr ess; Panel ; St
r =; Pds=; Means$
N/A
Random effect
TSCSREG /RANONE
PANEL /RANONE
. xt r eg w/ r e
Regr ess; Panel ; St
r =; Pds=; Random$
N/A
Two-way random
TSCSREG /RANTWO
PANEL /RANTWO
N/A Problematic N/A

SAS, STATA, and LIMDEP also provide the procedures (commands) that are designed to
estimate panel data models conveniently. SAS/ETS has the TSCSREG and PANEL procedures
to estimate one-way and two-way fixed and random effect models.
1
For the fixed effect model,
these procedures estimate LSDV1, which drops one of the dummy variables. For the random
effects model, they by default use the Fuller-Battese method (1974) to estimate variance
components for group, time, and error. These procedures also support other estimation methods
such as Parks (1967) autoregressive model and Da Silva moving average method.

The TSCSREG procedure can handle balanced data only, whereas the PANEL procedure is
able to deal with balanced and unbalanced data. The former provides one-way and two-way
fixed and random effect models, while the latter supports the between effect model and pooled
OLS regression as well. Despite advanced features of PANEL, output from the two procedures
looks alike.

The STATA . xt r eg command estimates within effect (fixed effect) models with the f e option,
between effect models with the be option, and random effect models with the r e option. This
command, however, does not fit the two-way fixed and random effect models. The LIMDEP

1
SAS recently announced the PROC PANEL, an experimental procedure, for panel data models.
2005 The Trustees of Indiana University (12/10/2005) Linear Regression Model for Panel Data: 4
http://www.indiana.edu/~statmath

r egr ess$ command with the panel ; subcommand estimates panel data models, but this
command is not sufficiently stable. SPSS has limited ability to analyze panel data.

1.4 Data Sets

This document uses two data sets. The cross-sectional data set contains research and
development (R&D) expenditure data of the top 50 information technology firms presented in
OECD Information Technology Outlook 2004. The panel data set has cost data for U.S. airlines
(1970-1984) from Econometric Analysis (Greene 2003). See the Appendix for the details.


2005 The Trustees of Indiana University (12/10/2005) Linear Regression Model for Panel Data: 5
http://www.indiana.edu/~statmath

2. Least Squares Dummy Variable Regression

A dummy variable is a binary variable that is coded either 1 or zero. It is commonly used to
examine group and time effects in regression. Consider a simple model of regressing R&D
expenditure in 2002 on 2000 net income and firm type. The dummy variable d1 is set to 1 for
equipment and software firms and zero for telecommunication and electronics. The variable d2
is coded in the opposite way. Take a look at the data structure (Figure 2).

Figure 2. Dummy Variable Coding for Firm Type
+- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - +
| f i r m r nd i ncome t ype d1 d2 |
| - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - |
| Samsung 2, 500 4, 768 El ect r oni cs 0 1 |
| AT&T 254 4, 669 Tel ecom 0 1 |
| I BM 4, 750 8, 093 I T Equi pment 1 0 |
| Si emens 5, 490 6, 528 El ect r oni cs 0 1 |
| Ver i zon . 11, 797 Tel ecom 0 1 |
| Mi cr osof t 3, 772 9, 421 Ser vi ce & S/ W 1 0 |



2.1 Model 1 without a Dummy Variable

The ordinary least squares (OLS) regression without dummy variables, a pooled regression
model, assumes a constant intercept and slope regardless of firm types. In the following
regression equation,
0
is the intercept;
1
is the slope of net income in 2000; and
i
is the
error term.

Model 1:
i i i
income D R + + =
1 0
&

The pooled model has the intercept of 1,482.697 and slope of .223. For a $ one million increase
in net income, a firm is likely to increase R&D expenditure in 2002 by $ .223 million.

. regress rnd income

Sour ce | SS df MS Number of obs = 39
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - F( 1, 37) = 7. 07
Model | 15902406. 5 1 15902406. 5 Pr ob > F = 0. 0115
Resi dual | 83261299. 1 37 2250305. 38 R- squar ed = 0. 1604
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Adj R- squar ed = 0. 1377
Tot al | 99163705. 6 38 2609571. 2 Root MSE = 1500. 1

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
r nd | Coef . St d. Er r . t P>| t | [ 95%Conf . I nt er val ]
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
i ncome | . 2230523 . 0839066 2. 66 0. 012 . 0530414 . 3930632
_cons | 1482. 697 314. 7957 4. 71 0. 000 844. 8599 2120. 533
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Pooled model: R&D =1,482.697 +.223*income

Despite moderate goodness of fit statistics such as F and t, this is a nave model. R&D
investment tends to vary across industries.

2005 The Trustees of Indiana University (12/10/2005) Linear Regression Model for Panel Data: 6
http://www.indiana.edu/~statmath


2.2 Model 2 with a Dummy Variable

You may assume that equipment and software firms have more R&D expenditure than other
types of companies. Let us take this group difference into account.
2
We have to drop one of the
two dummy variables in order to avoid perfect multicollinearity. That is, OLS does not work
with both dummies in a model. The
1
in model 2 is the coefficient that is valid in equipment
and software companies only.

Model 2:
i i i i
d income D R + + + =
1 1 1 0
&

Unlike Model 1, this model results in two different regression equations for two groups. The
difference lies in the intercepts, but the slope remains unchanged.

. regress rnd income d1

Sour ce | SS df MS Number of obs = 39
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - F( 2, 36) = 6. 06
Model | 24987948. 9 2 12493974. 4 Pr ob > F = 0. 0054
Resi dual | 74175756. 7 36 2060437. 69 R- squar ed = 0. 2520
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Adj R- squar ed = 0. 2104
Tot al | 99163705. 6 38 2609571. 2 Root MSE = 1435. 4

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
r nd | Coef . St d. Er r . t P>| t | [ 95%Conf . I nt er val ]
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
i ncome | . 2180066 . 0803248 2. 71 0. 010 . 0551004 . 3809128
d1 | 1006. 626 479. 3717 2. 10 0. 043 34. 41498 1978. 837
_cons | 1133. 579 344. 0583 3. 29 0. 002 435. 7962 1831. 361
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
d1=1: R&D =2,140.205 +.218*income =1,113.579 +1,006.626*1 +.218*income
d1=0: R&D =1,133.579 +.218*income =1,113.579 +1,006.626*0 +.218*income

The slope .218 indicates a positive impact of two-year-lagged net income on a firms R&D
expenditure. Equipment and software firms on average spend $1,007 million more for R&D
than telecommunication and electronics companies.


2.3 Visualization of Model 1 and 2

There is only a tiny difference in the slope (.223 versus .218) between Model 1 and Model 2.
The intercept 1,483 of Model 1, however, is quite different from 1,134 for equipment and
software companies and 2,140 for telecommunications and electronics in Model 2. This result
appears to support Model 2.

Figure 3 highlights differences between Model 1 and 2 more clearly. The black line (pooled) in
the middle is the regression line of Model 1; the red line at the top is one for equipment and
software companies (d1=1) in Model 2; finally the blue line at the bottom is for
telecommunication and electronics firms (d2=1 or d1=0).

2
The dummy variable (firm types) and regressors (net income) may or may not be correlated.
2005 The Trustees of Indiana University (12/10/2005) Linear Regression Model for Panel Data: 7
http://www.indiana.edu/~statmath


Figure 3. Regression Lines of Model 1 and Model 2

This plot shows that Model 1 ignores the group difference, and thus reports the misleading
intercept. The difference in the intercept between two groups of firms looks substantial.
Moreover, the two models have the similar slopes. Consequently, Model 2 considering fixed
group effects seems better than the simple Model 1. Compare goodness of fit statistics (e.g., F, t,
R
2
, and SSE) of the two models. See Section 3.2.2 and 4.7 for formal hypothesis testing.


2.4 Alternatives to LSDV1

The least squares dummy variable (LSDV) regression is ordinary least squares (OLS) with
dummy variables. The critical issue in LSDV is how to avoid the perfect multicollinearity or
the so called dummy variable trap. LSDV has three approaches to avoid getting caught in the
trap. They produce different parameter estimates of dummies, but their results are equivalent.

The first approach, LSDV1, drops a dummy variable as in Model 2 above. The second
approach includes all dummies and, in turn, suppresses the intercept (LSDV2). Finally, include
the intercept and all dummies, and then impose a restriction that the sum of parameters of all
dummies is zero (LSDV3). Take a look at the following functional forms to compare these
three LSDVs.

LSDV1:
i i i i
d income D R + + + =
1 1 1 0
& or
i i i i
d income D R + + + =
2 2 1 0
&
LSDV2:
i i i i i
d d income D R + + + =
2 2 1 1 1
&
LSDV3:
i i i i i
d d income D R + + + + =
2 2 1 1 1 0
& , subject to 0
2 1
= +

2005 The Trustees of Indiana University (12/10/2005) Linear Regression Model for Panel Data: 8
http://www.indiana.edu/~statmath

The main differences among these approaches exist in the meanings of the dummy variable
parameters. Each approach defines the coefficients of dummy variables in different ways
(Table 3). The parameter estimates in LSDV2 are actual intercepts of groups, making it easy to
interpret substantively. LSDV1 reports differences from the reference point (dropped dummy
variable). LSDV3 computes how far parameter estimates are away from the average group
effect. Accordingly, null hypotheses of t-tests in the three approaches are different. Keep in
mind that the R
2
of LSDV2 is not correct. Table 3 contrasts the three LSDVs.

Table 3. Three Approaches of Least Squares Dummy Variable Models
LSDV1:
Drop one dummy
LSDV2:
Suppress the intercept
LSDV3:
Impose a restriction
Dummy included
a
d
a a
d d
2
,
* *
1 d
d d
c
d
c c
d d
1
,
Intercept? Yes No Yes
All dummy? No (d-1) Yes (d) Yes (d)
Restriction? No No
0 =

c
i
d
*

Meaning of coefficient How far away from the
reference point (dropped)?
Fixed group effect How far away from the
average group effect?
Coefficients
a
i
a
i
d d + =
*
,
a
dropped
d =
*

*
1
d ,
*
2
d ,
*
d
d
c
i
c
i
d d + =
*
, where

=
*
1
i
c
d
d

H
0
of T-test
0
* *
=
dropped i
d d 0
*
=
i
d
0
1
* *
=
i i
d
d
d
Source: David Goods Lecture (2004)
* This restriction reduces the number of parameters to be estimated, making the model identified.


2.5 Estimating Three LSDVs

The SAS REG procedure, STATA . r egr ess command, LIMDEP Regr ess$ command, and
SPSS Regr essi on command all fit OLS and LSDVs. Let us estimate three LSDVs using SAS
and STATA.

2.5.1 LSDV 1 without a Dummy

LSDV 1 drops a dummy variable. The intercept is the actual parameter estimate of the dropped
dummy variable. The coefficient of the dummy included means how far its parameter estimate
is away from the reference point or baseline (i.e., the intercept).

Here we include d2 instead of d1 to see how a different reference point changes the result.
Check the sign of the dummy coefficient included and the intercept. Dropping other dummies
does not make any significant difference.

PROC REG DATA=masil.rnd2002;
MODEL rnd = income d2;
RUN;

2005 The Trustees of Indiana University (12/10/2005) Linear Regression Model for Panel Data: 9
http://www.indiana.edu/~statmath

The REG Procedure
Model: MODEL1
Dependent Variable: rnd

Number of Observations Read 50
Number of Observations Used 39
Number of Observations with Missing Values 11


Analysis of Variance

Sum of Mean
Source DF Squares Square F Value Pr > F

Model 2 24987949 12493974 6.06 0.0054
Error 36 74175757 2060438
Corrected Total 38 99163706


Root MSE 1435.42248 R-Square 0.2520
Dependent Mean 2023.56410 Adj R-Sq 0.2104
Coeff Var 70.93536


Parameter Estimates

Parameter Standard
Variable DF Estimate Error t Value Pr > |t|

Intercept 1 2140.20468 434.48460 4.93 <.0001
income 1 0.21801 0.08032 2.71 0.0101
d2 1 -1006.62593 479.37174 -2.10 0.0428

d2=0: R&D =2,140.205 +.218*income =2,140.205 - 1,006.626*0 +.218*income
d2=1: R&D =1,133.579 +.218*income =2,140.205 - 1,006.626*1 +.218*income

2.5.2 LSDV 2 without the Intercept

LSDV 2 includes all dummy variables and suppresses the intercept. The STATA . r egr ess
command has the noconst ant option to fit LSDV2. The coefficients of dummies are actual
parameter estimates; thus, you do not need to compute intercepts of groups. This LSDV,
however, reports wrong R
2
(.7135 .2520).

. regress rnd income d1 d2, noconstant

Sour ce | SS df MS Number of obs = 39
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - F( 3, 36) = 29. 88
Model | 184685604 3 61561868. 1 Pr ob > F = 0. 0000
Resi dual | 74175756. 7 36 2060437. 69 R- squar ed = 0. 7135
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Adj R- squar ed = 0. 6896
Tot al | 258861361 39 6637470. 79 Root MSE = 1435. 4

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
r nd | Coef . St d. Er r . t P>| t | [ 95%Conf . I nt er val ]
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
i ncome | . 2180066 . 0803248 2. 71 0. 010 . 0551004 . 3809128
d1 | 2140. 205 434. 4846 4. 93 0. 000 1259. 029 3021. 38
d2 | 1133. 579 344. 0583 3. 29 0. 002 435. 7962 1831. 361
2005 The Trustees of Indiana University (12/10/2005) Linear Regression Model for Panel Data: 10
http://www.indiana.edu/~statmath

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

d1=1: R&D =2,140.205 +.218*income
d2=1: R&D =1,133.579 +.218*income

2.5.3 LSDV 3 with a Restriction

LSDV 3 includes the intercept and all dummies and then imposes a restriction on the model.
The restriction is that the sum of all dummy parameters is zero. The STATA . const r ai nt
command defines a constraint, while the . cnsr eg command fits a constrained OLS using the
const r ai nt ( ) option. The number in the parenthesis indicates the constraint number defined in
the . const r ai nt command.

. constraint 1 d1 + d2 = 0
. cnsreg rnd income d1 d2, constraint(1)

Const r ai ned l i near r egr essi on Number of obs = 39
F( 2, 36) = 6. 06
Pr ob > F = 0. 0054
Root MSE = 1435. 4
( 1) d1 + d2 = 0
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
r nd | Coef . St d. Er r . t P>| t | [ 95%Conf . I nt er val ]
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
i ncome | . 2180066 . 0803248 2. 71 0. 010 . 0551004 . 3809128
d1 | 503. 313 239. 6859 2. 10 0. 043 17. 20749 989. 4184
d2 | - 503. 313 239. 6859 - 2. 10 0. 043 - 989. 4184 - 17. 20749
_cons | 1636. 892 310. 0438 5. 28 0. 000 1008. 094 2265. 69
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
d1=1: R&D =2,140.205 +.218*income =1,637 +503 *1 +(-503)*0 +.218*income
d2=1: R&D =1,133.579 +.218*income =1,637 +503 *0 +(-503)*1 +.218*income

The intercept is the average of actual parameter estimates: 1,636 =(2,140+1,133)/2. In the SAS
output below, the coefficient of RESTRICT is virtually zero and, in theory, should be zero.

PROC REG DATA=masil.rnd2002;
MODEL rnd = income d1 d2;
RESTRICT d1 + d2 = 0;
RUN;

The REG Procedure
Model: MODEL1
Dependent Variable: rnd

NOTE: Restrictions have been applied to parameter estimates.


Number of Observations Read 50
Number of Observations Used 39
Number of Observations with Missing Values 11


Analysis of Variance

Sum of Mean
Source DF Squares Square F Value Pr > F

Model 2 24987949 12493974 6.06 0.0054
2005 The Trustees of Indiana University (12/10/2005) Linear Regression Model for Panel Data: 11
http://www.indiana.edu/~statmath

Error 36 74175757 2060438
Corrected Total 38 99163706


Root MSE 1435.42248 R-Square 0.2520
Dependent Mean 2023.56410 Adj R-Sq 0.2104
Coeff Var 70.93536


Parameter Estimates

Parameter Standard
Variable DF Estimate Error t Value Pr > |t|

Intercept 1 1636.89172 310.04381 5.28 <.0001
income 1 0.21801 0.08032 2.71 0.0101
d1 1 503.31297 239.68587 2.10 0.0428
d2 1 -503.31297 239.68587 -2.10 0.0428
RESTRICT -1 1.81899E-12 0 . .

* Probability computed using beta distribution.

Table 4 compares how SAS, STATA, LIMDEP, and SPSS conducts LSDVs. SPSS is not able
to fit the LSDV3. In LIMDEP, the b( 2) of the Cl s: indicates the parameter estimate of the
second independent variable. In SPSS, pay attention to the /ORIGIN option for LSDV2.

Table 4. Estimating Three LSDVs Using SAS, STATA, LIMDEP, and SPSS

LSDV 1 LSDV 2 LSDV 3
SAS
PROC REG;
MODEL rnd =income d2;
RUN;
PROC REG;
MODEL rnd =income d1 d2 /NOINT;
RUN;
PROC REG;
MODEL rnd =income d1 d2;
RESTRICT d1 +d2 =0;
RUN;
STATA
. regress ind income d2 . regress rnd income d1 d2, noconstant . constraint 1 d1+d2 =0
. cnsreg rnd income d1 d2 const(1)
LIMDEP
REGRESS;
Lhs=rnd;
Rhs=ONE,income, d2$
REGRESS;
Lhs=rnd;
Rhs=income, d1, d2$
REGRESS;
Lhs=rnd;
Rhs=ONE,income, d1, d2;
Cls: b(2)+b(3)=0$
SPSS
REGRESSION
/MISSING LISTWISE
/STATISTICS COEFF R ANOVA
/CRITERIA=PIN(.05) POUT(.10)
/NOORIGIN
/DEPENDENT rnd
/METHOD=ENTER income d2.
REGRESSION
/MISSING LISTWISE
/STATISTICS COEFF R ANOVA
/CRITERIA=PIN(.05) POUT(.10)
/ORIGIN
/DEPENDENT rnd
/METHOD=ENTER income d1 d2.
N/A

2005 The Trustees of Indiana University (12/10/2005) Linear Regression Model for Panel Data: 12
http://www.indiana.edu/~statmath

3. Panel Data Models

Panel data may have group effects, time effects, or both. These effects are either fixed effect or
random effect. A fixed effect model assumes differences in intercepts across groups or time
periods, whereas a random effect model explores differences in error variances. A one-way
model includes only one set of dummy variables (e.g., firm), while a two way model considers
two sets of dummy variables (e.g., firm and year). Model 2 in Chapter 2, in fact, is a one-way
fixed group effect panel data model.


3.1 Functional Forms and Notation

The functional forms of one-way panel data models are as follows.

Fixed group effect model:
it it i it
v X y + + + =
'
) ( , where ) , 0 ( ~
2
v it
IID v
Random group effect model: ) (
'
it i it it
v X y + + + = , where ) , 0 ( ~
2
v it
IID v

The dummy variable is a part of the intercept in the fixed effect model and a part of error in the
random effect model. ) , 0 ( ~
2
v it
IID v indicates that errors are independent identically
distributed.

The notations used in this document are,

i
y : dependent variable (DV) mean of group i.

t
x

: means of independent variables (IVs) at time t.




y and

x for overall means of the DV and IVs, respectively.
n: the number of groups or firms
T : the number of time periods
N=nT : total number of observations
k : the number of regressors excluding dummy variables
K=k+1 (including the intercept)


3.2 Fixed Effect Models

There are several strategies for estimating fixed effect models. The least squares dummy
variable model (LSDV) uses dummy variables, whereas the within effect does not. These
strategies produce the identical slopes of non-dummy independent variables. The between
effect model also does not use dummies, but produces different parameter estimates. There are
pros and cons of these strategies (Table 5).

3.2.1 Estimations: LSDV, Within Effect, and Between Effect Model

As discussed in Chapter 2, LSDV is widely used because it is relatively easy to estimate and
interpret substantively. This LSDV, however, becomes problematic when there are many
2005 The Trustees of Indiana University (12/10/2005) Linear Regression Model for Panel Data: 13
http://www.indiana.edu/~statmath

groups or subjects in the panel data. If T is fixed and N , only coefficients of regressors
are consistent. The coefficients of dummy variables,
i
+ , are not consistent since the
number of these parameters increases as N increases (Baltagi 2001). This is so called the
incidental parameter problem. Under this circumstance, LSDV is useless, calling for another
strategy, the within effect model.

The within effect model does not use dummy variables, but uses deviations from group means.
Thus, this model is the OLS of ) ( ) ( ' ) (

+ =
i it i it i it
x x y y without an intercept.
3
You
do not need to worry about the incidental parameter problem any more. The parameter
estimates of regressors are identical to those of LSDV. The within effect model in turn has
several disadvantages.

Table 5. Three Strategies for Fixed Effect Models
LSDV1 Within Effect Between Effect
Functional form
i i i i
X i y + + =

+ =
i it i it i it
x x y y
i i i
x y + + =


Dummy Yes No No
Dummy coefficient Presented Need to be computed N/A
Transformation No Deviation from the group means Group means
Intercept (estimation) Yes No No
R
2
Correct Incorrect
SSE Correct Correct
MSE Correct Smaller
Standard error of
Correct Incorrect (smaller)
DF
error
nT-n-k nT-k (Larger) n-K
Observations nT nT n

Since this model does not report dummy coefficients, you need to compute them using the
formula

=
g g g
x y d '
*
Since no dummy is used, the within effect model has a larger degree
of freedom for error, resulting in a small MSE (mean square error) and incorrect (larger)
standard errors of parameter estimates. Thus, you have to adjust the standard error using the
formula
k n nT
k nT
se
df
df
se se
k
LSDV
error
Within
error
k k

= =
*
. Finally, R
2
of the within effect model is not
correct because an intercept is suppressed.

The between group effect model, so called the group mean regression, uses the group means of
the dependent and independent variables. Then, run OLS of
i i i
x y + + =

The number of
observations decreases to n. This model uses aggregated data to test effects between groups (or
individuals), assuming no group and time effect. Table 5 contrasts LSDV, the within effect
model, and the between group models. In two-way fixed effect model, LSDV2 and the between
effect model are not valid.


3
You need to follow three steps: 1) compute group means of the dependent and independent variables; 2)
transform variables to get deviations of individual values from the group means; 3) run OLS with the transformed
variables without the intercept.
2005 The Trustees of Indiana University (12/10/2005) Linear Regression Model for Panel Data: 14
http://www.indiana.edu/~statmath

3.2.2 Testing Group Effects

The null hypothesis is that all dummy parameters except one are zero: 0 ... :
1 1 0
= = =
n
H .
This hypothesis is tested by the F test, which is based on loss of goodness-of-fit. The robust
model in the following formula is LSDV and the efficient model is the pooled regression.
4


) , 1 ( ~
) ( ) 1 (
) 1 ( ) (
) ( ) ' (
) 1 ( ) ' ' (
2
2 2
k n nT n F
k n nT R
n R R
k n nT e e
n e e e e
Robust
Efficient Robust
Robust
Robust Efficient



=




If the null hypothesis is rejected, you may conclude that the fixed group effect model is better
than the pooled OLS model.

3.2.3 Fixed Time Effect and Two-way Fixed Effect Models

For the fixed time effects model, you need to switch n and T, and i and t in the formulas.

Model:
it it t it
X y + + + = '
Within effect model: ) ( ) ( ' ) (
t it t it t it
x x y y

+ =
Dummy coefficients:
t t t
x y d

= '
*

Correct standard errors:
k T Tn
k Tn
se
df
df
se se
k
LSDV
error
Within
error
k k


= =
*

Between effect model:
t t t
x y + + =


0 ... :
1 1 0
= = =
T
H .
F-test: ) , 1 ( ~
) ( ) ' (
) 1 ( ) ' ' (
k T Tn T F
k T Tn e e
T e e e e
Robust
Robust Efficient



.

The fixed group and time effect model uses slightly different formulas. The within effect model
of this two-way fixed model has four approaches for LSDV (see 6.1 for details).

Model:
it it t i it
X y + + + + = ' .
Within effect Model:

+ = y y y y y
t i it it
*
and

+ = x x x x x
t i it it
*
.
Dummy coefficients: ) ( ' ) (
*

= x x b y y d
g g g
and ) ( ' ) (
*

= x x b y y d
t t t

Correct standard errors:
1
*
+

= =
k T n nT
k nT
se
df
df
se se
k
LSDV
error
Within
error
k k

0 ... :
1 1 0
= = =
n
H and 0 ...
1 1
= = =
T
.
F-test: )] 1 ( ), 2 [( ~
) 1 ( ) ' (
) 2 ( ) ' ' (
+ +
+
+
k T n nT T n F
k T n nT e e
T n e e e e
Robust
Robust Efficient


4
When comparing fixed effect and random effect models, the fixed effect estimates are considered as the robust
estimates and random effect estimates as the efficient estimates.
2005 The Trustees of Indiana University (12/10/2005) Linear Regression Model for Panel Data: 15
http://www.indiana.edu/~statmath



3.3 Random Effect Models

The one-way random group effect model is formulated as
it i ti it
v X y + + + = ' ,
it i it
v w + = where ) , 0 ( ~
2

IID
i
and ) , 0 ( ~
2
v it
IID v . The
i
are assumed independent of
it
v and
it
X , which are also independent of each other for all i and t. Remember that this
assumption is not necessary in the fixed effect model. The components of
) ( ) , (
js it js it
w w E w w Cov = are
2 2
v

+ if i=j and t=s and


2

if i=j and s t .
5


A random effect model is estimated by generalized least squares (GLS) when the variance
structure is known and feasible generalized least squares (FGLS) when the variance is
unknown. Compared to fixed effect models, random effect models are relatively difficult to
estimate. This document assumes panel data are balanced.

3.3.1 Generalized Least Squares (GLS)

When is known (given), GLS based on the true variance components is BLUE and all the
feasible GLS estimators considered are asymptotically efficient as either n or T approaches
infinity (Baltagi 2001). The matrix looks like,

+
+
+
=

2 2 2 2
2 2 2 2
2 2 2 2
...
... ... ... ...
...
...
v
v
v
T T







In GLS, you just need to compute using the matrix:
2 2
2
1
v
v
T

+
= .
6
Then transform
variables as follows.

=
i it it
y y y
*

=
i it it
x x x
*
for all X
k

=1
*


Finally, run OLS with the transformed variables:
* * * * *
it it it
x y + = . Since is often
unknown, FGLS is more frequently used rather than GLS.

3.3.2 Feasible Generalized Least Squares (FGLS)


5
This implies that ) , (
js it
w w Corr is 1 if i=j and t=s, and ) (
2 2 2
v


+ if i=j and s t .
6
If 0 = , run pooled OLS. If 1 = and 0
2
=
v
, then run the within effect model.
2005 The Trustees of Indiana University (12/10/2005) Linear Regression Model for Panel Data: 16
http://www.indiana.edu/~statmath

If is unknown, first you have to estimate using
2

and
2

v
:
2
2
2 2
2

between
v
v
v
T T

=
+
= .

The
2

v
is derived from the SSE (sum of squares due to error) of the within effect model or
from the deviations of residuals from group means of residuals:
k n nT
v v
k n nT
e e
k n nT
SSE
n
i
T
t
i it
within within
v

=

=

=

= =

1 1
2
2
) (
'
, where
it
v are the residuals of the LSDV1.

The
2

comes from the between effect model (group mean regression):


T
v
between
2
2 2

= , where
K n
SSE
between
between

=
2
.

Next, transform variables using

and then run OLS:


* * * * *
it it it
x y + = .

=
i it it
y y y

=
i it it
x x x

*
for all X
k

1
*
=

The estimation of the two-way random effect model is skipped here, since it is complicated.

3.3.3 Testing Random Effects (LM test)

The null hypothesis is that cross-sectional variance components are zero, 0 :
2
0
=
u
H . Breusch
and Pagan (1980) developed the Lagrange multiplier (LM) test (Greene 2003; J udge et al.
1988). In the following formula, e is the n X 1 vector of the group specific means of pooled
regression residuals, and e e' is the SSE of the pooled OLS regression. The LM is distributed as
chi-squared with one degree of freedom.
) 1 ( ~ 1
'
'
) 1 ( 2
1
'
'
) 1 ( 2
2
2
2
2

=
e e
e e T
T
nT
e e
DDe e
T
nT
LM .

Baltagi (2001) presents the same LM test in a different way.
( ) ( )
) 1 ( ~ 1
) 1 ( 2
1
) 1 ( 2
2
2
2
2
2
2
2


it
i
it
it
e
e T
T
nT
e
e
T
nT
LM .

The two way random effect model has the null hypothesis of 0 :
2
0
=
u
H and 0
2
=
v
. The LM
test combines two one-way random effect models for group and time,
) 2 ( ~
2

v v
LM LM LM + = .
2005 The Trustees of Indiana University (12/10/2005) Linear Regression Model for Panel Data: 17
http://www.indiana.edu/~statmath


3.4 Hausman Test: Fixed Effects versus Random Effects

The Hausman specification test compares the fixed versus random effects under the null
hypothesis that the individual effects are uncorrelated with the other regressors in the model
(Hausman 1978). If correlated (H
0
is rejected), a random effect model produces biased
estimators, violating one of the Gauss-Markov assumptions; so a fixed effect model is preferred.
Hausmans essential result is that the covariance of an efficient estimator with its difference
from an inefficient estimator is zero (Greene 2003).

( ) ( ) ) ( ~

2 1 '
k b b b b m
Efficient Robust Efficient Robust
=

,

) ( ) ( ] [

Efficient Robust Efficient Robust


b Var b Var b b Var = = is the difference between the estimated
covariance matrix of the parameter estimates in the LSDV model (robust) and that of the
random effects model (efficient). It is notable that an intercept and dummy variables SHOULD
be excluded in computation.


3.5 Poolability Test

What is poolability? It asks if slopes are the same across groups or over time. Thus, the null
hypothesis of the poolability test is
k ik
H = :
0
. Remember that slopes remain constant in
fixed and random effect models; only intercepts and error variances matter.

The poolability test is undertaken under the assumption of ) , 0 ( ~
2
NT
I s N . This test uses the F
statistic, [ ] ) ( , ) 1 ( ~
) (
) 1 ( ) ' (
'
'
K T n K n F
K T n e e
K n e e e e
F
i i
i i
obs

, where e e' is the SSE of the


pooled OLS and
i i
e e
'
is the SSE of the OLS regression for group i. If the null hypothesis is
rejected, the panel data are not poolable. Under this circumstance, you may go to the random
coefficient model or hierarchical regression model.

Similarly, the null hypothesis of the poolability test over time is
k tk
H = :
0
. The F-test is
[ ] ) ( , ) 1 (
) (
) 1 ( ) ' (
'
'
K n T K T F
K n T e e
K T e e e e
F
t t
t t
obs
=

, where
t t
e e
'
is SSE of the OLS
regression at time t.



2005 The Trustees of Indiana University (12/10/2005) Linear Regression Model for Panel Data: 18
http://www.indiana.edu/~statmath

4. The Fixed Group Effect Model

The one-way fixed group model examines group differences in the intercepts. The LSDV for
this fixed model needs to create as many dummy variables as the number of groups or subjects.
When many dummies are needed, the within effect model is useful since it transforms variables
using group means to avoid dummies. The between effect model uses group means of variables.

4.1 The Pooled OLS Regression Model

Let us first consider the pooled model without dummy variables.

. regress cost output fuel load / / pool ed model

Sour ce | SS df MS Number of obs = 90
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - F( 3, 86) = 2419. 34
Model | 112. 705452 3 37. 5684839 Pr ob > F = 0. 0000
Resi dual | 1. 33544153 86 . 01552839 R- squar ed = 0. 9883
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Adj R- squar ed = 0. 9879
Tot al | 114. 040893 89 1. 28135835 Root MSE = . 12461

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
cost | Coef . St d. Er r . t P>| t | [ 95%Conf . I nt er val ]
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
out put | . 8827385 . 0132545 66. 60 0. 000 . 8563895 . 9090876
f uel | . 453977 . 0203042 22. 36 0. 000 . 4136136 . 4943404
l oad | - 1. 62751 . 345302 - 4. 71 0. 000 - 2. 313948 - . 9410727
_cons | 9. 516923 . 2292445 41. 51 0. 000 9. 0612 9. 972645
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
cost =9.517 +.883*output +.454*fuel -1.628*load.

This model fits the data well (p<.0000 and R
2
=.9883). We may, however, suspect fixed group
effects that produce different intercepts across groups. As discussed in Chapter 2, there are
three equivalent approaches of LSDV. They report the identical parameter estimates of
regresors excluding dummies. Let us begin with LSDV1.

4.2 LSDV1 without a Dummy

LSDV1 drops a dummy variable to identify the model. LSDV1 produces correct ANOVA
information, goodness of fit, parameter estimates, and standard errors. As a consequence, this
approach is commonly used in practice. LSDV produces six regression equations for six groups
(airlines).

Gr oup1: cost = 9. 706 + . 919*out put +. 417*f uel - 1. 070*l oad
Gr oup2: cost = 9. 665 + . 919*out put +. 417*f uel - 1. 070*l oad
Gr oup3: cost = 9. 497 + . 919*out put +. 417*f uel - 1. 070*l oad
Gr oup4: cost = 9. 891 + . 919*out put +. 417*f uel - 1. 070*l oad
Gr oup5: cost = 9. 730 + . 919*out put +. 417*f uel - 1. 070*l oad
Gr oup6: cost = 9. 793 + . 919*out put +. 417*f uel - 1. 070*l oad

In SAS, the REG procedure fits the OLS regression model. Let us drop the last dummy g6, the
reference point.

PROC REG DATA=masil.airline;
MODEL cost = g1-g5 output fuel load;
2005 The Trustees of Indiana University (12/10/2005) Linear Regression Model for Panel Data: 19
http://www.indiana.edu/~statmath

RUN;

The REG Procedure
Model: MODEL1
Dependent Variable: cost

Number of Observations Read 90
Number of Observations Used 90


Analysis of Variance

Sum of Mean
Source DF Squares Square F Value Pr > F

Model 8 113.74827 14.21853 3935.79 <.0001
Error 81 0.29262 0.00361
Corrected Total 89 114.04089


Root MSE 0.06011 R-Square 0.9974
Dependent Mean 13.36561 Adj R-Sq 0.9972
Coeff Var 0.44970


Parameter Estimates

Parameter Standard
Variable DF Estimate Error t Value Pr > |t|

Intercept 1 9.79300 0.26366 37.14 <.0001
g1 1 -0.08706 0.08420 -1.03 0.3042
g2 1 -0.12830 0.07573 -1.69 0.0941
g3 1 -0.29598 0.05002 -5.92 <.0001
g4 1 0.09749 0.03301 2.95 0.0041
g5 1 -0.06301 0.02389 -2.64 0.0100
output 1 0.91928 0.02989 30.76 <.0001
fuel 1 0.41749 0.01520 27.47 <.0001
load 1 -1.07040 0.20169 -5.31 <.0001

Note that the parameter estimate of g6 is presented in the intercept (9.793). Other dummy
parameter estimates are computed with the reference point. The actual intercept of the group 1,
for example, is computed as 9.706 =9.793 +(-.087)*1 +(-.1283)*0 +(-.2960)*0 +(.0975)*0 +
(-.0630)*0, where 9.793 is the reference point.

STATA has the . r egr ess command for OLS regression (LSDV).

. regress cost g1-g5 output fuel load

Sour ce | SS df MS Number of obs = 90
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - F( 8, 81) = 3935. 79
Model | 113. 74827 8 14. 2185338 Pr ob > F = 0. 0000
Resi dual | . 292622872 81 . 003612628 R- squar ed = 0. 9974
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Adj R- squar ed = 0. 9972
Tot al | 114. 040893 89 1. 28135835 Root MSE = . 06011

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
cost | Coef . St d. Er r . t P>| t | [ 95%Conf . I nt er val ]
2005 The Trustees of Indiana University (12/10/2005) Linear Regression Model for Panel Data: 20
http://www.indiana.edu/~statmath

- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
g1 | - . 0870617 . 0841995 - 1. 03 0. 304 - . 2545924 . 080469
g2 | - . 1282976 . 0757281 - 1. 69 0. 094 - . 2789728 . 0223776
g3 | - . 2959828 . 0500231 - 5. 92 0. 000 - . 395513 - . 1964526
g4 | . 097494 . 0330093 2. 95 0. 004 . 0318159 . 1631721
g5 | - . 063007 . 0238919 - 2. 64 0. 010 - . 1105443 - . 0154697
out put | . 9192846 . 0298901 30. 76 0. 000 . 8598126 . 9787565
f uel | . 4174918 . 0151991 27. 47 0. 000 . 3872503 . 4477333
l oad | - 1. 070396 . 20169 - 5. 31 0. 000 - 1. 471696 - . 6690963
_cons | 9. 793004 . 2636622 37. 14 0. 000 9. 268399 10. 31761
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Now, run the LIMDEP Regr ess$ command to fit the LSDV1. Do not forget to include ONE
for the intercept in the Rhs;.

--> REGRESS;Lhs=COST;Rhs=ONE,G1,G2,G3,G4,G5,OUTPUT,FUEL,LOAD$

+-----------------------------------------------------------------------+
| Ordinary least squares regression Weighting variable = none |
| Dep. var. = COST Mean= 13.36560933 , S.D.= 1.131971444 |
| Model size: Observations = 90, Parameters = 9, Deg.Fr.= 81 |
| Residuals: Sum of squares= .2926207777 , Std.Dev.= .06010 |
| Fit: R-squared= .997434, Adjusted R-squared = .99718 |
| Model test: F[ 8, 81] = 3935.82, Prob value = .00000 |
| Diagnostic: Log-L = 130.0865, Restricted(b=0) Log-L = -138.3581 |
| LogAmemiyaPrCrt.= -5.528, Akaike Info. Crt.= -2.691 |
| Autocorrel: Durbin-Watson Statistic = 1.02645, Rho = .48677 |
+-----------------------------------------------------------------------+
+---------+--------------+----------------+--------+---------+----------+
|Variable | Coefficient | Standard Error |t-ratio |P[|T|>t] | Mean of X|
+---------+--------------+----------------+--------+---------+----------+
Constant 9.793021272 .26366104 37.142 .0000
G1 -.8707201949E-01 .84199161E-01 -1.034 .3042 .16666667
G2 -.1283060033 .75727781E-01 -1.694 .0940 .16666667
G3 -.2959885994 .50022855E-01 -5.917 .0000 .16666667
G4 .9749253376E-01 .33009146E-01 2.954 .0041 .16666667
G5 -.6300770422E-01 .23891796E-01 -2.637 .0100 .16666667
OUTPUT .9192881432 .29889967E-01 30.756 .0000 -1.1743092
FUEL .4174910457 .15199071E-01 27.468 .0000 12.770359
LOAD -1.070395015 .20168924 -5.307 .0000 .56046016
(Note: E+nn or E-nn means multiply by 10 to + or -nn power.)

What if you drop a different dummy variable, say g1, instead of g6? Since the different
reference point is applied, you will get different dummy coefficients. The other statistics such
as goodness-of-fits, however, remain unchanged.

. regress cost g2-g6 output fuel load / / LSDV1 dr oppi ng g1

Sour ce | SS df MS Number of obs = 90
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - F( 8, 81) = 3935. 79
Model | 113. 74827 8 14. 2185338 Pr ob > F = 0. 0000
Resi dual | . 292622872 81 . 003612628 R- squar ed = 0. 9974
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Adj R- squar ed = 0. 9972
Tot al | 114. 040893 89 1. 28135835 Root MSE = . 06011

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
cost | Coef . St d. Er r . t P>| t | [ 95%Conf . I nt er val ]
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
g2 | - . 0412359 . 0251839 - 1. 64 0. 105 - . 0913441 . 0088722
g3 | - . 2089211 . 0427986 - 4. 88 0. 000 - . 2940769 - . 1237652
g4 | . 1845557 . 0607527 3. 04 0. 003 . 0636769 . 3054345
2005 The Trustees of Indiana University (12/10/2005) Linear Regression Model for Panel Data: 21
http://www.indiana.edu/~statmath

g5 | . 0240547 . 0799041 0. 30 0. 764 - . 1349293 . 1830387
g6 | . 0870617 . 0841995 1. 03 0. 304 - . 080469 . 2545924
out put | . 9192846 . 0298901 30. 76 0. 000 . 8598126 . 9787565
f uel | . 4174918 . 0151991 27. 47 0. 000 . 3872503 . 4477333
l oad | - 1. 070396 . 20169 - 5. 31 0. 000 - 1. 471696 - . 6690963
_cons | 9. 705942 . 193124 50. 26 0. 000 9. 321686 10. 0902
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

When you have not created dummy variables, take advantage of the . xi prefix command.
7

Note that STATA by default drops the first dummy variable while the SAS TSCSREG and
PANEL procedures in 4.5.2 drops the last dummy.

. xi: regress cost i.airline output fuel load

i . ai r l i ne _I ai r l i ne_1- 6 ( nat ur al l y coded; _I ai r l i ne_1 omi t t ed)

Sour ce | SS df MS Number of obs = 90
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - F( 8, 81) = 3935. 79
Model | 113. 74827 8 14. 2185338 Pr ob > F = 0. 0000
Resi dual | . 292622872 81 . 003612628 R- squar ed = 0. 9974
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Adj R- squar ed = 0. 9972
Tot al | 114. 040893 89 1. 28135835 Root MSE = . 06011

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
cost | Coef . St d. Er r . t P>| t | [ 95%Conf . I nt er val ]
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
_I ai r l i ne_2 | - . 0412359 . 0251839 - 1. 64 0. 105 - . 0913441 . 0088722
_I ai r l i ne_3 | - . 2089211 . 0427986 - 4. 88 0. 000 - . 2940769 - . 1237652
_I ai r l i ne_4 | . 1845557 . 0607527 3. 04 0. 003 . 0636769 . 3054345
_I ai r l i ne_5 | . 0240547 . 0799041 0. 30 0. 764 - . 1349293 . 1830387
_I ai r l i ne_6 | . 0870617 . 0841995 1. 03 0. 304 - . 080469 . 2545924
out put | . 9192846 . 0298901 30. 76 0. 000 . 8598126 . 9787565
f uel | . 4174918 . 0151991 27. 47 0. 000 . 3872503 . 4477333
l oad | - 1. 070396 . 20169 - 5. 31 0. 000 - 1. 471696 - . 6690963
_cons | 9. 705942 . 193124 50. 26 0. 000 9. 321686 10. 0902
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

4.3 LSDV2 without the Intercept

LSDV2 reports actual parameter estimates of the dummies. Because LSDV2 suppresses the
intercept, you will get incorrect F and R
2
statistics.

In the SAS REG procedure, you need to use the /NOINT option to suppress the intercept. Note
that the F value of 497,985 and R
2
of 1 are not likely.

PROC REG DATA=masil.airline;
MODEL cost = g1-g6 output fuel load /NOINT;
RUN;

The REG Procedure
Model: MODEL1
Dependent Variable: cost

Number of Observations Read 90
Number of Observations Used 90


7
The STATA . xi is used either as an ordinary command or a prefix command like . bysor t . This command
creates dummies from a categorical variable specified in the term i . and then run the command following the
colon.
2005 The Trustees of Indiana University (12/10/2005) Linear Regression Model for Panel Data: 22
http://www.indiana.edu/~statmath


NOTE: No intercept in model. R-Square is redefined.

Analysis of Variance

Sum of Mean
Source DF Squares Square F Value Pr > F

Model 9 16191 1799.03381 497985 <.0001
Error 81 0.29262 0.00361
Uncorrected Total 90 16192


Root MSE 0.06011 R-Square 1.0000
Dependent Mean 13.36561 Adj R-Sq 1.0000
Coeff Var 0.44970


Parameter Estimates

Parameter Standard
Variable DF Estimate Error t Value Pr > |t|

g1 1 9.70594 0.19312 50.26 <.0001
g2 1 9.66471 0.19898 48.57 <.0001
g3 1 9.49702 0.22496 42.22 <.0001
g4 1 9.89050 0.24176 40.91 <.0001
g5 1 9.73000 0.26094 37.29 <.0001
g6 1 9.79300 0.26366 37.14 <.0001
output 1 0.91928 0.02989 30.76 <.0001
fuel 1 0.41749 0.01520 27.47 <.0001
load 1 -1.07040 0.20169 -5.31 <.0001

STATA uses the noconst ant option to suppress the intercept. Note that noc is its abbreviation.

. regress cost g1-g6 output fuel load, noc

Sour ce | SS df MS Number of obs = 90
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - F( 9, 81) = .
Model | 16191. 3043 9 1799. 03381 Pr ob > F = 0. 0000
Resi dual | . 292622872 81 . 003612628 R- squar ed = 1. 0000
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Adj R- squar ed = 1. 0000
Tot al | 16191. 5969 90 179. 906633 Root MSE = . 06011

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
cost | Coef . St d. Er r . t P>| t | [ 95%Conf . I nt er val ]
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
g1 | 9. 705942 . 193124 50. 26 0. 000 9. 321686 10. 0902
g2 | 9. 664706 . 198982 48. 57 0. 000 9. 268794 10. 06062
g3 | 9. 497021 . 2249584 42. 22 0. 000 9. 049424 9. 944618
g4 | 9. 890498 . 2417635 40. 91 0. 000 9. 409464 10. 37153
g5 | 9. 729997 . 2609421 37. 29 0. 000 9. 210804 10. 24919
g6 | 9. 793004 . 2636622 37. 14 0. 000 9. 268399 10. 31761
out put | . 9192846 . 0298901 30. 76 0. 000 . 8598126 . 9787565
f uel | . 4174918 . 0151991 27. 47 0. 000 . 3872503 . 4477333
l oad | - 1. 070396 . 20169 - 5. 31 0. 000 - 1. 471696 - . 6690963
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

In LIMDEP, you need to drop ONE out of the Rhs; to suppress the intercept. Unlike SAS and
STATA, LIMDEP reports correct R2 and F even in LSDV2.
2005 The Trustees of Indiana University (12/10/2005) Linear Regression Model for Panel Data: 23
http://www.indiana.edu/~statmath


--> REGRESS;Lhs=COST;Rhs=G1,G2,G3,G4,G5,G6,OUTPUT,FUEL,LOAD$

+-----------------------------------------------------------------------+
| Ordinary least squares regression Weighting variable = none |
| Dep. var. = COST Mean= 13.36560933 , S.D.= 1.131971444 |
| Model size: Observations = 90, Parameters = 9, Deg.Fr.= 81 |
| Residuals: Sum of squares= .2926207777 , Std.Dev.= .06010 |
| Fit: R-squared= .997434, Adjusted R-squared = .99718 |
| Model test: F[ 8, 81] = 3935.82, Prob value = .00000 |
| Diagnostic: Log-L = 130.0865, Restricted(b=0) Log-L = -138.3581 |
| LogAmemiyaPrCrt.= -5.528, Akaike Info. Crt.= -2.691 |
| Model does not contain ONE. R-squared and F can be negative! |
| Autocorrel: Durbin-Watson Statistic = 1.02645, Rho = .48677 |
+-----------------------------------------------------------------------+
+---------+--------------+----------------+--------+---------+----------+
|Variable | Coefficient | Standard Error |t-ratio |P[|T|>t] | Mean of X|
+---------+--------------+----------------+--------+---------+----------+
G1 9.705949253 .19312325 50.258 .0000 .16666667
G2 9.664715269 .19898117 48.571 .0000 .16666667
G3 9.497032673 .22495746 42.217 .0000 .16666667
G4 9.890513806 .24176245 40.910 .0000 .16666667
G5 9.730013568 .26094094 37.288 .0000 .16666667
G6 9.793021272 .26366104 37.142 .0000 .16666667
OUTPUT .9192881432 .29889967E-01 30.756 .0000 -1.1743092
FUEL .4174910457 .15199071E-01 27.468 .0000 12.770359
LOAD -1.070395015 .20168924 -5.307 .0000 .56046016
(Note: E+nn or E-nn means multiply by 10 to + or -nn power.)


4.4 LSDV3 with Restrictions

LSDV3 imposes a restriction that the sum of the dummy parameters is zero. The SAS REG
procedure uses the RESTRICT statement to impose restrictions.

PROC REG DATA=masil.airline;
MODEL cost = g1-g6 output fuel load;
RESTRICT g1 + g2 + g3 + g4 + g5 + g6 = 0;
RUN;

The REG Procedure
Model: MODEL1
Dependent Variable: cost

NOTE: Restrictions have been applied to parameter estimates.


Number of Observations Read 90
Number of Observations Used 90


Analysis of Variance

Sum of Mean
Source DF Squares Square F Value Pr > F
2005 The Trustees of Indiana University (12/10/2005) Linear Regression Model for Panel Data: 24
http://www.indiana.edu/~statmath


Model 8 113.74827 14.21853 3935.79 <.0001
Error 81 0.29262 0.00361
Corrected Total 89 114.04089


Root MSE 0.06011 R-Square 0.9974
Dependent Mean 13.36561 Adj R-Sq 0.9972
Coeff Var 0.44970


Parameter Estimates

Parameter Standard
Variable DF Estimate Error t Value Pr > |t|

Intercept 1 9.71353 0.22964 42.30 <.0001
g1 1 -0.00759 0.04562 -0.17 0.8683
g2 1 -0.04882 0.03798 -1.29 0.2023
g3 1 -0.21651 0.01606 -13.48 <.0001
g4 1 0.17697 0.01942 9.11 <.0001
g5 1 0.01647 0.03669 0.45 0.6547
g6 1 0.07948 0.04050 1.96 0.0532
output 1 0.91928 0.02989 30.76 <.0001
fuel 1 0.41749 0.01520 27.47 <.0001
load 1 -1.07040 0.20169 -5.31 <.0001
RESTRICT -1 3.01674E-15 1.51088E-10 0.00 1.0000*

* Probability computed using beta distribution.

The dummy coefficients mean deviations from the averaged group effect (9.714). The actual
intercept of group 2, for example, is 9.665 =9.714+(-.049). Note that the 3.01674E-15 of
RESTRICT below is virtually zero.

In STATA, you have to use the . cnsr eg command rather than . r egr ess. The command,
however, does not provide an ANOVA table and goodness-of-fit statistics.

. constraint define 1 g1 + g2 + g3 + g4 + g5 + g6 = 0
. cnsreg cost g1-g6 output fuel load, constraint(1)

Const r ai ned l i near r egr essi on Number of obs = 90
F( 8, 81) = 3935. 79
Pr ob > F = 0. 0000
Root MSE = . 06011
( 1) g1 + g2 + g3 + g4 + g5 + g6 = 0
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
cost | Coef . St d. Er r . t P>| t | [ 95%Conf . I nt er val ]
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
g1 | - . 0075859 . 0456178 - 0. 17 0. 868 - . 0983509 . 0831792
g2 | - . 0488218 . 0379787 - 1. 29 0. 202 - . 1243875 . 0267439
g3 | - . 2165069 . 0160624 - 13. 48 0. 000 - . 2484661 - . 1845478
g4 | . 1769698 . 0194247 9. 11 0. 000 . 1383208 . 2156189
g5 | . 0164689 . 0366904 0. 45 0. 655 - . 0565335 . 0894712
g6 | . 0794759 . 0405008 1. 96 0. 053 - . 001108 . 1600597
out put | . 9192846 . 0298901 30. 76 0. 000 . 8598126 . 9787565
f uel | . 4174918 . 0151991 27. 47 0. 000 . 3872503 . 4477333
l oad | - 1. 070396 . 20169 - 5. 31 0. 000 - 1. 471696 - . 6690963
_cons | 9. 713528 . 229641 42. 30 0. 000 9. 256614 10. 17044
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
2005 The Trustees of Indiana University (12/10/2005) Linear Regression Model for Panel Data: 25
http://www.indiana.edu/~statmath


LIMDEP has the Cl s$ subcommand to impose restrictions. Again, do not forget to include
ONE in the Rhs; .

--> REGRESS;Lhs=COST;Rhs=ONE,G1,G2,G3,G4,G5,G6,OUTPUT,FUEL,LOAD;
Cls:b(1)+b(2)+b(3)+b(4)+b(5)+b(6)=0$

+-----------------------------------------------------------------------+
| Linearly restricted regression |
| Ordinary least squares regression Weighting variable = none |
| Dep. var. = COST Mean= 13.36560933 , S.D.= 1.131971444 |
| Model size: Observations = 90, Parameters = 9, Deg.Fr.= 81 |
| Residuals: Sum of squares= .2926207777 , Std.Dev.= .06010 |
| Fit: R-squared= .997434, Adjusted R-squared = .99718 |
| (Note: Not using OLS. R-squared is not bounded in [0,1] |
| Model test: F[ 8, 81] = 3935.82, Prob value = .00000 |
| Diagnostic: Log-L = 130.0865, Restricted(b=0) Log-L = -138.3581 |
| LogAmemiyaPrCrt.= -5.528, Akaike Info. Crt.= -2.691 |
| Note, when restrictions are imposed, R-squared can be less than zero. |
| F[ 1, 80] for the restrictions = .0000, Prob = 1.0000 |
| Autocorrel: Durbin-Watson Statistic = 1.02645, Rho = .48677 |
+-----------------------------------------------------------------------+
+---------+--------------+----------------+--------+---------+----------+
|Variable | Coefficient | Standard Error |t-ratio |P[|T|>t] | Mean of X|
+---------+--------------+----------------+--------+---------+----------+
Constant 12.12205614 .27886962 43.469 .0000
G1 -2.416106889 .89836871E-01 -26.894 .0000 .16666667
G2 -2.457340873 .82929154E-01 -29.632 .0000 .16666667
G3 -2.625023469 .56175656E-01 -46.729 .0000 .16666667
G4 -2.231542336 .41557714E-01 -53.697 .0000 .16666667
G5 -2.392042574 .29995908E-01 -79.746 .0000 .16666667
G6 -2.329034870 .33569388E-01 -69.380 .0000 .16666667
OUTPUT .9192881432 .29889967E-01 30.756 .0000 -1.1743092
FUEL .4174910457 .15199071E-01 27.468 .0000 12.770359
LOAD -1.070395015 .20168924 -5.307 .0000 .56046016

LSDV3 in LIMDEP reports different dummy coefficients. But you may draw actual intercepts
of groups in a manner similar to what you would do in SAS and STATA. The actual intercept
of group 3, for example, is 9.497 =12.122 +(-2.625).


4.5 Within Group Effect Model

The within effect model does not use the dummies and thus has larger degrees of freedom,
smaller MSE, and smaller standard errors of parameters than those of LSDV. As a consequence,
you need to adjust standard errors. This model does not report individual dummy coefficients
either. The SAS TSCSREG procedure and LIMDEP Regr ess$ command report the adjusted
(correct) MSE, SEE (Root MSE), R
2
, and standard errors.

4.5.1 Estimating the Within Effect Model

2005 The Trustees of Indiana University (12/10/2005) Linear Regression Model for Panel Data: 26
http://www.indiana.edu/~statmath

First, let us manually estimate the within group effect model in STATA. You need to compute
group means and transform dependent and independent variables using group means (log is
skipped here).

. egen gm_cost=mean(cost), by(airline) / / comput e gr oup means
. egen gm_output=mean(output), by(airline)
. egen gm_fuel=mean(fuel), by(airline)
. egen gm_load=mean(load), by(airline)

You will get the following group means of variables.

+- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - +
| ai r l i ne gm_cost gm_out put gm_f uel gm_l oad |
| - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - |
| 1 14. 67563 . 3192696 12. 7318 . 5971917 |
| 2 14. 37247 - . 033027 12. 75171 . 5470946 |
| 3 13. 37231 - . 9122626 12. 78972 . 5845358 |
| 4 13. 1358 - 1. 635174 12. 77803 . 5476773 |
| 5 12. 36304 - 2. 285681 12. 7921 . 5664859 |
| 6 12. 27441 - 2. 49898 12. 7788 . 5197756 |
+- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - +

. gen gw_cost = cost - gm_cost / / comput e devi at i ons f r omt he gr oup means
. gen gw_output = output - gm_output
. gen gw_fuel = fuel - gm_fuel
. gen gw_load = load - gm_load

Now, we are ready to run the within effect model. Keep in mind that you have to suppress the
intercept. Carefully check MSE, SEE, R
2
, and standard errors.

. regress gw_cost gw_output gw_fuel gw_load, noc / / wi t hi n ef f ect

Sour ce | SS df MS Number of obs = 90
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - F( 3, 87) = 3871. 82
Model | 39. 0683861 3 13. 0227954 Pr ob > F = 0. 0000
Resi dual | . 292622861 87 . 003363481 R- squar ed = 0. 9926
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Adj R- squar ed = 0. 9923
Tot al | 39. 361009 90 . 437344544 Root MSE = . 058

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
gw_cost | Coef . St d. Er r . t P>| t | [ 95%Conf . I nt er val ]
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
gw_out put | . 9192846 . 028841 31. 87 0. 000 . 86196 . 9766092
gw_f uel | . 4174918 . 0146657 28. 47 0. 000 . 3883422 . 4466414
gw_l oad | - 1. 070396 . 1946109 - 5. 50 0. 000 - 1. 457206 - . 6835858
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

You may compute group intercepts using

=
g g g
x y d '
*
. For example, the intercept of
airline 5 is computed as 9.730 =12.363 {.919*(-2.286) +.417*12.792 +(-1.073)*.566 }. In
order to get the correct standard errors, you need to adjust them using the ratio of degrees of
freedom of the within effect model and the LSDV. For example, the standard error of the
logged output is computed as .0299=.0288*sqrt(87/81).

4.5.2 Using the SAS TSCSREG and PANEL Procedures

The TSCSREG and PANEL procedures of SAS/ETS allows users to fit the within effect model
conveniently. The procedures, in fact, report LSDV1, but you do not need to create dummy
2005 The Trustees of Indiana University (12/10/2005) Linear Regression Model for Panel Data: 27
http://www.indiana.edu/~statmath

variables and compute deviations from the group means. This procedures reports correct MSE,
SEE, R
2
, and standard errors, and conducts the F test for the fixed group effect as well.

PROC SORT DATA=masil.airline;
BY airline year;

PROC TSCSREG DATA=masil.airline;
ID airline year;
MODEL cost = output fuel load /FIXONE;
RUN;

The TSCSREG Procedure

Dependent Variable: cost

Model Description

Estimation Method FixOne
Number of Cross Sections 6
Time Series Length 15


Fit Statistics

SSE 0.2926 DFE 81
MSE 0.0036 Root MSE 0.0601
R-Square 0.9974


F Test for No Fixed Effects

Num DF Den DF F Value Pr > F

5 81 57.73 <.0001


Parameter Estimates

Standard
Variable DF Estimate Error t Value Pr > |t| Label

CS1 1 -0.08706 0.0842 -1.03 0.3042 Cross Sectional
Effect 1
CS2 1 -0.1283 0.0757 -1.69 0.0941 Cross Sectional
Effect 2
CS3 1 -0.29598 0.0500 -5.92 <.0001 Cross Sectional
Effect 3
CS4 1 0.097494 0.0330 2.95 0.0041 Cross Sectional
Effect 4
CS5 1 -0.06301 0.0239 -2.64 0.0100 Cross Sectional
Effect 5
Intercept 1 9.793004 0.2637 37.14 <.0001 Intercept
output 1 0.919285 0.0299 30.76 <.0001
fuel 1 0.417492 0.0152 27.47 <.0001
load 1 -1.0704 0.2017 -5.31 <.0001

2005 The Trustees of Indiana University (12/10/2005) Linear Regression Model for Panel Data: 28
http://www.indiana.edu/~statmath

Note that a data set needs to be sorted in advance by variables to appear in the ID statement of
the TSCSREG and PANEL procedures. The following PANEL procedure returns the same
output.

PROC PANEL DATA=masil.airline;
ID airline year;
MODEL cost = output fuel load /FIXONE;
RUN;

4.5.3 Using STATA

The STATA . xt r eg command fits the within group effect model without creating dummy
variables. The command reports correct standard errors and the F test for fixed group effects.
This command, however, does not provide an analysis of variance (ANOVA) table and correct
R
2
and F statistics. The . xt r eg command should follow the . t sset command that specifies
grouping and time variables.

. tsset airline year
panel var i abl e: ai r l i ne, 1 t o 6
t i me var i abl e: year , 1 t o 15

The f e of . xt r eg indicates the within effect model and i ( ai r l i ne) specifies ai r l i ne as the
independent unit. Note that this command reports adjusted (correct) standard errors.

. xtreg cost output fuel load, fe i(airline) / / wi t hi n gr oup ef f ect

Fi xed- ef f ect s ( wi t hi n) r egr essi on Number of obs = 90
Gr oup var i abl e ( i ) : ai r l i ne Number of gr oups = 6

R- sq: wi t hi n = 0. 9926 Obs per gr oup: mi n = 15
bet ween = 0. 9856 avg = 15. 0
over al l = 0. 9873 max = 15

F( 3, 81) = 3604. 80
cor r ( u_i , Xb) = - 0. 3475 Pr ob > F = 0. 0000

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
cost | Coef . St d. Er r . t P>| t | [ 95%Conf . I nt er val ]
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
out put | . 9192846 . 0298901 30. 76 0. 000 . 8598126 . 9787565
f uel | . 4174918 . 0151991 27. 47 0. 000 . 3872503 . 4477333
l oad | - 1. 070396 . 20169 - 5. 31 0. 000 - 1. 471696 - . 6690963
_cons | 9. 713528 . 229641 42. 30 0. 000 9. 256614 10. 17044
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
si gma_u | . 1320775
si gma_e | . 06010514
r ho | . 82843653 ( f r act i on of var i ance due t o u_i )
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
F t est t hat al l u_i =0: F( 5, 81) = 57. 73 Pr ob > F = 0. 0000

The last line of the output tests the null hypothesis that all dummy parameters in LSDV1 are
zero (e.g., g1=0, g2=0, g3=0, g4=0, and g5=0). Not the intercept of 9.714 is that of LSDV3.

4.5.4 Using LIMDEP

2005 The Trustees of Indiana University (12/10/2005) Linear Regression Model for Panel Data: 29
http://www.indiana.edu/~statmath

In LIMDEP, you have to specify the panel data model and stratification or time variables. The
Panel $ and Fi xed$ subcommands mean a fixed effect panel data model. The St r $
subcommand specifies a stratification variable.

--> REGRESS;Lhs=COST;Rhs=ONE,OUTPUT,FUEL,LOAD;Panel;Str=AIRLINE;Fixed$

+-----------------------------------------------------------------------+
| OLS Without Group Dummy Variables |
| Ordinary least squares regression Weighting variable = none |
| Dep. var. = COST Mean= 13.36560933 , S.D.= 1.131971444 |
| Model size: Observations = 90, Parameters = 4, Deg.Fr.= 86 |
| Residuals: Sum of squares= 1.335449522 , Std.Dev.= .12461 |
| Fit: R-squared= .988290, Adjusted R-squared = .98788 |
| Model test: F[ 3, 86] = 2419.33, Prob value = .00000 |
| Diagnostic: Log-L = 61.7699, Restricted(b=0) Log-L = -138.3581 |
| LogAmemiyaPrCrt.= -4.122, Akaike Info. Crt.= -1.284 |
| Panel Data Analysis of COST [ONE way] |
| Unconditional ANOVA (No regressors) |
| Source Variation Deg. Free. Mean Square |
| Between 74.6799 5. 14.9360 |
| Residual 39.3611 84. .468584 |
| Total 114.041 89. 1.28136 |
+-----------------------------------------------------------------------+
+---------+--------------+----------------+--------+---------+----------+
|Variable | Coefficient | Standard Error |t-ratio |P[|T|>t] | Mean of X|
+---------+--------------+----------------+--------+---------+----------+
OUTPUT .8827386341 .13254552E-01 66.599 .0000 -1.1743092
FUEL .4539777119 .20304240E-01 22.359 .0000 12.770359
LOAD -1.627507797 .34530293 -4.713 .0000 .56046016
Constant 9.516912231 .22924522 41.514 .0000
(Note: E+nn or E-nn means multiply by 10 to + or -nn power.)

+-----------------------------------------------------------------------+
| Least Squares with Group Dummy Variables |
| Ordinary least squares regression Weighting variable = none |
| Dep. var. = COST Mean= 13.36560933 , S.D.= 1.131971444 |
| Model size: Observations = 90, Parameters = 9, Deg.Fr.= 81 |
| Residuals: Sum of squares= .2926207777 , Std.Dev.= .06010 |
| Fit: R-squared= .997434, Adjusted R-squared = .99718 |
| Model test: F[ 8, 81] = 3935.82, Prob value = .00000 |
| Diagnostic: Log-L = 130.0865, Restricted(b=0) Log-L = -138.3581 |
| LogAmemiyaPrCrt.= -5.528, Akaike Info. Crt.= -2.691 |
| Estd. Autocorrelation of e(i,t) .573531 |
+-----------------------------------------------------------------------+
+---------+--------------+----------------+--------+---------+----------+
|Variable | Coefficient | Standard Error |t-ratio |P[|T|>t] | Mean of X|
+---------+--------------+----------------+--------+---------+----------+
OUTPUT .9192881432 .29889967E-01 30.756 .0000 -1.1743092
FUEL .4174910457 .15199071E-01 27.468 .0000 12.770359
LOAD -1.070395015 .20168924 -5.307 .0000 .56046016
(Note: E+nn or E-nn means multiply by 10 to + or -nn power.)

LIMDEP reports both the pooled OLS regression and the within effect model. Like the SAS
TSCSREG procedure, LIMDEP provides correct MSE, SEE, R
2
, and standard errors.

2005 The Trustees of Indiana University (12/10/2005) Linear Regression Model for Panel Data: 30
http://www.indiana.edu/~statmath


4.6 Between Group Effect Model: Group Mean Regression

The between effect model uses aggregate information, group means of variables. In other
words, the unit of analysis is not an individual observation, but groups or subjects. The number
of observations jumps down to n from nT. This group mean regression produces different
goodness-of-fits and parameter estimates from those of LSDV and the within effect model.

Let us compute group means and run the OLS regression with them. The . col l apse command
computes aggregate information and saves into a new data set. Note that /// links two command
lines.

. collapse (mean) gm_cost=cost (mean) gm_output=output (mean) gm_fuel=fuel (mean) ///
gm_load=load, by(airline)

. regress gm_cost gm_output gm_fuel gm_load

Sour ce | SS df MS Number of obs = 6
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - F( 3, 2) = 104. 12
Model | 4. 94698124 3 1. 64899375 Pr ob > F = 0. 0095
Resi dual | . 031675926 2 . 015837963 R- squar ed = 0. 9936
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Adj R- squar ed = 0. 9841
Tot al | 4. 97865717 5 . 995731433 Root MSE = . 12585

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
gm_cost | Coef . St d. Er r . t P>| t | [ 95%Conf . I nt er val ]
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
gm_out put | . 7824568 . 1087646 7. 19 0. 019 . 3144803 1. 250433
gm_f uel | - 5. 523904 4. 478718 - 1. 23 0. 343 - 24. 79427 13. 74647
gm_l oad | - 1. 751072 2. 743167 - 0. 64 0. 589 - 13. 55397 10. 05182
_cons | 85. 8081 56. 48199 1. 52 0. 268 - 157. 2143 328. 8305
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

The SAS PANEL procedure has the /BTWNG and /BTWNT option to estimate the between
effect model. The TSCSREG procedure does not have this option.

PROC PANEL DATA=masil.airline;
ID airline year;
MODEL cost = output fuel load /BTWNG;
RUN;

The PANEL Procedure
Between Groups Estimates

Dependent Variable: cost

Model Description

Estimation Method BtwGrps
Number of Cross Sections 6
Time Series Length 15


Fit Statistics

SSE 0.0317 DFE 2
MSE 0.0158 Root MSE 0.1258
2005 The Trustees of Indiana University (12/10/2005) Linear Regression Model for Panel Data: 31
http://www.indiana.edu/~statmath

R-Square 0.9936


Parameter Estimates

Standard
Variable DF Estimate Error t Value Pr > |t| Label

Intercept 1 85.80901 56.4830 1.52 0.2681 Intercept
output 1 0.782455 0.1088 7.19 0.0188
fuel 1 -5.52398 4.4788 -1.23 0.3427
load 1 -1.75102 2.7432 -0.64 0.5886

The STATA . xt r eg command has the be option to fit the between effect model. This
command, however, does not report the ANOVA table.

. xtreg cost output fuel load, be i(airline)

Bet ween r egr essi on ( r egr essi on on gr oup means) Number of obs = 90
Gr oup var i abl e ( i ) : ai r l i ne Number of gr oups = 6

R- sq: wi t hi n = 0. 8808 Obs per gr oup: mi n = 15
bet ween = 0. 9936 avg = 15. 0
over al l = 0. 1371 max = 15

F( 3, 2) = 104. 12
sd( u_i + avg( e_i . ) ) = . 1258491 Pr ob > F = 0. 0095

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
cost | Coef . St d. Er r . t P>| t | [ 95%Conf . I nt er val ]
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
out put | . 7824552 . 1087663 7. 19 0. 019 . 3144715 1. 250439
f uel | - 5. 523978 4. 478802 - 1. 23 0. 343 - 24. 79471 13. 74675
l oad | - 1. 751016 2. 74319 - 0. 64 0. 589 - 13. 55401 10. 05198
_cons | 85. 80901 56. 48302 1. 52 0. 268 - 157. 2178 328. 8358
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

LIMDEP has the Mean; subcommand to fit the between effect model.

--> REGRESS;Lhs=COST;Rhs=ONE,OUTPUT,FUEL,LOAD;Panel;Str=AIRLINE;Means$

+-----------------------------------------------------------------------+
| Group Means Regression |
| Ordinary least squares regression Weighting variable = none |
| Dep. var. = YBAR(i.) Mean= 13.36560933 , S.D.= .9978636346 |
| Model size: Observations = 6, Parameters = 4, Deg.Fr.= 2 |
| Residuals: Sum of squares= .3167277206E-01, Std.Dev.= .12584 |
| Fit: R-squared= .993638, Adjusted R-squared = .98410 |
| Model test: F[ 3, 2] = 104.13, Prob value = .00953 |
| Diagnostic: Log-L = 7.2185, Restricted(b=0) Log-L = -7.9538 |
| LogAmemiyaPrCrt.= -3.635, Akaike Info. Crt.= -1.073 |
+-----------------------------------------------------------------------+
+---------+--------------+----------------+--------+---------+----------+
|Variable | Coefficient | Standard Error |b/St.Er.|P[|Z|>z] | Mean of X|
+---------+--------------+----------------+--------+---------+----------+
OUTPUT .7824472689 .10876126 7.194 .0000 .23025612E-11
FUEL -5.524437466 4.4786519 -1.234 .2174 .18642891
LOAD -1.750947653 2.7430470 -.638 .5233 .32541105
Constant 85.81483169 56.481148 1.519 .1287

2005 The Trustees of Indiana University (12/10/2005) Linear Regression Model for Panel Data: 32
http://www.indiana.edu/~statmath


4.7 Testing Fixed Group Effects (F-test)

How do we know whether there are fixed group effects? The null hypothesis is that all dummy
parameters except one are zero: 0 ... :
1 1 0
= = =
n
H .

In order to conduct a F-test, let us take the SSE (ee) of 1.3354 from the pooled OLS regression
and .2926 from the LSDVs (LSDV1 through LSDV3) or the within effect model. Alternatively,
you may draw R
2
of .9974 from LSDV1 or LSDV3 and .9883 from the pooled OLS. Do not,
however, use LSDV2 and the within effect model for R
2
.

The Fstatistic is computed as ] 81 , 5 [ 7319 . 57 ~
) 3 6 90 ( ) 9974 . 1 (
) 1 6 ( ) 9883 . 9974 (.
) 3 6 90 ( ) 2926 (.
) 1 6 ( ) 2926 . 3354 . 1 (


=


.

The large F statistic rejects the null hypothesis in favor of the fixed group effect model
(p<.0000).

The SAS TSCSREG and PANEL procedures and STATA . xt r eg command by default conduct
the F test. Alternatively, you may conduct the same test with LSDV1. In SAS, add the TEST
statement in the REG procedure and run the procedure again (other outputs are skipped).

PROC REG DATA=masil.airline;
MODEL cost = g1-g5 output fuel load;
TEST g1 = g2 = g3 = g4 = g5 = 0;
RUN;

The REG Procedure
Model: MODEL1

Test 1 Results for Dependent Variable cost

Mean
Source DF Square F Value Pr > F

Numerator 5 0.20856 57.73 <.0001
Denominator 81 0.00361

In STATA, run the . t est command, a follow-up command for the Wald test, right after
estimating the model.

. quietly regress cost g1-g5 output fuel load / / LSDV1
. test g1 g2 g3 g4 g5

( 1) g1 = 0
( 2) g2 = 0
( 3) g3 = 0
( 4) g4 = 0
( 5) g5 = 0

F( 5, 81) = 57. 73
Pr ob > F = 0. 0000

2005 The Trustees of Indiana University (12/10/2005) Linear Regression Model for Panel Data: 33
http://www.indiana.edu/~statmath


4.8 Summary

Table 6 summarizes the estimation of panel data models in SAS, STATA, and LIMDEP. The
SAS REG and TSCSREG procedures are generally preferred to STATA and LIMDEP
commands.

Table 6 Comparison of the Fixed Effect Model in SAS, STATA, LIMDEP
*

SAS 9.1 STATA 9.0 LIMDEP 8.0
OLS estimation PROC REG; . r egr ess ( cnsr eg) Regr ess$
LSDV1 Correct Correct Correct (slightly different F)
LSDV2 Incorrect F, (adjusted) R
2
Incorrect F, (adjusted) R
2
Correct (slightly different F)
Correct R
2

LSDV3 Correct . cnsr eg command
No R
2
, ANOVA table but F
Correct (slightly different F)
Different dummy coefficients
Panel Estimation PROC TSCSREG;
PROC PANEL;
. xt r eg Regr ess; Panel $
Estimation type LSDV1 Within and between effect Within effect
SSE (ee) Correct No Correct
MSE or SEE Correct (adjusted) No Correct (adjusted) SEE
Model test (F) No Incorrect Slightly different F
(adjusted) R
2
Correct Incorrect Correct
Intercept Correct LSDV3 intercept No
Coefficients Correct Correct Correct
Standard errors Correct (adjusted) Correct (adjusted) Correct (adjusted)
Effect test (F) Yes Yes No
Between effect Yes (PROC PANEL;) Yes (the be option) N/A
* Yes/No means whether the software reports the statistics. Correct/incorrect indicates whether the statistics
are different from those of the least squares dummy variable (LSDV) 1 without a dummy variable.
2005 The Trustees of Indiana University (12/10/2005) Linear Regression Model for Panel Data: 34
http://www.indiana.edu/~statmath

5. The Fixed Time Effect Model

The fixed time effect model investigates how time affects the intercept using time dummy
variables. The logic and method are the same as those of the fixed group effect model.


5.1 Least Squares Dummy Variable Models

The least squares dummy variable (LSDV) model produces fifteen regression equations. This
section does not present all outputs, but one or two for each LSDV approach.

Ti me01: cost = 20. 496 + . 868*out put - . 484*f uel - 1. 954*l oad
Ti me02: cost = 20. 578 + . 868*out put - . 484*f uel - 1. 954*l oad
Ti me03: cost = 20. 656 + . 868*out put - . 484*f uel - 1. 954*l oad
Ti me04: cost = 20. 741 + . 868*out put - . 484*f uel - 1. 954*l oad
Ti me05: cost = 21. 200 + . 868*out put - . 484*f uel - 1. 954*l oad
Ti me06: cost = 21. 412 + . 868*out put - . 484*f uel - 1. 954*l oad
Ti me07: cost = 21. 503 + . 868*out put - . 484*f uel - 1. 954*l oad
Ti me08: cost = 21. 654 + . 868*out put - . 484*f uel - 1. 954*l oad
Ti me09: cost = 21. 830 + . 868*out put - . 484*f uel - 1. 954*l oad
Ti me10: cost = 22. 114 + . 868*out put - . 484*f uel - 1. 954*l oad
Ti me11: cost = 22. 465 + . 868*out put - . 484*f uel - 1. 954*l oad
Ti me12: cost = 22. 651 + . 868*out put - . 484*f uel - 1. 954*l oad
Ti me13: cost = 22. 617 + . 868*out put - . 484*f uel - 1. 954*l oad
Ti me14: cost = 22. 552 + . 868*out put - . 484*f uel - 1. 954*l oad
Ti me15: cost = 22. 537 + . 868*out put - . 484*f uel - 1. 954*l oad

5.1.1 LSDV1 without a Dummy

Let us begin with the SAS REG procedure. The test statement examines fixed time effects.

PROC REG DATA=masil.airline;
MODEL cost = t1-t14 output fuel load;
RUN;

The REG Procedure
Model: MODEL1
Dependent Variable: cost

Number of Observations Read 90
Number of Observations Used 90


Analysis of Variance

Sum of Mean
Source DF Squares Square F Value Pr > F

Model 17 112.95270 6.64428 439.62 <.0001
Error 72 1.08819 0.01511
Corrected Total 89 114.04089


Root MSE 0.12294 R-Square 0.9905
2005 The Trustees of Indiana University (12/10/2005) Linear Regression Model for Panel Data: 35
http://www.indiana.edu/~statmath

Dependent Mean 13.36561 Adj R-Sq 0.9882
Coeff Var 0.91981


Parameter Estimates

Parameter Standard
Variable DF Estimate Error t Value Pr > |t|

Intercept 1 22.53677 4.94053 4.56 <.0001
t1 1 -2.04096 0.73469 -2.78 0.0070
t2 1 -1.95873 0.72275 -2.71 0.0084
t3 1 -1.88103 0.72036 -2.61 0.0110
t4 1 -1.79601 0.69882 -2.57 0.0122
t5 1 -1.33693 0.50604 -2.64 0.0101
t6 1 -1.12514 0.40862 -2.75 0.0075
t7 1 -1.03341 0.37642 -2.75 0.0076
t8 1 -0.88274 0.32601 -2.71 0.0085
t9 1 -0.70719 0.29470 -2.40 0.0190
t10 1 -0.42296 0.16679 -2.54 0.0134
t11 1 -0.07144 0.07176 -1.00 0.3228
t12 1 0.11457 0.09841 1.16 0.2482
t13 1 0.07979 0.08442 0.95 0.3477
t14 1 0.01546 0.07264 0.21 0.8320
output 1 0.86773 0.01541 56.32 <.0001
fuel 1 -0.48448 0.36411 -1.33 0.1875
load 1 -1.95440 0.44238 -4.42 <.0001

The following are the corresponding STATA and LIMDEP commands for LSDV1 (outputs are
skipped).

. regress cost t1-t14 output fuel load

REGRESS;Lhs=COST;Rhs=ONE,T1,T2,T3,T4,T5,T6,T7,T8,T9,T10,T11,T12,T13,T14,OUTPUT,FUEL,LOAD$

5.1.2 LSDV2 without the Intercept

Let us use LIMDEP to fit LSDV2 because it reports correct (although slightly different) F and
R
2
statistics.

--> REGRESS;Lhs=COST;Rhs=T1,T2,T3,T4,T5,T6,T7,T8,T9,T10,T11,T12,T13,T14,T15,OUTPUT,FUEL,LOAD$

+-----------------------------------------------------------------------+
| Ordinary least squares regression Weighting variable = none |
| Dep. var. = COST Mean= 13.36560929 , S.D.= 1.131971002 |
| Model size: Observations = 90, Parameters = 18, Deg.Fr.= 72 |
| Residuals: Sum of squares= 1.088190223 , Std.Dev.= .12294 |
| Fit: R-squared= .990458, Adjusted R-squared = .98820 |
| Model test: F[ 17, 72] = 439.62, Prob value = .00000 |
| Diagnostic: Log-L = 70.9837, Restricted(b=0) Log-L = -138.3581 |
| LogAmemiyaPrCrt.= -4.010, Akaike Info. Crt.= -1.177 |
| Model does not contain ONE. R-squared and F can be negative! |
| Autocorrel: Durbin-Watson Statistic = 2.93900, Rho = -.46950 |
+-----------------------------------------------------------------------+
+---------+--------------+----------------+--------+---------+----------+
|Variable | Coefficient | Standard Error |t-ratio |P[|T|>t] | Mean of X|
+---------+--------------+----------------+--------+---------+----------+
2005 The Trustees of Indiana University (12/10/2005) Linear Regression Model for Panel Data: 36
http://www.indiana.edu/~statmath

T1 20.49580478 4.2095283 4.869 .0000 .66666667E-01
T2 20.57803885 4.2215262 4.875 .0000 .66666667E-01
T3 20.65573100 4.2241771 4.890 .0000 .66666667E-01
T4 20.74075857 4.2457497 4.885 .0000 .66666667E-01
T5 21.19983202 4.4403312 4.774 .0000 .66666667E-01
T6 21.41162082 4.5386212 4.718 .0000 .66666667E-01
T7 21.50335085 4.5713968 4.704 .0000 .66666667E-01
T8 21.65402827 4.6228858 4.684 .0000 .66666667E-01
T9 21.82957108 4.6569062 4.688 .0000 .66666667E-01
T10 22.11380260 4.7926483 4.614 .0000 .66666667E-01
T11 22.46532734 4.9499089 4.539 .0000 .66666667E-01
T12 22.65133704 5.0085924 4.522 .0000 .66666667E-01
T13 22.61655508 4.9861391 4.536 .0000 .66666667E-01
T14 22.55222832 4.9559418 4.551 .0000 .66666667E-01
T15 22.53676562 4.9405321 4.562 .0000 .66666667E-01
OUTPUT .8677267843 .15408184E-01 56.316 .0000 -1.1743092
FUEL -.4844835367 .36410849 -1.331 .1875 12.770359
LOAD -1.954404328 .44237771 -4.418 .0000 .56046015
(Note: E+nn or E-nn means multiply by 10 to + or -nn power.)

The following are the corresponding SAS REG procedure and STATA command for LSDV2
(outputs are skipped).

PROC REG DATA=masil.airline;
MODEL cost = t1-t15 output fuel load /NOINT;
RUN;

. regress cost t1-t15 output fuel load, noc

5.1.3 LSDV3 with a Restriction

In SAS, you need to use the RESTRICT statement to impose a restriction.

PROC REG DATA=masil.airline;
MODEL cost = t1-t15 output fuel load;
RESTRICT t1+t2+t3+t4+t5+t6+t7+t8+t9+t10+t11+t12+t13+t14+t15=0;
RUN;

The REG Procedure
Model: MODEL1
Dependent Variable: cost

NOTE: Restrictions have been applied to parameter estimates.


Number of Observations Read 90
Number of Observations Used 90


Analysis of Variance

Sum of Mean
Source DF Squares Square F Value Pr > F

Model 17 112.95270 6.64428 439.62 <.0001
Error 72 1.08819 0.01511
Corrected Total 89 114.04089
2005 The Trustees of Indiana University (12/10/2005) Linear Regression Model for Panel Data: 37
http://www.indiana.edu/~statmath



Root MSE 0.12294 R-Square 0.9905
Dependent Mean 13.36561 Adj R-Sq 0.9882
Coeff Var 0.91981


Parameter Estimates

Parameter Standard
Variable DF Estimate Error t Value Pr > |t|

Intercept 1 21.66698 4.62405 4.69 <.0001
t1 1 -1.17118 0.41783 -2.80 0.0065
t2 1 -1.08894 0.40586 -2.68 0.0090
t3 1 -1.01125 0.40323 -2.51 0.0144
t4 1 -0.92622 0.38177 -2.43 0.0178
t5 1 -0.46715 0.19076 -2.45 0.0168
t6 1 -0.25536 0.09856 -2.59 0.0116
t7 1 -0.16363 0.07190 -2.28 0.0258
t8 1 -0.01296 0.04862 -0.27 0.7907
t9 1 0.16259 0.06271 2.59 0.0115
t10 1 0.44682 0.17599 2.54 0.0133
t11 1 0.79834 0.32940 2.42 0.0179
t12 1 0.98435 0.38756 2.54 0.0132
t13 1 0.94957 0.36537 2.60 0.0113
t14 1 0.88524 0.33549 2.64 0.0102
t15 1 0.86978 0.32029 2.72 0.0083
output 1 0.86773 0.01541 56.32 <.0001
fuel 1 -0.48448 0.36411 -1.33 0.1875
load 1 -1.95440 0.44238 -4.42 <.0001
RESTRICT -1 -3.946E-15 . . .

* Probability computed using beta distribution.

In STATA, define the restriction with the . const r ai nt command and specify the restriction
using the const r ai nt ( ) option of the . cnsr eg command.

. constraint define 3 t1+t2+t3+t4+t5+t6+t7+t8+t9+t10+t11+t12+t13+t14+t15=0
. cnsreg cost t1-t15 output fuel load, constraint(3)

Const r ai ned l i near r egr essi on Number of obs = 90
F( 17, 72) = 439. 62
Pr ob > F = 0. 0000
Root MSE = . 12294
( 1) t 1 + t 2 + t 3 + t 4 + t 5 + t 6 + t 7 + t 8 + t 9 + t 10 + t 11 + t 12 + t 13 + t 14 + t 15 = 0
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
cost | Coef . St d. Er r . t P>| t | [ 95%Conf . I nt er val ]
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
t 1 | - 1. 171179 . 4178338 - 2. 80 0. 007 - 2. 004115 - . 3382422
t 2 | - 1. 088945 . 4058579 - 2. 68 0. 009 - 1. 898008 - . 2798816
t 3 | - 1. 011252 . 4032308 - 2. 51 0. 014 - 1. 815078 - . 2074266
t 4 | - . 9262249 . 3817675 - 2. 43 0. 018 - 1. 687265 - . 1651852
t 5 | - . 4671515 . 1907596 - 2. 45 0. 017 - . 8474239 - . 0868791
t 6 | - . 2553627 . 0985615 - 2. 59 0. 012 - . 4518415 - . 0588839
t 7 | - . 1636326 . 0718969 - 2. 28 0. 026 - . 3069564 - . 0203088
t 8 | - . 0129552 . 0486249 - 0. 27 0. 791 - . 1098872 . 0839768
t 9 | . 1625876 . 0627099 2. 59 0. 012 . 0375776 . 2875976
t 10 | . 4468191 . 175994 2. 54 0. 013 . 0959814 . 7976568
t 11 | . 7983439 . 3294027 2. 42 0. 018 . 1416916 1. 454996
t 12 | . 9843536 . 3875583 2. 54 0. 013 . 2117702 1. 756937
2005 The Trustees of Indiana University (12/10/2005) Linear Regression Model for Panel Data: 38
http://www.indiana.edu/~statmath

t 13 | . 9495716 . 3653675 2. 60 0. 011 . 2212248 1. 677918
t 14 | . 8852448 . 3354912 2. 64 0. 010 . 2164554 1. 554034
t 15 | . 8697821 . 3202933 2. 72 0. 008 . 2312891 1. 508275
out put | . 8677268 . 0154082 56. 32 0. 000 . 8370111 . 8984424
f uel | - . 4844835 . 3641085 - 1. 33 0. 188 - 1. 210321 . 2413535
l oad | - 1. 954404 . 4423777 - 4. 42 0. 000 - 2. 836268 - 1. 07254
_cons | 21. 66698 4. 624053 4. 69 0. 000 12. 4491 30. 88486
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

The following are the corresponding LIMDEP command for LSDV3 (outputs are skipped).

REGRESS;Lhs=COST;Rhs=ONE,T1,T2,T3,T4,T5,T6,T7,T8,T9,T10,T11,T12,T13,T14,T15,OUTPUT,FUEL,LOAD;
Cls:b(1)+b(2)+b(3)+b(4)+b(5)+b(6)+b(7)+b(8)+b(9)+b(10)+b(11)+b(12)+b(13)+b(14)+b(15)=0$


5.2 Within Time Effect Model

The within effect mode for the fixed time effects needs to compute deviations from the time
means. Keep in mind that the intercept should be suppressed.

5.2.1 Estimating the Time Effect Model

Let us manually estimate the fixed time effect model first.

. egen tm_cost = mean(cost), by(year) // compute time means
. egen tm_output = mean(output), by(year)
. egen tm_fuel = mean(fuel), by(year)
. egen tm_load = mean(load), by(year)

+- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - +
| year t m_cost t m_out put t m_f uel t m_l oad |
| - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - |
| 1 12. 36897 - 1. 790283 11. 63606 . 4788587 |
| 2 12. 45963 - 1. 744389 11. 66868 . 4868322 |
| 3 12. 60706 - 1. 577767 11. 67494 . 52358 |
| 4 12. 77912 - 1. 443695 11. 73193 . 5244486 |
| 5 12. 94143 - 1. 398122 12. 26843 . 5635266 |
| 6 13. 0452 - 1. 393002 12. 53826 . 5541809 |
| 7 13. 15965 - 1. 302416 12. 62714 . 5607425 |
| 8 13. 29884 - 1. 222963 12. 76768 . 5670587 |
| 9 13. 4651 - 1. 067003 12. 86104 . 6179098 |
| 10 13. 70187 - . 9023156 13. 23183 . 6233943 |
| 11 13. 91324 - . 9205539 13. 66246 . 5802577 |
| 12 14. 05984 - . 8641667 13. 82315 . 5856243 |
| 13 14. 12841 - . 7923916 13. 75979 . 5803183 |
| 14 14. 23517 - . 6428015 13. 67403 . 5804528 |
| 15 14. 32062 - . 5527684 13. 62997 . 5797168 |
+- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - +

. gen tw_cost = cost - tm_cost // transform variables
. gen tw_output = output - tm_output
. gen tw_fuel = fuel - tm_fuel
. gen tw_load = load - tm_load

. regress tw_cost tw_output tw_fuel tw_load, noc / / wi t hi n t i me ef f ect

Sour ce | SS df MS Number of obs = 90
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - F( 3, 87) = 2015. 95
Model | 75. 6459391 3 25. 215313 Pr ob > F = 0. 0000
Resi dual | 1. 08819023 87 . 012507934 R- squar ed = 0. 9858
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Adj R- squar ed = 0. 9853
Tot al | 76. 7341294 90 . 852601437 Root MSE = . 11184

2005 The Trustees of Indiana University (12/10/2005) Linear Regression Model for Panel Data: 39
http://www.indiana.edu/~statmath

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
t w_cost | Coef . St d. Er r . t P>| t | [ 95%Conf . I nt er val ]
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
t w_out put | . 8677268 . 0140171 61. 90 0. 000 . 8398663 . 8955873
t w_f uel | - . 4844836 . 3312359 - 1. 46 0. 147 - 1. 142851 . 1738836
t w_l oad | - 1. 954404 . 4024388 - 4. 86 0. 000 - 2. 754295 - 1. 154514
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

If you want to get intercepts of years, use
t t t
x y d

= '
*
. For example, the intercept of year 7
is 21.503=13.1597-{.8677*(-1.3024) +(-.4845)*12.6271 +(-1.9544)*.5607}. As discussed
previously, the standard errors of the within effects model need to be adjusted. For instance, the
correct standard error of fuel price is computed as .364 =.3312*sqrt(87/72).

5.2.2 Using the TSCSREG and PANEL procedures

You need to sort the data set by variables (i.e., year and ai r l i ne) to appear in the ID
statement of the TSCSREG and PANEL procedures.

PROC SORT DATA=masil.airline;
BY year airline;

PROC PANEL DATA=masil.airline;
ID year airline;
MODEL cost = output fuel load /FIXONE;
RUN;

The PANEL Procedure
Fixed One Way Estimates

Dependent Variable: cost

Model Description

Estimation Method FixOne
Number of Cross Sections 15
Time Series Length 6


Fit Statistics

SSE 1.0882 DFE 72
MSE 0.0151 Root MSE 0.1229
R-Square 0.9905


F Test for No Fixed Effects

Num DF Den DF F Value Pr > F

14 72 1.17 0.3178


Parameter Estimates

Standard
Variable DF Estimate Error t Value Pr > |t| Label
2005 The Trustees of Indiana University (12/10/2005) Linear Regression Model for Panel Data: 40
http://www.indiana.edu/~statmath


CS1 1 -2.04096 0.7347 -2.78 0.0070 Cross Sectional
Effect 1
CS2 1 -1.95873 0.7228 -2.71 0.0084 Cross Sectional
Effect 2
CS3 1 -1.88103 0.7204 -2.61 0.0110 Cross Sectional
Effect 3
CS4 1 -1.79601 0.6988 -2.57 0.0122 Cross Sectional
Effect 4
CS5 1 -1.33693 0.5060 -2.64 0.0101 Cross Sectional
Effect 5
CS6 1 -1.12514 0.4086 -2.75 0.0075 Cross Sectional
Effect 6
CS7 1 -1.03341 0.3764 -2.75 0.0076 Cross Sectional
Effect 7
CS8 1 -0.88274 0.3260 -2.71 0.0085 Cross Sectional
Effect 8
CS9 1 -0.70719 0.2947 -2.40 0.0190 Cross Sectional
Effect 9
CS10 1 -0.42296 0.1668 -2.54 0.0134 Cross Sectional
Effect 10
CS11 1 -0.07144 0.0718 -1.00 0.3228 Cross Sectional

CS12 1 0.114571 0.0984 1.16 0.2482 Cross Sectional
Effect 12
CS13 1 0.079789 0.0844 0.95 0.3477 Cross Sectional
Effect 13
CS14 1 0.015463 0.0726 0.21 0.8320 Cross Sectional
Effect 14
Intercept 1 22.53677 4.9405 4.56 <.0001 Intercept
output 1 0.867727 0.0154 56.32 <.0001
fuel 1 -0.48448 0.3641 -1.33 0.1875
load 1 -1.9544 0.4424 -4.42 <.0001

The following TSCSREG procedure gives the same outputs.

PROC TSCSREG DATA=masil.airline;
ID year airline;
MODEL cost = output fuel load /FIXONE;
RUN;

5.2.3 Using STATA

The STATA . xt r eg command uses the f e option for the fixed effect model.

. xtreg cost output fuel load, fe i(year)

Fi xed- ef f ect s ( wi t hi n) r egr essi on Number of obs = 90
Gr oup var i abl e ( i ) : year Number of gr oups = 15

R- sq: wi t hi n = 0. 9858 Obs per gr oup: mi n = 6
bet ween = 0. 4812 avg = 6. 0
over al l = 0. 5265 max = 6

F( 3, 72) = 1668. 37
cor r ( u_i , Xb) = - 0. 1503 Pr ob > F = 0. 0000

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
cost | Coef . St d. Er r . t P>| t | [ 95%Conf . I nt er val ]
2005 The Trustees of Indiana University (12/10/2005) Linear Regression Model for Panel Data: 41
http://www.indiana.edu/~statmath

- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
out put | . 8677268 . 0154082 56. 32 0. 000 . 8370111 . 8984424
f uel | - . 4844835 . 3641085 - 1. 33 0. 188 - 1. 210321 . 2413535
l oad | - 1. 954404 . 4423777 - 4. 42 0. 000 - 2. 836268 - 1. 07254
_cons | 21. 66698 4. 624053 4. 69 0. 000 12. 4491 30. 88486
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
si gma_u | . 8027907
si gma_e | . 12293801
r ho | . 97708602 ( f r act i on of var i ance due t o u_i )
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
F t est t hat al l u_i =0: F( 14, 72) = 1. 17 Pr ob > F = 0. 3178

5.2.4 Using LIMDEP

You need to pay attention to the St r =; subcommand for stratification.

--> REGRESS;Lhs=COST;Rhs=ONE,OUTPUT,FUEL,LOAD;Panel;Str=YEAR;Fixed$

+-----------------------------------------------------------------------+
| OLS Without Group Dummy Variables |
| Ordinary least squares regression Weighting variable = none |
| Dep. var. = COST Mean= 13.36560933 , S.D.= 1.131971444 |
| Model size: Observations = 90, Parameters = 4, Deg.Fr.= 86 |
| Residuals: Sum of squares= 1.335449522 , Std.Dev.= .12461 |
| Fit: R-squared= .988290, Adjusted R-squared = .98788 |
| Model test: F[ 3, 86] = 2419.33, Prob value = .00000 |
| Diagnostic: Log-L = 61.7699, Restricted(b=0) Log-L = -138.3581 |
| LogAmemiyaPrCrt.= -4.122, Akaike Info. Crt.= -1.284 |
| Panel Data Analysis of COST [ONE way] |
| Unconditional ANOVA (No regressors) |
| Source Variation Deg. Free. Mean Square |
| Between 37.3068 14. 2.66477 |
| Residual 76.7341 75. 1.02312 |
| Total 114.041 89. 1.28136 |
+-----------------------------------------------------------------------+
+---------+--------------+----------------+--------+---------+----------+
|Variable | Coefficient | Standard Error |t-ratio |P[|T|>t] | Mean of X|
+---------+--------------+----------------+--------+---------+----------+
OUTPUT .8827386341 .13254552E-01 66.599 .0000 -1.1743092
FUEL .4539777119 .20304240E-01 22.359 .0000 12.770359
LOAD -1.627507797 .34530293 -4.713 .0000 .56046016
Constant 9.516912231 .22924522 41.514 .0000
(Note: E+nn or E-nn means multiply by 10 to + or -nn power.)

+-----------------------------------------------------------------------+
| Least Squares with Group Dummy Variables |
| Ordinary least squares regression Weighting variable = none |
| Dep. var. = COST Mean= 13.36560933 , S.D.= 1.131971444 |
| Model size: Observations = 90, Parameters = 18, Deg.Fr.= 72 |
| Residuals: Sum of squares= 1.088193393 , Std.Dev.= .12294 |
| Fit: R-squared= .990458, Adjusted R-squared = .98820 |
| Model test: F[ 17, 72] = 439.62, Prob value = .00000 |
| Diagnostic: Log-L = 70.9836, Restricted(b=0) Log-L = -138.3581 |
| LogAmemiyaPrCrt.= -4.010, Akaike Info. Crt.= -1.177 |
| Estd. Autocorrelation of e(i,t) .573531 |
+-----------------------------------------------------------------------+
+---------+--------------+----------------+--------+---------+----------+
|Variable | Coefficient | Standard Error |t-ratio |P[|T|>t] | Mean of X|
+---------+--------------+----------------+--------+---------+----------+
2005 The Trustees of Indiana University (12/10/2005) Linear Regression Model for Panel Data: 42
http://www.indiana.edu/~statmath

OUTPUT .8677268093 .15408179E-01 56.316 .0000 -1.1743092
FUEL -.4844946699 .36410984 -1.331 .1868 12.770359
LOAD -1.954414378 .44237791 -4.418 .0000 .56046016
(Note: E+nn or E-nn means multiply by 10 to + or -nn power.)

+------------------------------------------------------------------------+
| Test Statistics for the Classical Model |
| |
| Model Log-Likelihood Sum of Squares R-squared |
| (1) Constant term only -138.35814 .1140409821D+03 .0000000 |
| (2) Group effects only -120.52864 .7673414157D+02 .3271354 |
| (3) X - variables only 61.76991 .1335449522D+01 .9882897 |
| (4) X and group effects 70.98362 .1088193393D+01 .9904579 |
| |
| Hypothesis Tests |
| Likelihood Ratio Test F Tests |
| Chi-squared d.f. Prob. F num. denom. Prob value |
| (2) vs (1) 35.659 14 .00117 2.605 14 75 .00404 |
| (3) vs (1) 400.256 3 .00000 2419.329 3 86 .00000 |
| (4) vs (1) 418.684 17 .00000 439.617 17 72 .00000 |
| (4) vs (2) 383.025 3 .00000 1668.364 3 72 .00000 |
| (4) vs (3) 18.427 14 .18800 1.169 14 72 .31776 |
+------------------------------------------------------------------------+


5.3 Between Time Effect Model

The between effect model regresses time means of dependent variables on those of independent
variables. See also 3.2 and 4.6.

. collapse (mean) tm_cost=cost (mean) tm_output=output (mean) tm_fuel=fuel ///
(mean) tm_load=load, by(year)

. regress tm_cost tm_output tm_fuel tm_load / / bet ween t i me ef f ect

Sour ce | SS df MS Number of obs = 15
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - F( 3, 11) = 4074. 33
Model | 6. 21220479 3 2. 07073493 Pr ob > F = 0. 0000
Resi dual | . 005590631 11 . 000508239 R- squar ed = 0. 9991
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Adj R- squar ed = 0. 9989
Tot al | 6. 21779542 14 . 444128244 Root MSE = . 02254

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
t m_cost | Coef . St d. Er r . t P>| t | [ 95%Conf . I nt er val ]
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
t m_out put | 1. 133337 . 0512898 22. 10 0. 000 1. 020449 1. 246225
t m_f uel | . 3342486 . 0228284 14. 64 0. 000 . 2840035 . 3844937
t m_l oad | - 1. 350727 . 2478264 - 5. 45 0. 000 - 1. 896189 - . 8052644
_cons | 11. 18505 . 3660016 30. 56 0. 000 10. 37949 11. 99062
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

The SAS PANEL procedure has the /BTWNT option to estimate the between effect model.

PROC PANEL DATA=masil.airline;
ID airline year;
MODEL cost = output fuel load /BTWNT;
RUN;

The PANEL Procedure
2005 The Trustees of Indiana University (12/10/2005) Linear Regression Model for Panel Data: 43
http://www.indiana.edu/~statmath

Between Time Periods Estimates

Dependent Variable: cost

Model Description

Estimation Method BtwTime
Number of Cross Sections 6
Time Series Length 15


Fit Statistics

SSE 0.0056 DFE 11
MSE 0.0005 Root MSE 0.0225
R-Square 0.9991

Parameter Estimates

Standard
Variable DF Estimate Error t Value Pr > |t| Label

Intercept 1 11.18504 0.3660 30.56 <.0001 Intercept
output 1 1.133335 0.0513 22.10 <.0001
fuel 1 0.334249 0.0228 14.64 <.0001
load 1 -1.35073 0.2478 -5.45 0.0002

You may use the be option in the STATA . xt r eg command and the Means; subcommand in
LIMDEP (outputs are skipped).

. xtreg cost output fuel load, be i(year) / / bet ween t i me ef f ect model

--> REGRESS;Lhs=COST;Rhs=ONE,OUTPUT,FUEL,LOAD;Panel;Str=YEAR;Means$


5.4 Testing Fixed Time Effects.

The null hypothesis is that all time dummy parameters except one are zero:
0 ... :
1 1 0
= = =
T
H . The F statistic is ] 72 , 14 [ 1683 . 1 ~
) 3 15 15 * 6 ( ) 0882 . 1 (
) 1 15 ( ) 0882 . 1 3354 . 1 (


. The p-
value of .3180 does not reject the null hypothesis.

The SAS TSCSREG and PANEL procedures and the STATA . xt r eg command conduct the
Wald test. You may get the same test using the TEST statement in LSDV1 and the
STATA . t est command (the output is skipped).

PROC REG DATA=masil.airline;
MODEL cost = t1-t14 output fuel load;
TEST t1=t2=t3=t4=t5=t6=t7=t8=t9=t10=t11=t12=t13=t14=0;
RUN;

. quietly regress cost t1-t14 output fuel load
. test t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13 t14
2005 The Trustees of Indiana University (12/10/2005) Linear Regression Model for Panel Data: 44
http://www.indiana.edu/~statmath

6. The Fixed Group and Time Effect Model

The two-way fixed model considers both group and time effects. This model thus needs two
sets of group and time dummy variables. LSDV2 and the between effect model are not valid in
this model.


6.1 Least Squares Dummy Variable Models

There are four approaches to avoid the perfect multicollinearity or the dummy variable trap.
You may not suppress the intercept under any circumstances.
Drop one cross-section and one time-series dummy variables.
Drop one cross-section dummy and impose a restriction on the time-series dummies of
0 =
t

Drop one time-series dummy and impose a restriction on the cross-section dummies of
0 =
g

Include all dummy variables and impose two restrictions on the cross-section and time-
series dummies of 0 =
g
and 0 =
t



6.2 LSDV1 without Two Dummies

Let us first run LSDV1 using STATA.

. regress cost g1-g5 t1-t14 output fuel load

Sour ce | SS df MS Number of obs = 90
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - F( 22, 67) = 1960. 82
Model | 113. 864044 22 5. 17563838 Pr ob > F = 0. 0000
Resi dual | . 176848775 67 . 002639534 R- squar ed = 0. 9984
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Adj R- squar ed = 0. 9979
Tot al | 114. 040893 89 1. 28135835 Root MSE = . 05138

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
cost | Coef . St d. Er r . t P>| t | [ 95%Conf . I nt er val ]
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
g1 | . 1742825 . 0861201 2. 02 0. 047 . 0023861 . 346179
g2 | . 1114508 . 0779551 1. 43 0. 157 - . 0441482 . 2670499
g3 | - . 143511 . 0518934 - 2. 77 0. 007 - . 2470907 - . 0399313
g4 | . 1802087 . 0321443 5. 61 0. 000 . 1160484 . 2443691
g5 | - . 0466942 . 0224688 - 2. 08 0. 042 - . 0915422 - . 0018463
t 1 | - . 6931382 . 3378385 - 2. 05 0. 044 - 1. 367467 - . 0188098
t 2 | - . 6384366 . 3320802 - 1. 92 0. 059 - 1. 301271 . 0243983
t 3 | - . 5958031 . 3294473 - 1. 81 0. 075 - 1. 253383 . 0617764
t 4 | - . 5421537 . 3189139 - 1. 70 0. 094 - 1. 178708 . 0944011
t 5 | - . 4730429 . 2319459 - 2. 04 0. 045 - . 9360088 - . 0100769
t 6 | - . 4272042 . 18844 - 2. 27 0. 027 - . 8033319 - . 0510764
t 7 | - . 3959783 . 1732969 - 2. 28 0. 025 - . 7418804 - . 0500762
t 8 | - . 3398463 . 1501062 - 2. 26 0. 027 - . 6394596 - . 040233
t 9 | - . 2718933 . 1348175 - 2. 02 0. 048 - . 5409901 - . 0027964
t 10 | - . 2273857 . 0763495 - 2. 98 0. 004 - . 37978 - . 0749914
t 11 | - . 1118032 . 0319005 - 3. 50 0. 001 - . 175477 - . 0481295
t 12 | - . 033641 . 0429008 - 0. 78 0. 436 - . 1192713 . 0519893
t 13 | - . 0177346 . 0362554 - 0. 49 0. 626 - . 0901007 . 0546315
t 14 | - . 0186451 . 030508 - 0. 61 0. 543 - . 0795393 . 042249
2005 The Trustees of Indiana University (12/10/2005) Linear Regression Model for Panel Data: 45
http://www.indiana.edu/~statmath

out put | . 8172487 . 031851 25. 66 0. 000 . 7536739 . 8808235
f uel | . 16861 . 163478 1. 03 0. 306 - . 1576935 . 4949135
l oad | - . 8828142 . 2617373 - 3. 37 0. 001 - 1. 405244 - . 3603843
_cons | 12. 94004 2. 218231 5. 83 0. 000 8. 512434 17. 36765
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

The following is the corresponding SAS REG procedure (outputs are skipped).

PROC REG DATA=masil.airline;
MODEL cost = g1-g5 t1-t14 output fuel load;
RUN;

The LIMDEP example is skipped here, since many dummy variables need to be listed in the
Regr ess$ command.


6.3 LSDV1 + LSDV3: Dropping a Dummy and Imposing a Restriction

In the second approach, you may drop either one group dummy or one time dummy. The
following drops one time dummy, includes all group dummies, and imposes a restriction on
group dummies.

PROC REG DATA=masil.airline;
MODEL cost = g1-g6 t1-t14 output fuel load;
RESTRICT g1 + g2 + g3 + g4 + g5 + g6 = 0;
RUN;

The REG Procedure
Model: MODEL1
Dependent Variable: cost

NOTE: Restrictions have been applied to parameter estimates.


Number of Observations Read 90
Number of Observations Used 90


Analysis of Variance

Sum of Mean
Source DF Squares Square F Value Pr > F

Model 22 113.86404 5.17564 1960.82 <.0001
Error 67 0.17685 0.00264
Corrected Total 89 114.04089


Root MSE 0.05138 R-Square 0.9984
Dependent Mean 13.36561 Adj R-Sq 0.9979
Coeff Var 0.38439


Parameter Estimates

Parameter Standard
2005 The Trustees of Indiana University (12/10/2005) Linear Regression Model for Panel Data: 46
http://www.indiana.edu/~statmath

Variable DF Estimate Error t Value Pr > |t|

Intercept 1 12.98600 2.22540 5.84 <.0001
g1 1 0.12833 0.04601 2.79 0.0069
g2 1 0.06549 0.03897 1.68 0.0975
g3 1 -0.18947 0.01561 -12.14 <.0001
g4 1 0.13425 0.01832 7.33 <.0001
g5 1 -0.09265 0.03731 -2.48 0.0155
g6 1 -0.04596 0.04161 -1.10 0.2733
t1 1 -0.69314 0.33784 -2.05 0.0441
t2 1 -0.63844 0.33208 -1.92 0.0588
t3 1 -0.59580 0.32945 -1.81 0.0750
t4 1 -0.54215 0.31891 -1.70 0.0938
t5 1 -0.47304 0.23195 -2.04 0.0454
t6 1 -0.42720 0.18844 -2.27 0.0266
t7 1 -0.39598 0.17330 -2.28 0.0255
t8 1 -0.33985 0.15011 -2.26 0.0268
t9 1 -0.27189 0.13482 -2.02 0.0477
t10 1 -0.22739 0.07635 -2.98 0.0040
t11 1 -0.11180 0.03190 -3.50 0.0008
t12 1 -0.03364 0.04290 -0.78 0.4357
t13 1 -0.01773 0.03626 -0.49 0.6263
t14 1 -0.01865 0.03051 -0.61 0.5432
output 1 0.81725 0.03185 25.66 <.0001
fuel 1 0.16861 0.16348 1.03 0.3061
load 1 -0.88281 0.26174 -3.37 0.0012
RESTRICT -1 -1.9387E-16 . . .

* Probability computed using beta distribution.

Alternatively, you may run the STATA . cnsr eg command with the second constraint (output
is skipped).

. cnsreg cost g1-g6 t1-t14 output fuel load, constraint(2)

The following drops one group dummy and imposes a restriction on time dummies.

. cnsreg cost g1-g5 t1-t15 output fuel load, constraint(3)

Const r ai ned l i near r egr essi on Number of obs = 90
F( 22, 67) = 1960. 82
Pr ob > F = 0. 0000
Root MSE = . 05138
( 1) t 1 + t 2 + t 3 + t 4 + t 5 + t 6 + t 7 + t 8 + t 9 + t 10 + t 11 + t 12 + t 13 + t 14 + t 15 = 0
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
cost | Coef . St d. Er r . t P>| t | [ 95%Conf . I nt er val ]
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
g1 | . 1742825 . 0861201 2. 02 0. 047 . 0023861 . 346179
g2 | . 1114508 . 0779551 1. 43 0. 157 - . 0441482 . 2670499
g3 | - . 143511 . 0518934 - 2. 77 0. 007 - . 2470907 - . 0399313
g4 | . 1802087 . 0321443 5. 61 0. 000 . 1160484 . 2443691
g5 | - . 0466942 . 0224688 - 2. 08 0. 042 - . 0915422 - . 0018463
t 1 | - . 3740245 . 191872 - 1. 95 0. 055 - . 7570026 . 0089536
t 2 | - . 3193228 . 1860877 - 1. 72 0. 091 - . 6907554 . 0521097
t 3 | - . 2766893 . 1833501 - 1. 51 0. 136 - . 6426576 . 0892789
t 4 | - . 2230399 . 1729671 - 1. 29 0. 202 - . 5682837 . 1222038
t 5 | - . 1539291 . 0864404 - 1. 78 0. 079 - . 3264649 . 0186066
t 6 | - . 1080904 . 0448591 - 2. 41 0. 019 - . 1976296 - . 0185513
t 7 | - . 0768646 . 0319336 - 2. 41 0. 019 - . 1406043 - . 0131248
t 8 | - . 0207326 . 0204506 - 1. 01 0. 314 - . 061552 . 0200869
t 9 | . 0472205 . 0290822 1. 62 0. 109 - . 0108278 . 1052688
2005 The Trustees of Indiana University (12/10/2005) Linear Regression Model for Panel Data: 47
http://www.indiana.edu/~statmath

t 10 | . 0917281 . 0811525 1. 13 0. 262 - . 0702531 . 2537092
t 11 | . 2073105 . 1491443 1. 39 0. 169 - . 0903829 . 5050039
t 12 | . 2854727 . 1756365 1. 63 0. 109 - . 0650993 . 6360447
t 13 | . 3013791 . 1660294 1. 82 0. 074 - . 030017 . 6327752
t 14 | . 3004686 . 1536212 1. 96 0. 055 - . 0061606 . 6070978
t 15 | . 3191137 . 1474883 2. 16 0. 034 . 0247259 . 6135015
out put | . 8172487 . 031851 25. 66 0. 000 . 7536739 . 8808235
f uel | . 16861 . 163478 1. 03 0. 306 - . 1576935 . 4949135
l oad | - . 8828142 . 2617373 - 3. 37 0. 001 - 1. 405244 - . 3603843
_cons | 12. 62093 2. 074302 6. 08 0. 000 8. 480603 16. 76125
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

You may run the following SAS REG procedure to get the same result (output is skipped).

PROC REG DATA=masil.airline; /* LSDV3 */
MODEL cost = g1-g5 t1-t15 output fuel load;
RESTRICT t1+t2+t3+t4+t5+t6+t7+t8+t9+t10+t11+t12+t13+t14+t15=0;
RUN;


6.4 LSDV3 with Two Restrictions

The third approach includes all group and time dummies and imposes two restrictions on group
and time dummies.

. cnsreg cost g1-g6 t1-t15 output fuel load, constraint(2 3)

Const r ai ned l i near r egr essi on Number of obs = 90
F( 22, 67) = 1960. 82
Pr ob > F = 0. 0000
Root MSE = . 05138
( 1) g1 + g2 + g3 + g4 + g5 + g6 = 0
( 2) t 1 + t 2 + t 3 + t 4 + t 5 + t 6 + t 7 + t 8 + t 9 + t 10 + t 11 + t 12 + t 13 + t 14 + t 15 = 0
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
cost | Coef . St d. Er r . t P>| t | [ 95%Conf . I nt er val ]
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
g1 | . 1283264 . 0460126 2. 79 0. 007 . 0364849 . 2201679
g2 | . 0654947 . 0389685 1. 68 0. 097 - . 0122867 . 1432761
g3 | - . 1894671 . 0156096 - 12. 14 0. 000 - . 220624 - . 1583102
g4 | . 1342526 . 0183163 7. 33 0. 000 . 097693 . 1708121
g5 | - . 0926504 . 0373085 - 2. 48 0. 016 - . 1671184 - . 0181824
g6 | - . 0459561 . 0416069 - 1. 10 0. 273 - . 1290038 . 0370916
t 1 | - . 3740245 . 191872 - 1. 95 0. 055 - . 7570026 . 0089536
t 2 | - . 3193228 . 1860877 - 1. 72 0. 091 - . 6907554 . 0521097
t 3 | - . 2766893 . 1833501 - 1. 51 0. 136 - . 6426576 . 0892789
t 4 | - . 2230399 . 1729671 - 1. 29 0. 202 - . 5682837 . 1222038
t 5 | - . 1539291 . 0864404 - 1. 78 0. 079 - . 3264649 . 0186066
t 6 | - . 1080904 . 0448591 - 2. 41 0. 019 - . 1976296 - . 0185513
t 7 | - . 0768646 . 0319336 - 2. 41 0. 019 - . 1406043 - . 0131248
t 8 | - . 0207326 . 0204506 - 1. 01 0. 314 - . 061552 . 0200869
t 9 | . 0472205 . 0290822 1. 62 0. 109 - . 0108278 . 1052688
t 10 | . 0917281 . 0811525 1. 13 0. 262 - . 0702531 . 2537092
t 11 | . 2073105 . 1491443 1. 39 0. 169 - . 0903829 . 5050039
t 12 | . 2854727 . 1756365 1. 63 0. 109 - . 0650993 . 6360447
t 13 | . 3013791 . 1660294 1. 82 0. 074 - . 030017 . 6327752
t 14 | . 3004686 . 1536212 1. 96 0. 055 - . 0061606 . 6070978
t 15 | . 3191137 . 1474883 2. 16 0. 034 . 0247259 . 6135015
out put | . 8172487 . 031851 25. 66 0. 000 . 7536739 . 8808235
f uel | . 16861 . 163478 1. 03 0. 306 - . 1576935 . 4949135
l oad | - . 8828142 . 2617373 - 3. 37 0. 001 - 1. 405244 - . 3603843
_cons | 12. 66688 2. 081068 6. 09 0. 000 8. 513054 16. 82071
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

The following SAS REG procedure gives you the same result (output is skipped).

PROC REG DATA=masil.airline;
2005 The Trustees of Indiana University (12/10/2005) Linear Regression Model for Panel Data: 48
http://www.indiana.edu/~statmath

MODEL cost = g1-g6 t1-t15 output fuel load;
RESTRICT g1 + g2 + g3 + g4 + g5 + g6 = 0;
RESTRICT t1+t2+t3+t4+t5+t6+t7+t8+t9+t10+t11+t12+t13+t14+t15=0;
RUN;


6.5 Two-way Within Effect Model

The two-way within group and time effect model requires a transformation of the data set as

+ = y y y y y
t i it it
*
and

+ = x x x x x
t i it it
*
. The following commands do this task.

. gen w_cost = cost - gm_cost - tm_cost + m_cost
. gen w_output = output - gm_output - tm_output + m_output
. gen w_fuel = fuel - gm_fuel - tm_fuel + m_fuel
. gen w_load = load - gm_load - tm_load + m_load

. tabstat cost output fuel load, stat(mean)

st at s | cost out put f uel l oad
- - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
mean | 13. 36561 - 1. 174309 12. 77036 . 5604602
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Now, run the OLS with the transformed variables. Do not forget to suppress the intercept.

. regress w_cost w_output w_fuel w_load, noc / / wi t hi n ef f ect

Sour ce | SS df MS Number of obs = 90
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - F( 3, 87) = 307. 86
Model | 1. 87739643 3 . 625798811 Pr ob > F = 0. 0000
Resi dual | . 176848774 87 . 002032745 R- squar ed = 0. 9139
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Adj R- squar ed = 0. 9109
Tot al | 2. 05424521 90 . 022824947 Root MSE = . 04509

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
w_cost | Coef . St d. Er r . t P>| t | [ 95%Conf . I nt er val ]
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
w_out put | . 8172487 . 0279512 29. 24 0. 000 . 7616927 . 8728048
w_f uel | . 16861 . 1434621 1. 18 0. 243 - . 1165364 . 4537565
w_l oad | - . 8828142 . 2296907 - 3. 84 0. 000 - 1. 339349 - . 426279
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Note again that R
2
, MSE, standard errors, and DF
error
are not correct. The dummy variable
coefficients are computed as ) ( ' ) (
*

= x x b y y d
g g g
and ) ( ' ) (
*

= x x b y y d
t t t
.
The standard errors also need to be adjusted; for instance, the standard error of the load factor
is .2617=.2297*sqrt(87/67).


6.6 Using the TSCSREG and PANEL Procedures

The SAS TSCSREG and PANEL procedures have the /FIXTWO option to fit the two-way
fixed effect model.

PROC TSCSREG DATA=masil.airline;
ID airline year;
MODEL cost = output fuel load /FIXTWO;
RUN;
2005 The Trustees of Indiana University (12/10/2005) Linear Regression Model for Panel Data: 49
http://www.indiana.edu/~statmath


The TSCSREG Procedure

Dependent Variable: cost

Model Description

Estimation Method FixTwo
Number of Cross Sections 6
Time Series Length 15


Fit Statistics

SSE 0.1768 DFE 67
MSE 0.0026 Root MSE 0.0514
R-Square 0.9984


F Test for No Fixed Effects

Num DF Den DF F Value Pr > F

19 67 23.10 <.0001


Parameter Estimates

Standard
Variable DF Estimate Error t Value Pr > |t| Label

CS1 1 0.174283 0.0861 2.02 0.0470 Cross Sectional
Effect 1
CS2 1 0.111451 0.0780 1.43 0.1575 Cross Sectional
Effect 2
CS3 1 -0.14351 0.0519 -2.77 0.0073 Cross Sectional
Effect 3
CS4 1 0.180209 0.0321 5.61 <.0001 Cross Sectional
Effect 4
CS5 1 -0.04669 0.0225 -2.08 0.0415 Cross Sectional
Effect 5
TS1 1 -0.69314 0.3378 -2.05 0.0441 Time Series
Effect 1
TS2 1 -0.63844 0.3321 -1.92 0.0588 Time Series
Effect 2
TS3 1 -0.5958 0.3294 -1.81 0.0750 Time Series
Effect 3
TS4 1 -0.54215 0.3189 -1.70 0.0938 Time Series
Effect 4
TS5 1 -0.47304 0.2319 -2.04 0.0454 Time Series
Effect 5
TS6 1 -0.4272 0.1884 -2.27 0.0266 Time Series
Effect 6
TS7 1 -0.39598 0.1733 -2.28 0.0255 Time Series
Effect 7
TS8 1 -0.33985 0.1501 -2.26 0.0268 Time Series
Effect 8
2005 The Trustees of Indiana University (12/10/2005) Linear Regression Model for Panel Data: 50
http://www.indiana.edu/~statmath

TS9 1 -0.27189 0.1348 -2.02 0.0477 Time Series
Effect 9
TS10 1 -0.22739 0.0763 -2.98 0.0040 Time Series
Effect 10
TS11 1 -0.1118 0.0319 -3.50 0.0008 Time Series
Effect 11
TS12 1 -0.03364 0.0429 -0.78 0.4357 Time Series
Effect 12
TS13 1 -0.01773 0.0363 -0.49 0.6263 Time Series
Effect 13
TS14 1 -0.01865 0.0305 -0.61 0.5432 Time Series
Effect 14
Intercept 1 12.94004 2.2182 5.83 <.0001 Intercept
output 1 0.817249 0.0319 25.66 <.0001
fuel 1 0.16861 0.1635 1.03 0.3061
load 1 -0.88281 0.2617 -3.37 0.0012

The STATA . xt r eg command does not fit the two-way fixed or random effect model. The
following LIMDEP command fits the two-way fixed model. Note that this command has St r $
and Per i od$ specifications to specify stratification and time variables. This command presents
the pooled model and one-way group effect model as well, but reports the incorrect intercept in
the two-way fixed model, 12.667 (2.081).

REGRESS;Lhs=COST;Rhs=ONE,OUTPUT,FUEL,LOAD;Panel;Str=AIRLINE;Period=YEAR;Fixed$


6.7 Testing Fixed Group and Time Effects

The null hypothesis is that parameters of group and time dummies are zero:
0 ... :
1 1 0
= = =
n
H and 0 ...
1 1
= = =
T
. The F test compares the pooled regression and
two-way group and time effect model. The F statistic of 23.1085 rejects the null hypothesis at
the .01 significance level (p<.0000).

] 67 , 19 [ 1085 . 23 ~
) 1 3 15 6 15 * 6 ( ) 1768 (.
) 2 15 6 ( ) 1768 . 3354 . 1 (
+
+


The SAS TSCSREG and PANEL procedures conduct the F-test for the group and time effects.
You may also run the following SAS REG procedure and . r egr ess command to perform the
same test.

PROC REG DATA=masil.airline;
MODEL cost = g1-g5 t1-t14 output fuel load;
TEST g1=g2=g3=g4=g5=t1=t2=t3=t4=t5=t6=t7=t8=t9=t10=t11=t12=t13=t14=0;
RUN;

. quietly regress cost g1-g5 t1-t14 output fuel load
. test g1 g2 g3 g4 g5 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13 t14
2005 The Trustees of Indiana University (12/10/2005) Linear Regression Model for Panel Data: 51
http://www.indiana.edu/~statmath

7. Random Effect Models

The random effects model examines how group and/or time affect error variances. This model
is appropriate for n individuals who were drawn randomly from a large population. This
chapter focuses on the feasible generalized least squares (FGLS) with variance component
estimation methods fromBaltagi and Chang (1994), Fuller and Battese (1974), and Wansbeek
and Kapteyn (1989).
8



7.1 The One-way Random Group Effect Model

When the omega matrix is not known, you have to estimate using the SSEs of the pooled
model (.0317) and the fixed effect model (.2926).

The variance component of error
2

is .00361263 =.292622872/(6*15-6-3)
The variance component of group
2

u
is .01559712 =.031675926/(6-4) - .00361263/15

Thus,

is
4) - /(6 .031675926 * 15
.00361263
1 .87668488 =

Now, transform the dependent and independent variables including the intercept.

. gen rg_cost = cost - .87668488*gm_cost / / t r ansf or mvar i abl es
. gen rg_output = output - .87668488*gm_output
. gen rg_fuel = fuel - .87668488*gm_fuel
. gen rg_load = load - .87668488*gm_load
. gen rg_int = 1 - .87668488 / / f or t he i nt er cept

Finally, run the OLS with the transformed variables. Do not forget to suppress the intercept.
This is the groupwise heteroscedastic regression model (Greene 2003).

. regress rg_cost rg_int rg_output rg_fuel rg_load, noc

Sour ce | SS df MS Number of obs = 90
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - F( 4, 86) =19642. 72
Model | 284. 670313 4 71. 1675783 Pr ob > F = 0. 0000
Resi dual | . 311586777 86 . 003623102 R- squar ed = 0. 9989
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Adj R- squar ed = 0. 9989
Tot al | 284. 9819 90 3. 16646556 Root MSE = . 06019

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
r g_cost | Coef . St d. Er r . t P>| t | [ 95%Conf . I nt er val ]
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
r g_i nt | 9. 627911 . 2101638 45. 81 0. 000 9. 210119 10. 0457

8
Baltagi and Cheng (1994) introduce various ANOVA estimation methods, such as a modified Wallace and
Hussain method, the Wansbeek and Kapteyn method, the Swamy and Arora method, and Hendersons method III.
They also discuss maximum likelihood (ML) estimators, restricted ML estimators, minimum norm quadratic
unbiased estimators (MINQUE), and minimum variance quadratic unbiased estimators (MIVQUE). Based on a
Monte Carlo simulation, they argue that ANOVA estimators are Best Quadratic Unbiased estimators of the
variance components for the balanced model, whereas ML, restricted ML, MINQUE, and MIVQUE are
recommended for the unbalanced models.
2005 The Trustees of Indiana University (12/10/2005) Linear Regression Model for Panel Data: 52
http://www.indiana.edu/~statmath

r g_out put | . 9066808 . 0256249 35. 38 0. 000 . 8557401 . 9576215
r g_f uel | . 4227784 . 0140248 30. 15 0. 000 . 394898 . 4506587
r g_l oad | - 1. 0645 . 2000703 - 5. 32 0. 000 - 1. 462226 - . 6667731
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -


7.2 Estimations in SAS, STATA, and LIMDEP

The SAS TSCSREG and PANEL procedures have the /RANONE option to fit the one-way
random effect model. These procedures by default use the Fuller and Battese (1974) estimation
method, which produces slightly different estimates from FGLS.

PROC TSCSREG DATA=masil.airline;
ID airline year;
MODEL cost = output fuel load /RANONE;
RUN;

The TSCSREG Procedure

Dependent Variable: cost

Model Description

Estimation Method RanOne
Number of Cross Sections 6
Time Series Length 15


Fit Statistics

SSE 0.3090 DFE 86
MSE 0.0036 Root MSE 0.0599
R-Square 0.9923


Variance Component Estimates

Variance Component for Cross Sections 0.018198
Variance Component for Error 0.003613


Hausman Test for
Random Effects

DF m Value Pr > m

3 0.92 0.8209


Parameter Estimates

Standard
Variable DF Estimate Error t Value Pr > |t|

Intercept 1 9.637 0.2132 45.21 <.0001
output 1 0.908024 0.0260 34.91 <.0001
2005 The Trustees of Indiana University (12/10/2005) Linear Regression Model for Panel Data: 53
http://www.indiana.edu/~statmath

fuel 1 0.422199 0.0141 29.95 <.0001
load 1 -1.06469 0.1995 -5.34 <.0001

The PANEL procedure has the /VCOMP=WK option for the Wansbeek and Kapteyn (1989)
method, which is close to groupwise heteroscedastic regression. The BP option of the MODEL
statement, not available in the TSCSREG procedure, conducts the Breusch-Pagen LM test for
random effects. Note that two procedures estimate the same variance component for error
(.0036) but a different variance component for groups (.0182 versus .0160),

PROC PANEL DATA=masil.airline;
ID airline year;
MODEL cost = output fuel load /RANONE BP VCOMP=WK;
RUN;

The PANEL Procedure
Wansbeek and Kapteyn Variance Components (RanOne)

Dependent Variable: cost

Model Description

Estimation Method RanOne
Number of Cross Sections 6
Time Series Length 15


Fit Statistics

SSE 0.3111 DFE 86
MSE 0.0036 Root MSE 0.0601
R-Square 0.9923


Variance Component Estimates

Variance Component for Cross Sections 0.016015
Variance Component for Error 0.003613


Hausman Test for
Random Effects

DF m Value Pr > m

2 1.63 0.4429


Breusch Pagan Test for Random
Effects (One Way)

DF m Value Pr > m

1 334.85 <.0001


Parameter Estimates
2005 The Trustees of Indiana University (12/10/2005) Linear Regression Model for Panel Data: 54
http://www.indiana.edu/~statmath


Standard
Variable DF Estimate Error t Value Pr > |t|

Intercept 1 9.629513 0.2107 45.71 <.0001
output 1 0.906918 0.0257 35.30 <.0001
fuel 1 0.422676 0.0140 30.11 <.0001
load 1 -1.06452 0.2000 -5.32 <.0001

The STATA . xt r eg command has the r e option to produce FGLS estimates. The . i i s
command specifies the panel identification variable, such as a grouping or cross-section
variable that is used in the i ( ) option.

. iis airline

. xtreg cost output fuel load, re i(airline) theta

Random- ef f ect s GLS r egr essi on Number of obs = 90
Gr oup var i abl e ( i ) : ai r l i ne Number of gr oups = 6

R- sq: wi t hi n = 0. 9925 Obs per gr oup: mi n = 15
bet ween = 0. 9856 avg = 15. 0
over al l = 0. 9876 max = 15

Randomef f ect s u_i ~ Gaussi an Wal d chi 2( 3) = 11091. 33
cor r ( u_i , X) = 0 ( assumed) Pr ob > chi 2 = 0. 0000
t het a = . 87668503

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
cost | Coef . St d. Er r . z P>| z| [ 95%Conf . I nt er val ]
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
out put | . 9066805 . 025625 35. 38 0. 000 . 8564565 . 9569045
f uel | . 4227784 . 0140248 30. 15 0. 000 . 3952904 . 4502665
l oad | - 1. 064499 . 2000703 - 5. 32 0. 000 - 1. 456629 - . 672368
_cons | 9. 627909 . 210164 45. 81 0. 000 9. 215995 10. 03982
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
si gma_u | . 12488859
si gma_e | . 06010514
r ho | . 81193816 ( f r act i on of var i ance due t o u_i )
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

The t het a option reports the estimated theta (.8767). The si gma_u and si gma_e are square
roots of the variance components for groups and errors (.0036=.0601^2).

In LIMDEP, you have to specify Panel $ and Het $ subcommands for the groupwise
heteroscedastic model. Note that LIMDEP presents the pooled OLS regression and least square
dummy variable model as well.

--> REGRESS;Lhs=COST;Rhs=ONE,OUTPUT,FUEL,LOAD;Panel;Str=AIRLINE;Het=AIRLINE$

+-----------------------------------------------------------------------+
| OLS Without Group Dummy Variables |
| Ordinary least squares regression Weighting variable = none |
| Dep. var. = COST Mean= 13.36560933 , S.D.= 1.131971444 |
| Model size: Observations = 90, Parameters = 4, Deg.Fr.= 86 |
| Residuals: Sum of squares= 1.335449522 , Std.Dev.= .12461 |
| Fit: R-squared= .988290, Adjusted R-squared = .98788 |
| Model test: F[ 3, 86] = 2419.33, Prob value = .00000 |
| Diagnostic: Log-L = 61.7699, Restricted(b=0) Log-L = -138.3581 |
| LogAmemiyaPrCrt.= -4.122, Akaike Info. Crt.= -1.284 |
2005 The Trustees of Indiana University (12/10/2005) Linear Regression Model for Panel Data: 55
http://www.indiana.edu/~statmath

| Panel Data Analysis of COST [ONE way] |
| Unconditional ANOVA (No regressors) |
| Source Variation Deg. Free. Mean Square |
| Between 74.6799 5. 14.9360 |
| Residual 39.3611 84. .468584 |
| Total 114.041 89. 1.28136 |
+-----------------------------------------------------------------------+
+---------+--------------+----------------+--------+---------+----------+
|Variable | Coefficient | Standard Error |t-ratio |P[|T|>t] | Mean of X|
+---------+--------------+----------------+--------+---------+----------+
OUTPUT .8827386341 .13254552E-01 66.599 .0000 -1.1743092
FUEL .4539777119 .20304240E-01 22.359 .0000 12.770359
LOAD -1.627507797 .34530293 -4.713 .0000 .56046016
Constant 9.516912231 .22924522 41.514 .0000
(Note: E+nn or E-nn means multiply by 10 to + or -nn power.)

+-----------------------------------------------------------------------+
| Least Squares with Group Dummy Variables |
| Ordinary least squares regression Weighting variable = none |
| Dep. var. = COST Mean= 13.36560933 , S.D.= 1.131971444 |
| Model size: Observations = 90, Parameters = 9, Deg.Fr.= 81 |
| Residuals: Sum of squares= .2926207777 , Std.Dev.= .06010 |
| Fit: R-squared= .997434, Adjusted R-squared = .99718 |
| Model test: F[ 8, 81] = 3935.82, Prob value = .00000 |
| Diagnostic: Log-L = 130.0865, Restricted(b=0) Log-L = -138.3581 |
| LogAmemiyaPrCrt.= -5.528, Akaike Info. Crt.= -2.691 |
| Estd. Autocorrelation of e(i,t) .573531 |
| White/Hetero. corrected covariance matrix used. |
+-----------------------------------------------------------------------+
+---------+--------------+----------------+--------+---------+----------+
|Variable | Coefficient | Standard Error |t-ratio |P[|T|>t] | Mean of X|
+---------+--------------+----------------+--------+---------+----------+
OUTPUT .9192881432 .19105357E-01 48.117 .0000 -1.1743092
FUEL .4174910457 .13532534E-01 30.851 .0000 12.770359
LOAD -1.070395015 .21662097 -4.941 .0000 .56046016
(Note: E+nn or E-nn means multiply by 10 to + or -nn power.)

+------------------------------------------------------------------------+
| Test Statistics for the Classical Model |
| |
| Model Log-Likelihood Sum of Squares R-squared |
| (1) Constant term only -138.35814 .1140409821D+03 .0000000 |
| (2) Group effects only -90.48804 .3936109461D+02 .6548513 |
| (3) X - variables only 61.76991 .1335449522D+01 .9882897 |
| (4) X and group effects 130.08647 .2926207777D+00 .9974341 |
| |
| Hypothesis Tests |
| Likelihood Ratio Test F Tests |
| Chi-squared d.f. Prob. F num. denom. Prob value |
| (2) vs (1) 95.740 5 .00000 31.875 5 84 .00000 |
| (3) vs (1) 400.256 3 .00000 2419.329 3 86 .00000 |
| (4) vs (1) 536.889 8 .00000 3935.818 8 81 .00000 |
| (4) vs (2) 441.149 3 .00000 3604.832 3 81 .00000 |
| (4) vs (3) 136.633 5 .00000 57.733 5 81 .00000 |
+------------------------------------------------------------------------+
Error: 425: REGR;PANEL. Could not invert VC matrix for Hausman test.

2005 The Trustees of Indiana University (12/10/2005) Linear Regression Model for Panel Data: 56
http://www.indiana.edu/~statmath

+--------------------------------------------------+
| Random Effects Model: v(i,t) = e(i,t) + u(i) |
| Estimates: Var[e] = .361260D-02 |
| Var[u] = .119159D-01 |
| Corr[v(i,t),v(i,s)] = .767356 |
| Lagrange Multiplier Test vs. Model (3) = 334.85 |
| ( 1 df, prob value = .000000) |
| (High values of LM favor FEM/REM over CR model.) |
| Fixed vs. Random Effects (Hausman) = .00 |
| ( 3 df, prob value = 1.000000) |
| (High (low) values of H favor FEM (REM).) |
| Reestimated using GLS coefficients: |
| Estimates: Var[e] = .362491D-02 |
| Var[u] = .392309D-01 |
| Var[e] above is an average. Groupwise |
| heteroscedasticity model was estimated. |
| Sum of Squares .147779D+01 |
+--------------------------------------------------+
+---------+--------------+----------------+--------+---------+----------+
|Variable | Coefficient | Standard Error |b/St.Er.|P[|Z|>z] | Mean of X|
+---------+--------------+----------------+--------+---------+----------+
OUTPUT .9041238041 .24615477E-01 36.730 .0000 -1.1743092
FUEL .4238986905 .13746498E-01 30.837 .0000 12.770359
LOAD -1.064558659 .19933132 -5.341 .0000 .56046016
Constant 9.610634379 .20277404 47.396 .0000
(Note: E+nn or E-nn means multiply by 10 to + or -nn power.)

Like SAS TSCSREG and PANEL procedures, LIMDEP estimates a slightly different variance
component for groups (.0119), thus producing different parameter estimates. In addition, the
Hausman test is not successful in this example.


7.3 The One-way Random Time Effect Model

Let us compute

using the SSEs of the between effect model (.0056) and the fixed effect
model (1.0882).

The variance component for error
2

is .01511375 = 1.08819022/(15*6-15-3)
The variance component for time
2

v
is -.00201072 =.005590631/(15-4)- .01511375/6

The

is
4) - (15 005590631/ . * 6
.01511375
1 1.226263 - =

. gen rt_cost = cost - (-1.226263)*tm_cost / / t r ansf or mvar i abl es
. gen rt_output = output - (-1.226263)*tm_output
. gen rt_fuel = fuel - (-1.226263)*tm_fuel
. gen rt_load = load - (-1.226263)*tm_load
. gen rt_int = 1 - (-1.226263) / / f or t he i nt er cept

. regress rt_cost rt_int rt_output rt_fuel rt_load, noc


2005 The Trustees of Indiana University (12/10/2005) Linear Regression Model for Panel Data: 57
http://www.indiana.edu/~statmath

Sour ce | SS df MS Number of obs = 90
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - F( 4, 86) = .
Model | 79944. 1804 4 19986. 0451 Pr ob > F = 0. 0000
Resi dual | 1. 79271995 86 . 020845581 R- squar ed = 1. 0000
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Adj R- squar ed = 1. 0000
Tot al | 79945. 9732 90 888. 288591 Root MSE = . 14438

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
r t _cost | Coef . St d. Er r . t P>| t | [ 95%Conf . I nt er val ]
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
r t _i nt | 9. 516098 . 1489281 63. 90 0. 000 9. 220038 9. 812157
r t _out put | . 8883838 . 0143338 61. 98 0. 000 . 8598891 . 9168785
r t _f uel | . 4392731 . 0129051 34. 04 0. 000 . 4136186 . 4649277
r t _l oad | - 1. 279176 . 2482869 - 5. 15 0. 000 - 1. 772754 - . 7855982
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

However, the negative value of the variance component for time is not likely. This section
presents examples of procedures and commands for the one-way time random effect model
without outputs.

In SAS, use the TSCSREG or PANEL procedure with the /RANONE option.

PROC SORT DATA=masil.airline;
BY year airline;

PROC TSCSREG DATA=masil.airline;
ID year airline;
MODEL cost = output fuel load /RANONE;
RUN;

PROC PANEL DATA=masil.airline;
ID year airline;
MODEL cost = output fuel load /RANONE BP;
RUN;

In STATA, you have to switch the grouping and time variables using the . t sset command.

. tsset year airline
panel var i abl e: year , 1 t o 15
t i me var i abl e: ai r l i ne, 1 t o 6

. xtreg cost output fuel load, re i(year) theta

In LIMDEP, you need to use the Per i od$ and Random$ subcommands.

REGRESS;Lhs=COST;Rhs=ONE,OUTPUT,FUEL,LOAD;Panel;Pds=15;Het=YEAR$


7.4 The Two-way Random Effect Model in SAS

The random group and time effect model is formulated as
it t i ti it
u X y + + + + = ' . Let us
first estimate the two way FGLS using the SAS PANEL procedure with the /RANTWO option.
The BP2 option conducts the Breusch-Pagan LM test for the two-way random effect model.

PROC PANEL DATA=masil.airline;
ID airline year;
MODEL cost = output fuel load /RANTWO BP2;
2005 The Trustees of Indiana University (12/10/2005) Linear Regression Model for Panel Data: 58
http://www.indiana.edu/~statmath

RUN;


The PANEL Procedure
Fuller and Battese Variance Components (RanTwo)

Dependent Variable: cost

Model Description

Estimation Method RanTwo
Number of Cross Sections 6
Time Series Length 15


Fit Statistics

SSE 0.2322 DFE 86
MSE 0.0027 Root MSE 0.0520
R-Square 0.9829


Variance Component Estimates

Variance Component for Cross Sections 0.017439
Variance Component for Time Series 0.001081
Variance Component for Error 0.00264


Hausman Test for
Random Effects

DF m Value Pr > m

3 6.93 0.0741


Breusch Pagan Test for Random
Effects (Two Way)

DF m Value Pr > m

2 336.40 <.0001


Parameter Estimates

Standard
Variable DF Estimate Error t Value Pr > |t|

Intercept 1 9.362677 0.2440 38.38 <.0001
output 1 0.866448 0.0255 33.98 <.0001
fuel 1 0.436163 0.0172 25.41 <.0001
load 1 -0.98053 0.2235 -4.39 <.0001

Similarly, you may run the TSCSREG procedure with the /RANTWO option.

2005 The Trustees of Indiana University (12/10/2005) Linear Regression Model for Panel Data: 59
http://www.indiana.edu/~statmath

PROC TSCSREG DATA=masil.airline;
ID airline year;
MODEL cost = output fuel load /RANTWO;
RUN;


7.5 Testing Random Effect Models

The Breusch-Pagan Lagrange multiplier (LM) test is designed to test random effects. The null
hypothesis of the one-way random group effect model is that variances of groups are zero:
0 :
2
0
=
u
H . If the null hypothesis is not rejected, the pooled regression model is appropriate.
The ee of the pooled OLS is 1.33544153 and
e e'
is .0665147.

LM is 334.8496= ) 1 ( ~ 1
3354 . 1
0665 . * 15
) 1 15 ( 2
15 * 6
2
2
2

with p <.0000.

With the large chi-squared, we reject the null hypothesis in favor of the random group effect
model. The SAS PANEL procedure with the /BP option and the LIMDEP Panel $ and Het $
subcommands report the LM statistic. In STATA, run the .xttest0 command right after
estimating the one-way random effect model.

. quietly xtreg cost output fuel load, re i(airline)

. xttest0

Br eusch and Pagan Lagr angi an mul t i pl i er t est f or r andomef f ect s:

cost [ ai r l i ne, t ] = Xb + u[ ai r l i ne] + e[ ai r l i ne, t ]

Est i mat ed r esul t s:
| Var sd = sqr t ( Var )
- - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - -
cost | 1. 281358 1. 131971
e | . 0036126 . 0601051
u | . 0155972 . 1248886

Test : Var ( u) = 0
chi 2( 1) = 334. 85
Pr ob > chi 2 = 0. 0000

The null hypothesis of the one-way random time effect is that variance components for time are
zero, 0 :
2
0
=
v
H . The following LM test uses Baltagis formula. The small chi-squared of
1.5472 does not reject the null hypothesis at the .01 level.

LM is
( )
) 1 ( ~ 1
3354 . 1
7817 .
) 1 6 ( 2
6 * 15
1
) 1 ( 2
5472 . 1
2
2
2
2
2


it
t
e
e n
n
Tn
with p<.2135

. quietly xtreg cost output fuel load, re i(year)

. xttest0

Br eusch and Pagan Lagr angi an mul t i pl i er t est f or r andomef f ect s:

2005 The Trustees of Indiana University (12/10/2005) Linear Regression Model for Panel Data: 60
http://www.indiana.edu/~statmath

cost [ year , t ] = Xb + u[ year ] + e[ year , t ]

Est i mat ed r esul t s:
| Var sd = sqr t ( Var )
- - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - -
cost | 1. 281358 1. 131971
e | . 0151138 . 122938
u | 0 0

Test : Var ( u) = 0
chi 2( 1) = 1. 55
Pr ob > chi 2 = 0. 2135

The two way random effects model has the null hypothesis that variance components for
groups and time are all zero. The LM statistic with two degrees of freedom is 336.3968 =
334.8496 +1.5472 (p<.0001).


7.6 Fixed Effects versus Random Effects

How do we compare a fixed effect model and its counterpart random effect model? The
Hausman specification test examines if the individual effects are uncorrelated with the other
regressors in the model. Since computation is complicated, let us conduct the test in STATA.

. tsset airline year
panel var i abl e: ai r l i ne, 1 t o 6
t i me var i abl e: year , 1 t o 15

. quietly xtreg cost output fuel load, fe

. estimates store fixed_group

. quietly xtreg cost output fuel load, re

. hausman fixed_group .

- - - - Coef f i ci ent s - - - -
| ( b) ( B) ( b- B) sqr t ( di ag( V_b- V_B) )
| f i x_gr oup . Di f f er ence S. E.
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
out put | . 9192846 . 9066805 . 0126041 . 0153877
f uel | . 4174918 . 4227784 - . 0052867 . 0058583
l oad | - 1. 070396 - 1. 064499 - . 0058974 . 0255088
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
b = consi st ent under Ho and Ha; obt ai ned f r omxt r eg
B = i nconsi st ent under Ha, ef f i ci ent under Ho; obt ai ned f r omxt r eg

Test : Ho: di f f er ence i n coef f i ci ent s not syst emat i c

chi 2( 3) = ( b- B) ' [ ( V_b- V_B) ^( - 1) ] ( b- B)
= 2. 12
Pr ob>chi 2 = 0. 5469
( V_b- V_B i s not posi t i ve def i ni t e)

The Hausman statistic 2.12 is different from the PANEL procedures 1.63 and Greene (2003)s
4.16. It is because SAS, STATA, and LIMDEP use different estimation methods to produce
slightly different parameter estimates. These tests, however, do not reject the null hypothesis in
favor of the random effect model.


2005 The Trustees of Indiana University (12/10/2005) Linear Regression Model for Panel Data: 61
http://www.indiana.edu/~statmath

7.7 Summary

Table 7 summarizes random effect estimations in SAS, STATA, and LIMDEP. The SAS
PANEL procedure is highly recommended.

Table 7 Comparison of the Random Effect Model in SAS, STATA, LIMDEP
*

SAS 9.1 STATA 9.0 LIMDEP 8.0
Procedure/Command PROC TSCSREG PROC PANEL . xt r eg Regr ess; Panel $
One-way /RANONE /RANONE WK re Str=;Pds=;Het;Random$
Two-way /RANTWO /RANTWO No Problematic
SSE (ee) Slightly different Correct No No
MSE or SEE Slightly different Correct No No
Model test (F) No No Wald test No
(adjusted) R
2
Slightly different Slightly different Incorrect No
Intercept Slightly different Correct Correct Slightly different
Coefficients Slightly different Correct Correct Slightly different
Standard errors Slightly different Correct Correct Slightly different
Variance for group Slightly different Correct Correct (sigma) Slightly different
Variance for error Correct Correct Correct (sigma) Correct
Theta No No t het a No
Breusch-Pagan (LM) No BP option . xt t est 0 Yes
Hausman Test (H) Incorrect Yes . hausman Yes (unstable)
* Yes/No means whether the software reports the statistics. Correct/incorrect indicates whether the statistics
are different from those of the groupwise heteroscedastic regression.
2005 The Trustees of Indiana University (12/10/2005) Linear Regression Model for Panel Data: 62
http://www.indiana.edu/~statmath

8. The Poolability Test

In order to conduct the poolability test, you need to run group by group OLS regressions and/or
time by time OLS regressions. If the null hypothesis is rejected, the panel data are not poolable.
In this case, you may consider the random coefficient model and hierarchical regression model.


8.1 Group by Group OLS Regression

In SAS, use the BY statement in the REG procedure. Do not forget to sort the data set in
advance.

PROC SORT DATA=masil.airline;
BY airline;

PROC REG DATA=masil.airline;
MODEL cost = output fuel load;
BY airline;
RUN;

In STATA, the i f qualifier makes it easy to run group by group regressions.

. forvalues i= 1(1)6 { / / r un gr oup by gr oup r egr essi on
display "OLS regression for group " `i'
regress cost output fuel load if airline==`i'
}

OLS r egr essi on f or gr oup 1

Sour ce | SS df MS Number of obs = 15
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - F( 3, 11) = 1843. 46
Model | 3. 41824348 3 1. 13941449 Pr ob > F = 0. 0000
Resi dual | . 006798918 11 . 000618083 R- squar ed = 0. 9980
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Adj R- squar ed = 0. 9975
Tot al | 3. 4250424 14 . 244645886 Root MSE = . 02486

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
cost | Coef . St d. Er r . t P>| t | [ 95%Conf . I nt er val ]
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
out put | 1. 18318 . 0968946 12. 21 0. 000 . 9699164 1. 396444
f uel | . 3865867 . 0181946 21. 25 0. 000 . 3465406 . 4266329
l oad | - 2. 461629 . 4013571 - 6. 13 0. 000 - 3. 34501 - 1. 578248
_cons | 10. 846 . 2972551 36. 49 0. 000 10. 19174 11. 50025
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

OLS r egr essi on f or gr oup 2

Sour ce | SS df MS Number of obs = 15
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - F( 3, 11) = 3129. 50
Model | 6. 47622084 3 2. 15874028 Pr ob > F = 0. 0000
Resi dual | . 007587838 11 . 000689803 R- squar ed = 0. 9988
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Adj R- squar ed = 0. 9985
Tot al | 6. 48380868 14 . 463129191 Root MSE = . 02626

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
cost | Coef . St d. Er r . t P>| t | [ 95%Conf . I nt er val ]
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
out put | 1. 459104 . 0792856 18. 40 0. 000 1. 284597 1. 63361
f uel | . 3088958 . 0272443 11. 34 0. 000 . 2489315 . 36886
l oad | - 2. 724785 . 2376522 - 11. 47 0. 000 - 3. 247854 - 2. 201716
_cons | 11. 97243 . 4320951 27. 71 0. 000 11. 02139 12. 92346
2005 The Trustees of Indiana University (12/10/2005) Linear Regression Model for Panel Data: 63
http://www.indiana.edu/~statmath

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

OLS r egr essi on f or gr oup 3

Sour ce | SS df MS Number of obs = 15
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - F( 3, 11) = 608. 10
Model | 3. 79286673 3 1. 26428891 Pr ob > F = 0. 0000
Resi dual | . 022869767 11 . 00207907 R- squar ed = 0. 9940
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Adj R- squar ed = 0. 9924
Tot al | 3. 8157365 14 . 272552607 Root MSE = . 0456

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
cost | Coef . St d. Er r . t P>| t | [ 95%Conf . I nt er val ]
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
out put | . 7268305 . 1554418 4. 68 0. 001 . 3847054 1. 068956
f uel | . 4515127 . 0381103 11. 85 0. 000 . 3676324 . 5353929
l oad | - . 7513069 . 6105989 - 1. 23 0. 244 - 2. 095226 . 5926122
_cons | 8. 699815 . 8985786 9. 68 0. 000 6. 722057 10. 67757
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

OLS r egr essi on f or gr oup 4

Sour ce | SS df MS Number of obs = 15
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - F( 3, 11) = 777. 86
Model | 7. 37252558 3 2. 45750853 Pr ob > F = 0. 0000
Resi dual | . 034752343 11 . 003159304 R- squar ed = 0. 9953
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Adj R- squar ed = 0. 9940
Tot al | 7. 40727792 14 . 52909128 Root MSE = . 05621

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
cost | Coef . St d. Er r . t P>| t | [ 95%Conf . I nt er val ]
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
out put | . 9353749 . 0759266 12. 32 0. 000 . 7682616 1. 102488
f uel | . 4637263 . 044347 10. 46 0. 000 . 3661192 . 5613333
l oad | - . 7756708 . 4707826 - 1. 65 0. 128 - 1. 811856 . 2605148
_cons | 9. 164608 . 6023241 15. 22 0. 000 7. 838902 10. 49031
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

OLS r egr essi on f or gr oup 5

Sour ce | SS df MS Number of obs = 15
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - F( 3, 11) = 1999. 89
Model | 7. 08313716 3 2. 36104572 Pr ob > F = 0. 0000
Resi dual | . 012986435 11 . 001180585 R- squar ed = 0. 9982
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Adj R- squar ed = 0. 9977
Tot al | 7. 09612359 14 . 506865971 Root MSE = . 03436

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
cost | Coef . St d. Er r . t P>| t | [ 95%Conf . I nt er val ]
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
out put | 1. 076299 . 0771255 13. 96 0. 000 . 9065471 1. 246051
f uel | . 2920542 . 0434213 6. 73 0. 000 . 1964845 . 3876239
l oad | - 1. 206847 . 3336308 - 3. 62 0. 004 - 1. 941163 - . 4725305
_cons | 11. 77079 . 7430078 15. 84 0. 000 10. 13544 13. 40614
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

OLS r egr essi on f or gr oup 6

Sour ce | SS df MS Number of obs = 15
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - F( 3, 11) = 2602. 49
Model | 11. 1173565 3 3. 70578551 Pr ob > F = 0. 0000
Resi dual | . 015663323 11 . 001423938 R- squar ed = 0. 9986
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Adj R- squar ed = 0. 9982
Tot al | 11. 1330199 14 . 795215705 Root MSE = . 03774

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
cost | Coef . St d. Er r . t P>| t | [ 95%Conf . I nt er val ]
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
out put | . 9673393 . 0321728 30. 07 0. 000 . 8965275 1. 038151
f uel | . 3023258 . 0308235 9. 81 0. 000 . 2344839 . 3701678
l oad | . 1050328 . 4767508 0. 22 0. 830 - . 9442886 1. 154354
2005 The Trustees of Indiana University (12/10/2005) Linear Regression Model for Panel Data: 64
http://www.indiana.edu/~statmath

_cons | 10. 77381 . 4095921 26. 30 0. 000 9. 872309 11. 67532
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -


8.2 Poolability Test across Groups

The null hypothesis of the poolability test across groups is
k ik
H = :
0
. The ee is 1.3354, the
SSE of the pooled OLS regression. The
i i
e e
'
is .1007 = .0068 +.0076 +.0229 +.0348 +.0130
+.0157.

Thus, the F statistic is [ ] 66 , 20 4812 . 40 ~
) 4 15 ( 6 1007 .
4 ) 1 6 ( 1007 . 3354 . 1 (




The large 40.4812 rejects the null hypothesis of poolability (p<.0000). We conclude that the
panel data are not poolable with respect to group.


8.3 Poolability Test over Time

The null hypothesis of the poolability test over time is
k tk
H = :
0
. The sum of
t t
e e
'
is
computed from the 15 time by time regression.

. di .044807673 + .023093978 + .016506613 + .012170358 + .014104542 + ///
.000469826 + .063648817 + .085430285 + .049329439 + .077112957 + ///
.029913538 + .087240016 + .143348297 + .066075346 + .037256216

. 7505079

The F statistic is [ ]
) 4 6 ( 15 7505 .
4 ) 1 15 ( ) 7505 . 3354 . 1 (
30 , 84 4175 .


=

The small F statistic does not reject the null hypothesis in favor of poolable panel data with
respect to time (p<.9991).

2005 The Trustees of Indiana University (12/10/2005) Linear Regression Model for Panel Data: 65
http://www.indiana.edu/~statmath

9. Conclusion

Panel data models investigate group and time effects using fixed effect and random effect
models. The fixed effect models ask how group and/or time affect the intercept, while the
random effect models analyze error variance structures affected by group and/or time. Slopes
are assumed unchanged in both fixed effect and random effect models.

Fixed effect models are estimated by least squares dummy variable (LSDV) regression, the
within effect model, and the between effect model. LSDV has three approaches to avoid perfect
multicollinearity. LSDV1 drops a dummy, LSDV2 suppresses the intercept, and LSDV3
includes all dummies and imposes restrictions instead. LSDV1 is commonly used since it
produces correct statistics. LSDV2 provides actual parameter estimates of group intercepts, but
reports incorrect R
2
and F statistic. Note that the dummy parameters of three LSDV approaches
have different meanings and thus different t-tests.

The within effect model does not use dummy variables but deviations from the group means.
Thus, this model is useful when there are many groups and/or time periods in the panel data set
(no incidental parameter problem at all). The dummy parameter estimates need to be computed
afterward. Because of its larger degrees of freedom, the within effect model produces incorrect
MSE and standard errors of parameters. As a result, you need to adjust the standard errors to
conduct the correct t-tests.

Random effect models are estimated by the generalized least squares (GLS) and the feasible
generalization least squares (FGLS). When the variance structure is known, GLS is used. If
unknown, FGLS estimates theta. Parameter estimates may vary depending on estimation
methods.

Fixed effects are tested by the F-test and random effects by the Breusch-Pagan Lagrange
multiplier test. The Hausman specification test compares a fixed effect model and a random
effect model. If the null hypothesis of uncorrelation is rejected, the fixed effect model is
preferred. Poolabiltiy is tested by running group by group or time by time regressions.

Among the four statistical packages addressed in this document, I would recommend SAS and
STATA. In particular, the SAS PANEL procedure, although experimental now, provides
various ways of analyzing panel data. STATA is very handy to manipulate panel data, but it
does not fit two-way effect models. LIMDEP is able to estimate various panel data models, but
it is not stable enough. SPSS is not recommended for panel data models.
2005 The Trustees of Indiana University (12/10/2005) Linear Regression Model for Panel Data: 66
http://www.indiana.edu/~statmath

APPENDIX: Data sets

Data set 1: Data of the top 50 information technology firms presented in OECD Information
Technology Outlook 2004 (http://thesius.sourceoecd.org/).

firm =IT company name
type =type of IT firm
rnd =2002 R&D investment in current USD millions
income =2000 net income in current USD millions
d1 =1 for equipment and software firms and 0 for telecommunication and electronics

. tab type d1

| d1
Type of Fi r m| 0 1 | Tot al
- - - - - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - +- - - - - - - - - -
Tel ecom| 18 0 | 18
El ect r oni cs | 17 0 | 17
I T Equi pment | 0 6 | 6
Comm. Equi pment | 0 5 | 5
Ser vi ce & S/ W| 0 4 | 4
- - - - - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - +- - - - - - - - - -
Tot al | 35 15 | 50


. sum rnd income

Var i abl e | Obs Mean St d. Dev. Mi n Max
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
r nd | 39 2023. 564 1615. 417 0 5490
i ncome | 50 2509. 78 3104. 585 - 732 11797


Data set 2: Cost data for U.S. airlines (1970-1984) presented in Greene (2003).
URL: http://pages.stern.nyu.edu/~wgreene/Text/tables/tablelist5.htm

airline =airline (six airlines)
year =year (fifteen years)
output0 =output in revenue passenger miles, index number
cost0 =total cost in $1,000
fuel0 =fuel price
load =load factor, the average capacity utilization of the fleet

. tsset
panel var i abl e: ai r l i ne, 1 t o 6
t i me var i abl e: year , 1 t o 15

. sum output0 cost0 fuel0 load

Var i abl e | Obs Mean St d. Dev. Mi n Max
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
out put 0 | 90 . 5449946 . 5335865 . 037682 1. 93646
cost 0 | 90 1122524 1192075 68978 4748320
f uel 0 | 90 471683 329502. 9 103795 1015610
l oad | 90 . 5604602 . 0527934 . 432066 . 676287


2005 The Trustees of Indiana University (12/10/2005) Linear Regression Model for Panel Data: 67
http://www.indiana.edu/~statmath

References

Baltagi, Badi H. 2001. Econometric Analysis of Panel Data. Wiley, J ohn & Sons.
Baltagi, Badi H., and Young-J ae Chang. 1994. "Incomplete Panels: A Comparative Study of
Alternative Estimators for the Unbalanced One-way Error Component Regression
Model." Journal of Econometrics, 62(2): 67-89.
Breusch, T. S., and A. R. Pagan. 1980. "The Lagrange Multiplier Test and its Applications to
Model Specification in Econometrics." Review of Economic Studies, 47(1):239-253.
Fox, J ohn. 1997. Applied Regression Analysis, Linear Models, and Related Methods. Newbury
Park, CA: Sage.
Freund, Rudolf J ., and Ramon C. Littell. 2000. SAS System for Regression, 3
rd
ed. Cary, NC:
SAS Institute.
Fuller, Wayne A. and George E. Battese. 1973. "Transformations for Estimation of Linear
Models with Nested-Error Structure." Journal of the American Statistical
Association, 68(343) (September): 626-632.
Fuller, Wayne A. and George E. Battese. 1974. "Estimation of Linear Models with Crossed-
Error Structure." Journal of Econometrics, 2: 67-78.
Greene, William H. 2002. LIMDEP Version 8.0 Econometric Modeling Guide, 4th ed.
Plainview, New York: Econometric Software.
Greene, William H. 2003. Econometric Analysis, 5th ed. Upper Saddle River, NJ : Prentice Hall.
Hausman, J . A. 1978. "Specification Tests in Econometrics." Econometrica, 46(6):1251-1271.
SAS Institute. 2004. SAS/ETS 9.1 Users Guide. Cary, NC: SAS Institute.
SAS Institute. 2004. SAS/STAT 9.1 Users Guide. Cary, NC: SAS Institute.
http://www.sas.com/
STATA Press. 2005. STATA Base Reference Manual, Release 9. College Station, TX: STATA
Press.
STATA Press. 2005. STATA Longitudinal/Panel Data Reference Manual, Release 9. College
Station, TX: STATA Press.
STATA Press. 2005. STATA Time-Series Reference Manual, Release 9. College Station, TX:
STATA Press.
Wooldridge, J effrey M. 2002. Econometric Analysis of Cross Section and Panel
Data. Cambridge, MA: MIT Press.


Acknowledgements

I have to thank Dr. Heejoon Kang in the Kelley School of Business and Dr. David H. Good in
the School of Public and Environmental Affairs, Indiana University at Bloomington, for their
insightful lectures. I am also grateful to J eremy Albright and Kevin Wilhite at the UITS Center
for Statistical and Mathematical Computing for comments and suggestions.


Revision History

2005.11 First draft
2003-2005, The Trustees of Indiana University Categorical Dependent Variable Models: 1
http://www.indiana.edu/~statmath
Categorical Dependent Variable Models
Using SAS, STATA, LIMDEP, and SPSS

Hun Myoung Park

This document summarizes regression models for categorical dependent variables and illustrates
how to estimate individual models using SAS 9.1, STATA 9.0, LIMDEP 8.0, and SPSS 13.0.

1. Introduction
2. The Binary Logit Model
3. The Binary Probit Model
4. Bivariate Logit/Probit Models
5. Ordered Logit/Probit Models
6. The Multinomial Logit Model
7. The Conditional Logit Model
8. The Nested Logit Model
9. Conclusion
10. Appendix


1. Introduction

The categorical variable here refers to a variable that is binary, ordinal, or nominal. Event count
data are discrete (categorical) but often considered continuous. When the dependent variable is
categorical, the ordinary least squares (OLS) method can no longer produce the best linear
unbiased estimator (BLUE); that is, OLS is biased and inefficient. Consequently, researchers
have developed various categorical dependent variable models (CDVMs). The nonlinearity of
CDVMs makes it difficult to interpret outputs, since the effect of a change in a variable depends
on the values of all other variables in the model (Long 1997).


1.1 Categorical Dependent Variable Models

In CDVMs, the left-hand side (LHS) variable or dependent variable is neither interval nor ratio,
but rather categorical. The level of measurement and data generation process (DGP) of a
dependent variable determines the proper type of CDVM. Thus, binary responses are modeled
with the binary logit and probit regressions, ordinal responses are formulated into the ordered
logit/probit regression models, and nominal responses are analyzed by multinomial logit,
conditional logit, or nested logit models. Independent variables on the right-hand side (RHS)
may be interval, ratio, or binary (dummy).

The CDVMs adopt the maximum likelihood (ML) estimation method, whereas OLS uses the
moment based method. The ML method requires assumptions about probability distribution
functions, such as the logistic function and the complementary log-log function. Logit models
use the standard logistic probability distribution, while probit models assume the standard
2003-2005, The Trustees of Indiana University Categorical Dependent Variable Models: 2
http://www.indiana.edu/~statmath
normal distribution. This document focuses on logit and probit models only. Table 1 summarizes
CDVMs in comparison with OLS.

Table 1. Ordinary Least Squares and CDVMs
Model Dependent (LHS) Estimation Independent (RHS)
OLS
Ordinary least
squares
Interval or ratio
Moment based
method
Binary response Binary (0 or 1)
Ordinal response Ordinal (1
st
, 2
nd
, 3
rd
)
Nominal response Nominal (A, B, C )
CDVMs
Event count data Count (0, 1, 2, 3)
Maximum
likelihood
method
A linear function of
interval/ratio or binary
variables
...
2 2 1 1 0
X X + +


1.2 Logit Models versus Probit Models

How do logit models differ from probit models? The core difference lies in the distribution of
errors. In the logit model, errors are assumed to follow the standard logistic distribution with
mean 0 and variance
3
2

,
2
) 1 (
) (


e
e
+
= . The errors of the probit model are assumed to follow
the standard normal distribution,
2
2
2
1
) (



= e .

Figure 1. Comparison of the Standard Normal and Standard Logistic Probability Distributions
PDF of the Standard Normal Distribution CDF of the Standard Normal Distribution
PDF of the Standard Logistic Distribution CDF of the Standard Logistic Distribution

The probability density function (PDF) of the standard normal probability distribution has a
higher peak and thinner tails than the standard logistic probability distribution (Figure 1). The
2003-2005, The Trustees of Indiana University Categorical Dependent Variable Models: 3
http://www.indiana.edu/~statmath
standard logistic distribution looks as if someone has weighed down the peak of the standard
normal distribution and strained its tails. As a result, the cumulative density function (CDF) of
the standard normal distribution is steeper in the middle than the CDF of the standard logistic
distribution and quickly approaches zero on the left and one on the right.

The two models, of course, produce different parameter estimates. In binary response models,
the estimates of a logit model are roughly 3 times larger than those of the corresponding
probit model. These estimators, however, are almost the same in terms of the standardized
impacts of independent variables and predictions (Long 1997).

In general, logit models reach convergence in estimation fairly well. Some (multinomial) probit
models may take a long time to reach convergence, although the probit works well for bivariate
models.


1.3 Estimation in SAS, STATA, LIMDEP, and SPSS

SAS provides several procedures for CDVMs, such as LOGISTIC, PROBIT, GENMOD, QLIM,
MDC, and CATMOD. Since these procedures support various models, a CDVM can be
estimated by multiple procedures. For example, you may run a binary logit model using the
LOGISTIC, PROBIT, GENMODE, and QLIM. The LOGISTIC and PROBIT procedures of
SAS/STAT have been commonly used, but the QLIM and MDC procedures of SAS/ETS are
noted for their advanced features.

Table 2. Procedures and Commands for CDVMs
Model SAS 9.1 Stata 9.0 LIMDEP 8.0 SPSS13.0
OLS (Ordinary least squares) REG .regress Regress$ Regression
Binary logit
QLIM, GENMOD,
LOGISTIC, PROBIT,
CATMOD
.logit,
logistic
Logit$
Logistic
regression
Binary
Binary probit
QLIM, GENMOD,
LOGISTIC, PROBIT
.probit Probit$ Probit
Bivariate logit QLIM - - -
Bivariate
Bivariate probit QLIM .biprobit Bivariateprobit$ -
Ordered logit
QLIM, PROBIT,
LOGISTIC
.ologit Ordered$, Logit$ Plum
Generalized logit - .gologit
*
- - Ordinal
Ordered probit
QLIM, PROBIT,
LOGISTIC
.oprobit Ordered$ Plum
Multinomial logit CATMOD .mlogit Mlogit$, Logit$ Nomreg
Conditional logit MDC, PHREG .clogit Clogit$, Logit$ Coxreg
Nested logit MDC .nlogit Nlogit$
**
-
Nominal
Multinomial probit MDC .mprobit - -
* User-written commands written by Fu (1998) and Williams (2005)
** The Nl ogi t $ command is supported by NLOGIT 3.0, which is sold separately.

The QLIM (Qualitative and LImited dependent variable Model) procedure analyzes various
categorical and limited dependent variable regression models such as censored, truncated, and
sample-selection models. This QLIM procedure also handles Box-Cox regression and bivariate
2003-2005, The Trustees of Indiana University Categorical Dependent Variable Models: 4
http://www.indiana.edu/~statmath
probit and logit models. The MDC (Multinomial Discrete Choice) Procedure can estimate
multinomial probit, conditional logit, and nested (multinomial) logit models.

Unlike SAS, STATA has individualized commands for corresponding CDVMs. For example,
the . l ogi t and . pr obi t commands respectively fit the binary logit and probit models. The
LIMDEP Logi t $ and Pr obi t $ commands support a variety of CDVMs that are addressed in
Greenes Econometric Analysis (2003). SPSS supports some related commands for CDVMs but
has limited ability to analyze categorical data. Because of its limitation, SPSS outputs are
skipped here. Table 2 summarizes the procedures and commands for CDVMs.


1. 4 Long and Freeses SPost Module

STATA users may take advantages of user-written modules such as J. Scott Long and Jeremy
Freeses SPost. The module allows researchers to conduct follow-up analyses of various CDVMs
including event count data models. See section 2.2 for major SPost commands.

In order to install SPost, execute the following commands consecutively. For more details, visit J.
Scott Longs Web site at http://www.indiana.edu/~jslsoc/spost_install.htm.

. net from http://www.indiana.edu/~jslsoc/stata/

. net install spost9_ado, replace

. net get spost9_do, replace

If you want to use Vincent Kang Fus gol ogi t (2000) and Richard Williams gol ogi t 2 (2005)
for the generalized ordered logit model, type in the following.

. net search gologit

. net install gologit from(http://www.stata.com/users/jhardin)

. net install gologit2 from(http://fmwww.bc.edu/RePEc/bocode/g)


2003-2005, The Trustees of Indiana University Categorical Dependent Variable Models: 5
http://www.indiana.edu/~statmath
2. The Binary Logit Regression Model

The binary logit model is represented as ) (
) exp( 1
) exp(
) | 1 ( Pr

x
x
x
x y ob =
+
= = , where
indicates a link function, the cumulative standard logistic probability distribution function. This
chapter examines how car ownership (owncar ) is affected by monthly income (i ncome), age, and
gender (mal e). See the appendix for details about the data set.


2.1 Binary Logit in STATA (.logit)

STATA provides two equivalent commands for the binary logit model, which present the same
result in different ways. The . l ogi t command produces coefficients with respect to logit (log of
odds), while the . l ogi st i c reports estimates as odd ratios.

. logistic owncar income age male

Logi st i c r egr essi on Number of obs = 437
LR chi 2( 3) = 18. 24
Pr ob > chi 2 = 0. 0004
Log l i kel i hood = - 273. 84758 Pseudo R2 = 0. 0322

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
owncar | Odds Rat i o St d. Er r . z P>| z| [ 95%Conf . I nt er val ]
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
i ncome | . 9898826 . 5677504 - 0. 02 0. 986 . 3216431 3. 046443
age | 1. 279626 . 088997 3. 55 0. 000 1. 116561 1. 466505
mal e | 1. 513669 . 3111388 2. 02 0. 044 1. 011729 2. 264633
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

. logit

In order to get the coefficients (log of odds), simply run the . l ogi t without any argument right
after the . l ogi st i c command. Or run an independent . l ogi t command with all arguments.

. logit owncar income age male

I t er at i on 0: l og l i kel i hood = - 282. 96512
I t er at i on 1: l og l i kel i hood = - 273. 93537
I t er at i on 2: l og l i kel i hood = - 273. 84761
I t er at i on 3: l og l i kel i hood = - 273. 84758

Logi st i c r egr essi on Number of obs = 437
LR chi 2( 3) = 18. 24
Pr ob > chi 2 = 0. 0004
Log l i kel i hood = - 273. 84758 Pseudo R2 = 0. 0322

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
owncar | Coef . St d. Er r . z P>| z| [ 95%Conf . I nt er val ]
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
i ncome | - . 010169 . 5735533 - 0. 02 0. 986 - 1. 134313 1. 113975
age | . 2465678 . 0695492 3. 55 0. 000 . 1102539 . 3828817
mal e | . 4145366 . 2055527 2. 02 0. 044 . 0116606 . 8174126
_cons | - 4. 682741 1. 474519 - 3. 18 0. 001 - 7. 572745 - 1. 792738
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
2003-2005, The Trustees of Indiana University Categorical Dependent Variable Models: 6
http://www.indiana.edu/~statmath

Note that a coefficient of the . l ogi t is the logarithmic transformed corresponding estimator of
the . l ogi st i c. For example, .2465678= log(1.279626).

STATA has post-estimation commands that conduct follow-up analyses. The . pr edi ct
command computes predictions, residuals, or standard errors of the prediction and stores them
into a new variable.

. predict r, residual

The . t est and . l r t est commands respectively conduct the Wald test and likelihood ratio test.

. test income age

( 1) i ncome = 0
( 2) age = 0

chi 2( 2) = 12. 57
Pr ob > chi 2 = 0. 0019


2.2 Using the SPost Module in STATA

The SPost module provides useful follow-up analysis commands (ado files) for various
categorical dependent variable models (Long and Freese 2003). The . f i t st at command
calculates various goodness-of-fit statistics such as log likelihood, McFaddens R
2
(or Pseudo
R
2
), Akaike Information Criterion (AIC), and (Bayesian Information Criterion (BIC).

. fitstat

Measur es of Fi t f or l ogi st i c of owncar

Log- Li k I nt er cept Onl y: - 282. 965 Log- Li k Ful l Model : - 273. 848
D( 433) : 547. 695 LR( 3) : 18. 235
Pr ob > LR: 0. 000
McFadden' s R2: 0. 032 McFadden' s Adj R2: 0. 018
Maxi mumLi kel i hood R2: 0. 041 Cr agg & Uhl er ' s R2: 0. 056
McKel vey and Zavoi na' s R2: 0. 059 Ef r on' s R2: 0. 040
Var i ance of y*: 3. 495 Var i ance of er r or : 3. 290
Count R2: 0. 638 Adj Count R2: - 0. 033
AI C: 1. 272 AI C*n: 555. 695
BI C: - 2084. 916 BI C' : 0. 005

The likelihood ratio for goodness of fit is computed as,

. di 2*(-273.848 - (-282.965))
18. 234

The . l i st coef command lists unstandardized coefficients (parameter estimates), factor and
percent changes, and standardized coefficients to help interpret results. The hel p option tells
how to read the outputs.

. listcoef, help

2003-2005, The Trustees of Indiana University Categorical Dependent Variable Models: 7
http://www.indiana.edu/~statmath
l ogi st i c ( N=437) : Fact or Change i n Odds

Odds of : 1 vs 0

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
owncar | b z P>| z| e^b e^bSt dX SDof X
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
i ncome | - 0. 01017 - 0. 018 0. 986 0. 9899 0. 9982 0. 1792
age | 0. 24657 3. 545 0. 000 1. 2796 1. 4876 1. 6108
mal e | 0. 41454 2. 017 0. 044 1. 5137 1. 2279 0. 4953
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
b = r aw coef f i ci ent
z = z- scor e f or t est of b=0
P>| z| = p- val ue f or z- t est
e^b = exp( b) = f act or change i n odds f or uni t i ncr ease i n X
e^bSt dX = exp( b*SD of X) = change i n odds f or SD i ncr ease i n X
SDof X = st andar d devi at i on of X

The . pr t ab command constructs a table of predicted values (events) for all combinations of
categorical variables listed. The following example shows that 60 percent of female and 70
percent of male students are likely to own cars, given the mean values of i ncome and age.

. prtab male

l ogi st i c: Pr edi ct ed pr obabi l i t i es of posi t i ve out come f or owncar

- - - - - - - - - - - - - - - - - - - - - -
mal e | Pr edi ct i on
- - - - - - - - - - +- - - - - - - - - - -
0 | 0. 6017
1 | 0. 6958
- - - - - - - - - - - - - - - - - - - - - -

i ncome age mal e
x= . 61683982 20. 691076 . 57208238

The . pr val ue lists predicted probabilities of positive and negative outcomes for a given set of
values for the independent variables. Note both the . pr t ab and . pr val ue commands report the
identical predicted probability that male students own cars, .6017, holding other variables at their
means.

. prvalue, x(male=0) rest(mean)

l ogi st i c: Pr edi ct i ons f or owncar

Pr ( y=1| x) : 0. 6017 95%ci : ( 0. 5286, 0. 6706)
Pr ( y=0| x) : 0. 3983 95%ci : ( 0. 3294, 0. 4714)

i ncome age mal e
x= . 61683982 20. 691076 0

The most useful command is the . pr change, which calculates marginal effects (changes) and
discrete changes at the given set of values of independent variables. The hel p option tells how to
read the outputs. For instance, the predicted probability that a male students owns a car is .094
(0- >1) higher than that of female students, holding other variables at their mean.

. prchange, help
2003-2005, The Trustees of Indiana University Categorical Dependent Variable Models: 8
http://www.indiana.edu/~statmath

l ogi t : Changes i n Pr edi ct ed Pr obabi l i t i es f or owncar

mi n- >max 0- >1 - +1/ 2 - +sd/ 2 Mar gEf ct
i ncome - 0. 0019 - 0. 0023 - 0. 0023 - 0. 0004 - 0. 0023
age 0. 4404 0. 0032 0. 0555 0. 0893 0. 0556
mal e 0. 0940 0. 0940 0. 0932 0. 0462 0. 0934

0 1
Pr ( y| x) 0. 3430 0. 6570

i ncome age mal e
x= . 61684 20. 6911 . 572082
sd( x) = . 17918 1. 61081 . 495344

Pr ( y| x) : pr obabi l i t y of obser vi ng each y f or speci f i ed x val ues
Avg| Chg| : aver age of absol ut e val ue of t he change acr oss cat egor i es
Mi n- >Max: change i n pr edi ct ed pr obabi l i t y as x changes f r omi t s mi ni mumt o
i t s maxi mum
0- >1: change i n pr edi ct ed pr obabi l i t y as x changes f r om0 t o 1
- +1/ 2: change i n pr edi ct ed pr obabi l i t y as x changes f r om1/ 2 uni t bel ow
base val ue t o 1/ 2 uni t above
- +sd/ 2: change i n pr edi ct ed pr obabi l i t y as x changes f r om1/ 2 st andar d
dev bel ow base t o 1/ 2 st andar d dev above
Mar gEf ct : t he par t i al der i vat i ve of t he pr edi ct ed pr obabi l i t y/ r at e wi t h
r espect t o a gi ven i ndependent var i abl e

The SPost module also includes the . pr gen, which computes a series of predictions by holding
all variables but one interval variable constant and allowing that variable to vary (Long and
Freese 2003).

. prgen income, from(.1) to(1.5) x(male=1) rest(median) generate(ppcar)

l ogi st i c: Pr edi ct ed val ues as i ncome var i es f r om. 1 t o 1. 5.

i ncome age mal e
x= . 58200002 21 1

The above command computes predicted probabilities that male students own cars when i ncome
changes from $100 through $1,500, holding age at its median of 21 and stores them into a new
variable ppcar .


2.3 Using the SAS LOGISTIC and PROBIT Procedures

SAS has several procedures for the binary logit model such as the LOGISTIC, PROBIT,
GENMOD, and QLIM. The LOGISTIC procedure is commonly used for the binary logit model,
but the PROBIT procedure also estimates the binary logit. Let us first consider the LOGISTIC
procedure.

PROC LOGISTIC DESCENDING DATA = masil.students;
MODEL owncar = income age male;
RUN;


The LOGISTIC Procedure
2003-2005, The Trustees of Indiana University Categorical Dependent Variable Models: 9
http://www.indiana.edu/~statmath

Model Information

Data Set MASIL.STUDENTS
Response Variable owncar
Number of Response Levels 2
Model binary logit
Optimization Technique Fisher's scoring


Number of Observations Read 437
Number of Observations Used 437


Response Profile

Ordered Total
Value owncar Frequency

1 1 284
2 0 153

Probability modeled is owncar=1.


Model Convergence Status

Convergence criterion (GCONV=1E-8) satisfied.


Model Fit Statistics

Intercept
Intercept and
Criterion Only Covariates

AIC 567.930 555.695
SC 572.010 572.015
-2 Log L 565.930 547.695


Testing Global Null Hypothesis: BETA=0

Test Chi-Square DF Pr > ChiSq

Likelihood Ratio 18.2351 3 0.0004
Score 17.4697 3 0.0006
Wald 16.7977 3 0.0008


Analysis of Maximum Likelihood Estimates

Standard Wald
Parameter DF Estimate Error Chi-Square Pr > ChiSq

Intercept 1 -4.6827 1.4745 10.0855 0.0015
income 1 -0.0102 0.5736 0.0003 0.9859
2003-2005, The Trustees of Indiana University Categorical Dependent Variable Models: 10
http://www.indiana.edu/~statmath
age 1 0.2466 0.0695 12.5686 0.0004
male 1 0.4145 0.2056 4.0670 0.0437


Odds Ratio Estimates

Point 95% Wald
Effect Estimate Confidence Limits

income 0.990 0.322 3.046
age 1.280 1.117 1.467
male 1.514 1.012 2.265


Association of Predicted Probabilities and Observed Responses

Percent Concordant 58.9 Somers' D 0.246
Percent Discordant 34.3 Gamma 0.264
Percent Tied 6.8 Tau-a 0.112
Pairs 43452 c 0.623

The SAS LOGISTIC, PROBIT, and GENMOD procedures by default uses a smaller value in the
dependent variable as success. Thus, the magnitudes of the coefficients remain the same, but the
signs are opposite to those of the QLIM procedure, STATA, and LIMDEP. The DESCENDING
option forces SAS to use a larger value as success. Alternatively, you may explicitly specify the
category of successful event using the EVENT option as follows.

PROC LOGISTIC DESCENDING DATA = masil.students;
MODEL owncar(EVENT=1) = income age male;
RUN;

The SAS LOGISTIC procedure computes odds changes when independent variables increase by
the units specified in the UNITS statement. The SD below indicates a standard deviation increase
in i ncome and age (e.g., -2 means a two unit decrease in independent variables).

PROC LOGISTIC DESCENDING DATA = masil.students;
MODEL owncar = income age male;
UNITS income=SD age=SD;
RUN;

The UNITS statement adds the Adjusted Odds Ratios to the end of the outputs above. Note
that the odds changes of the two variables are identical to those under the e^bSt dX of the
previous SPost . l i st coef output.

Adjusted Odds Ratios

Effect Unit Estimate

income 0.1792 0.998
age 1.6108 1.488

Now, let us use the PROBIT procedure to estimate the same binary logit model. The PROBIT
requires the CLASS statement to list categorical variables. The /DIST=LOGISTIC option
indicates the probability distribution to be used in maximum likelihood estimation.
2003-2005, The Trustees of Indiana University Categorical Dependent Variable Models: 11
http://www.indiana.edu/~statmath

PROC PROBIT DATA = masil.students;
CLASS owncar;
MODEL owncar = income age male /DIST=LOGISTIC;
RUN;

Probit Procedure

Model Information

Data Set MASIL.STUDENTS
Dependent Variable owncar
Number of Observations 437
Name of Distribution Logistic
Log Likelihood -273.847577


Number of Observations Read 437
Number of Observations Used 437


Class Level Information

Name Levels Values

owncar 2 0 1


Response Profile

Ordered Total
Value owncar Frequency

1 0 153
2 1 284

PROC PROBIT is modeling the probabilities of levels of owncar having LOWER Ordered Values in
the response profile table.

Algorithm converged.


Type III Analysis of Effects

Wald
Effect DF Chi-Square Pr > ChiSq

income 1 0.0003 0.9859
age 1 12.5686 0.0004
male 1 4.0670 0.0437


Analysis of Parameter Estimates

Standard 95% Confidence Chi-
Parameter DF Estimate Error Limits Square Pr > ChiSq

2003-2005, The Trustees of Indiana University Categorical Dependent Variable Models: 12
http://www.indiana.edu/~statmath
Intercept 1 4.6827 1.4745 1.7927 7.5727 10.09 0.0015
income 1 0.0102 0.5736 -1.1140 1.1343 0.00 0.9859
age 1 -0.2466 0.0695 -0.3829 -0.1103 12.57 0.0004
male 1 -0.4145 0.2056 -0.8174 -0.0117 4.07 0.0437

Unlike LOGISTIC, PROBIT does not have the DESCENDING option. Thus, you have to switch
the signs of coefficients when comparing with those of STATA and LIMDEP. The PROBIT
procedure also does not have the UNITS statement to compute changes in odds.


2.4 Using the SAS GENMOD and QLIM Procedures

The GENMOD provides flexible methods to estimate generalized linear model. The
DISTRIBUTION (DIST) and the LINK=LOGIT options respectively specify a probability
distribution and a link function.

PROC GENMOD DATA = masil.students DESC;
MODEL owncar = income age male /DIST=BINOMIAL LINK=LOGIT;
RUN;

The GENMOD Procedure

Model Information

Data Set MASIL.STUDENTS
Distribution Binomial
Link Function Logit
Dependent Variable owncar


Number of Observations Read 437
Number of Observations Used 437
Number of Events 284
Number of Trials 437


Response Profile

Ordered Total
Value owncar Frequency

1 1 284
2 0 153

PROC GENMOD is modeling the probability that owncar='1'.


Criteria For Assessing Goodness Of Fit

Criterion DF Value Value/DF

Deviance 433 547.6952 1.2649
Scaled Deviance 433 547.6952 1.2649
Pearson Chi-Square 433 436.4352 1.0079
Scaled Pearson X2 433 436.4352 1.0079
2003-2005, The Trustees of Indiana University Categorical Dependent Variable Models: 13
http://www.indiana.edu/~statmath
Log Likelihood -273.8476


Algorithm converged.


Analysis Of Parameter Estimates

Standard Wald 95% Confidence Chi-
Parameter DF Estimate Error Limits Square Pr > ChiSq

Intercept 1 -4.6827 1.4745 -7.5727 -1.7927 10.09 0.0015
income 1 -0.0102 0.5736 -1.1343 1.1140 0.00 0.9859
age 1 0.2466 0.0695 0.1103 0.3829 12.57 0.0004
male 1 0.4145 0.2056 0.0117 0.8174 4.07 0.0437
Scale 0 1.0000 0.0000 1.0000 1.0000

NOTE: The scale parameter was held fixed.

If you have categorical (string) independent variables, list the variables in the CLASS statement
without creating dummy variables.

PROC GENMOD DATA = masil.students DESC;
CLASS male;
MODEL owncar = income age male /DIST=BINOMIAL LINK=LOGIT;
RUN;

Users may also provide their own link functions using the FWDLINK and INVLINK statements
instead of the LINK=LOGIT option.

PROC GENMOD DATA = masil.students DESC;
FWDLINK link=LOG(_MEAN_/(1-_MEAN_));
INVLINK invlink=1/(1+EXP(-1*_XBETA_));
MODEL owncar = income age male /DIST=BINOMIAL;
RUN;

All three GENMOD examples discussed so far produce the identical result.

The QLIM procedure estimates not only logit and probit models, but also censored, truncated,
and sample-selected models. You may provide characteristics of the dependent variable either in
the ENDOGENOUS statement or the option of the MODEL statement.

PROC QLIM DATA=masil.students;
MODEL owncar = income age male;
ENDOGENOUS owncar ~ DISCRETE (DIST=LOGIT);
RUN;

Or,

PROC QLIM DATA=masil.students;
MODEL owncar = income age male /DISCRETE (DIST=LOGIT);
RUN;

The QLIM Procedure

2003-2005, The Trustees of Indiana University Categorical Dependent Variable Models: 14
http://www.indiana.edu/~statmath
Discrete Response Profile of owncar

Index Value Frequency Percent

1 0 153 35.01
2 1 284 64.99


Model Fit Summary

Number of Endogenous Variables 1
Endogenous Variable owncar
Number of Observations 437
Log Likelihood -273.84758
Maximum Absolute Gradient 9.63219E-6
Number of Iterations 8
AIC 555.69515
Schwarz Criterion 572.01489


Goodness-of-Fit Measures

Measure Value Formula

Likelihood Ratio (R) 18.235 2 * (LogL - LogL0)
Upper Bound of R (U) 565.93 - 2 * LogL0
Aldrich-Nelson 0.0401 R / (R+N)
Cragg-Uhler 1 0.0409 1 - exp(-R/N)
Cragg-Uhler 2 0.0563 (1-exp(-R/N)) / (1-exp(-U/N))
Estrella 0.0415 1 - (1-R/U)^(U/N)
Adjusted Estrella 0.0234 1 - ((LogL-K)/LogL0)^(-2/N*LogL0)
McFadden's LRI 0.0322 R / U
Veall-Zimmermann 0.071 (R * (U+N)) / (U * (R+N))
McKelvey-Zavoina 0.1699

N = # of observations, K = # of regressors

Algorithm converged.


Parameter Estimates

Standard Approx
Parameter Estimate Error t Value Pr > |t|

Intercept -4.682741 1.474519 -3.18 0.0015
income -0.010169 0.573553 -0.02 0.9859
age 0.246568 0.069549 3.55 0.0004
male 0.414537 0.205553 2.02 0.0437

Finally, the CATMOD procedure fits the logit model to the functions of categorical response
variables. This procedure, however, produces slightly different estimators compared to those of
other procedures discussed so far. This procedure is, therefore, less recommended for the binary
logit model. The DIRECT statement specifies interval or ratio variables used in the MODEL.
The /NOPROFILE suppresses the display of the population profiles and the response profiles.

2003-2005, The Trustees of Indiana University Categorical Dependent Variable Models: 15
http://www.indiana.edu/~statmath
PROC CATMOD DATA = masil.students;
DIRECT income age;
MODEL owncar = income age male /NOPROFILE;
RUN;


2.5 Binary Logit in LIMDEP (Logit$)

The Logi t $ command in LIMDEP estimates various logit models. The dependent variable is
specified in the Lhs$ (left-hand side) subcommand and a list of independent variables in the
Rhs$ (right-hand side). You have to explicitly specify the ONE for the intercept. The Mar gi nal
Ef f ect s$ and the Means$ subcommands compute marginal effects at the mean values of
independent variables.

LOGIT;
Lhs=owncar;
Rhs=ONE,income,age,male;
Marginal Effects; Means$

Normal exit from iterations. Exit status=0.

+---------------------------------------------+
| Multinomial Logit Model |
| Maximum Likelihood Estimates |
| Model estimated: Sep 17, 2005 at 05:31:28PM.|
| Dependent variable OWNCAR |
| Weighting variable None |
| Number of observations 437 |
| Iterations completed 5 |
| Log likelihood function -273.8476 |
| Restricted log likelihood -282.9651 |
| Chi squared 18.23509 |
| Degrees of freedom 3 |
| Prob[ChiSqd > value] = .3933723E-03 |
| Hosmer-Lemeshow chi-squared = 8.44648 |
| P-value= .39111 with deg.fr. = 8 |
+---------------------------------------------+
+---------+--------------+----------------+--------+---------+----------+
|Variable | Coefficient | Standard Error |b/St.Er.|P[|Z|>z] | Mean of X|
+---------+--------------+----------------+--------+---------+----------+
Characteristics in numerator of Prob[Y = 1]
Constant -4.682741385 1.4745190 -3.176 .0015
INCOME -.1016896029E-01 .57355331 -.018 .9859 .61683982
AGE .2465677833 .69549211E-01 3.545 .0004 20.691076
MALE .4145365774 .20555276 2.017 .0437 .57208238
(Note: E+nn or E-nn means multiply by 10 to + or -nn power.)

+--------------------------------------------------------------------+
| Information Statistics for Discrete Choice Model. |
| M=Model MC=Constants Only M0=No Model |
| Criterion F (log L) -273.84758 -282.96512 -302.90532 |
| LR Statistic vs. MC 18.23509 .00000 .00000 |
| Degrees of Freedom 3.00000 .00000 .00000 |
| Prob. Value for LR .00039 .00000 .00000 |
| Entropy for probs. 273.84758 282.96512 302.90532 |
| Normalized Entropy .90407 .93417 1.00000 |
2003-2005, The Trustees of Indiana University Categorical Dependent Variable Models: 16
http://www.indiana.edu/~statmath
| Entropy Ratio Stat. 58.11548 39.88039 .00000 |
| Bayes Info Criterion 565.93495 584.17004 624.05044 |
| BIC - BIC(no model) 58.11548 39.88039 .00000 |
| Pseudo R-squared .03222 .00000 .00000 |
| Pct. Correct Prec. 63.84439 .00000 50.00000 |
| Means: y=0 y=1 y=2 y=3 yu=4 y=5, y=6 y>=7 |
| Outcome .3501 .6499 .0000 .0000 .0000 .0000 .0000 .0000 |
| Pred.Pr .3501 .6499 .0000 .0000 .0000 .0000 .0000 .0000 |
| Notes: Entropy computed as Sum(i)Sum(j)Pfit(i,j)*logPfit(i,j). |
| Normalized entropy is computed against M0. |
| Entropy ratio statistic is computed against M0. |
| BIC = 2*criterion - log(N)*degrees of freedom. |
| If the model has only constants or if it has no constants, |
| the statistics reported here are not useable. |
+--------------------------------------------------------------------+

+-------------------------------------------+
| Partial derivatives of probabilities with |
| respect to the vector of characteristics. |
| They are computed at the means of the Xs. |
+-------------------------------------------+
+---------+--------------+----------------+--------+---------+----------+
|Variable | Coefficient | Standard Error |b/St.Er.|P[|Z|>z] | Mean of X|
+---------+--------------+----------------+--------+---------+----------+
Characteristics in numerator of Prob[Y = 1]
Constant -1.055282283 .33183024 -3.180 .0015
INCOME -.2291632775E-02 .12925338 -.018 .9859 .61683982
AGE .5556544593E-01 .15534022E-01 3.577 .0003 20.691076
Marginal effect for dummy variable is P|1 - P|0.
MALE .9403411023E-01 .46726710E-01 2.012 .0442 .57208238
(Note: E+nn or E-nn means multiply by 10 to + or -nn power.)

+----------------------------------------+
| Fit Measures for Binomial Choice Model |
| Logit model for variable OWNCAR |
+----------------------------------------+
| Proportions P0= .350114 P1= .649886 |
| N = 437 N0= 153 N1= 284 |
| LogL = -273.84758 LogL0 = -282.9651 |
| Estrella = 1-(L/L0)^(-2L0/n) = .04153 |
+----------------------------------------+
| Efron | McFadden | Ben./Lerman |
| .03963 | .03222 | .56318 |
| Cramer | Veall/Zim. | Rsqrd_ML |
| .04010 | .07099 | .04087 |
+----------------------------------------+
| Information Akaike I.C. Schwarz I.C. |
| Criteria 1.27161 572.01489 |
+----------------------------------------+
Frequencies of actual & predicted outcomes
Predicted outcome has maximum probability.
Threshold value for predicting Y=1 = .5000
Predicted
------ ---------- + -----
Actual 0 1 | Total
------ ---------- + -----
0 21 132 | 153
2003-2005, The Trustees of Indiana University Categorical Dependent Variable Models: 17
http://www.indiana.edu/~statmath
1 26 258 | 284
------ ---------- + -----
Total 47 390 | 437

Note that the marginal effects above are identical to those of the SPost . pr change command in
section 2.2. LIMDEP computes discrete changes for binary variables like mal e.


2.6 Binary Logit in SPSS

SPSS has the Logi st i c r egr essi on command for the binary logit model.

LOGISTIC REGRESSION VAR=owncar
/METHOD=ENTER income age male
/CRITERIA PIN(.05) POUT(.10) ITERATE(20) CUT(.5) .

Table 3 summarizes parameter estimates and goodness-of-fit statistics across procedures and
commands for the binary logit model. Estimates and their standard errors produced are almost
identical except some rounding errors. As shown in Table 3, the QLIM and LOGISTIC are
recommended for categorical dependent variables. Note that the PROBIT procedure returns the
opposite signs of estimates.

Table 3. Parameter Estimates and Goodness-of-fit Statistics of the Binary Logit Model
LOGISTIC PROBIT GENMOD QLIM STATA LIMDEP
Intercept
- 4. 6827
( 1. 4745)
4. 6827
( 1. 4745)
- 4. 6827
( 1. 4745)
- 4. 6827
( 1. 4745)
- 4. 6827
( 1. 4745)
- 4. 6827
( 1. 4745)
i ncome - . 0102
( . 5736)
. 0102
( . 5736)
- . 0102
( . 5736)
- . 0102
( . 5736)
- . 0102
( . 5736)
- . 0102
( . 5736)
age . 2466
( . 0695)
- . 2466
( . 0695)
. 2466
( . 0695)
. 2466
( . 0695)
. 2466
( . 0695)
. 2466
( . 0695)
mal e . 4145
( . 2056)
- . 4145
( . 2056)
. 4145
( . 2056)
. 4145
( . 2056)
. 4145
( . 2056)
. 4145
( . 2056)
Log likelihood
547. 695
*
- 273. 8476 - 273. 8476 - 273. 8476 - 273. 8476 - 273. 8476
Likelihood test
18. 2351 18. 235 18. 24 18. 2351
Pseudo R
2

. 0322 . 0322 . 0322
AIC
555. 695
**
555. 6952
**
1. 2716
Schwarz
572. 015 572. 0149 572. 0150
BIC
565. 9350
* The LOGISTIC procedure reports (-2*log likelihood).
** AIC*N
2003-2005, The Trustees of Indiana University Categorical Dependent Variable Models: 18
http://www.indiana.edu/~statmath
3. The Binary Probit Regression Model

The probit model is represented as ) ( ) | 1 ( Pr x x y ob = = , where indicates the cumulative
standard normal probability distribution function.


3.1 Binary Probit in STATA (.probit)

STATA has the . pr obi t command to estimate the binary probit regression model.

. probit owncar income age male

I t er at i on 0: l og l i kel i hood = - 282. 96512
I t er at i on 1: l og l i kel i hood = - 273. 84832
I t er at i on 2: l og l i kel i hood = - 273. 81741
I t er at i on 3: l og l i kel i hood = - 273. 81741

Pr obi t r egr essi on Number of obs = 437
LR chi 2( 3) = 18. 30
Pr ob > chi 2 = 0. 0004
Log l i kel i hood = - 273. 81741 Pseudo R2 = 0. 0323

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
owncar | Coef . St d. Er r . z P>| z| [ 95%Conf . I nt er val ]
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
i ncome | . 0005613 . 3476842 0. 00 0. 999 - . 6808873 . 6820098
age | . 1487005 . 0409837 3. 63 0. 000 . 068374 . 2290271
mal e | . 2579112 . 1256085 2. 05 0. 040 . 0117231 . 5040993
_cons | - 2. 823671 . 8730955 - 3. 23 0. 001 - 4. 534907 - 1. 112435
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

In order to get standardized estimates and factor changes, run the SPost . l i st coef command.

. listcoef

pr obi t ( N=437) : Unst andar di zed and St andar di zed Est i mat es

Obser ved SD: . 47755228
Lat ent SD: 1. 0371456

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
owncar | b z P>| z| bSt dX bSt dY bSt dXY SDof X
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
i ncome | 0. 00056 0. 002 0. 999 0. 0001 0. 0005 0. 0001 0. 1792
age | 0. 14870 3. 628 0. 000 0. 2395 0. 1434 0. 2309 1. 6108
mal e | 0. 25791 2. 053 0. 040 0. 1278 0. 2487 0. 1232 0. 4953
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

You may compute the marginal effects and discrete change using the SPost . pr change.

. prchange, x(income=1 age=21 male=0)

pr obi t : Changes i n Pr edi ct ed Pr obabi l i t i es f or owncar

mi n- >max 0- >1 - +1/ 2 - +sd/ 2 Mar gEf ct
i ncome 0. 0002 0. 0002 0. 0002 0. 0000 0. 0002
age 0. 4900 0. 0014 0. 0567 0. 0912 0. 0567
2003-2005, The Trustees of Indiana University Categorical Dependent Variable Models: 19
http://www.indiana.edu/~statmath
mal e 0. 0937 0. 0937 0. 0981 0. 0487 0. 0984

0 1
Pr ( y| x) 0. 3822 0. 6178

i ncome age mal e
x= 1 21 0
sd( x) = . 17918 1. 61081 . 495344


3.2 Using the PROBIT and LOGISTIC Procedures

The PROBIT and LOGISTIC procedures estimate the binary probit model. Keep in mind that the
coefficients of PROBIT has opposite signs.

PROC PROBIT DATA = masil.students;
CLASS owncar;
MODEL owncar = income age male;
RUN;


Probit Procedure

Model Information

Data Set MASIL.STUDENTS
Dependent Variable owncar
Number of Observations 437
Name of Distribution Normal
Log Likelihood -273.8174115


Number of Observations Read 437
Number of Observations Used 437


Class Level Information

Name Levels Values

owncar 2 0 1


Response Profile

Ordered Total
Value owncar Frequency

1 0 153
2 1 284

PROC PROBIT is modeling the probabilities of levels of owncar having LOWER Ordered Values in
the response profile table.


Algorithm converged.
2003-2005, The Trustees of Indiana University Categorical Dependent Variable Models: 20
http://www.indiana.edu/~statmath


Type III Analysis of Effects

Wald
Effect DF Chi-Square Pr > ChiSq

income 1 0.0000 0.9987
age 1 13.1644 0.0003
male 1 4.2160 0.0400


Analysis of Parameter Estimates

Standard 95% Confidence Chi-
Parameter DF Estimate Error Limits Square Pr > ChiSq

Intercept 1 2.8237 0.8731 1.1124 4.5349 10.46 0.0012
income 1 -0.0006 0.3477 -0.6820 0.6809 0.00 0.9987
age 1 -0.1487 0.0410 -0.2290 -0.0684 13.16 0.0003
male 1 -0.2579 0.1256 -0.5041 -0.0117 4.22 0.0400

The LOGISTIC procedure requires a normal probability distribution as a link function
(/LINK=PROBIT or /LINK=NORMIT).

PROC LOGISTIC DATA = masil.students DESC;
MODEL owncar = income age male /LINK=PROBIT;
RUN;

The LOGISTIC Procedure

Model Information

Data Set MASIL.STUDENTS
Response Variable owncar
Number of Response Levels 2
Model binary probit
Optimization Technique Fisher's scoring


Number of Observations Read 437
Number of Observations Used 437


Response Profile

Ordered Total
Value owncar Frequency

1 1 284
2 0 153

Probability modeled is owncar=1.


Model Convergence Status

2003-2005, The Trustees of Indiana University Categorical Dependent Variable Models: 21
http://www.indiana.edu/~statmath
Convergence criterion (GCONV=1E-8) satisfied.


Model Fit Statistics

Intercept
Intercept and
Criterion Only Covariates

AIC 567.930 555.635
SC 572.010 571.955
-2 Log L 565.930 547.635


Testing Global Null Hypothesis: BETA=0

Test Chi-Square DF Pr > ChiSq

Likelihood Ratio 18.2954 3 0.0004
Score 17.4697 3 0.0006
Wald 17.4690 3 0.0006


Analysis of Maximum Likelihood Estimates

Standard Wald
Parameter DF Estimate Error Chi-Square Pr > ChiSq

Intercept 1 -2.8237 0.8796 10.3048 0.0013
income 1 0.000548 0.3496 0.0000 0.9987
age 1 0.1487 0.0413 12.9602 0.0003
male 1 0.2579 0.1257 4.2096 0.0402


Association of Predicted Probabilities and Observed Responses

Percent Concordant 57.8 Somers' D 0.249
Percent Discordant 32.9 Gamma 0.274
Percent Tied 9.3 Tau-a 0.113
Pairs 43452 c 0.624


3.3 Using the GENMODE and QLIM Procedures

The GENMOD procedure also estimates the binary probit model using the /DIST=BINOMIAL
and /LINK=PROBIT options in the MODEL statement.

PROC GENMOD DATA = masil.students DESC;
MODEL owncar = income age male /DIST=BINOMIAL LINK=PROBIT;
RUN;

The GENMOD Procedure

Model Information

Data Set MASIL.STUDENTS
2003-2005, The Trustees of Indiana University Categorical Dependent Variable Models: 22
http://www.indiana.edu/~statmath
Distribution Binomial
Link Function Probit
Dependent Variable owncar


Number of Observations Read 437
Number of Observations Used 437
Number of Events 284
Number of Trials 437


Response Profile

Ordered Total
Value owncar Frequency

1 1 284
2 0 153

PROC GENMOD is modeling the probability that owncar='1'.


Criteria For Assessing Goodness Of Fit

Criterion DF Value Value/DF

Deviance 433 547.6348 1.2647
Scaled Deviance 433 547.6348 1.2647
Pearson Chi-Square 433 437.0270 1.0093
Scaled Pearson X2 433 437.0270 1.0093
Log Likelihood -273.8174

Algorithm converged.


Analysis Of Parameter Estimates

Standard Wald 95% Confidence Chi-
Parameter DF Estimate Error Limits Square Pr > ChiSq

Intercept 1 -2.8237 0.8731 -4.5349 -1.1124 10.46 0.0012
income 1 0.0006 0.3477 -0.6809 0.6820 0.00 0.9987
age 1 0.1487 0.0410 0.0684 0.2290 13.16 0.0003
male 1 0.2579 0.1256 0.0117 0.5041 4.22 0.0400
Scale 0 1.0000 0.0000 1.0000 1.0000

NOTE: The scale parameter was held fixed.

The QLIM procedure provides various goodness-of-fit statistics. The DIST=NORMAL option
indicates the normal probability distribution used in estimation.

PROC QLIM DATA=masil.students;
MODEL owncar = income age male /DISCRETE (DIST=NORMAL);
RUN;


The QLIM Procedure
2003-2005, The Trustees of Indiana University Categorical Dependent Variable Models: 23
http://www.indiana.edu/~statmath

Discrete Response Profile of owncar

Index Value Frequency Percent

1 0 153 35.01
2 1 284 64.99


Model Fit Summary

Number of Endogenous Variables 1
Endogenous Variable owncar
Number of Observations 437
Log Likelihood -273.81741
Maximum Absolute Gradient 3.82848E-8
Number of Iterations 10
AIC 555.63482
Schwarz Criterion 571.95456


Goodness-of-Fit Measures

Measure Value Formula

Likelihood Ratio (R) 18.295 2 * (LogL - LogL0)
Upper Bound of R (U) 565.93 - 2 * LogL0
Aldrich-Nelson 0.0402 R / (R+N)
Cragg-Uhler 1 0.041 1 - exp(-R/N)
Cragg-Uhler 2 0.0565 (1-exp(-R/N)) / (1-exp(-U/N))
Estrella 0.0417 1 - (1-R/U)^(U/N)
Adjusted Estrella 0.0235 1 - ((LogL-K)/LogL0)^(-2/N*LogL0)
McFadden's LRI 0.0323 R / U
Veall-Zimmermann 0.0712 (R * (U+N)) / (U * (R+N))
McKelvey-Zavoina 0.0702

N = # of observations, K = # of regressors

Algorithm converged.


Parameter Estimates

Standard Approx
Parameter Estimate Error t Value Pr > |t|

Intercept -2.823671 0.873096 -3.23 0.0012
income 0.000561 0.347684 0.00 0.9987
age 0.148701 0.040984 3.63 0.0003
male 0.257911 0.125608 2.05 0.0400


3.4 Binary Probit in LIMDEP (Probit$)

The LIMDEP Pr obi t $ command estimates various probit models. Do not forget to include the
ONE for the intercept.
2003-2005, The Trustees of Indiana University Categorical Dependent Variable Models: 24
http://www.indiana.edu/~statmath

PROBIT;
Lhs=owncar;
Rhs=ONE,income,age,male$

Normal exit from iterations. Exit status=0.

+---------------------------------------------+
| Binomial Probit Model |
| Maximum Likelihood Estimates |
| Model estimated: Sep 17, 2005 at 10:28:56PM.|
| Dependent variable OWNCAR |
| Weighting variable None |
| Number of observations 437 |
| Iterations completed 4 |
| Log likelihood function -273.8174 |
| Restricted log likelihood -282.9651 |
| Chi squared 18.29542 |
| Degrees of freedom 3 |
| Prob[ChiSqd > value] = .3822542E-03 |
| Hosmer-Lemeshow chi-squared = 8.18372 |
| P-value= .41573 with deg.fr. = 8 |
+---------------------------------------------+
+---------+--------------+----------------+--------+---------+----------+
|Variable | Coefficient | Standard Error |b/St.Er.|P[|Z|>z] | Mean of X|
+---------+--------------+----------------+--------+---------+----------+
Index function for probability
Constant -2.823670829 .87309548 -3.234 .0012
INCOME .5612515407E-03 .34768423 .002 .9987 .61683982
AGE .1487005234 .40983697E-01 3.628 .0003 20.691076
MALE .2579111914 .12560848 2.053 .0400 .57208238
(Note: E+nn or E-nn means multiply by 10 to + or -nn power.)

+----------------------------------------+
| Fit Measures for Binomial Choice Model |
| Probit model for variable OWNCAR |
+----------------------------------------+
| Proportions P0= .350114 P1= .649886 |
| N = 437 N0= 153 N1= 284 |
| LogL = -273.81741 LogL0 = -282.9651 |
| Estrella = 1-(L/L0)^(-2L0/n) = .04166 |
+----------------------------------------+
| Efron | McFadden | Ben./Lerman |
| .03984 | .03233 | .56327 |
| Cramer | Veall/Zim. | Rsqrd_ML |
| .04016 | .07121 | .04100 |
+----------------------------------------+
| Information Akaike I.C. Schwarz I.C. |
| Criteria 1.27148 571.95456 |
+----------------------------------------+
Frequencies of actual & predicted outcomes
Predicted outcome has maximum probability.
Threshold value for predicting Y=1 = .5000
Predicted
------ ---------- + -----
Actual 0 1 | Total
------ ---------- + -----
2003-2005, The Trustees of Indiana University Categorical Dependent Variable Models: 25
http://www.indiana.edu/~statmath
0 5 148 | 153
1 8 276 | 284
------ ---------- + -----
Total 13 424 | 437


3.5 Binary Probit in SPSS

SPSS has the Pr obi t command to fit the binary probit model. This command requires a variable
(e.g., n in the following example) with constant 1.

COMPUTE n=1.
PROBIT owncar OF n WITH income age male
/LOG NONE /MODEL PROBIT
/PRINT FREQ /CRITERIA ITERATE(20) STEPLIMIT(.1).

Table 4 summarizes parameter estimates and goodness-of-fit statistics produced. Note that the
LOGISTIC procedure reports slightly different estimates and standard errors. I would
recommend the SAS QLIM procedure, STATA, and LIMDEP for the binary probit model.

Table 4.Parameter Estimates and Goodness-of-fit Statistics of the Binary Probit Model
LOGISTIC PROBIT GENMOD QLIM STATA LIMDEP
Intercept
- 2. 8237
( . 8796)
2. 8237
( . 8731)
- 2. 8237
( . 8731)
- 2. 8237
( . 8731)
- 2. 8237
( . 8731)
- 2. 8237
( . 8731)
i ncome . 0005
( . 3496)
- . 0006
( . 3477)
. 0006
( . 3477)
. 0006
( . 3477)
. 0006
( . 3477)
. 0006
( . 3477)
age . 1487
( . 0413)
- . 1487
( . 0410)
. 1487
( . 0410)
. 1487
( . 0410)
. 1487
( . 0410)
. 1487
( . 0410)
mal e . 2579
( . 1257)
- . 2579
( . 1256)
. 2579
( . 1256)
. 2579
( . 1256)
. 2579
( . 1256)
. 2579
( . 1256)
Log likelihood
547. 653
*
- 273. 8174 - 273. 8174 - 273. 8174 - 273. 8174 - 273. 8174
Likelihood test
18. 2954 18. 295 18. 30 18. 2954
Pseudo R
2

. 0323 . 0323 . 0323
AIC
555. 635
**
555. 6348
**
1. 2715
Schwarz
571. 955 571. 9546 571. 9546
BIC

* The LOGISTIC procedure reports (-2*log likelihood).
** AIC*N




2003-2005, The Trustees of Indiana University Categorical Dependent Variable Models: 26
http://www.indiana.edu/~statmath
4. Bivariate Probit/Logit Regression Models

Bivariate regression models have two equations for the two dependent variables. This chapter
explains the bivariate regression model with two binary dependent variables. Like the seemingly
unrelated regression model (SUR), biviriate probit/logit models assume that the independent,
identically distributed errors are correlated (Greene 2003).

The bivariate probit model, although consuming relatively much time, is more likely to converge
than the bivariate logit model. SAS supports both the bivariate probit and logit models, while
STATA and LIMDEP estimate the bivariate probit model. Here we consider a model for car
ownership (owncar ) and housing type (of f camp).


4.1 Bivariate Probit in STATA (.biprobit)

STATA has the .biprobit command to estimate the bivariate probit model. The two dependent
variables precede a set of independent variables.

. biprobit owncar offcamp income age male

Fi t t i ng compar i son equat i on 1:

I t er at i on 0: l og l i kel i hood = - 282. 96512
I t er at i on 1: l og l i kel i hood = - 273. 84832
I t er at i on 2: l og l i kel i hood = - 273. 81741
I t er at i on 3: l og l i kel i hood = - 273. 81741

Fi t t i ng compar i son equat i on 2:

I t er at i on 0: l og l i kel i hood = - 54. 97403
I t er at i on 1: l og l i kel i hood = - 45. 919608
I t er at i on 2: l og l i kel i hood = - 43. 685448
I t er at i on 3: l og l i kel i hood = - 43. 32265
I t er at i on 4: l og l i kel i hood = - 43. 309675
I t er at i on 5: l og l i kel i hood = - 43. 309654

Compar i son: l og l i kel i hood = - 317. 12707

Fi t t i ng f ul l model :

I t er at i on 0: l og l i kel i hood = - 317. 12707
I t er at i on 1: l og l i kel i hood = - 307. 15684
I t er at i on 2: l og l i kel i hood = - 306. 49535
I t er at i on 3: l og l i kel i hood = - 306. 46018
I t er at i on 4: l og l i kel i hood = - 306. 45493
I t er at i on 5: l og l i kel i hood = - 306. 45408
I t er at i on 6: l og l i kel i hood = - 306. 45395
I t er at i on 7: l og l i kel i hood = - 306. 45392

Bi var i at e pr obi t r egr essi on Number of obs = 437
Wal d chi 2( 6) = 30. 13
Log l i kel i hood = - 306. 45392 Pr ob > chi 2 = 0. 0000

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
| Coef . St d. Er r . z P>| z| [ 95%Conf . I nt er val ]
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
2003-2005, The Trustees of Indiana University Categorical Dependent Variable Models: 27
http://www.indiana.edu/~statmath
owncar |
i ncome | - . 0017168 . 347905 - 0. 00 0. 996 - . 6835982 . 6801645
age | . 1492475 . 0409238 3. 65 0. 000 . 0690383 . 2294568
mal e | . 2594624 . 1255633 2. 07 0. 039 . 0133628 . 505562
_cons | - 2. 834625 . 8719679 - 3. 25 0. 001 - 4. 543651 - 1. 125599
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
of f camp |
i ncome | . 7519064 . 8254937 0. 91 0. 362 - . 8660316 2. 369844
age | . 5895658 . 149221 3. 95 0. 000 . 297098 . 8820336
mal e | . 3939644 . 2834889 1. 39 0. 165 - . 1616637 . 9495925
_cons | - 10. 34593 2. 947501 - 3. 51 0. 000 - 16. 12293 - 4. 568938
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
/ at hr ho | 2. 387522 27. 20167 0. 09 0. 930 - 50. 92678 55. 70182
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
r ho | . 9832658 . 9027811 - 1 1
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Li kel i hood- r at i o t est of r ho=0: chi 2( 1) = 21. 3463 Pr ob > chi 2 = 0. 0000


4.2 Bivariate Probit in SAS

The SAS QLIM procedure is able to estimate both the bivariate logit and probit models. You
need to provide two equations that may or may not have different sets of independent variables.

PROC QLIM DATA=masil.students;
MODEL owncar = income age male;
MODEL offcamp = income age male;
ENDOGENOUS owncar offcamp ~ DISCRETE(DIST=NORMAL);
RUN;

Or, simply,

PROC QLIM DATA=masil.students;
MODEL owncar offcamp = income age male /DISCRETE;
RUN;

The QLIM Procedure

Discrete Response Profile of owncar

Index Value Frequency Percent

1 0 153 35.01
2 1 284 64.99


Discrete Response Profile of offcamp

Index Value Frequency Percent

1 0 12 2.75
2 1 425 97.25


Model Fit Summary

Number of Endogenous Variables 2
2003-2005, The Trustees of Indiana University Categorical Dependent Variable Models: 28
http://www.indiana.edu/~statmath
Endogenous Variable owncar offcamp
Number of Observations 437
Log Likelihood -306.45392
Maximum Absolute Gradient 2.16967E-6
Number of Iterations 27
AIC 628.90784
Schwarz Criterion 661.54730

Algorithm converged.


Parameter Estimates

Standard Approx
Parameter Estimate Error t Value Pr > |t|

owncar.Intercept -2.834511 0.871964 -3.25 0.0012
owncar.income -0.001723 0.347904 -0.00 0.9960
owncar.age 0.149243 0.040924 3.65 0.0003
owncar.male 0.259462 0.125563 2.07 0.0388
offcamp.Intercept -10.345002 2.947054 -3.51 0.0004
offcamp.income 0.751837 0.825398 0.91 0.3624
offcamp.age 0.589515 0.149197 3.95 <.0001
offcamp.male 0.393859 0.283458 1.39 0.1647
_Rho 0.999990 0 . .


4.3 Bivariate Probit in LIMDEP (Bivariateprobit$)

LIMDEP has the Bi var i at epr obi t $ command to estimate the bivariate probit model. The Lhs$
subcommand lists the two binary dependent variables, whereas Rh1$ and Rh2$ respectively
indicate independent variables for the two dependent variables. In this model, you may not
switch the order of dependent variables (Lhs=owncar , of f camp; ) to avoid convergence problems.

BIVARIATEPROBIT;
Lhs=offcamp,owncar;
Rh1=ONE,income,age,male;
Rh2= ONE,income,age,male$

Normal exit from iterations. Exit status=0.

+---------------------------------------------+
| FIML Estimates of Bivariate Probit Model |
| Maximum Likelihood Estimates |
| Model estimated: Sep 17, 2005 at 10:36:25PM.|
| Dependent variable OFFOWN |
| Weighting variable None |
| Number of observations 437 |
| Iterations completed 35 |
| Log likelihood function -306.4539 |
+---------------------------------------------+
+---------+--------------+----------------+--------+---------+----------+
|Variable | Coefficient | Standard Error |b/St.Er.|P[|Z|>z] | Mean of X|
+---------+--------------+----------------+--------+---------+----------+
Index equation for OFFCAMP
2003-2005, The Trustees of Indiana University Categorical Dependent Variable Models: 29
http://www.indiana.edu/~statmath
Constant -10.34508235 3.6592558 -2.827 .0047
INCOME .7518407011 .85274898 .882 .3780 .61683982
AGE .5895189160 .18572787 3.174 .0015 20.691076
MALE .3938599470 .29308051 1.344 .1790 .57208238
Index equation for OWNCAR
Constant -2.834513147 .84825468 -3.342 .0008
INCOME -.1723102966E-02 .34222451 -.005 .9960 .61683982
AGE .1492426338 .39739762E-01 3.755 .0002 20.691076
MALE .2594618946 .12565094 2.065 .0389 .57208238
Disturbance correlation
RHO(1,2) .9941311591 .73338053E+09 .000 1.0000
(Note: E+nn or E-nn means multiply by 10 to + or -nn power.)

Joint Frequency Table: Columns=OWNCAR
Rows =OFFCAMP

(N) = Count of Fitted Values

0 1 TOTAL

0 12 0 12
( 0) ( 0) ( 0)

1 141 284 425
( 0) ( 437) ( 437)

TOTAL 153 284 437
( 0) ( 437) ( 437)

SAS, STATA, and LIMDEP produce almost the same parameter estimates and standard errors
with slight differences after the decimal point.


4.4 Bivariate Logit in SAS

The QLIM procedure also estimates the bivariate logit model using the DIST=LOGIT option.
Unfortunately, this model does not fit in SAS.

PROC QLIM DATA=masil.students;
MODEL owncar = income age male;
MODEL offcamp = income age male;
ENDOGENOUS offcamp owncar ~ DISCRETE(DIST=LOGIT);
RUN;

Or,

PROC QLIM DATA=masil.students;
MODEL owncar offcamp = income age male /DISCRETE(DIST=LOGIT);
RUN;

2003-2005, The Trustees of Indiana University Categorical Dependent Variable Models: 30
http://www.indiana.edu/~statmath
5. Ordered Logit/Probit Regression Models

Suppose we have an ordinal dependent variable such as the degree of illegal parking (0=none,
1=sometimes, and 2=often). The ordered logit and probit models have the parallel regression
assumption, which is violated from time to time.


5.1 Ordered Logit/Probit in STATA (.ologit and .oprobit)

STATA has the . ol ogi t and . opr obi t commands to estimate the ordered logit and probit
models, respectively.

. ologit parking income age male

I t er at i on 0: l og l i kel i hood = - 103. 78713
I t er at i on 1: l og l i kel i hood = - 92. 739147
I t er at i on 2: l og l i kel i hood = - 90. 036393
I t er at i on 3: l og l i kel i hood = - 89. 861679
I t er at i on 4: l og l i kel i hood = - 89. 860105
I t er at i on 5: l og l i kel i hood = - 89. 860105

Or der ed l ogi st i c r egr essi on Number of obs = 437
LR chi 2( 3) = 27. 85
Pr ob > chi 2 = 0. 0000
Log l i kel i hood = - 89. 860105 Pseudo R2 = 0. 1342

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
par ki ng | Coef . St d. Er r . z P>| z| [ 95%Conf . I nt er val ]
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
i ncome | - . 5140709 1. 283192 - 0. 40 0. 689 - 3. 029082 2. 00094
age | - . 7362588 . 1894339 - 3. 89 0. 000 - 1. 107542 - . 3649752
mal e | - 1. 227092 . 4705859 - 2. 61 0. 009 - 2. 149423 - . 3047605
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
/ cut 1 | - 12. 74479 3. 787616 - 20. 16839 - 5. 321203
/ cut 2 | - 10. 83295 3. 801685 - 18. 28412 - 3. 381786
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

STATA estimates
m
, / cut 1 and / cut 2, assuming 0
0
= (Long and Freese 2003). This
parameterization is different from that of SAS and LIMDEP, which assume 0
1
= .

. oprobit parking income age male

I t er at i on 0: l og l i kel i hood = - 103. 78713
I t er at i on 1: l og l i kel i hood = - 90. 990455
I t er at i on 2: l og l i kel i hood = - 89. 496288
I t er at i on 3: l og l i kel i hood = - 89. 430915
I t er at i on 4: l og l i kel i hood = - 89. 430754

Or der ed pr obi t r egr essi on Number of obs = 437
LR chi 2( 3) = 28. 71
Pr ob > chi 2 = 0. 0000
Log l i kel i hood = - 89. 430754 Pseudo R2 = 0. 1383

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
par ki ng | Coef . St d. Er r . z P>| z| [ 95%Conf . I nt er val ]
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
i ncome | - . 1869839 . 6116037 - 0. 31 0. 760 - 1. 385705 1. 011737
2003-2005, The Trustees of Indiana University Categorical Dependent Variable Models: 31
http://www.indiana.edu/~statmath
age | - . 3594853 . 0924817 - 3. 89 0. 000 - . 540746 - . 1782246
mal e | - . 5867871 . 2205253 - 2. 66 0. 008 - 1. 019009 - . 1545655
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
/ cut 1 | - 6. 000986 1. 869046 - 9. 664248 - 2. 337724
/ cut 2 | - 5. 118676 1. 862909 - 8. 769911 - 1. 467442
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -


5.2 The Parallel Assumption and the Generalized Ordered Logit Model

The . br ant command of SPost is valid only in the . ol ogi t command. This command tests the
parallel regression assumption of the ordinal regression model. The outputs here are skipped.

. quietly ologit parking income male

. brant

The parallel regression assumption is often violated. If this is the case, you may use the
multinomial regression model or estimate the generalized ordered logit model (GOLM) using
either the . gol ogi t command written by Fu (1998) or the . gol ogi t 2 command by Williams
(2005). Note that Fus module does not impose the restriction of ) ( ) (
1 1

j j j j
x x
(Longs class note 2003).

. gologit2 parking income age male, autofit

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Test i ng par al l el l i nes assumpt i on usi ng t he . 05 l evel of si gni f i cance. . .

St ep 1: mal e meet s t he pl assumpt i on ( P Val ue = 0. 9901)
St ep 2: i ncome meet s t he pl assumpt i on ( P Val ue = 0. 8958)
St ep 3: age meet s t he pl assumpt i on ( P Val ue = 0. 7964)
St ep 4: Al l expl anat or y var i abl es meet t he pl assumpt i on

Wal d t est of par al l el l i nes assumpt i on f or t he f i nal model :

( 1) [ 0] mal e - [ 1] mal e = 0
( 2) [ 0] i ncome - [ 1] i ncome = 0
( 3) [ 0] age - [ 1] age = 0

chi 2( 3) = 0. 04
Pr ob > chi 2 = 0. 9982

An i nsi gni f i cant t est st at i st i c i ndi cat es t hat t he f i nal model
does not vi ol at e t he pr opor t i onal odds/ par al l el l i nes assumpt i on

I f you r e- est i mat e t hi s exact same model wi t h gol ogi t 2, i nst ead
of aut of i t you can save t i me by usi ng t he par amet er

pl ( mal e i ncome age)

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Gener al i zed Or der ed Logi t Est i mat es Number of obs = 437
Wal d chi 2( 3) = 21. 74
Pr ob > chi 2 = 0. 0001
Log l i kel i hood = - 89. 860105 Pseudo R2 = 0. 1342

( 1) [ 0] mal e - [ 1] mal e = 0
2003-2005, The Trustees of Indiana University Categorical Dependent Variable Models: 32
http://www.indiana.edu/~statmath
( 2) [ 0] i ncome - [ 1] i ncome = 0
( 3) [ 0] age - [ 1] age = 0
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
par ki ng | Coef . St d. Er r . z P>| z| [ 95%Conf . I nt er val ]
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
0 |
i ncome | - . 5140709 1. 283192 - 0. 40 0. 689 - 3. 029082 2. 00094
age | - . 7362588 . 1894339 - 3. 89 0. 000 - 1. 107543 - . 3649752
mal e | - 1. 227092 . 4705859 - 2. 61 0. 009 - 2. 149423 - . 3047605
_cons | 12. 74479 3. 787616 3. 36 0. 001 5. 321202 20. 16839
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
1 |
i ncome | - . 5140709 1. 283192 - 0. 40 0. 689 - 3. 029082 2. 00094
age | - . 7362588 . 1894339 - 3. 89 0. 000 - 1. 107543 - . 3649752
mal e | - 1. 227092 . 4705859 - 2. 61 0. 009 - 2. 149423 - . 3047605
_cons | 10. 83295 3. 801686 2. 85 0. 004 3. 381785 18. 28412
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -


5.3 Ordered Logit in SAS

The QLIM, LOGISTIC, and PROBIT procedures estimate ordered logit and probit models. As
shown in Tables 3 and 4, the QLIM procedure is most recommended. Note that the
DIST=LOGISTIC indicates the logit model to be estimated.

PROC QLIM DATA=masil.students;
MODEL parking = income age male /DISCRETE (DIST=LOGISTIC);
RUN;

The QLIM Procedure

Discrete Response Profile of parking

Index Value Frequency Percent

1 0 413 94.51
2 1 20 4.58
3 2 4 0.92


Model Fit Summary

Number of Endogenous Variables 1
Endogenous Variable parking
Number of Observations 437
Log Likelihood -89.86011
Maximum Absolute Gradient 8.14046E-7
Number of Iterations 23
AIC 189.72021
Schwarz Criterion 210.11988


Goodness-of-Fit Measures

Measure Value Formula

Likelihood Ratio (R) 27.854 2 * (LogL - LogL0)
2003-2005, The Trustees of Indiana University Categorical Dependent Variable Models: 33
http://www.indiana.edu/~statmath
Upper Bound of R (U) 207.57 - 2 * LogL0
Aldrich-Nelson 0.0599 R / (R+N)
Cragg-Uhler 1 0.0618 1 - exp(-R/N)
Cragg-Uhler 2 0.1633 (1-exp(-R/N)) / (1-exp(-U/N))
Estrella 0.0662 1 - (1-R/U)^(U/N)
Adjusted Estrella 0.0418 1 - ((LogL-K)/LogL0)^(-2/N*LogL0)
McFadden's LRI 0.1342 R / U
Veall-Zimmermann 0.1861 (R * (U+N)) / (U * (R+N))
McKelvey-Zavoina 0.6462

N = # of observations, K = # of regressors

Algorithm converged.


Parameter Estimates

Standard Approx
Parameter Estimate Error t Value Pr > |t|

Intercept 12.744794 3.787615 3.36 0.0008
income -0.514071 1.283192 -0.40 0.6887
age -0.736259 0.189434 -3.89 0.0001
male -1.227092 0.470586 -2.61 0.0091
_Limit2 1.911842 0.468050 4.08 <.0001

The SAS QLIM procedure estimates the intercept and
2
, assuming 0
1
= . The estimated
intercept of SAS is equivalent to (0-/ cut 1) in STATA. The _Li mi t 2 of SAS is the difference
between cut points of STATA, 1.91184=-10.83295-(-12.74479).

The SAS LOGISTIC and PROBIT procedures are also used to estimate the ordered logit and
probit models. These procedures recognize binary or ordinal response models by examining the
dependent variable.

PROC LOGISTIC DATA = masil.students DESC;
MODEL parking = income age male /LINK=LOGIT;
RUN;

Like the STATA . ol ogi t command, The LOGISTIC procedure fits the model, assuming the
intercept is zero. The parameter estimates and standard errors are slightly different from those of
the QLIM procedure and the . ol ogi t command. Other parts of the output are skipped.

Analysis of Maximum Likelihood Estimates

Standard Wald
Parameter DF Estimate Error Chi-Square Pr > ChiSq

Intercept 2 1 10.8324 3.8112 8.0784 0.0045
Intercept 1 1 12.7444 3.8021 11.2354 0.0008
income 1 -0.5142 1.2908 0.1587 0.6904
age 1 -0.7362 0.1900 15.0221 0.0001
male 1 -1.2271 0.4709 6.7902 0.0092

PROC PROBIT DATA = masil.students;
2003-2005, The Trustees of Indiana University Categorical Dependent Variable Models: 34
http://www.indiana.edu/~statmath
CLASS parking;
MODEL parking = income age male /DIST=LOGISTIC;
RUN;

The PROBIT procedure returns almost the same results as the QLIM procedure except for the
signs of the estimates. Other parts of the output are skipped.

Analysis of Parameter Estimates

Standard 95% Confidence Chi-
Parameter DF Estimate Error Limits Square Pr > ChiSq

Intercept 1 -12.7448 3.7876 -20.1684 -5.3212 11.32 0.0008
Intercept2 1 1.9118 0.4680 0.9945 2.8292 16.68 <.0001
income 1 0.5141 1.2832 -2.0009 3.0291 0.16 0.6887
age 1 0.7363 0.1894 0.3650 1.1075 15.11 0.0001
male 1 1.2271 0.4706 0.3048 2.1494 6.80 0.0091


5.4 Ordered Probit in SAS

The QLIM procedure by default estimates a probit model. The DIST=NORMAL, the default
option, may be omitted.

PROC QLIM DATA=masil.students;
MODEL parking = income age male /DISCRETE (DIST=NORMAL);
RUN;


The QLIM Procedure

Discrete Response Profile of parking

Index Value Frequency Percent

1 0 413 94.51
2 1 20 4.58
3 2 4 0.92


Model Fit Summary

Number of Endogenous Variables 1
Endogenous Variable parking
Number of Observations 437
Log Likelihood -89.43075
Maximum Absolute Gradient 4.69307E-6
Number of Iterations 17
AIC 188.86151
Schwarz Criterion 209.26117


Goodness-of-Fit Measures

Measure Value Formula
2003-2005, The Trustees of Indiana University Categorical Dependent Variable Models: 35
http://www.indiana.edu/~statmath

Likelihood Ratio (R) 28.713 2 * (LogL - LogL0)
Upper Bound of R (U) 207.57 - 2 * LogL0
Aldrich-Nelson 0.0617 R / (R+N)
Cragg-Uhler 1 0.0636 1 - exp(-R/N)
Cragg-Uhler 2 0.1682 (1-exp(-R/N)) / (1-exp(-U/N))
Estrella 0.0683 1 - (1-R/U)^(U/N)
Adjusted Estrella 0.0439 1 - ((LogL-K)/LogL0)^(-2/N*LogL0)
McFadden's LRI 0.1383 R / U
Veall-Zimmermann 0.1915 (R * (U+N)) / (U * (R+N))
McKelvey-Zavoina 0.3011

N = # of observations, K = # of regressors

Algorithm converged.


Parameter Estimates

Standard Approx
Parameter Estimate Error t Value Pr > |t|

Intercept 6.000986 1.869053 3.21 0.0013
income -0.186984 0.611605 -0.31 0.7598
age -0.359485 0.092482 -3.89 0.0001
male -0.586787 0.220526 -2.66 0.0078
_Limit2 0.882310 0.196555 4.49 <.0001

The QLIM procedure and . opr obi t command produce almost the same result except for the
2

estimate. The _Li mi t 2 of SAS is the difference of the cut points of STATA, .88231=-5.118676-
(-6.000986).

The PROBIT and LOGISTIC procedures also estimate the ordered probit model. Keep in mind
that the signs of the coefficients are reversed in the PROBIT procedure.

PROC LOGISTIC DATA = masil.students DESC;
MODEL parking = income age male /LINK=PROBIT;
RUN;

Analysis of Maximum Likelihood Estimates

Standard Wald
Parameter DF Estimate Error Chi-Square Pr > ChiSq

Intercept 2 1 5.1181 1.8373 7.7601 0.0053
Intercept 1 1 6.0004 1.8441 10.5872 0.0011
income 1 -0.1869 0.6160 0.0921 0.7615
age 1 -0.3595 0.0908 15.6767 <.0001
male 1 -0.5868 0.2203 7.0941 0.0077

PROC PROBIT DATA = masil.students;
CLASS parking;
MODEL parking = income age male /DIST=NORMAL;
RUN;
2003-2005, The Trustees of Indiana University Categorical Dependent Variable Models: 36
http://www.indiana.edu/~statmath

Analysis of Parameter Estimates

Standard 95% Confidence Chi-
Parameter DF Estimate Error Limits Square Pr > ChiSq

Intercept 1 -6.0010 1.8691 -9.6643 -2.3377 10.31 0.0013
Intercept2 1 0.8823 0.1966 0.4971 1.2675 20.15 <.0001
income 1 0.1870 0.6116 -1.0117 1.3857 0.09 0.7598
age 1 0.3595 0.0925 0.1782 0.5407 15.11 0.0001
male 1 0.5868 0.2205 0.1546 1.0190 7.08 0.0078


5.5 Ordered Logit/Probit in LIMDEP (Ordered$)

The LIMDEP Or der ed$ command estimates ordered logit and probit models. The Logi t $
subcommand runs the ordered logit model.

ORDERED;
Lhs=parking;
Rhs=ONE,income,age,male;
Logit$

Normal exit from iterations. Exit status=0.

+---------------------------------------------+
| Ordered Probability Model |
| Maximum Likelihood Estimates |
| Model estimated: Sep 18, 2005 at 05:53:44PM.|
| Dependent variable PARKING |
| Weighting variable None |
| Number of observations 437 |
| Iterations completed 13 |
| Log likelihood function -89.86011 |
| Restricted log likelihood -103.7871 |
| Chi squared 27.85404 |
| Degrees of freedom 3 |
| Prob[ChiSqd > value] = .3896741E-05 |
| Underlying probabilities based on Logistic |
| Cell frequencies for outcomes |
| Y Count Freq Y Count Freq Y Count Freq |
| 0 413 .945 1 20 .045 2 4 .009 |
+---------------------------------------------+
+---------+--------------+----------------+--------+---------+----------+
|Variable | Coefficient | Standard Error |b/St.Er.|P[|Z|>z] | Mean of X|
+---------+--------------+----------------+--------+---------+----------+
Index function for probability
Constant 12.74479424 3.7876161 3.365 .0008
INCOME -.5140708643 1.2831923 -.401 .6887 .61683982
AGE -.7362588281 .18943391 -3.887 .0001 20.691076
MALE -1.227091964 .47058590 -2.608 .0091 .57208238
Threshold parameters for index
Mu(1) 1.911841923 .46804996 4.085 .0000

+---------------------------------------------------------------------------+
| Cross tabulation of predictions. Row is actual, column is predicted. |
2003-2005, The Trustees of Indiana University Categorical Dependent Variable Models: 37
http://www.indiana.edu/~statmath
| Model = Logistic . Prediction is number of the most probable cell. |
+-------+-------+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
| Actual|Row Sum| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
+-------+-------+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
| 0| 413| 413| 0| 0|
| 1| 20| 20| 0| 0|
| 2| 4| 4| 0| 0|
+-------+-------+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
|Col Sum| 437| 437| 0| 0| 0| 0| 0| 0| 0| 0| 0|
+-------+-------+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+

LIMDEP and SAS QLIM produce the same results for the ordered logit model. Note that
_Li mi t 2 in SAS is equivalent to Mu( 1) , the threshold parameter, in LIMDEP.

The ordered probit model is estimated by the Or der ed$ command without the Logi t $
subcommand. The command by default fits the ordered logit model. The output is comparable to
that of the QLIM procedure.

ORDERED;
Lhs=parking;
Rhs=ONE,income,age,male$

Normal exit from iterations. Exit status=0.

+---------------------------------------------+
| Ordered Probability Model |
| Maximum Likelihood Estimates |
| Model estimated: Sep 18, 2005 at 05:55:42PM.|
| Dependent variable PARKING |
| Weighting variable None |
| Number of observations 437 |
| Iterations completed 11 |
| Log likelihood function -89.43075 |
| Restricted log likelihood -103.7871 |
| Chi squared 28.71275 |
| Degrees of freedom 3 |
| Prob[ChiSqd > value] = .2572557E-05 |
| Underlying probabilities based on Normal |
| Cell frequencies for outcomes |
| Y Count Freq Y Count Freq Y Count Freq |
| 0 413 .945 1 20 .045 2 4 .009 |
+---------------------------------------------+
+---------+--------------+----------------+--------+---------+----------+
|Variable | Coefficient | Standard Error |b/St.Er.|P[|Z|>z] | Mean of X|
+---------+--------------+----------------+--------+---------+----------+
Index function for probability
Constant 6.000985035 1.8690536 3.211 .0013
INCOME -.1869836008 .61160494 -.306 .7598 .61683982
AGE -.3594852294 .92482090E-01 -3.887 .0001 20.691076
MALE -.5867870572 .22052578 -2.661 .0078 .57208238
Threshold parameters for index
Mu(1) .8823095981 .19655461 4.489 .0000
(Note: E+nn or E-nn means multiply by 10 to + or -nn power.)

+---------------------------------------------------------------------------+
| Cross tabulation of predictions. Row is actual, column is predicted. |
2003-2005, The Trustees of Indiana University Categorical Dependent Variable Models: 38
http://www.indiana.edu/~statmath
| Model = Probit . Prediction is number of the most probable cell. |
+-------+-------+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
| Actual|Row Sum| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
+-------+-------+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
| 0| 413| 413| 0| 0|
| 1| 20| 20| 0| 0|
| 2| 4| 4| 0| 0|
+-------+-------+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
|Col Sum| 874| 437| 0| 0| 0| 0| 0| 0| 0| 0| 0|
+-------+-------+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+


5.6 Ordered Logit/Probit in SPSS

The Pl umcommand estimates the ordered logit and probit models in SPSS. The Threshold points
in SPSS are equivalent to the cut points in STATA.

PLUM parking WITH income age male
/CRITERIA = CIN(95) DELTA(0) LCONVERGE(0) MXITER(100) MXSTEP(5)
PCONVERGE(1.0E-6) SINGULAR(1.0E-8)
/LINK = LOGIT /PRINT = FIT PARAMETER SUMMARY .

PLUM parking WITH income age male
/CRITERIA = CIN(95) DELTA(0) LCONVERGE(0) MXITER(100) MXSTEP(5)
PCONVERGE(1.0E-6) SINGULAR(1.0E-8)
/LINK = PROBIT /PRINT = FIT PARAMETER SUMMARY .
2003-2005, The Trustees of Indiana University Categorical Dependent Variable Models: 39
http://www.indiana.edu/~statmath
6. The Multinomial Logit Regression Model

Suppose we have a nominal dependent variable such as the mode of transportation (walk, bike,
bus, and car). The multinomial logit and conditional logit models are commonly used; the
multinomial probit model is not often used mainly due to the practical difficulty in estimation.
However, STATA does have the . mpr obi t command to fit the model.

In the multinomial logit model, the independent variables contain characteristics of individuals,
while they are the attributes of the choices in the conditional logit model. In other words, the
conditional logit estimates how alternative-specific, not individual-specific, variables affect the
likelihood of observing a given outcome (Long 2003). Therefore, data need to be appropriately
arranged in advance.


6.1 Multinomial Logit/Probit in STATA (.mlogit and .mprobit)

STATA has the . ml ogi t command for the multinomial logit model. The base( ) option indicates
the value of the dependent variable to be used as the base category for the estimation. You may
omit the default option, base( 0) .

. mlogit transmode income age male, base(0)

I t er at i on 0: l og l i kel i hood = - 444. 84113
I t er at i on 1: l og l i kel i hood = - 411. 18604
I t er at i on 2: l og l i kel i hood = - 406. 36474
I t er at i on 3: l og l i kel i hood = - 406. 3251
I t er at i on 4: l og l i kel i hood = - 406. 32509

Mul t i nomi al l ogi st i c r egr essi on Number of obs = 437
LR chi 2( 9) = 77. 03
Pr ob > chi 2 = 0. 0000
Log l i kel i hood = - 406. 32509 Pseudo R2 = 0. 0866

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
t r ansmode | Coef . St d. Er r . z P>| z| [ 95%Conf . I nt er val ]
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
1 |
i ncome | 4. 018021 1. 34443 2. 99 0. 003 1. 382986 6. 653057
age | . 1915917 . 1392928 1. 38 0. 169 - . 0814172 . 4646006
mal e | . 2582886 . 4039971 0. 64 0. 523 - . 5335311 1. 050108
_cons | - 6. 903473 2. 97678 - 2. 32 0. 020 - 12. 73785 - 1. 069091
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
2 |
i ncome | 8. 951041 1. 338539 6. 69 0. 000 6. 327552 11. 57453
age | . 1374997 . 1451938 0. 95 0. 344 - . 1470749 . 4220742
mal e | . 1573179 . 4191014 0. 38 0. 707 - . 6641057 . 9787415
_cons | - 9. 091051 3. 088123 - 2. 94 0. 003 - 15. 14366 - 3. 038442
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
3 |
i ncome | 4. 210485 1. 032024 4. 08 0. 000 2. 187755 6. 233215
age | . 3457236 . 0995071 3. 47 0. 001 . 1506932 . 540754
mal e | . 5402549 . 2769887 1. 95 0. 051 - . 0026329 1. 083143
_cons | - 8. 388756 2. 135792 - 3. 93 0. 000 - 12. 57483 - 4. 202681
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
( t r ansmode==0 i s t he base out come)
2003-2005, The Trustees of Indiana University Categorical Dependent Variable Models: 40
http://www.indiana.edu/~statmath

Let us see if the base outcome changes. As shown in the following, the parameter estimates and
standard errors are changed, whereas the goodness-of-fit remains unchanged. The two . ml ogi t
commands with different bases fit the same model but present the result in different manner. The
SAS CATMOD procedure in the next section uses the largest value as the base outcome.

. mlogit transmode income age male, base(3)

I t er at i on 0: l og l i kel i hood = - 444. 84113
I t er at i on 1: l og l i kel i hood = - 411. 18604
I t er at i on 2: l og l i kel i hood = - 406. 36474
I t er at i on 3: l og l i kel i hood = - 406. 3251
I t er at i on 4: l og l i kel i hood = - 406. 32509

Mul t i nomi al l ogi st i c r egr essi on Number of obs = 437
LR chi 2( 9) = 77. 03
Pr ob > chi 2 = 0. 0000
Log l i kel i hood = - 406. 32509 Pseudo R2 = 0. 0866

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
t r ansmode | Coef . St d. Er r . z P>| z| [ 95%Conf . I nt er val ]
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
0 |
i ncome | - 4. 210485 1. 032024 - 4. 08 0. 000 - 6. 233215 - 2. 187755
age | - . 3457236 . 0995071 - 3. 47 0. 001 - . 540754 - . 1506932
mal e | - . 5402549 . 2769887 - 1. 95 0. 051 - 1. 083143 . 0026329
_cons | 8. 388756 2. 135792 3. 93 0. 000 4. 202681 12. 57483
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
1 |
i ncome | - . 1924639 1. 00606 - 0. 19 0. 848 - 2. 164305 1. 779377
age | - . 154132 . 1131506 - 1. 36 0. 173 - . 3759031 . 0676392
mal e | - . 2819663 . 3443963 - 0. 82 0. 413 - . 9569706 . 3930379
_cons | 1. 485283 2. 430912 0. 61 0. 541 - 3. 279216 6. 249783
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
2 |
i ncome | 4. 740556 . 9447126 5. 02 0. 000 2. 888953 6. 592158
age | - . 2082239 . 1164954 - 1. 79 0. 074 - . 4365507 . 0201028
mal e | - . 382937 . 3490247 - 1. 10 0. 273 - 1. 067013 . 3011389
_cons | - . 7022953 2. 460119 - 0. 29 0. 775 - 5. 52404 4. 119449
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
( t r ansmode==3 i s t he base out come)

The SPost . ml ogt est command conducts a variety of statistical tests for the multinomial logit
model. This command supports not only Wald and likelihood ratio tests, but also Hausman and
Small-Hsiao tests for the independence of irrelevant alternatives (IIA) assumption. The
. ml ogt est command works with the . ml ogi t command only.

. mlogtest, hausman smhsiao base

**** Hausman t est s of I I A assumpt i on

Ho: Odds( Out come- J vs Out come- K) ar e i ndependent of ot her al t er nat i ves.

Omi t t ed | chi 2 df P>chi 2 evi dence
- - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
0 | 0. 260 8 1. 000 f or Ho
1 | - 3. 307 8 1. 000 f or Ho
2 | - 0. 319 8 1. 000 f or Ho
3 | 2. 315 8 0. 970 f or Ho
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
2003-2005, The Trustees of Indiana University Categorical Dependent Variable Models: 41
http://www.indiana.edu/~statmath

**** Smal l - Hsi ao t est s of I I A assumpt i on

Ho: Odds( Out come- J vs Out come- K) ar e i ndependent of ot her al t er nat i ves.

Omi t t ed | l nL( f ul l ) l nL( omi t ) chi 2 df P>chi 2 evi dence
- - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
0 | - 120. 685 - 116. 139 9. 092 4 0. 059 f or Ho
1 | - 131. 938 - 128. 574 6. 728 4 0. 151 f or Ho
2 | - 155. 078 - 150. 308 9. 540 4 0. 049 agai nst Ho
3 | - 71. 735 - 67. 571 8. 327 4 0. 080 f or Ho
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

The STATA . mpr obi t command fits the multinomial probit model. The model took longer time
to converge than the multinomial logit model.

. mprobit transmode income age male

I t er at i on 0: l og l i kel i hood = - 425. 2053
I t er at i on 1: l og l i kel i hood = - 407. 95972
I t er at i on 2: l og l i kel i hood = - 406. 38652
I t er at i on 3: l og l i kel i hood = - 406. 38431
I t er at i on 4: l og l i kel i hood = - 406. 38431

Mul t i nomi al pr obi t r egr essi on Number of obs = 437
Wal d chi 2( 9) = 64. 47
Log l i kel i hood = - 406. 38431 Pr ob > chi 2 = 0. 0000

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
t r ansmode | Coef . St d. Er r . z P>| z| [ 95%Conf . I nt er val ]
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
_out come_2 |
i ncome | 2. 651949 . 8328566 3. 18 0. 001 1. 01958 4. 284318
age | . 1501467 . 0903614 1. 66 0. 097 - . 0269584 . 3272519
mal e | . 1967795 . 262047 0. 75 0. 453 - . 3168232 . 7103822
_cons | - 5. 075328 1. 953866 - 2. 60 0. 009 - 8. 904835 - 1. 245822
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
_out come_3 |
i ncome | 5. 757611 . 8443105 6. 82 0. 000 4. 102793 7. 412429
age | . 1218625 . 0942662 1. 29 0. 196 - . 0628959 . 3066209
mal e | . 1662947 . 2772189 0. 60 0. 549 - . 3770444 . 7096339
_cons | - 6. 547874 2. 031953 - 3. 22 0. 001 - 10. 53043 - 2. 565319
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
_out come_4 |
i ncome | 2. 751622 . 6936632 3. 97 0. 000 1. 392067 4. 111177
age | . 2760071 . 074178 3. 72 0. 000 . 1306208 . 4213933
mal e | . 4232271 . 2086763 2. 03 0. 043 . 0142289 . 8322252
_cons | - 6. 375609 1. 598767 - 3. 99 0. 000 - 9. 509134 - 3. 242083
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
( t r ansmode=0 i s t he base out come)


6.2 Multinomial Logit in SAS

SAS has the CATMOD procedure for the multinomial logit model. In the CATMOD procedure,
the RESPONSE statement is used to specify the functions of response probabilities.

PROC CATMOD DATA = masil.students;
DIRECT income age male;
RESPONSE LOGITS;
2003-2005, The Trustees of Indiana University Categorical Dependent Variable Models: 42
http://www.indiana.edu/~statmath
MODEL transmode = income age male /NOPROFILE;
RUN;


The CATMOD Procedure

Data Summary

Response transmode Response Levels 4
Weight Variable None Populations 414
Data Set STUDENTS Total Frequency 437
Frequency Missing 0 Observations 437


Maximum Likelihood Analysis

Maximum likelihood computations converged.


Maximum Likelihood Analysis of Variance

Source DF Chi-Square Pr > ChiSq

Intercept 3 15.91 0.0012
income 3 45.72 <.0001
age 3 14.66 0.0021
male 3 4.73 0.1927

Likelihood Ratio 1E3 778.33 1.0000


Analysis of Maximum Likelihood Estimates

Function Standard Chi-
Parameter Number Estimate Error Square Pr > ChiSq

Intercept 1 8.3888 2.1358 15.43 <.0001
2 1.4853 2.4309 0.37 0.5412
3 -0.7023 2.4601 0.08 0.7753
income 1 -4.2105 1.0320 16.65 <.0001
2 -0.1925 1.0061 0.04 0.8483
3 4.7406 0.9447 25.18 <.0001
age 1 -0.3457 0.0995 12.07 0.0005
2 -0.1541 0.1132 1.86 0.1731
3 -0.2082 0.1165 3.19 0.0739
male 1 -0.5403 0.2770 3.80 0.0511
2 -0.2820 0.3444 0.67 0.4129
3 -0.3829 0.3490 1.20 0.2726

As mentioned before, the CATMOD procedure uses the largest value of the dependent variable
as a base outcome. Accordingly, you need to compare the above with the STATA output of the
base( 3) option. The two outputs are the same except for the likelihood ratio.


6.3 Multinomial Logit in LIMDEP (Mlogit$)
2003-2005, The Trustees of Indiana University Categorical Dependent Variable Models: 43
http://www.indiana.edu/~statmath

In LIMDEP, you may use either the Ml ogi t $ or simply the Logi t $ commands to fit the
multinomial logit model. Both commands produce the identical result. Like STATA, LIMDEP
by default uses the smallest value as the base outcome.

MLOGIT;
Lhs=transmod;
Rhs=ONE,income,age,male$

Or, use the old style command.

LOGIT;
Lhs=transmod;
Rhs=ONE,income,age,male$

Normal exit from iterations. Exit status=0.

+---------------------------------------------+
| Multinomial Logit Model |
| Maximum Likelihood Estimates |
| Model estimated: Sep 19, 2005 at 09:19:23AM.|
| Dependent variable TRANSMOD |
| Weighting variable None |
| Number of observations 437 |
| Iterations completed 6 |
| Log likelihood function -406.3251 |
| Restricted log likelihood -444.8411 |
| Chi squared 77.03209 |
| Degrees of freedom 9 |
| Prob[ChiSqd > value] = .0000000 |
+---------------------------------------------+
+---------+--------------+----------------+--------+---------+----------+
|Variable | Coefficient | Standard Error |b/St.Er.|P[|Z|>z] | Mean of X|
+---------+--------------+----------------+--------+---------+----------+
Characteristics in numerator of Prob[Y = 1]
Constant -6.903472671 2.9767801 -2.319 .0204
INCOME 4.018021309 1.3444306 2.989 .0028 .61683982
AGE .1915916653 .13929281 1.375 .1690 20.691076
MALE .2582886230 .40399708 .639 .5226 .57208238
Characteristics in numerator of Prob[Y = 2]
Constant -9.091051495 3.0881231 -2.944 .0032
INCOME 8.951040805 1.3385396 6.687 .0000 .61683982
AGE .1374996725 .14519378 .947 .3436 20.691076
MALE .1573178944 .41910143 .375 .7074 .57208238
Characteristics in numerator of Prob[Y = 3]
Constant -8.388756169 2.1357918 -3.928 .0001
INCOME 4.210485161 1.0320242 4.080 .0000 .61683982
AGE .3457236198 .99507140E-01 3.474 .0005 20.691076
MALE .5402549359 .27698869 1.950 .0511 .57208238
(Note: E+nn or E-nn means multiply by 10 to + or -nn power.)

+--------------------------------------------------------------------+
| Information Statistics for Discrete Choice Model. |
| M=Model MC=Constants Only M0=No Model |
| Criterion F (log L) -406.32509 -444.84113 -605.81064 |
| LR Statistic vs. MC 77.03209 .00000 .00000 |
2003-2005, The Trustees of Indiana University Categorical Dependent Variable Models: 44
http://www.indiana.edu/~statmath
| Degrees of Freedom 9.00000 .00000 .00000 |
| Prob. Value for LR .00000 .00000 .00000 |
| Entropy for probs. 406.32509 444.84113 605.81064 |
| Normalized Entropy .67071 .73429 1.00000 |
| Entropy Ratio Stat. 398.97109 321.93900 .00000 |
| Bayes Info Criterion 867.36958 944.40167 1266.34067 |
| BIC - BIC(no model) 398.97109 321.93900 .00000 |
| Pseudo R-squared .08658 .00000 .00000 |
| Pct. Correct Prec. 64.98856 .00000 25.00000 |
| Means: y=0 y=1 y=2 y=3 yu=4 y=5, y=6 y>=7 |
| Outcome .1648 .0892 .0961 .6499 .0000 .0000 .0000 .0000 |
| Pred.Pr .1648 .0892 .0961 .6499 .0000 .0000 .0000 .0000 |
| Notes: Entropy computed as Sum(i)Sum(j)Pfit(i,j)*logPfit(i,j). |
| Normalized entropy is computed against M0. |
| Entropy ratio statistic is computed against M0. |
| BIC = 2*criterion - log(N)*degrees of freedom. |
| If the model has only constants or if it has no constants, |
| the statistics reported here are not useable. |
+--------------------------------------------------------------------+
Frequencies of actual & predicted outcomes
Predicted outcome has maximum probability.

Predicted
------ -------------------- + -----
Actual 0 1 2 3 | Total
------ -------------------- + -----
0 5 0 0 67 | 72
1 0 0 1 38 | 39
2 0 0 1 41 | 42
3 6 0 0 278 | 284
------ -------------------- + -----
Total 11 0 2 424 | 437

Note that the variable name TRANSMOD was truncated because LIMDEP allows up to eight
characters for a variable name. LIMDEP and STATA produce the same result of the multinomial
logit model.


6.4 Multinomial Logit in SPSS

SPSS has the Nomr eg command to estimate the multinomial logit model. Like SAS, SPSS by
default uses the largest value as the base outcome.

NOMREG transmode WITH income age male
/CRITERIA CIN(95) DELTA(0) MXITER(100) MXSTEP(5) CHKSEP(20) LCONVERGE(0)
PCONVERGE(0.000001) SINGULAR(0.00000001)
/MODEL /INTERCEPT INCLUDE /PRINT PARAMETER SUMMARY LRT .

2003-2005, The Trustees of Indiana University Categorical Dependent Variable Models: 45
http://www.indiana.edu/~statmath
7. The Conditional Logit Regression Model

Imagine a choice of the travel modes among air flight, train, bus, and car. The data set and model
here are adopted from Greene (2003). The model examines how the generalized cost measure
(cost ), terminal waiting time (t i me), and household income (i ncome) affect the choice.

These independent variables are not characteristics of subjects (individuals), but attributes of the
alternatives. Thus, the data arrangement of the conditional logit model is different from that of
the multinomial logit model (Figure 2).

Figure 2. Data Arrangement for the Conditional Logit Model
+- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - +
| subj ect mode choi ce ai r t r ai n bus cost t i me i ncome ai r _i nc |
| - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - |
| 1 1 0 1 0 0 70 69 35 35 |
| 1 2 0 0 1 0 71 34 35 0 |
| 1 3 0 0 0 1 70 35 35 0 |
| 1 4 1 0 0 0 30 0 35 0 |
| 2 1 0 1 0 0 68 64 30 30 |
| - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - |
| 2 2 0 0 1 0 84 44 30 0 |
| 2 3 0 0 0 1 85 53 30 0 |
| 2 4 1 0 0 0 50 0 30 0 |
| 3 1 0 1 0 0 129 69 40 40 |
| 3 2 0 0 1 0 195 34 40 0 |


The example data set has four observations per subject, each of which contains attributes of
using air flight, train, bus, and car. The dependent variable choi ce is coded 1 only if a subject
chooses that travel mode. The four dummy variables, ai r , t r ai n, bus, and car , are flagging
the corresponding modes of transportation. See the appendix for details about the data set.


7.1 Conditional Logit in STATA (.clogit)

STATA has the . cl ogi t command to estimate the condition logit model. The gr oup( ) option
specifies the variable (e.g., identification number) that identifies unique individuals.

. clogit choice air train bus cost time air_inc, group(subject)

I t er at i on 0: l og l i kel i hood = - 205. 8187
I t er at i on 1: l og l i kel i hood = - 199. 23679
I t er at i on 2: l og l i kel i hood = - 199. 12851
I t er at i on 3: l og l i kel i hood = - 199. 12837
I t er at i on 4: l og l i kel i hood = - 199. 12837

Condi t i onal ( f i xed- ef f ect s) l ogi st i c r egr essi on Number of obs = 840
LR chi 2( 6) = 183. 99
Pr ob > chi 2 = 0. 0000
Log l i kel i hood = - 199. 12837 Pseudo R2 = 0. 3160

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
choi ce | Coef . St d. Er r . z P>| z| [ 95%Conf . I nt er val ]
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
ai r | 5. 207443 . 7790551 6. 68 0. 000 3. 680523 6. 734363
2003-2005, The Trustees of Indiana University Categorical Dependent Variable Models: 46
http://www.indiana.edu/~statmath
t r ai n | 3. 869043 . 4431269 8. 73 0. 000 3. 00053 4. 737555
bus | 3. 163194 . 4502659 7. 03 0. 000 2. 280689 4. 045699
cost | - . 0155015 . 004408 - 3. 52 0. 000 - . 024141 - . 006862
t i me | - . 0961248 . 0104398 - 9. 21 0. 000 - . 1165865 - . 0756631
ai r _i nc | . 013287 . 0102624 1. 29 0. 195 - . 0068269 . 033401
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Let us run the . l i st coef command to compute factor changes in odds. For a one unit increase
in the waiting time for a given travel mode, for example, we can expect a decrease in the odds of
using that travel by 9 percent (or a factor of .9084), holding other variables constant.

. listcoef

cl ogi t ( N=840) : Fact or Change i n Odds

Odds of : 1 vs 0

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
choi ce | b z P>| z| e^b
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
ai r | 5. 20744 6. 684 0. 000 182. 6265
t r ai n | 3. 86904 8. 731 0. 000 47. 8965
bus | 3. 16319 7. 025 0. 000 23. 6460
cost | - 0. 01550 - 3. 517 0. 000 0. 9846
t i me | - 0. 09612 - 9. 207 0. 000 0. 9084
ai r _i nc | 0. 01329 1. 295 0. 195 1. 0134
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -


7.2 Conditional Logit in SAS

SAS has the MDC procedure to fit the conditional logit model. The TYPE=CLOGIT indicates
the conditional logit model; the ID statement specifies the identification variable; and the
NCHOICE=4 tells that there are four choices of the travel mode.

PROC MDC DATA=masil.travel;
MODEL choice = air train bus cost time air_inc /TYPE=CLOGIT NCHOICE=4;
ID subject;
RUN;

The MDC Procedure

Conditional Logit Estimates

Algorithm converged.


Model Fit Summary

Dependent Variable choice
Number of Observations 210
Number of Cases 840
Log Likelihood -199.12837
Maximum Absolute Gradient 2.73152E-8
Number of Iterations 5
Optimization Method Newton-Raphson
AIC 410.25674
2003-2005, The Trustees of Indiana University Categorical Dependent Variable Models: 47
http://www.indiana.edu/~statmath
Schwarz Criterion 430.33938


Discrete Response Profile

Index CHOICE Frequency Percent

0 1 58 27.62
1 2 63 30.00
2 3 30 14.29
3 4 59 28.10


Goodness-of-Fit Measures

Measure Value Formula

Likelihood Ratio (R) 183.99 2 * (LogL - LogL0)
Upper Bound of R (U) 582.24 - 2 * LogL0
Aldrich-Nelson 0.467 R / (R+N)
Cragg-Uhler 1 0.5836 1 - exp(-R/N)
Cragg-Uhler 2 0.6225 (1-exp(-R/N)) / (1-exp(-U/N))
Estrella 0.6511 1 - (1-R/U)^(U/N)
Adjusted Estrella 0.6212 1 - ((LogL-K)/LogL0)^(-2/N*LogL0)
McFadden's LRI 0.316 R / U
Veall-Zimmermann 0.6354 (R * (U+N)) / (U * (R+N))

N = # of observations, K = # of regressors


Conditional Logit Estimates

Parameter Estimates

Standard Approx
Parameter DF Estimate Error t Value Pr > |t|

air 1 5.2074 0.7791 6.68 <.0001
train 1 3.8690 0.4431 8.73 <.0001
bus 1 3.1632 0.4503 7.03 <.0001
cost 1 -0.0155 0.004408 -3.52 0.0004
time 1 -0.0961 0.0104 -9.21 <.0001
air_inc 1 0.0133 0.0103 1.29 0.1954

Alternatively, you may use the PHREG procedure that estimates the Cox proportional hazards
model for survival data and the conditional logit model.

In order to make the data set consistent with the survival analysis data, you need to create a
failure time variable, f ai l ur e=1choi ce. The identification variable is specified in the
STRATA statement. The NOSUMMARY option suppresses the display of the event and
censored observation frequencies.

PROC PHREG DATA=masil.travel NOSUMMARY;
STRATA subject;
MODEL failure*choice(0)=air train bus cost time air_inc;
RUN;
2003-2005, The Trustees of Indiana University Categorical Dependent Variable Models: 48
http://www.indiana.edu/~statmath

The PHREG Procedure

Model Information

Data Set MASIL.TRAVEL
Dependent Variable failure
Censoring Variable choice
Censoring Value(s) 0
Ties Handling BRESLOW


Number of Observations Read 840
Number of Observations Used 840


Convergence Status

Convergence criterion (GCONV=1E-8) satisfied.


Model Fit Statistics

Without With
Criterion Covariates Covariates

-2 LOG L 582.244 398.257
AIC 582.244 410.257
SBC 582.244 430.339


Testing Global Null Hypothesis: BETA=0

Test Chi-Square DF Pr > ChiSq

Likelihood Ratio 183.9869 6 <.0001
Score 173.4374 6 <.0001
Wald 103.7695 6 <.0001


Analysis of Maximum Likelihood Estimates

Parameter Standard Hazard
Variable DF Estimate Error Chi-Square Pr > ChiSq Ratio

air 1 5.20743 0.77905 44.6799 <.0001 182.625
train 1 3.86904 0.44313 76.2343 <.0001 47.896
bus 1 3.16319 0.45027 49.3530 <.0001 23.646
cost 1 -0.01550 0.00441 12.3671 0.0004 0.985
time 1 -0.09612 0.01044 84.7778 <.0001 0.908
air_inc 1 0.01329 0.01026 1.6763 0.1954 1.013

While the MDC procedure reports t statistics, the PHREG procedure computes chi-squared (e.g.,
12.3671=-3.52^2). The PHREG presents the hazard ratio at the last column of the output, which
is equivalent to the factor changes under the e^b column of the SPost . l i st coef command.

2003-2005, The Trustees of Indiana University Categorical Dependent Variable Models: 49
http://www.indiana.edu/~statmath

7.3 Conditional Logit in LIMDEP (Clogit$)

LIMDEP fits the conditional logit model using either the Cl ogi t $ or the Logi t $ command. The
Cl ogi t $ command has the Choi ces$ subcommand to list the choices available.

CLOGIT;
Lhs=choice;
Rhs=air,train,bus,cost,time,air_inc;
Choices=air,train,bus,car$

Normal exit from iterations. Exit status=0.

+---------------------------------------------+
| Discrete choice (multinomial logit) model |
| Maximum Likelihood Estimates |
| Model estimated: Sep 19, 2005 at 09:20:39PM.|
| Dependent variable Choice |
| Weighting variable None |
| Number of observations 210 |
| Iterations completed 6 |
| Log likelihood function -199.1284 |
| Log-L for Choice model = -199.12837 |
| R2=1-LogL/LogL* Log-L fncn R-sqrd RsqAdj |
| Constants only -283.7588 .29825 .29150 |
| Response data are given as ind. choice. |
| Number of obs.= 210, skipped 0 bad obs. |
+---------------------------------------------+
+---------+--------------+----------------+--------+---------+
|Variable | Coefficient | Standard Error |b/St.Er.|P[|Z|>z] |
+---------+--------------+----------------+--------+---------+
AIR 5.207443299 .77905514 6.684 .0000
TRAIN 3.869042702 .44312685 8.731 .0000
BUS 3.163194212 .45026593 7.025 .0000
COST -.1550152532E-01 .44079931E-02 -3.517 .0004
TIME -.9612479610E-01 .10439847E-01 -9.207 .0000
AIR_INC .1328702625E-01 .10262407E-01 1.295 .1954
(Note: E+nn or E-nn means multiply by 10 to + or -nn power.)

The Cl ogi t $ command has the I as$ subcommand to conduct the Hausman test for the IIA
assumption (e.g., I as=ai r , bus$). Unfortunately, the subcommand does not work in this model
because the Hessian is not positive definite.

The Logi t $ command takes the panel data analysis approach. The Pds$ subcommand specifies
the number of time periods. The two commands produce the same result.

LOGIT;
Lhs=choice;
Rhs=air,train,bus,cost,time,air_inc;
Pds=4$

+--------------------------------------------------+
| Panel Data Binomial Logit Model |
| Number of individuals = 210 |
2003-2005, The Trustees of Indiana University Categorical Dependent Variable Models: 50
http://www.indiana.edu/~statmath
| Number of periods = 4 |
| Conditioning event is the sum of CHOICE |
| Distribution of sums over the 4 periods: |
| Sum 0 1 2 3 4 5 6 |
| Number 0 210 0 0 0 5 10 |
| Pct. .00100.00 .00 .00 .00 .00 .00 |
+--------------------------------------------------+
Normal exit from iterations. Exit status=0.

+---------------------------------------------+
| Logit Model for Panel Data |
| Maximum Likelihood Estimates |
| Model estimated: Sep 19, 2005 at 09:21:58PM.|
| Dependent variable CHOICE |
| Weighting variable None |
| Number of observations 840 |
| Iterations completed 6 |
| Log likelihood function -199.1284 |
| Hosmer-Lemeshow chi-squared = 251.24482 |
| P-value= .00000 with deg.fr. = 8 |
| Fixed Effects Logit Model for Panel Data |
+---------------------------------------------+
+---------+--------------+----------------+--------+---------+
|Variable | Coefficient | Standard Error |b/St.Er.|P[|Z|>z] |
+---------+--------------+----------------+--------+---------+
AIR 5.207443299 .77905514 6.684 .0000
TRAIN 3.869042702 .44312685 8.731 .0000
BUS 3.163194212 .45026593 7.025 .0000
COST -.1550152532E-01 .44079931E-02 -3.517 .0004
TIME -.9612479610E-01 .10439847E-01 -9.207 .0000
AIR_INC .1328702625E-01 .10262407E-01 1.295 .1954
(Note: E+nn or E-nn means multiply by 10 to + or -nn power.)


7.4 Conditional Logit in SPSS

Like the SAS PHREG procedure, the SPSS Coxr eg command, which was designed for survival
analysis data, provides a backdoor way of estimating the conditional logit model.

COXREG failure WITH air train bus cost time air_inc
/STATUS=choice(1)
/STRATA=subject.

2003-2005, The Trustees of Indiana University Categorical Dependent Variable Models: 51
http://www.indiana.edu/~statmath
8. The Nested Logit Regression Model

Now, consider a nested structure of choices. When the IIA assumption is violated, one of the
alternatives is the nested (multinomial) logit model. This chapter replicates the nested logit
model discussed in Greene (2003).

) ( * ) | ( ) , ( branch P branch choice P branch choice P =
) cos ( ) | (
2 1 3 2 1
time t bus train air P branch choice P
child
+ + + + =
) _ ( ) (
ground ground fly fly income parent
IV IV inc air P branch P + + =


8.1 Nested Logit in STATA (.nlogit)

The STATA . nl ogi t command estimates the nested multinomial logit model. First you need to
create a variable based on the specification of the tree using the . nl ogi t gen command. From the
top, the parent-level has fly and ground branches; the fly branch of the child-level has air flight
(1); the ground branch has train (2), bus (3), and car (4).

. nlogitgen tree = mode(fly: 1, ground: 2 | 3 | 4)

new var i abl e t r ee i s gener at ed wi t h 2 gr oups
l abel l i st l b_t r ee
l b_t r ee:
1 f l y
2 gr ound

The . nl ogi t t r ee command.displays the tree-structure defined by the . nl ogi t gen command.

. nlogittree mode tree

t r ee st r uct ur e speci f i ed f or t he nest ed l ogi t model

t op - - > bot t om

t r ee mode
- - - - - - - - - - - - - - - - - - - - - - - - - -
f l y 1
gr ound 2
3
4

The . nl ogi t command consists of three parts. The dependent or choice variable follows the
command. Utility functions of the parent and child-levels are then specified. The gr oup( ) option
specifies an identification or grouping variable.

. nlogit choice (mode=air train bus cost time) (tree=air_inc), ///
group(subject) notree nolog

Nest ed l ogi t r egr essi on
Level s = 2 Number of obs = 840
Dependent var i abl e = choi ce LR chi 2( 8) = 194. 9313
2003-2005, The Trustees of Indiana University Categorical Dependent Variable Models: 52
http://www.indiana.edu/~statmath
Log l i kel i hood = - 193. 65615 Pr ob > chi 2 = 0. 0000

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
| Coef . St d. Er r . z P>| z| [ 95%Conf . I nt er val ]
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
mode |
ai r | 6. 042255 1. 198907 5. 04 0. 000 3. 692441 8. 39207
t r ai n | 5. 064679 . 6620317 7. 65 0. 000 3. 767121 6. 362237
bus | 4. 096302 . 6151582 6. 66 0. 000 2. 890614 5. 30199
cost | - . 0315888 . 0081566 - 3. 87 0. 000 - . 0475754 - . 0156022
t i me | - . 1126183 . 0141293 - 7. 97 0. 000 - . 1403111 - . 0849254
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
t r ee |
ai r _i nc | . 0153337 . 0093814 1. 63 0. 102 - . 0030534 . 0337209
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
( i ncl . val ue |
par amet er s) |
t r ee |
/ f l y | . 5859993 . 1406199 4. 17 0. 000 . 3103894 . 8616092
/ gr ound | . 3889488 . 1236623 3. 15 0. 002 . 1465753 . 6313224
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
LR t est of homoskedast i ci t y ( i v = 1) : chi 2( 2) = 10. 94 Pr ob > chi 2 = 0. 0042
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

The not r ee option does not show the tree-structure and the nol og suppresses an iteration log of
the log likelihood. Note that the /// joins the next command line with the current line.


8.2 Nested Logit in SAS

The SAD MDC procedure fits the conditional logit model as well as the nested multinomial logit
model. For the nested logit model, you have to use the UTILITY statement to specify utility
functions of the parent (level 2) and child level (level 1), and the NEST statement to construct
the decision-tree structure. Note that 2 3 4 @ 2 reads that there are three nodes at the child
level under the branch 2 at the parent-level.

PROC MDC DATA=masil.travel;
MODEL choice = air train bus cost time air_inc /TYPE=NLOGIT CHOICE=(mode);
ID subject;
UTILITY U(1,) = air train bus cost time,
U(2, 1 2) = air_inc;
NEST LEVEL(1) = (1 @ 1, 2 3 4 @ 2),
LEVEL(2) = (1 2 @ 1);
RUN;

The MDC Procedure

Nested Logit Estimates

Algorithm converged.


Model Fit Summary

Dependent Variable choice
Number of Observations 210
Number of Cases 840
2003-2005, The Trustees of Indiana University Categorical Dependent Variable Models: 53
http://www.indiana.edu/~statmath
Log Likelihood -193.65615
Maximum Absolute Gradient 0.0000147
Number of Iterations 15
Optimization Method Newton-Raphson
AIC 403.31230
Schwarz Criterion 430.08916


Discrete Response Profile

Index mode Frequency Percent

0 1 58 27.62
1 2 63 30.00
2 3 30 14.29
3 4 59 28.10


Goodness-of-Fit Measures

Measure Value Formula

Likelihood Ratio (R) 194.93 2 * (LogL - LogL0)
Upper Bound of R (U) 582.24 - 2 * LogL0
Aldrich-Nelson 0.4814 R / (R+N)
Cragg-Uhler 1 0.6048 1 - exp(-R/N)
Cragg-Uhler 2 0.6451 (1-exp(-R/N)) / (1-exp(-U/N))
Estrella 0.6771 1 - (1-R/U)^(U/N)
Adjusted Estrella 0.6485 1 - ((LogL-K)/LogL0)^(-2/N*LogL0)
McFadden's LRI 0.3348 R / U
Veall-Zimmermann 0.655 (R * (U+N)) / (U * (R+N))

N = # of observations, K = # of regressors


Nested Logit Estimates

Parameter Estimates

Standard Approx
Parameter DF Estimate Error t Value Pr > |t|

air_L1 1 6.0423 1.1989 5.04 <.0001
train_L1 1 5.0646 0.6620 7.65 <.0001
bus_L1 1 4.0963 0.6152 6.66 <.0001
cost_L1 1 -0.0316 0.008156 -3.87 0.0001
time_L1 1 -0.1126 0.0141 -7.97 <.0001
air_inc_L2G1 1 0.0153 0.009381 1.63 0.1022
INC_L2G1C1 1 0.5860 0.1406 4.17 <.0001
INC_L2G1C2 1 0.3890 0.1237 3.15 0.0017

The / f l y and / gr ound in the STATA output above are equivalent to the INC_L2G1C1 and
INC_L2G1C2 in the SAS output. SAS and STATA produce the same result.


2003-2005, The Trustees of Indiana University Categorical Dependent Variable Models: 54
http://www.indiana.edu/~statmath
9. Conclusion

The appropriate type of categorical dependent variable model (CDVM) is determined largely by
the level of measurement of the dependent variable. The level of measurement should be,
however, considered in conjunction with your theory and research questions (Long 1997). You
must also examine the data generation process (DGP) of a dependent variable to understand its
behavior. Sophisticated researchers pay special attention to censoring, truncation, sample
selection, and other particular patterns of the DGP.

If your dependent variable is a binary variable, you may use the binary logit or probit regression
model. For ordinal responses, try to fit the ordered logit/probit regression models. If you have a
nominal response variable, investigate the DGP carefully and then choose one of the multinomial
logit, conditional logit, and nested logit models. In order to use the conditional logit and nested
logit, the data set requires a different setup.

You should check the key assumptions of the CDVMs when fitting the models. Examples are the
parallel regression assumption in the ordered logit model and the independence of irrelevant
alternatives (IIA) assumption in the multinomial logit model. You may conduct the Brant test
and Hausman test for these assumptions.

Since CDVMs are nonlinear, they produce estimates that are difficult to interpret intuitively.
Consequently, researchers need to spend more time and effort interpreting the results
substantively. Reporting parameter estimates and goodness-of-fit statistics is not sufficient. J.
Scott Long (1997) and Long and Freese (2003) provide good examples of meaningful
interpretations using predicted probabilities, factor changes in odds, and marginal/discrete
changes of predicted probabilities.

Regarding statistical software for CDVMs, I would recommend the SAS QLIM and MDC
procedures of SAS/ETS (see Table 3 and 4). SAS has other procedures such as LOGISTIC,
GENMODE, and PROBIT for CDVMs, but the QLIM procedure seems best for binary and
ordinal response models, and the MDC procedure is good for nominal dependent variable models.
I also strongly recommend STATA with SPost, since it has various useful commands for
CDVMs such as . pr change, . l i st coef , and . pr t ab. I encourage SAS Institute to develop
additional statements similar to those SPost commands.

LIMDEP supports various CDVMs addressed in Greene (2003) but does not seem stable and
reliable. Thus, I recommend LIMDEP for CDVMs that SAS and STATA do not support. SPSS is
not currently recommended for CDVMs.
2003-2005, The Trustees of Indiana University Categorical Dependent Variable Models: 55
http://www.indiana.edu/~statmath
Appendix: Data Sets

The first data set st udent s is a subset of data provided for David H. Goods class in the School
of Public and Environmental Affairs (SPEA). The data were manipulated for the sake of data
security.

owncar : 1 if a student owns a car
par ki ng: Illegal parking (0=none, 1=sometimes, and 2=often)
of f camp: 1 if a student lives off-campus
t r ansmode: the mode of transportation (0=walk, 1=bike, 2=bus, 3=car)
age: students age
i ncome: monthly income
mal e: 1 for male and 0 for female


. t ab mal e owncar

| owncar
mal e | 0 1 | Tot al
- - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - +- - - - - - - - - -
0 | 76 111 | 187
1 | 77 173 | 250
- - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - +- - - - - - - - - -
Tot al | 153 284 | 437


. t ab mal e of f camp

| of f camp
mal e | 0 1 | Tot al
- - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - +- - - - - - - - - -
0 | 7 180 | 187
1 | 5 245 | 250
- - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - +- - - - - - - - - -
Tot al | 12 425 | 437


. t ab mal e par ki ng

| par ki ng
mal e | 0 1 2 | Tot al
- - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - +- - - - - - - - - -
0 | 170 13 4 | 187
1 | 243 7 0 | 250
- - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - +- - - - - - - - - -
Tot al | 413 20 4 | 437


. t ab mal e t r ansmode

| t r ansmode
mal e | 0 1 2 3 | Tot al
- - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - +- - - - - - - - - -
0 | 38 18 20 111 | 187
1 | 34 21 22 173 | 250
- - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - +- - - - - - - - - -
Tot al | 72 39 42 284 | 437
2003-2005, The Trustees of Indiana University Categorical Dependent Variable Models: 56
http://www.indiana.edu/~statmath

. sumi ncome age

Var i abl e | Obs Mean St d. Dev. Mi n Max
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
i ncome | 437 . 6168398 . 17918 . 4 1. 227
age | 437 20. 69108 1. 610812 18 29


The second data set t r avel on travel mode choice is adopted from Greene (2003). You may get
the data from http://pages.stern.nyu.edu/~wgreene/Text/tables/tablelist5.htm
subj ect : identification number
mode: 1=Air, 2=Train, 3=Bus, 4=Car
choi ce: 1 if the travel mode is chosen
t i me: terminal waiting time, 0 for car
cost : generalized cost measure
i ncome: household income
ai r _i nc: interaction of air flight and household income, air*income
ai r : 1 for the air flight mode, 0 for others
t r ai n: 1 for the train mode, 0 for others
bus: 1 for the bus mode, 0 for others
car : 1 for the car mode, 0 for others
f ai l ur e: failure time variable, 1- choi ce

. t ab choi ce mode

| mode
choi ce | 1 2 3 4 | Tot al
- - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - +- - - - - - - - - -
0 | 152 147 180 151 | 630
1 | 58 63 30 59 | 210
- - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - +- - - - - - - - - -
Tot al | 210 210 210 210 | 840


. sumt i me i ncome ai r _i nc

Var i abl e | Obs Mean St d. Dev. Mi n Max
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
t i me | 840 34. 58929 24. 94861 0 99
i ncome | 840 34. 54762 19. 67604 2 72
ai r _i nc | 840 8. 636905 17. 91206 0 72
2003-2005, The Trustees of Indiana University Categorical Dependent Variable Models: 57
http://www.indiana.edu/~statmath
References

Allison, Paul D. 1991. Logistic Regression Using the SAS System: Theory and Application. Cary,
NC: SAS Institute.
Greene, William H. 2002. LIMDEP Version 8.0 Econometric Modeling Guide. Plainview, New
York: Econometric Software.
Greene, William H. 2003. Econometric Analysis, 5
th
ed. Upper Saddle River, NJ: Prentice Hall.
Long, J. Scott, and Jeremy Freese. 2003. Regression Models for Categorical Dependent
Variables Using STATA, 2
nd
ed. College Station, TX: STATA Press.
Long, J. Scott. 1997. Regression Models for Categorical and Limited Dependent Variables.
Advanced Quantitative Techniques in the Social Sciences. Sage Publications.
Maddala, G. S. 1983. Limited Dependent and Qualitative Variables in Econometrics. New York:
Cambridge University Press.
SAS Institute. 2004. SAS/STAT 9.1 User's Guide. Cary, NC: SAS Institute.
SPSS Inc. 2001. SPSS 11.0 Syntax Reference Guide. Chicago, IL: SPSS Inc.
STATA Press. 2004. STATA Base Reference Manual, Release 8. College Station, TX: STATA
Press.
Stokes, Maura E., Charles S. Davis, and Gary G. Koch. 2000. Categorical Data Analysis Using
the SAS System, 2
nd
ed. Cary, NC: SAS Institute.



Acknowledgements

I am grateful to Jeremy Albright and Kevin Wilhite at the UITS Center for Statistical and
Mathematical Computing for comments and suggestions. I also thank J. Scott Long in Sociology
and David H. Good in the School of Public and Environmental Affairs, Indiana University, for
their insightful lectures and data set.



Revision History

2003. First draft
2004. Second draft
2005. Third draft (Added bivariate logit/probit models and the nested logit model with
LIMDEP examples).

Vous aimerez peut-être aussi