Académique Documents
Professionnel Documents
Culture Documents
p n
n
ts coefficien regression the of estimates square least Y X
1
X) X (
'
'
= |
PROC GLM:
It uses the method of least squares to fit general linear models
relating to one or several continuous dependent variables to one or
several independent variables.
Strengths:
direct specification of polynomial effects
ease of specifying categorical effects (PROC GLM automatically
generates dummy variables for class variables)
Weaknesses:
No collinearity diagnostics
No influence diagnostics
No scatter plots
Only one model at one time
Regression procedures PROC GLM
PROC REG: Provides the most general analysis capabilities
handles multiple regression models
provides nine model-selection methods
allows interactive changes both in the model and in the data used to
fit the model
allows linear equality restrictions on parameters
tests linear hypotheses and multivariate hypotheses
produces collinearity diagnostics, influence diagnostics, and partial
regression leverage plots
saves estimates, predicted values, residuals, confidence limits, and
other diagnostic statistics in output SAS data sets
generates plots of data and of various statistics
Regression procedures PROC REG
PROC REG <DATA=SAS-dataset> ;
MODEL dependent-variable = predictors /
selection=method R CLI CLM ;
PLOT r.*p. ;
RUN ;
QUIT;
*Regression using proc reg ;
PROC REG data=insurance;
model time = size type sizetype
/selection=none;
RUN;
delete sizetype;
print;
RUN;
plot r.*p. time*p.;
RUN;
QUIT;
DATA insurance;
input time size type @@;
sizetype=size*type;
datalines;
17 151 0 26 92 0 21 175 0 30 31 0 22 104 0
0 277 0 12 210 0 19 120 0 4 290 0 16 238 0
28 164 1 15 272 1 11 295 1 38 68 1 31 85 1
21 224 1 20 166 1 13 305 1 30 124 1 14 246 1
;
SELECTION: specifies model selection
model: forward, backward, etc.
DELETE: deletes variables from the model.
PRINT: print the analysis results.
PROT: produces diagnostic plots.
Regression procedures PROC REG
*Polynomial regression using proc reg;
PROC REG data=USPopulation;
var YearSq;
model Population=Year / selection=none;
plot r.*p. ;
RUN;
add YearSq;
print;
plot / cframe=ligr;
RUN;
plot (Population predicted. u95. l95.)*Year / overlay cframe=ligr;
RUN;
QUIT;
ODS html;
ODS graphics on;
PROC REG data=USPopulation;
Linear: model Population=Year;
Quadratic:model Population=Year YearSq;
RUN;
ODS graphics off;
ODS html close;
QUIT;
Regression procedures PROC REG
Logistic regression procedures
Logistic models
Binary logistic model: dichotomous response outcomes
e.g.: presence or absence of an event
PROC LOGISTIC provides the capability of model-fitting.
Ordinal logistic model: ordinal response variable with more than two
ordered categories
e.g.: a 5-point Likert scale
PROC LOGISTIC fits the proportional odds model with CLOGIT link.
Multinomial logistic model: nominal response variables with more than
two categories
e.g.: different types of programs in school
PROC LOGISTIC fits the generalized logit model if you specify the GLOGIT link.
Binary logistic model
Ordinal logistic model
Multinomial logistic model
) | (
i i i
x y E = t
X g | o t t t t ' + = = = ) ( )) 1 /( log( ) ( logit
k i X X i Y g
i
,......, 1 , ' )) | (Pr( = + = s | o
k i X
X k Y
X i Y
i i
,......, 1 , '
) | 1 Pr(
) | Pr(
log = + =
|
|
.
|
\
|
+ =
=
| o
PROC LOGISTIC <DATA=SAS-dataset> ;
CLASS variables;
MODEL dependent-variable = predictors / options;
RUN ;
Regression procedures PROC LOGISTIC
Binary logistic model
Variable name Variable information
age Age in years
ed Level of education
1= didnt complete high school 2= high school degree
3= college degree 4= undergraduate 5= postgraduate
employ Years with current employer
address Years in current address
income Household income in thousands
debtinc Debt to income ratio (*100)
creddebt Credit card debt in thousands
othdebt Other debts in thousands
default Previously defaulted (1=Yes; 0=No)
How to identify a person with high chance of getting defaults on the bank
loan. We have 700 records from bank database (bankloan) .
Regression procedures PROC LOGISTIC
*Binary logistic model;
PROC MEANS data=sas2.bankloan;
var age employ address income debtinc creddebt othdebt;
class default;
RUN;
PROC LOGISTIC data=sas2.bankloan;
class ed(ref='1') / param=ref;
model default(event='1')= ed age employ address income debtinc
creddebt othdebt
/ selection=stepwise slentry=0.3 slstay=0.35 details
rsquare lackfit;
output out=bankloanpred p=prob lower=lcl upper=ucl xbeta=logit;
ods output parameterestimates=bankloanest;
RUN;
Regression procedures PROC LOGISTIC
SELECTION: specifies model selection methods.
SLENTRY=0.3 : a significance level of 0.3 is required to allow a variable into the model.
SLSTAY=0.35: a significance level of 0.35 is required for a variable to stay in the model.
DETAILS: produces a detailed account of the variable selection process.
RSQUARE: produces generalized R-square measure.
LACKFIT: produces Hosmer and Lemeshow goodness-of-fit test for the final selected model.
PARAM=REF: specifies the reference cell coding.
REF: specifies reference group for categorical predictors.
EVENT: specifies reference group for dependent variable.
Regression procedures PROC LOGISTIC
Proportional Odds Model for Ordinal Logistic Model
To identify factors that influence a persons income category.
*Ordinal logistic model;
DATA income;
set sas2.bankloan;
if income<20 then inccat=1;
else if 20 <= income < 30 then inccat=2;
else if 30 <= income < 40 then inccat=3;
else if 40 <= income < 50 then inccat=4;
else if income >= 50 then inccat=5;
else inccat=.;
RUN;
PROC LOGISTIC data=income;
class ed(ref='5') / param=ref;
model inccat = age ed employ address debtinc / link=clogit;
output out=incpred p=prob xbeta=linp;
RUN;
LINK: specifies the link function.
CLOGIT: cumulative logits.
Regression procedures PROC LOGISTIC
Generalized Logits Model for Multinomial Logistic Model
*Multinomial logistic model;
DATA school;
input school program style $ count;
datalines;
1 1 self 10
1 1 team 17
1 1 class 26
1 2 self 5
1 2 team 12
1 2 class 50
2 1 self 21
2 1 team 17
2 1 class 26
2 2 self 16
2 2 team 12
2 2 class 36
3 1 self 15
3 1 team 15
3 1 class 16
3 2 self 12
3 2 team 12
3 2 class 20
;
To identify the difference of study types among
schools and programs.
Regression procedures PROC LOGISTIC
PROC LOGISTIC data=school;
freq count;
class school program / order=data;
model style = school program school*program / link=glogit;
output out=progstat p=prob;
ods output parameterestimates=progest;
RUN;
PROC FREQ data=progstat;
format prob 5.4;
tables school*program*_level_*prob /list nopercent nocum;
RUN;
DATA progodd;
set progest;
odds=exp(estimate);
RUN;
PROC PRINT data=progodd;
var response estimate odds;
RUN;
LINK: specifies the link function.
GLOGIT: generalized logit function.
Resources and books
Regression methods
Applied Regression Analysis, Linear Models, and Related Methods by John Fox
Regression Analysis by Example by Chatterjee, Hadi and Price
An Introduction to Generalized Linear Models, Second Edition by Annette J. Dobson
Logistic regression and categorical data analysis
Applied Logistic Regression, Second Edition by David Hosmer and Stanley Lemeshow
An Introduction to Categorical Data Analysis Alan Agresti
CAC statistical consultation support:
CAC statistical WIKI page:
http://research2.smu.edu.sg/CAC/StatisticalComputing/Wiki/SAS.aspx
Statistical consultation service: lsun@smu.edu.sg
End!