Vous êtes sur la page 1sur 12

EPID 602

Winter 2019
Homework 3: Linear Regression

Honor Code

I pledge on my honor that:


I have completed all steps of the attached homework on my own,
I have not used any unauthorized materials while completing this homework, and
I have not given anyone else access to my homework.

Please electronically sign the following honor code below:

Your Name: ___________Stacy Huang________________________

Your Student ID: ____________61460380___________________

Signature and date:

______Stacy Huang 3/8/19_______________________________________

In the homework assignments we will be using data from the Child Health and Development
Studies to answer the question: Is girls’ age of menarche affected by their mothers’ cigarette
smoking during pregnancy? The data sets are located on the Epid 602 Canvas site. In the third
assignment we will use linear regression to explore the associations between mothers’
cigarette smoking during pregnancy and daughters’ ages at menarche.

Please remember that you may consult with each other while working on this assignment but
you must run your own code, write your answers in your own words, and submit your own
assignment. Please upload an electronic word document (No PDFs will be accepted) of your
assignment with your code pasted in one the last page via Canvas no later than 1PM on
3/15/19. Only output that is relevant to the questions should be pasted in below each question.
Please do not change the numbering or ordering of the questions.

Be sure that: 1) Your SAS code runs from start to finish.


2) Your results make sense (check your sample size and look for unreasonable,
unlikely, or impossible answers).
3) Your code is well commented (the top of your file should include the
homework number and your name, each question should be identified in the
code, and each new task should be described by comments) and formatted
(indentation and carriage returns should be used to improve readability). 5% will
be deducted if either of these two tasks is not completed.

1
1. (1 point) Remember that in HW1 you created a permanent library in the Private folder of
your IFS space called chds. Save a copy of the SAS dataset hw3and4 (located in the CHDS
folder on the Canvas site) to the folder you are using for your permanent library. This is the
dataset we created in HW2; now we’re going to use it for our HW3 analyses. If you like
using formats, also include a separate libname statement to reference your format library
from HW2. Remember that all files and libraries should be located in your Private folder on
your IFS space or similar.

2. (4 points) We’re going to conduct a complete-case analysis that excludes observations with
missing data (we’ll talk about other options for handling missing data later in the semester).
a. Create a new dataset based on hw3and4 containing only observations with nonmissing
information for the outcome (TEENMENS), exposure (MOMCIGS), and covariates
(MOMMENS3, PARITY3, MOMED3, INCOME3, and RACE3).

NOTE: There were 1003 observations read from the data set CHDS.HW3AND4.
NOTE: The data set WORK.hw34 has 826 observations and 25 variables.
NOTE: DATA statement used (Total process time):
real time 0.08 seconds
cpu time 0.03 seconds

b. How many observations are in the new dataset? What percent of observations in the
original dataset hw3and4 contained missing information?

826 observations. (1003-826)/1003 = 0.18  18% of observations in the original


dataset contained missing information.

3. (7 points) We will start with a simple linear regression.


a. In your new dataset, use PROC GLM or PROC REG to run a simple linear regression of
the daughters’ ages at menarche (TEENMENS) on the simplified variable of mothers’
smoking during pregnancy (MOMCIGS).
• In PROC GLM, do not include a CLASS statement; in PROC REG, do not create dummy
variables for the values of MOMCIGS.
• Ask SAS to include the parameter estimates and confidence intervals in the output.

b. What does this model assume about the shape of the association between age at
menarche and mother’s smoking during pregnancy?

This model assumes that the shape of the association between age at menarche and
mother’s smoking during pregnancy is a straight line (linear).

c. Report and interpret the effect estimate for mother’s smoking during pregnancy.
Include the 95% confidence interval. Also interpret the p-value of the effect estimate.

Beta = -0.06113 95% CI = (-0.13564, 0.01337) p-value = 0.1077

2
For every 1 unit increase in smoking category, mean age at menarche decreases by
0.06113 years.

The p-value is less than 0.05, so we fail to reject the null hypothesis that mother’s
smoking during pregnancy and age at menarche are not associated.

4. (15 points) Now we’ll add a CLASS statement.


a. Repeat the simple linear regression, but this time include a CLASS statement in PROC
GLM or use dummy variables in PROC REG. Use the mothers who did not smoke at all
during pregnancy as the referent group.
• So, do include a CLASS statement in PROC GLM or use dummy variables in PROC
REG. Use the mothers who did not smoke at all during pregnancy as the referent
group.
• Ask SAS to include the parameter estimates and confidence intervals in the output.
• In addition, save the predicted values, residuals, studentized residuals, and Cook’s
distances in a separate dataset.

b. How do our assumptions regarding the shape of the association between age at
menarche and mother’s smoking during pregnancy differ between this model and the
model in question 3?

With this model, we’re no longer assuming that the association between age at
menarche and mother’s smoking during pregnancy has a straight line pattern,
therefore we assume that the shape of the association may be non-linear.

c. Report and interpret each effect estimate for mother’s smoking during pregnancy.
Include the 95% confidence intervals.

MOMCIGS 0: 12.9987 (12.8945, 13.1029)  the expected mean age of girls with
mothers that do not smoke during pregnancy is 12.9987 years.

MOMCIGS 1: -0.1338 (-0.3792, 0.1116)  girls with mothers that smoke 1-9 cigarettes
per day during pregnancy are expected to be 0.1338 years younger, on average,
compared to girls with mothers that do not smoke during pregnancy (reference
group).

MOMCIGS 2: 0.0364 (-0.2972, 0.3699)  girls with mothers that smoke 10-19
cigarettes per day during pregnancy are expected to be 0.0364 years older, on
average, compared to girls with mothers that do not smoke during pregnancy
(reference group).

MOMCIGS 3: -0.2157 (-0.4529, 0.0216)  girls with mothers that smoke >=20
cigarettes per day during pregnancy are expected to be 0.2157 years younger, on

3
average, compared to girls with mothers that do not smoke during pregnancy
(reference group).

d. Given these results, do you think using the model from question 3 is a reasonable
choice? Why or why not?

I think that using the model from question 3 is not reasonable based on these results
which suggest a non-linear association between age at menarche and smoking during
pregnancy. From the results (effect estimates, LSMEANS), we can see that there is not
a large difference in mean age at menarche across all smoking categories (the mean
ages are pretty similar). Furthermore, indication of non-linearity is mean age at
menarche slightly decreases from smoking category 0 to 1, then it slightly increases
from smoking category 1 to 2, and slightly decreases from smoking category 2 to 3.
Therefore, the model from question 4 is better since it does not assume linearity.

e. Report and interpret the R2 for this model. Does it seem consistent with the effect
estimates?

R2 = 0.004901
This means that 0.4901% of the variation in age at menarche is explained by the linear
model with smoking during pregnancy. Yes, this is consistent with the effect estimates
because they indicate a non-linear association, which would result in a low R2 value.

f. Use PROC UNIVARIATE to examine the distribution of the studentized residuals. Include
a histogram and qqplot and paste the plots below. Describe the plots with respect to
normality and interpret the tests for normality from the output. Do you think we have
violated the linear regression assumption of normality of the errors?

4
From the histogram, we can see that the studentized residuals are approximately
normally distributed. The QQ-plot also suggests the studentized residuals are normally
distributed because most of the points follow the straight line, but with only a few small
deviations from the line. All the tests for normality (i.e., Shapiro-Wilk, Kolmogorov-
Smirnov, etc.) have significant p-values less than 0.05, which means the null hypotheses
that the studentized residuals have a normal distribution should be rejected. However,
these tests are sensitive to deviations. With large sample sizes (such as the one in this
study), however, we can afford some departure from normality. So it would be better to
double check the normality assumption by looking at graphs like histograms or QQ-plots
of the residuals. In this study, since the histogram and QQ-plot suggest the studentized
residuals have a normal distribution, so we can conclude that the linear regression
normality assumption of the errors was not violated.

5. (16 points) If we were conducting this analysis in real life, we would go through the model-
building steps we discussed in class to choose which of our potential confounders to include
in adjusted models. To make things simpler for the homework assignment, we’re going to
skip that step and all run the same adjusted model.

a. Run a multiple regression adjusted for mother’s age at menarche (MOMMENS3),


mother’s parity (PARITY3), mother’s education (MOMED3), family income (INCOME3),
and child’s race (RACE3).
• Include a CLASS statement or create dummy variables for all variables. Use the
mothers who did not smoke at all during pregnancy as the referent group for
MOMCIGS. Don’t worry too much about the referent groups for the other
variables—this can be important for interpretation, but we will not be interpreting

5
the results for the other variables in this assignment. Changing the referent group
for the other variables will not affect the results for MOMCIGS.
• Ask SAS to include the parameter estimates and confidence intervals in the output.
• Save the predicted values, residuals, studentized residuals, and Cook’s distances in a
separate dataset.
• Also include a partial F test for each of the independent variables. In PROC GLM, this
is the Type III SS results. In PROC REG, you can do this with TEST statements.
• Finally, ask SAS to provide a panel of diagnostic plots. Depending on the version of
SAS you use, you may have to turn ods graphics on before running the regression
and turn it off afterwards. (Hint: In PROC GLM, you can get diagnostic plots by
including PLOT=DIAGNOSTICS in the PROC GLM statement.)

b. Do these estimates differ substantially from the simple regression estimates? Do you
think the unadjusted results were confounded by the other variables?

Yes these estimates differ substantially from the simple regression estimates. Two of
the estimates (MOMCIGS 1 and MOMCIGS3) changed by more than 10% compared to
the simple regression model. Therefore, I think the unadjusted results were
confounded by the other variables.

c. Report and interpret the R2 for this model. How does it differ from the simple regression
model?

R2 = 0.08469
This means that 8.469% of the variation in age at menarche is explained by the linear
model with smoking during pregnancy, adjusting for mother’s age at menarche,
mother’s parity, mother’s education, family income, and child’s race. This R2 is still
very low, but it is higher than the R2 from the simple regression model, therefore the
multiple regression model is better.

d. Report and interpret the partial F-statistic and associated p-value for mother’s smoking
during pregnancy. According to the partial F-tests, which covariates are statistically
significantly associated with the outcome at the 95% confidence level when the other
variables are included in the model?

F = 0.77 p-value = 0.5088


The small F-statistic corresponds to a p-value that is not significant. Therefore, we fail
to reject the null hypothesis that smoking during pregnancy and age at menarche are
not associated (adjusting for mother’s age at menarche, mother’s parity, mother’s
education, family income, and child’s race). Based on the partial F-tests, covariates
that are statistically significantly associated with the outcome include mother’s age at
menarche and child’s race.

6
e. Paste the diagnostics plots below. According to the diagnostics plots, does the
assumption of normality of the errors hold? Explain your answer.

Yes the assumption of normality of the errors holds. The histogram of the residuals
appears to be approximately normally distributed and in the QQ-plot, most of the points
follow the straight line with only slight deviation.

f. According to the diagnostics plots, does the assumption of homoscedasticity hold?


Explain your answer.

Yes the assumption of constant variance holds. The plots of the residuals vs. predicted
value and Rstudent vs. predicted value both show random scatter of points with no
clear pattern. This indicates constant variance.

6. (8 points) We’re going to check for influential observations.

a. According to the diagnostics plots, are there any observations with a very high Cook’s
distance relative to the other observations? How many?

Yes, there are 46 observations that have a very high Cook’s distance relative to the
other observations.

7
b. According to your separate output dataset, how many observations have a studentized
residual value with an absolute value greater than 3?

4 observations have a studentized residual value with an absolute value greater than
3.

c. Conduct a sensitivity analysis by repeating the multiple regression, omitting the


observations with large-magnitude residuals you identified in part b. Report the new
effect estimates for mother’s smoking during pregnancy. Did the results change
substantially?

MOMCIGS 1: -0.04806
MOMCIGS 2: 0.04299
MOMCIGS 3: -0.1355

Yes the results changed substantially. All three effect estimates changed by more than
10%.

7. (9 points) Now we’ll create a table using our model results.


a. Use your results from questions 4 and 5 to fill in the table below. We are using results
from the models including all observations, regardless of your conclusions from
question 6. Don’t forget to fill in the footnotes.

Table 3. Mean differences in child age at menarche, Child Health and Development Study,
California, pregnancy years 1959–1966a
Unadjusted Adjustedb
Difference 95% Confidence Difference 95% Confidence
Interval Interval
Maternal prenatal smoking
(cigarettes/day)
0 0.0 0.0
1–9 -0.1338 (-0.3792,0.1116) -0.0541 (-0.2952, 0.1870)
10–19 0.0364 (-0.2972, 0.3699) 0.0343 (-0.2910, 0.3597)
≥ 20 -0.2157 (-0.4529, 0.0216) -0.1724 (-0.4070, 0.0623)
a
A sample of size of 826 was used for both the adjusted and unadjusted regression models.
b
Variables adjusted for in the multiple linear regression model were mother’s age at menarche,
mother’s parity, mother’s education, family income, and child’s race.

b. Write a short (2–5 sentences) summary of the results of your analysis. Include effect
estimates and 95% confidence intervals as appropriate. The goal is to present a full and
accurate picture of your results while also remaining concise.

8
Based on my analysis, the relationship between daughters’ age at menarche and mother’s
smoking during pregnancy is non-linear. In table 3, we can see that the smoking category
effect estimates do not differ from the reference category (do not smoke) in a linear way
in both the adjusted and unadjusted models. Additionally, in both models, the confidence
interval for the effect of each smoking category includes 0, which could mean that there is
not a significant difference in mean age at menarche between each smoking category and
the reference group. When comparing the adjusted and unadjusted effect estimates, the
effect estimates for smoking categories 1-9 cigarettes per day and >=20 cigarettes per day
changed by more than 10%, suggesting that the association between smoking during
pregnancy and age at menarche is confounded by other variables in the model. The 1-9
cigarettes per day category changed from -0.1338 (95% CI: -0.3792, 0.1116) in the
unadjusted model to -0.0541 (95% CI: -0.2952, 0.1870) in the adjusted model. The >=20
cigarettes per day category changed from -0.2157 (95% CI: -0.4529, 0.0216) in the
unadjusted model to -0.1724 (95% CI: -0.4070, 0.0623) in the adjusted model. The two
potential confounders are mother’s age at menarche and child’s race because they are
significantly associated with daughter’s age at menarche.

9
SAS Code

/*******************************************
CLASS: EPID 602
SEC: 4
ASSIG: Homework 3
NAME: Stacy Huang
DATE: 3-15-19
*******************************************/

/*********** QUESTION 1 ***********/

/* Save a copy of the SAS dataset hw3and4 to the folder you are using for
your permanent library. */

LIBNAME chds 'M:\Private\epid 602 hw3';


OPTIONS fmtsearch=(chds);

PROC CONTENTS DATA=chds.hw3and4;


RUN;

/* Creating and applying formats from homework 2. */

PROC FORMAT LIBRARY=chds;


VALUE smoking 0 = '0: do not smoke'
1 = '1: 1-9 cigarettes per day'
2 = '2: 10-19 cigarettes per day'
3 = '3: >=20 cigarettes per day'
. = 'Missing or Unknown Info';
VALUE MOMMENS 1 = '1: Less than 12 years old'
2 = '2: 12-13 years old'
3 = '3: 14 years or older'
. = 'Missing or Unknown Info';
VALUE MOMED 1 = '1: High school degre or less (includes mothers who
attended trade school only'
2 = '2: High school degree plus additional education,
but no college degree'
3 = '3: College degree or registered nurse'
. = 'Missing or Unknown Info';
VALUE INCOME 1 = '1: Less than 5000'
2 = '2: 5000 to 10000 (not including)'
3 = '3: 10000 or more'
. = 'Missing or Unknown Info';
VALUE PARITY 1 = '1: Zero previous pregnancies'
2 = '2: 1-2 previous pregnancies'
3 = '3: 3 or more previous pregnancies'
. = 'Missing or Unknown Info';
VALUE RACE 1 = '1: White'
2 = '2: Black'
3 = '3: Other race'
. = 'Missing or Unknown Info';
RUN;

DATA chds.hw3and4;
SET chds.hw3and4;

10
FORMAT MOMCIGS smoking. MOMMENS3 MOMMENS. MOMED3 MOMED. INCOME3 INCOME.
PARITY3 PARITY. RACE3 RACE.;
RUN;

/*********** QUESTION 2A ***********/

/* Create a new dataset based on hw3and4 containing only observations with


nonmissing information for the outcome (TEENMENS), exposure (MOMCIGS), and
covariates. */

DATA hw34;
SET chds.hw3and4;
IF TEENMENS = . OR MOMCIGS = . OR MOMMENS3 = . OR PARITY3 = . OR MOMED3
= . OR INCOME3 = . OR RACE3 = . THEN DELETE;
RUN;

/*********** QUESTION 3A ***********/

/* Run a simple linear regression of TEENMENS on MOMCIGS. */

PROC GLM DATA=hw34;


MODEL TEENMENS = MOMCIGS / clparm;
RUN;

/*********** QUESTION 4A ***********/

/* Repeat the simple linear regression, but this time include a CLASS
statement in PROC GLM. */

ODS GRAPHICS ON;


PROC GLM DATA=hw34 plot=meanplot(cl);
CLASS MOMCIGS (REF = '0: do not smoke');
MODEL TEENMENS = MOMCIGS / SOLUTION clparm;
LSMEANS MOMCIGS;
OUTPUT out=regout p=predict residual=resid student=student
rstudent=rstud cookd=cd;
RUN;
QUIT;
ODS GRAPHICS OFF;

/*********** QUESTION 4F ***********/

PROC UNIVARIATE DATA=regout NORMAL PLOT;


VAR rstud;
HISTOGRAM;
QQPLOT rstud / normal (mu=est sigma=est);
RUN;

/*********** QUESTION 5A ***********/

/* Run a multiple regression adjusted for MOMMENS3, PARITY3, MOMED3, INCOME3,


RACE3. */

11
ODS GRAPHICS ON;
PROC GLM DATA=hw34 plot=diagnostics plot=meanplot(cl);
CLASS MOMCIGS (REF = '0: do not smoke') MOMMENS3 PARITY3 MOMED3 INCOME3
RACE3;
MODEL TEENMENS = MOMCIGS MOMMENS3 PARITY3 MOMED3 INCOME3 RACE3 /
SOLUTION clparm;
LSMEANS MOMCIGS;
OUTPUT out=regout2 p=predict residual=resid student=student
rstudent=rstud cookd=cd;
RUN;
QUIT;
ODS GRAPHICS OFF;

/*********** QUESTION 6A ***********/

/* Ran PROC MEANS to determine how many observations have high Cook's
distance relative to the other observations. */

PROC MEANS DATA=regout2;


WHERE cd > 4/826;
VAR cd;
RUN;

/*********** QUESTION 6B ***********/

/* Used PROC PRINT to determine how many observations have a studentized


residual value with an absolute value greater than 3. */

PROC PRINT DATA=regout2;


VAR rstud;
WHERE rstud < -3 or rstud > 3;
ID woman;
RUN;

/*********** QUESTION 6C ***********/

/* Conduct a sensitivity analysis by repeating the multiple regression,


omitting the observations with large-magnitude residuals from part b. */

ODS GRAPHICS ON;


PROC GLM DATA=hw34 plot=diagnostics plot=meanplot(cl);
CLASS MOMCIGS (REF = '0: do not smoke') MOMMENS3 PARITY3 MOMED3 INCOME3
RACE3;
MODEL TEENMENS = MOMCIGS MOMMENS3 PARITY3 MOMED3 INCOME3 RACE3 /
SOLUTION clparm;
WHERE woman not in (8228, 2994, 6568, 3028);
LSMEANS MOMCIGS;
OUTPUT out=regout2 p=predict residual=resid student=student
rstudent=rstud cookd=cd;
RUN;
QUIT;
ODS GRAPHICS OFF;

12

Vous aimerez peut-être aussi