Académique Documents
Professionnel Documents
Culture Documents
Final Review2
Fall2014
Dr. Ghamsary
Part I: The researcher begins by asking whether there is a relationship between whether a
person has arteriosclerosis and whether they have high cholesterol. Here high cholesterol is
defined as a total cholesterol level of 250 mg/deciliter or above. She examines 25 patients with
arteriosclerosis and find that 20 have high blood pressure. She also examines 25 controls who do
not have arteriosclerosis and finds that 10 have high cholesterol. A contingency table analysis of
her data is shown below. Use it to answer the questions on the following .
Arteriosclerosis\Cholesterol
High
Low
Yes
20
10
No
5
15
Suppose you wanted to prove that high cholesterol was associated with increased risk of
arteriosclerosis. Explain what the p-value for this test would be and why.
Based on the OR and its CI, I would predict that the p value of this test would be less
than 5% .
Estimate the odds ratio for having arteriosclerosis among people who have high cholesterol
versus people who dont. Show your work.
OR= P1/(1-p1) // p2/ (1-p2) = 20/5 // 10/5 = 20*5/ 5*10 = 6
(Optional Bonus)Suppose that the researcher had fit a logistic regression to this data instead
of doing a contingency table analysis with Y being whether or not the person got
arteriosclerosis (Y=1 for yes and Y=0 for no) and X being whether the person had high
cholesterol (X=1 for yes and X=0 for no). Find the estimated logistic regression equation (i.e.
find b0 and b1) and explain your reasoning.
ALS434
Final Review2
Fall2014
Dr. Ghamsary
Part II: Recognizing that there are many factors that affect whether a person gets
arteriosclerosis, the researcher records additional information about her subjects. Her response
variable is whether or not a person has arteriosclerosis (Y=1 for yes and Y=0 for no). Her
possible predictor variables
are age (in years), weight (in pounds), blood cholesterol level (now measured in mg/dL, not
as a categorical variable) and whether the person has a family history of coronary artery
disease (1 = yes, 0=no). A correlation table is shown below and the printouts for one of her
logistic regression models are shown below. Use them to answer the following:
ALS434
Final Review2
Fall2014
Dr. Ghamsary
Find the probability that a 50 year old with a cholesterol level of 250 and no family history of
coronary artery disease would have arteriosclerosis. Show your work.
Give a brief interpretation of the odds ratio for the family history variable.
Give a brief interpretation of the confidence interval for the odds ratio of the age variable. What
does this interval tell you about the usefulness of the age variable in this model?
Give your best estimate of and a 95% confidence interval for the odds ratio comparing the
likelihood of arteriosclerosis for a person with high cholesterol (250 mg/dL) to an otherwise
equivalent person with normal cholesterol (200 mg/dL).
Weight is known to be a risk factor for arteriosclerosis, and our researcher even collected data about
it. Is it likely to be a good idea for her to add it to this model? Explain. What would you have to do to
tell for sure if it was a good idea?
Suppose we wanted to add an interaction term between cholesterol level and family history to the
model. Explain how we would define the variable and what it would tell us if the variable were
significant.
(Optional Bonus) In part (2) you explained the meaning of the odds ratio for the family history
variable. Can you say based on this odds ratio that a person with a family history of coronary artery
disease has a probability of getting arteriosclerosis that is 6.87 times as high as that of a person
without a family history? If yes, explain why. If not, give a counter-example.
ALS434
Final Review2
Fall2014
Dr. Ghamsary
Source
SS
df
MS
F
P_Vlaue
Between Group
Within Group
80
Total
11681.4
Based on this data is there evidence that any of the group means are different from each other?
Justify your answer by performing an appropriate hypothesis test. Be sure to state the null and
alternative hypotheses, both mathematically and in words, give the p-value, and your real-world
conclusions.
ALS434
Final Review2
Fall2014
Dr. Ghamsary
Suppose that instead of doing an ANOVA we had fit a regression model to this data using the
vegetarian diet group as the reference group. Write down the estimated regression equation
we would have obtained.
What percentage of the variability in cholesterol levels is explained by the treatment group to which a
subject belonged? Show your calculations or explain your reasoning.
Below is a table showing test statistics and p-values for pairwise comparisons of the
different group means for this ANOVA. Use it to help answer the following:
The test comparing the vegetarian diet group to the low fat diet group is missing. State the null and
alternative hypotheses mathematically and in words, compute the test statistic and an approximate pvalue and explain your real-world conclusions. (Note: Make sure you carefully show your calculation
of the standard error.)
Which pairs of means are significantly different from one another at the
adjusting for multiple testing? Explain briefly.
0.0
5
level without
According to the Bonferroni method, what significance level should you use for the individual tests
for differences of means to get an overall significance level of 0.05? Explain briefly. Use your
answer to repeat part (6), adjusting for multiple comparisons. Indicate any results that have changed.
The researcher is interested in comparing the average cholesterol level of people in the two diet
groups with that of the people in the low dose medication group. Write down an appropriate linear
combination, L, for the comparison she wishes to do. Give your best estimate of L and the
corresponding standard error and use these numbers to find a 95% confidence interval for L. Give a
brief interpretation of your interval and explain whether the researcher can conclude there is a
difference in efficacy between diets and the low dose medication in reducing cholesterol levels.
ALS434
Final Review2
Fall2014
Dr. Ghamsary
Obviously there are factors other than treatment group which could affect a persons cholesterol level.
Thus, the researcher has fit a multiple regression of cholesterol level on treatment group, age, weight
and whether or not the person has a family history of coronary artery disease (1 = yes and 0=no). Use
the STATA multiple regression printout to answer the remaining parts of the question.
In terms of percentage of variability explained and accuracy of predictions does this model do a better
job than the simple ANOVA from parts 1-8 Explain briefly what numbers you are looking at to
answer this question. Does this model make good predictions? Explain.
After adjusting for age, weight, and family history, does it appear that any of the diets or medication
doses have a significant impact on cholesterol levels compared to the control group? Briefly justify
your answer.
Your answer to part (10) is different from what you found in parts (6) and (7). Explain what has happened
and what it implies about whether the researcher performed a properly randomized study.
ALS434
Final Review2
Fall2014
Dr. Ghamsary
You want to test whether there is a relationship between gender, X, and whether or not a person gets
colon cancer, Y. For any of the method(s) you chose, are there restrictions on when you can use them?
You want to know whether there is a relationship between a persons risk of a heart attack, Y, and
whether they have a family member who has had a heart attack, X, after adjusting for their age (in
years) and weight (in pounds).
You want to know whether a people lose more weight on the Atkins diet, the Zone diet, or the Weight
Watchers diet. In addition to specifying the model(s) you could use, state the basic assumptions you
make when you fit those model(s).
You are interested in knowing whether nervousness about going to the doctors office affects peo-ples
blood pressure readings so you take a group of people and record their blood pressures at home
(X) and in the doctors office (Y), both in mm Hg, and fit a model to see if there is a relationship
ALS434
Final Review2
Fall2014
Dr. Ghamsary
between X and Y. In addition to stating what method(s) you would use, explain how you would check
whether going to the doctors office actually was associated with higher blood pressure readings.
You repeat the study from part (d) but instead of recording the actual blood pressures you simply record
whether or not the person had high blood pressure (above 120 systolic) in each setting.
You are studying whether a group of genes is associated with elevated risk of breast cancer. For each
person you record whether or not they have cancer, Y, and an expression level, Xj , for each gene (a
continuous measure).
For the scenario in part (f) imagine that you wanted to identify the subset of genes tested that were actually
related to cancer. If you were testing 10 genes what approach would you use to ensure the overall accuracy
of your answers? If you were testing 10000 genes? Explain your choice in each case. (Note: You do not
need to re-specify the modeling technique(s) from part (f).)