Vous êtes sur la page 1sur 10

ALS434

Final Review2

Fall2014

Dr. Ghamsary

Part I: The researcher begins by asking whether there is a relationship between whether a
person has arteriosclerosis and whether they have high cholesterol. Here high cholesterol is
defined as a total cholesterol level of 250 mg/deciliter or above. She examines 25 patients with
arteriosclerosis and find that 20 have high blood pressure. She also examines 25 controls who do
not have arteriosclerosis and finds that 10 have high cholesterol. A contingency table analysis of
her data is shown below. Use it to answer the questions on the following .

Arteriosclerosis\Cholesterol
High
Low
Yes
20
10
No
5
15

Does there appear to be a statistically significant relationship between arteriosclerosis and


cholesterol level? Briefly explain your reasoning. (You do not need to write out all the details
of the test.)
The ods ratio is = 6
With CI of ( 1.69,21.24)
As the interval does not go to 1, we can say with confidence that there is significant
relationship between arteriosclerosis and cholesterol level
Explain as precisely as you can the meaning of the p-value for the contingency table test.
Your answer should be specific to this context (arteriosclerosis and cholesterol) and
incorporate the relevant numeric value(s).
The p value tells us about the significance of the relationship between levels of
cholesterol and atherosclerosis.

Suppose you wanted to prove that high cholesterol was associated with increased risk of
arteriosclerosis. Explain what the p-value for this test would be and why.
Based on the OR and its CI, I would predict that the p value of this test would be less
than 5% .

Estimate the odds ratio for having arteriosclerosis among people who have high cholesterol
versus people who dont. Show your work.
OR= P1/(1-p1) // p2/ (1-p2) = 20/5 // 10/5 = 20*5/ 5*10 = 6
(Optional Bonus)Suppose that the researcher had fit a logistic regression to this data instead
of doing a contingency table analysis with Y being whether or not the person got
arteriosclerosis (Y=1 for yes and Y=0 for no) and X being whether the person had high
cholesterol (X=1 for yes and X=0 for no). Find the estimated logistic regression equation (i.e.
find b0 and b1) and explain your reasoning.

ALS434

Final Review2

Fall2014

Dr. Ghamsary

Part II: Recognizing that there are many factors that affect whether a person gets
arteriosclerosis, the researcher records additional information about her subjects. Her response
variable is whether or not a person has arteriosclerosis (Y=1 for yes and Y=0 for no). Her
possible predictor variables

are age (in years), weight (in pounds), blood cholesterol level (now measured in mg/dL, not
as a categorical variable) and whether the person has a family history of coronary artery
disease (1 = yes, 0=no). A correlation table is shown below and the printouts for one of her
logistic regression models are shown below. Use them to answer the following:

ALS434

Final Review2

Fall2014

Dr. Ghamsary

Find the probability that a 50 year old with a cholesterol level of 250 and no family history of
coronary artery disease would have arteriosclerosis. Show your work.
Give a brief interpretation of the odds ratio for the family history variable.

Give a brief interpretation of the confidence interval for the odds ratio of the age variable. What
does this interval tell you about the usefulness of the age variable in this model?

Give your best estimate of and a 95% confidence interval for the odds ratio comparing the
likelihood of arteriosclerosis for a person with high cholesterol (250 mg/dL) to an otherwise
equivalent person with normal cholesterol (200 mg/dL).
Weight is known to be a risk factor for arteriosclerosis, and our researcher even collected data about
it. Is it likely to be a good idea for her to add it to this model? Explain. What would you have to do to
tell for sure if it was a good idea?
Suppose we wanted to add an interaction term between cholesterol level and family history to the
model. Explain how we would define the variable and what it would tell us if the variable were
significant.

(Optional Bonus) In part (2) you explained the meaning of the odds ratio for the family history
variable. Can you say based on this odds ratio that a person with a family history of coronary artery
disease has a probability of getting arteriosclerosis that is 6.87 times as high as that of a person
without a family history? If yes, explain why. If not, give a counter-example.

ALS434

Final Review2

Fall2014

Dr. Ghamsary

Part III: Analysis of Varying Medications


Now our researcher turns to analyzing methods of reducing cholesterol levels. She is interested in the
relative merits of diets versus cholesterol lowering medications. For each of 65 subjects who began the
study with high cholesterol she records total blood cholesterol level (in mg per deciliter) after 6 months
participation in the study. The patients are divided into G=5 groups: a control group (C) which receives a
placebo, a vegetarian diet group (V), a low fat diet group (LF), a low dose medication group (LD) and a
high dose medication group (HD). STATA printouts below show the group means, standard deviations,
and group sizes, along with an ANOVA table which seems to be missing a few numbers. Use this
information to answer the questions on the following pages.

Source
SS
df
MS
F
P_Vlaue

Between Group

Within Group

80

Total
11681.4

Fill the ANOVA table

Based on this data is there evidence that any of the group means are different from each other?
Justify your answer by performing an appropriate hypothesis test. Be sure to state the null and
alternative hypotheses, both mathematically and in words, give the p-value, and your real-world
conclusions.

ALS434

Final Review2

Fall2014

Dr. Ghamsary

Suppose that instead of doing an ANOVA we had fit a regression model to this data using the
vegetarian diet group as the reference group. Write down the estimated regression equation
we would have obtained.
What percentage of the variability in cholesterol levels is explained by the treatment group to which a
subject belonged? Show your calculations or explain your reasoning.

Below is a table showing test statistics and p-values for pairwise comparisons of the
different group means for this ANOVA. Use it to help answer the following:

The test comparing the vegetarian diet group to the low fat diet group is missing. State the null and
alternative hypotheses mathematically and in words, compute the test statistic and an approximate pvalue and explain your real-world conclusions. (Note: Make sure you carefully show your calculation
of the standard error.)
Which pairs of means are significantly different from one another at the
adjusting for multiple testing? Explain briefly.

0.0
5

level without

According to the Bonferroni method, what significance level should you use for the individual tests

for differences of means to get an overall significance level of 0.05? Explain briefly. Use your
answer to repeat part (6), adjusting for multiple comparisons. Indicate any results that have changed.
The researcher is interested in comparing the average cholesterol level of people in the two diet
groups with that of the people in the low dose medication group. Write down an appropriate linear
combination, L, for the comparison she wishes to do. Give your best estimate of L and the
corresponding standard error and use these numbers to find a 95% confidence interval for L. Give a
brief interpretation of your interval and explain whether the researcher can conclude there is a
difference in efficacy between diets and the low dose medication in reducing cholesterol levels.

ALS434

Final Review2

Fall2014

Dr. Ghamsary

Obviously there are factors other than treatment group which could affect a persons cholesterol level.
Thus, the researcher has fit a multiple regression of cholesterol level on treatment group, age, weight
and whether or not the person has a family history of coronary artery disease (1 = yes and 0=no). Use
the STATA multiple regression printout to answer the remaining parts of the question.

In terms of percentage of variability explained and accuracy of predictions does this model do a better
job than the simple ANOVA from parts 1-8 Explain briefly what numbers you are looking at to
answer this question. Does this model make good predictions? Explain.
After adjusting for age, weight, and family history, does it appear that any of the diets or medication
doses have a significant impact on cholesterol levels compared to the control group? Briefly justify
your answer.

Your answer to part (10) is different from what you found in parts (6) and (7). Explain what has happened
and what it implies about whether the researcher performed a properly randomized study.

ALS434

Final Review2

Fall2014

Dr. Ghamsary

Part IV: Model Matching Madness


It is important to understand what types of statistical techniques can be used to analyze different kinds
of situations and what assumptions you have to make when you use those techniques. Below is a list
of statistical techniques, each labeled with a number. Each piece of this problem describes an analysis
scenario. Your job is to indicate which technique(s) can be used to analyze the situation described by
giving the appropriate number(s) and to answer the supplementary questions about the assumptions. It
is possible for these to be more than one appropriate technique and it is possible for none of the
techniques listed to be applicable. Unless otherwise indicated you do not need to explain your choices
although it could help you if you make a mistake. Note that we will deduct points for incorrect
techniques so it is not to your advantage to guess lots of extra answers in the hopes of hitting the right
one.
Simple linear regression
Multiple linear regression
Analysis of Variance
Logistic regression
Fishers exact test
Contingency table analysis (chi-squared test)
McNemars test
Z test for difference of proportions

You want to test whether there is a relationship between gender, X, and whether or not a person gets
colon cancer, Y. For any of the method(s) you chose, are there restrictions on when you can use them?

You want to know whether there is a relationship between a persons risk of a heart attack, Y, and
whether they have a family member who has had a heart attack, X, after adjusting for their age (in
years) and weight (in pounds).

You want to know whether a people lose more weight on the Atkins diet, the Zone diet, or the Weight
Watchers diet. In addition to specifying the model(s) you could use, state the basic assumptions you
make when you fit those model(s).
You are interested in knowing whether nervousness about going to the doctors office affects peo-ples
blood pressure readings so you take a group of people and record their blood pressures at home

(X) and in the doctors office (Y), both in mm Hg, and fit a model to see if there is a relationship

ALS434

Final Review2

Fall2014

Dr. Ghamsary

between X and Y. In addition to stating what method(s) you would use, explain how you would check
whether going to the doctors office actually was associated with higher blood pressure readings.

You repeat the study from part (d) but instead of recording the actual blood pressures you simply record
whether or not the person had high blood pressure (above 120 systolic) in each setting.

You are studying whether a group of genes is associated with elevated risk of breast cancer. For each
person you record whether or not they have cancer, Y, and an expression level, Xj , for each gene (a
continuous measure).
For the scenario in part (f) imagine that you wanted to identify the subset of genes tested that were actually
related to cancer. If you were testing 10 genes what approach would you use to ensure the overall accuracy
of your answers? If you were testing 10000 genes? Explain your choice in each case. (Note: You do not
need to re-specify the modeling technique(s) from part (f).)

Vous aimerez peut-être aussi