1041 Partc S2 15

36
CHAPTER 3. PAST EXAM PAPERS
Final Exam 2011

NOVEMBER 2011
Use a separate book clearly marked Question 1
1. [15 marks] A study looked at how accurately people keep time, and how this varies
with age and occupation. This was done by asking randomly selected people for the
time on their watch. Time dierence (in seconds) was calculated by comparing their
watch time with the actual exact time (adjusting for any deliberate dierences, e.g.
if they intended to set their watch to be 5 minutes early).
The results presented below give the time dierence (in seconds) for retirees and
for students. We would like to know if there is a dierence between how accurately
students and retirees set their watches.
mean
32.2
78.8
sd
116.6
98.6
400
Students
Retirees
n
45
13
200
100
0
100
200
Time difference (seconds)
300
Retirees
(a)
Students
i. What type of graph is presented above?

ii. Briefly summarise what this graph tells us about how time dierence varies
between retirees and students.
(b) Use a hypothesis test to find out if there is evidence that, on average, students
and retirees dier in how accurately their watches keep time. In your answer:
i. State clearly the hypotheses H0 and Ha .
ii. State the test statistic and its distribution assuming H0 is true.
iii. Give an expression for the P -value, and find this P -value as accurately as
you can using tables.
iv. State your conclusion in simple language.
(c)
i. What assumptions were required for your hypothesis test to be valid?

ii. Do you think these assumptions were valid? In your answer, refer to the
appropriate information and analyses in the above.
FINAL EXAM 2011
37

2. [15 marks] An article published in Nature Genetics in 2009 looked at the relationship between an individuals genotype and whether or not they responded to drug
therapy for Hepatitis C infection (Response status). Below is a summary of the
data for one particular marker.
Genotype
AA
Aa
aa
Response status
Responder Non-responder
3
11
38
97
88
62
While their genotypes are dierent, Aa individuals are expected to behave the same
(in terms of response status) as AA individuals.
(a)
i. Find the marginal distributions of response status and genotype.

ii. Find the expected number of individuals of genotype AA with response
status Responder.
(b)
i. Which type of hypothesis test could you carry out to examine whether
there is a relationship between response status and genotype?
ii. State two conditions necessary for this test to be valid.
iii. Where appropriate, check that these assumptions hold, or explain what
could be done when collecting data to ensure these assumptions were satisfied.
(c) In order to carry out the hypothesis test, the data were rearranged as follows:
Genotype
A* (AA or Aa)
aa
Responder
41
88
Non-responder
108
62
i. Why were data rearranged like this prior to analysis?

ii. Use this new table to carry out a test of whether there is a relationship
between response status and genotype. In your answer:
State clearly the hypotheses H0 and Ha .
Calculate the test statistic, showing your working.
State the distribution of the test statistic (including any degrees of

freedom) assuming H0 is true.
Give an expression for the P -value, and find this P -value as accurately
as you can using tables.
State your conclusion in simple language.
38

3. [15 marks] The United States and Japan are countries with large populations that
are currently struggling with large government debts. This question will examine
whether these countries are examples of a more general pattern that is, whether
there is a relationship between population size and government debt (as a proportion of GDP) in general. Below are some relevant analyses, using data from the
International Monetary Fund (IMF) for all 28 developed countries with populations
over one million.
100
50
Government Debt (%GDP) [log scale]
200
Australia
2
10
20
50
100
200
Population (millions) [log scale]
Call:
lm(formula = log(Debt) ~ log(Population))
Coefficients:
(Intercept)
Population
--Signif. codes:
Estimate Std. Error t value Pr(>|t|)

3.70588
0.23584 15.714 8.64e-15 ***
0.17134
0.07782
2.202
0.0368 *
0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 0.4958 on 26 degrees of freedom

Multiple R-squared: 0.1571,
Adjusted R-squared: 0.1247
F-statistic: 4.847 on 1 and 26 DF, p-value: 0.03676
FINAL EXAM 2011
39
3.8
4.0
4.2
4.4
Australia
0.0
0.5
0.5
1.0
Residuals
Normal quantile plot of residuals
Standardized residuals
1.0
Residual vs fitted values
4.6
Fitted values
Australia
Theoretical Quantiles
(a) Is there evidence of a relationship between government debt and population

size? Refer to the relevant parts of the above analyses in your answer.
(b) How strong is the relationship between government debt and population size?
Refer to relevant parts of the above analyses in your answer.
(c) Construct a 95% confidence interval for the true regression slope.
(d) Is linear regression appropriate for this dataset? In your answer refer to any
specific assumptions that are made, whether or not they seem reasonable, and
what output you used in coming to this conclusion.
(e) Both variables were loge -transformed prior to analyses. Why do you think data
were transformed?
(f) Australia has a population of 22.8 million and a government debt of 24.2%.
On the log-scale used for analyses, this means that the values of x and y for
Australia were:
x = 3.13
y = 3.19
Also, note from the regression output that the fitted regression line was:
y = 3.71 + 0.171x
i. Use this information to calculate the residual for Australia.
ii. Out of all developed countries with populations over one million, Australia
is the country with the smallest residual value. Explain in simple terms
what this tells us about Australia.
40

4. [15 marks]
(a) In this question we will consider the number of cars owned by Australian
households, using data from the Australian Bureau of Statistics 2006 census.
Below is the probability distribution of the number of registered motor vehicles
per household, as in the 2006 census:
Number of cars
Probability
0
?
1
0.39
2
0.35
3
0.16
The proportion of households with more than 3 cars was zero (to two decimal
places).
i. A number is missing from the above table. Find this missing number.
ii. Find the mean number of registered motor vehicles per household in 2006.
(b) To look at the question of whether car ownership is dierent in 2011 than it
was in 2006, a sample of 200 households is taken to estimate the mean number
of registered motor vehicles per household, in 2011.
i. How would you sample Australian households in order to estimate the
mean number of registered motor vehicles in 2011?
ii. Explain why you suggest this sampling design, including in your reasoning
at least one advantage of the selected design.
(c) We need to understand the properties of the sample mean (X) in order to use
it to make inferences about the true mean ().
i. What type of distribution does the sample mean number of registered cars
(X) come from, when calculated from 200 households?
ii. Give a reason why you think the sample mean comes from this distribution.
iii. An important result which we use for inference about means is that the
standard deviation (or standard error) of X, X , is:
X
=p
Show where this formula comes from, including in your answer any assumptions you need to make to derive the above formula.
iv. What proportion of sample means would you expect to fall within two
standard errors of the true mean?
(d) The 2011 sample of 200 households had a sample mean of 1.49 cars per household, with a standard deviation of 0.71. Using this sample:
i. Estimate the standard error of the mean.
ii. Construct a 95% confidence interval for the true mean.
iii. Is there evidence that the mean number of cars in 2011 is dierent to the
mean from the 2006 census? Use your confidence interval to answer this
question.
FINAL EXAM 2011
41
FORMULAE SHEET
1. Mean and standard deviation. For a set of n observations {x1 , x2 , . . . , xn }, the

sample mean and standard deviation are:
v
u
n
n
X
u 1 X
1
t
x=
xi
s=
(xi x)2
n i=1
n 1 i=1
For a discrete random variable X which takes values x1 , x2 , . . . , xk , the mean and
variance are:
X = x 1 p 1 + x 2 p 2 + . . . + x k p k
2
X
X )2 p1 + (x2
= (x1
X )2 p2 + . . . + (xk
X ) 2 p k
2. Correlation and regression. Consider a set of n paired observations

{(x1 , y1 ), (x2 , y2 ), . . . , (xn , yn )}.
The correlation is
r=
1
n
n
X
xi
i=1
x
sx
yi
y
sy
The least squares regression line for y on x is

y = b0 + b1 x
where b1 = r
sy
and b0 = y
sx
b1 x.
T =
has a t-distribution with n
b1
1
SEb1
2 degrees of freedom.
3. Binomial Probabilities. If X B(n, p) then for x = 0, 1, . . . , n

n
P (X = x) =
px (1 p)n x .
x
X has mean np and variance np(1 p).
q
p(1 p)
For large n, p is approximately N p,
.
n
4. Central Limit Theorem. For a simple random sample from a population with
mean and standard deviation
, when n is large,
X is approximately N
, p
5. One sample. For a simple random sample from N (, )

T =
X
p
S/ n
42

6. Paired samples. When we have a set of n independent pairs of observations, and
the paired dierences X N (, ),
T =
X
p
S/ n
7. Two independent simple random samples. When we have two independent

simple random samples from normal populations with a common standard deviation
(i.e. 1 = 2 = ),
(n1 1)S12 + (n2 1)S22
Sp2 =
n1 + n2 2
T =
X1
X2
q
Sp
(1
1
n1
2 )
1
n2
t(n1 + n2
2)
8. Two way tables. For a r c table of observed counts of n independent events,

X2 =
where
X (observed
expected =
expected)2
expected
row total column total

n
Under the hypothesis of independence of variables, X 2 has a

(r 1)(c 1) degrees of freedom.
distribution with
FINAL EXAM 2012
43
Final Exam 2012

NOVEMBER 2012
1. [15 marks] Below are some data from 11 occasions on which world records were
set in the womens 100 metre sprint, since 1960.
Call:
lm(formula = dat$Time ~ dat$Year)
Residuals:
Min
1Q
-0.11838 -0.02490
Median
0.01570
3Q
0.04625
Max
0.08607
Coefficients:
(Intercept) 67.8543
5.480807
12.38 5.90e-07 ***
dat$Year
-0.0288
0.002775 -10.38 2.63e-06 ***
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1
1
44

Multiple R-squared: 0.9229,
Adjusted R-squared: 0.9143
F-statistic: 107.7 on 1 and 9 DF, p-value: 2.628e-06
(a) How strong is the relationship between world record time and year? Include
an appropriate numerical summary in your answer.
(b) Construct a 95% confidence interval for the true regression slope.
(c) Explain in words what this confidence interval tells us about how the womens
100 metre world record is changing over time.
(d) Is linear regression appropriate for this dataset? In your answer refer to any
specific assumptions that are made, whether or not they seem reasonable, and
what output you used in coming to this conclusion.
(e) Use the above regression analyses to predict the world record for the womens
100 metres in 2012.
(f) Your prediction in the previous question was wrong by a relatively large amount
the world record is currently 10.49 seconds (it has not changed since 1988).
Use ideas learnt in MATH1041 to identify the main reason why your prediction
was so wrong.
FINAL EXAM 2012
45

2. [15 marks] We are interested in answering the research question: Did Australian
swimmers under-perform at the London Games?
One way to address this question is to compare the times (in seconds) swum by Australian swimmers in their final individual Games swim (Games) to the times they
swam in the same event at the national trials, held four months earlier (Trial). Differences between these times are calculated using the formula Difference=Games-Trial,
meaning that a swimmer has under-performed if Difference is positive.
Below is the table of swimming times for 17 swimmers, and some graphs constructed
to summarise results.
ID Event
Trial Games Difference
A Mens 50m Free
21.92 21.98
0.06
B Mens 100m Free
47.10 47.53
0.43
C Mens 100m Back
53.98 53.55
-0.43
D Mens 100m Breast
59.91 58.93
-0.98
E Mens 200m Free
106.88 106.93
0.05
F Mens 200m Back
117.90 118.02
0.12
G Womens 100m Free
53.85 53.47
-0.38
H Womens 100m Back
59.41 59.29
-0.12
I
Womens 100m Back
59.28 58.68
-0.60
J
Womens 100m Breast 67.64 66.95
-0.69
K Womens 100m Breast 66.88 67.74
0.86
L
Womens 200m Free
115.99 115.81
-0.18
M Womens 200m Breast 146.51 146.00
-0.51
N Womens 200m Breast 146.31 147.38
1.07
O Womens 200m Free
116.04 117.68
1.64
P Womens 200m Back
127.83 127.43
-0.40
Q Womens 200m IM
129.38 129.55
0.17
n
17
17
17
x
88.048 88.054
0.006
s
39.21 39.40
0.680
46

(a) Which of the above graphs (labelled I-III) best answers the research question?
(b) Briefly summarise what this graph tells us about how Australian swimmers
performed at the Games, as compared to the trials.
(c) Making any necessary assumptions, use a hypothesis test to find out if there
is evidence that, on average, Australian swimmers under-performed at the
Games. In your answer:
ii. Calculate the test statistic, showing your working.
iii. State the distribution of the test statistic assuming H0 is true.
iv. Give an expression for the P -value, and find this P -value as accurately as
v. State your conclusion in simple language.
(d)
i. What assumptions were required for your hypothesis test to be valid?

ii. Do you think these assumptions were valid? In your answer, refer to the
appropriate information and analyses in the above.
FINAL EXAM 2012
47

3. [15 marks] Consider again the research question: Did Australian swimmers underperform at the London Games?
A dierent way to answer this question is by comparing the medals won in the
London 2012 and Beijing 2008 Games, across all 32 swimming events. Historically,
Australians tend to perform better at some of these events than at others.
Two tables, labelled (I) and (II), summarise the data in two dierent ways.
(I)
Medal won
No medal
Total
(II)
Beijing
Medal won
No medal
Total
Beijing
19
13
32
London
10
22
32
Total
29
35
London
Medal won No medal
7
12
3
10
10
22
Total
19
13
32
Table (I) records the number of events in which a medal was won or not won across
each of the 32 events, separately for each Games.
Table (II) cross-classifies the data, recording the number of events for which a
medal was won at both Games, neither, or only one of the two Games.
(a) Which table better answers the research question?
Use study design as your main consideration in answering this question.
(b) Briefly explain how the research question is better answered by the table you
chose in part 3a).
(c) Hence test whether significantly fewer medals were won in the pool at London
than at Beijing. In your answer:
(d)
i. State two conditions necessary for this test to be valid.

ii. Where possible, use the above data to check that these assumptions hold.
48

4. [15 marks]
Lets say you are at the shops and by chance you meet three Australian swimmers
from the London Games. Let X be the number, out of the three swimmers, that
won a medal at the London Games. Below is the probability distribution of X:
X
Probability
0
0.23
1
0.45
2
0.27
3
0.05
(a) Draw a probability histogram of X.

(b) Find X , the mean number of medallists.
(c) Explain in words what X means.
(d) Assume that the government, in its wisdom, gave all swimmers $1,000 prizemoney for competing at the Games, but gave them $10,000 if they won a
medal. Let Y be the total prize money (or spending money, depending on your
perspective) of your three swimmers. We can calculate Y using the formula:
Y = 3000 + 9000X
i. Show that this is the correct formula for Y .
ii. Find Y , the expected spending money of the three swimmers you meet
at the shops.
(e)
i. Find the probability that at least one of the people you meet won a medal.
ii. Assume now that you notice a medal hanging around one of the swimmers
necks. Given this information, what is the chance that all swimmers won
medals?
(f) The three swimmers you meet at the shops can be considered as a simple
random sample of the 47 Australian swimmers at the London Games. Of these
47 swimmers, 18 won a medal.
Use this information to verify that P (X = 3) = 0.05, to two decimal places.
(g) If the three swimmers at the shops were male, then the probability distribution
of X would have been as follows:
X
Probability
0
0.39
1
0.46
2
0.14
3
0.01
Is gender of the swimmers you meet independent of whether or not they won
any medals? Include reasons.
FINAL EXAM 2012
49
FORMULAE SHEET
1. Mean and standard deviation. For a set of n observations {x1 , x2 , . . . , xn }, the

sample mean and standard deviation are:
v
u
n
n
X
u 1 X
1
t
x=
xi
s=
(xi x)2
n i=1
n 1 i=1
For a discrete random variable X which takes values x1 , x2 , . . . , xk , the mean and
variance are:
X = x 1 p 1 + x 2 p 2 + . . . + x k p k
2
X
X )2 p1 + (x2
= (x1
X )2 p2 + . . . + (xk
X ) 2 p k

{(x1 , y1 ), (x2 , y2 ), . . . , (xn , yn )}.
The correlation is
r=
1
n
n
X
xi
i=1
x
sx
yi
y
sy

y = b0 + b1 x
where b1 = r
sy
and b0 = y
sx
b1 x.
T =
b1
1
SEb1

n
P (X = x) =
px (1 p)n x .
x
X has mean np and variance np(1 p).
q
p(1 p)
.
n
4. Central Limit Theorem. For a simple random sample from a population with
mean and standard deviation
, when n is large,
, p
5. One sample. For a simple random sample from N (, )

T =
X
p
S/ n
50

T =
X
p
S/ n
7. Two independent simple random samples. When we have two independent

simple random samples from normal populations with a common standard deviation
(i.e. 1 = 2 = ),
(n1 1)S12 + (n2 1)S22
Sp2 =
n1 + n2 2
T =
X1
X2
q
Sp
(1
1
n1
2 )
1
n2
t(n1 + n2
2)

X2 =
where
X (observed
expected =
expected)2
expected

n
Under the hypothesis of independence of variables, X 2 has a

(r 1)(c 1) degrees of freedom.
distribution with
FINAL EXAM 2013
51
Final Exam 2013

NOVEMBER 2013
1. [15 marks] The following data, scatterplot and regression analysis concern the
number of live births per 1000 women in the USA, starting in 1965.
Year
Rate
1965 1970 1975 1980 1985 1990 1995 2000 2005 2009
19.4 18.4 14.8 15.9 15.6 16.4 14.8 14.4 14.0 13.5
16
14
Rate
18
Figure 3.1: Total live births per 1000 women
1970
1980
1990
2000
2010
Year
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
Call:
lm(formula = Rate ~ Year)
Residuals:
Min
1Q Median
-2.260 -0.310 0.085
3Q
0.640
Max
1.250
Coefficients:
(Intercept) 231.1398
46.4169
4.98
0.0011 **
Year
-0.1084
0.0234
-4.64
0.0017 **
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Multiple R-squared: 0.729,Adjusted R-squared: 0.695
(a)
i. What is the explanatory variable and what is the response variable?

ii. For each of these variables, state whether it is categorical or quantitative.
52

(b) Describe the general trend shown in the data.
(c) The residuals (to 2 decimal places) are computed to be :
res = residuals(birth.reg)
res
##
##
1
1.25
2
3
4
5
0.79 -2.26 -0.62 -0.38
6
7
0.96 -0.10
8
0.05
9
0.19
10
0.12
i. Explain how to calculate the first residual.

ii. Draw a plot of residuals against the explanatory variable.
iii. Is linear regression an appropriate model here for predicting the birthrate?
Explain your answer with reference to your plot and to the value of r2 .
(d)
i. Estimate the birthrate in 1978.

ii. In 1978 the actual birthrate was 15.0. How close was the model in its
prediction?
(e)
i. Use the regression line to predict the birthrate in 2025.

ii. Comment on your confidence in this prediction.
FINAL EXAM 2013
53

2. [15 marks] In a flour mill, flour is packaged in small packets. It is intended that
each packet should contain 500g of flour.
Packets are filled by a machine where the amount of flour put in a packet is a random
variable with a standard deviation of = 2g. In order that packets should not, in
general, be underfilled, the random variable is set to have a mean of = 501g.
(a) A random box of 36 packets of flour is delivered to a customer.
i. Determine the probability that the packets in the box are not, on average,
underweight. That is, determine the probablity that the average weight
of the packets is at least 500g.
ii. Hence, determine the probability that the total weight of the 36 packets
of flour is at least 18kg.
iii. What assumptions did you need to make to determine these probabilities?
(b) Due to faulty machines, many of the flour packets are mis-labelled. There
are two dierent faults that occur. Some claim to contain 5500g (rather than
500g). In some packets the use-by-date is unreadable. A random sample of
100 flour packets are checked to see whether they are faulty. The results are
summarised in the 2 by 2 table of frequencies below.
unreadable use-by-date
Clear use-by-date
Incorrect 5500g
12
8
correct 500g
25
55
i. What proportion of the sample of 100 packets is mis-labelled as containing

5500g (rather than 500g) flour?
ii. Determine a 95% confidence interval for the true population proportion
of flour packets which are mis-labelled as containing 5500g (rather than
500g) flour.
iii. A 2 test is carried out to investigate whether the two types of faults in the
labels occur independently. The 2 statistic is calculated to be X 2 = 5.674.
You are not required to show or explain how this value of X 2 = 5.674 is
calculated.
A. State the null and alternative hypotheses for this
test.
B. Assuming the null hypothesis is true, what is the distribution of the

test statistic?
C. Determine the P-value.
D. State your conclusion in plain language.
54

3. [15 marks] Systolic Blood Pressure (SBP) measurements are monitored on patients
in a hospital intensive care unit. Past experience suggests that SBP levels among
young adults is lower, on average, than the SBP of the elderly. Data was collected
for this purpose from 13 young adults (x) and from 15 elderly patients (y). This
data is shown below in Table 3.1.
Table 3.1: Summary of Blood pressure (SBP) of patients.
x
100
100
104
104
112
130
130
136
140
140
142
146
156
95
100
100
110
110
122
130
135
136
138
140
141
162
190
195
nx = 13
x = 126.15
sx = 19.57
ny = 15
y = 133.6
sy = 30.34
Figure 3.2: Diagnostic plots
1.5
0.5
0.5
1.5
Frequency
4 6 8
0
100
100
X quantiles
120
140
(III)
10
(II)
Y quantiles
140
180
(I)
1
0
1
80
120
160
200
SBP values for combined data
FINAL EXAM 2013
55
(a) Which group of patients has the larger variability in SBP, the young adults
or the elderly? Identify the numerical summaries that show this and explain
what data point(s) could be the cause of this larger variability.
(b) Making any necessary assumptions, use a hypothesis test to find out if there
is any evidence that, on average, the elderly patients SBP is higher than the
young adults SBP in the intensive care unit. In your answer:
(c)
i. What assumptions were required for the hypothesis test in (3b) to be valid?
ii. Explain whether these assumptions are valid. In your answer, refer to any
appropriate information given or explain what further information might
be required.
56

4. [15 marks]
The United States Census Bureau collected demographic information about the
number of times people have been married. Let X denote the number of times an
individual has been married. The probabilities associated with X are given in Table
3.2 below.
Table 3.2: Probability distribution of X.
x
P (X = x)
(a)
0
?
1
2
3
0.549 0.1185 0.0315
i. Determine the missing value in the above table. That is, what is chance
of not being married?
ii. Given the information, compute X , the mean of the distribution.
iii. Using the table and the result computed in (4(a)i), construct a probability
histogram and comment on its features.
(b) Consider individuals that have been married at least once. That is, consider
the values of X, where X > 0.
i. Using Table 3.2, list the set of possible outcomes for X if it is known
already that this particular person has been married at least once.
ii. Create a new conditional probability distribution table for the values considered in (4(b)i). That is, create a table of the conditional probabilities
P (X = k|X > 0) for all possible values of k.
iii. Compute the conditional mean, X|X>0 , using the table you created in
(4(b)ii).
(c) It is suggested that the individuals of dierent genders have a dierent marriage
distribution. Data was collected from 238,818 individuals that can be used to
investigate this claim.
Table 3.3: Frequency table of X across gender.
Number of Marriages
Male
Female
38213 60561 13432 3589

33283 70881 14915 3944
i. Compute the marginal distribution of the number of marriages and the

marginal distribution of gender from the counts in Table 3.3.
ii. Show that the two variables, number of marriages and gender, are not
independent.
FINAL EXAM 2013
57
FORMULAE SHEET
1. Mean and standard deviation of a sample. For a set of n observations

{x1 , x2 , . . . , xn }, the sample mean and standard deviation are:
v
u
n
n
X
u 1 X
1
x=
xi
s=t
(xi x)2
n i=1
n 1 i=1
{(x1 , y1 ), (x2 , y2 ), . . . , (xn , yn )}.
The correlation is
r=
1
n
n
X
xi
i=1
x
sx
yi
y
sy

y = b0 + b1 x
sy
and b0 = y b1 x.
sx
Under the appropriate assumptions,
where b1 = r
T =
b1
1
t(n
SEb1
2).
3. Mean and standard deviation of a discrete random variable. For a discrete

random variable X which takes values x1 , x2 , . . . , xk , the mean and variance are:
X = x 1 p 1 + x 2 p 2 + . . . + x k p k
2
X
= (x1
X )2 p1 + (x2
X )2 p2 + . . . + (xk
X ) 2 p k

n
P (X = x) =
px (1 p)n x .
x
2
X has mean X = np and variance X
= np(1 p).
q
p(1 p)
.
n
5. Central Limit Theorem. For a random sample from a population with mean
and standard deviation , when
n is large,
, p
6. One sample. For a random sample from N (, )

T =
X
p t(n
S/ n
1).
58

X
p t(n
S/ n
T =
1).
8. Two independent random samples. When we have two independent random

samples from normal populations with a common standard deviation (i.e. 1 =
2 = ),
(n1 1)S12 + (n2 1)S22
Sp2 =
n1 + n2 2
T =
X1
X2
q
Sp
(1
1
n1
2 )
1
n2
t(n1 + n2
2)

X2 =
where
X (observed
expected =
expected)2
expected

n
Under the hypothesis of independence of variables, X 2
2
(r 1)(c 1) .
FINAL EXAM 2014
59
Final Exam 2014

NOVEMBER 2014
1. [15 marks]
A research claim is made that the brain weight and body weight of vertebrate
animals follow a linear relationship. To investigate the claim, a random sample of
15 animals was taken and their typical body weights (in kg) and brain weights (in
g) were measured and displayed in Figure 1. The data point for humans is labelled.
1000
600
200
Brain weight (g)
Human
100
200
300
400
500
Body weight (kg)

Figure 3.3: Brain weight vs Body weight in a sample of 15 vertabrates
(a) Describe the relationship between body and brain weights in the data as displayed in the scatterplot in Figure 1.
(b) A linear regression is fitted to the data and a partial output summary is given
below.
##
##
##
##
##
##
Call:
lm(formula = brain ~ body, data = MedAnimals)
Residual standard error: 317 on 13 degrees of freedom
A researcher views the above analysis output and states that The linear relationship between body weight and brain weight is weak. Justify their statement by including an appropriate numerical summary and explain how this
numerical summary shows the linear relationship is weak.
60

Standardised residuals
200
400 800
200
Residuals
400 800
Residuals vs fitted values

Human
250 300 350 400 450 500 550 600

Fitted values
Residuals
100
100
200
300
400
500
Fitted values
600
Human
1
0
1
Standardised residuals
100
100
Residuals vs fitted values
Normal Quantile plot of residuals
Normal Quantile plot of residuals
1
0
1
Figure 3.4: Diagnostic plots. Row 1: with humans, Row 2: without humans.
(c) To see the impact of the human data point on the linear regression, the
human data point is removed from the dataset and the linear regression recalculated. The summary output is given below and diagnostic plots for the
regression with and without the human data point are given in Figure 2.
##
##
##
##
##
##
##
##
##
##
Call:
lm(formula = brain ~ body, data = MedAnimals[-human, ])
Coefficients:
(Intercept) 135.502
41.032
3.30
0.0063
body
0.917
0.162
5.67
0.0001
Residual standard error: 112 on 12 degrees of freedom
i. Is linear regression appropriate for the full dataset, the reduced dataset
(without the human data) or for both of them? In your answer refer to
any assumptions that are made, whether or not they seem reasonable, and
what output or diagnostic plots you used in coming to this conclusion.
ii. Assume that linear regression is appropriate for the reduced dataset.
A. Compute a 99% confidence interval for the expected change in brain
weight in grams for each kg increase of body weight.
B. Explain in plain English what your computed interval says about the
average brain weights of vertebrates.
FINAL EXAM 2014
61
iii. The two least-squares regression lines for the full dataset and reduced
dataset are given in Figure 3. The solid line is calculated using the full
dataset while the dashed line is calculated using the data with the human
data point removed. With reference to the least-squares method, explain
why the solid line has tilted upwards compared to the dashed line.
1000
600
200
Brain weight (g)
Human
100
200
300
400
500
Body weight (kg)

Figure 3.5: Solid line: least squares line for the full data. Dashed line: least sqaures line
for the reduced data.
62

2. [15 marks]
(a) Three students meet for lunch every day during session and go to the Whitehouse, the Roundhouse, or the Quad. Each day during session, their choice
is made by chance: They each flip a fair coin and see how many of the three
coins show Heads and how many show Tails.
If they get 3 Heads, they eat at the Whitehouse.
If they get 3 Tails, they eat at the Roundhouse.
Otherwise, (if they get 2 Heads 1 Tail, or 2 Tails 1 Head) they go to the
Quad.
i. On a random day during session what is the probability they eat at the
Whitehouse?
ii. At the beginning of one week they discuss how often they might be eating
at the Whitehouse over the 5 lunch-times (Monday to Friday). Let X be
the number of lunches at the Whitehouse in a random week.
A. What is the distribution of X? State the name of the distribution and
the value of any parameters.
B. State the mean and variance of X, that is state X and
2
X.
iii. Determine the probability that in a random week (Monday to Friday) they
eat at the Whitehouse just once, and on 4 days they eat elsewhere.
iv. In the study break the students still meet for lunch, but the Roundhouse is
closed. They determine their lunch place in the same way, but if they get
3 Tails then they cant now go to the Roundhouse so they flip the coins
again, possibly several times, until they get a dierent outcome. They
then eat at the Whitehouse (if they get 3 Heads) and otherwise they go
the Quad.
On a random day during study break what is the probability they eat at
the Whitehouse?
(b) Assume that the distribution of weights of people is approximately normal and
that the mean is 75 kg and standard deviation is 9 kg. Lifts in the Red Centre
at UNSW have notices saying that the maximum number of people to go in
the lift is 13 and the maximum weight is 1000 kg.
i. What is the distribution of the total weight of 13 random people? State
the name of the distribution and the value of any parameters.
ii. Suppose that 13 people get in to the lift in the Red Centre.
A. Estimate the probability that the total weight of the 13 people exceeds
1000kg.
B. What assumptions did you make to estimate this probability?
FINAL EXAM 2014
63

3. [15 marks] A car rental company has two offices, one at the airport and one in the
city centre. The company wants to investigate the dierence, on average, in monthly
revenue for the two offices. They collate the monthly revenue for each month over
a year.
The monthly revenue data is displayed in the following table.
Month
July
August
September
October
November
December
January
February
March
April
May
June
x
s
Airport ($) City Centre ($) dierence ($)

283592
188010
95582
269620
197874
71746
312220
193954
118266
300679
210545
90134
217889
212116
5773
381030
277022
104008
232288
239715
7427
186285
197761
11476
230672
256650
25978
248172
182655
65517
221898
146602
75296
257073
149663
107410
261780
52290
204380
38873
57404
52278
There are some graphs of this data given in Figure 4 on page 64.
(a) Carry out a hypothesis test to see if there is significant evidence that there is
a dierence in expected revenue per month. In your answer
ii. State the test statistic.
State the distribution of the test statistic assuming H0 is true.
Determine the observed value of the test statistic.
iii. Give an expression for the P -value.
Determine this P -value as accurately as you can from the tables.
iv. State your conclusion in plain language.
(b) What assumptions did you need to make for your hypothesis test to be valid?
Do you think these assumptions were reasonable?
In your answer refer to the graphs and diagnostic plots on the next page, stating
carefully which plots you are using.
(c)
i. Determine a 95% confidence interval for the mean dierence in revenue

per month between the airport and city offices.
ii. Explain in simple language what this confidence interval means.
64
0e+00
240000
160000
200000
0.5
0.5
1.5
1.5
0.5
0.5
0e+00
1e+05
Normal quantile plot

of difference
Sample Quantiles
240000
160000
Sample Quantiles
1.5
1.5
1.5
0.5
0.5
1.5
difference vs month
Airport vs City
Histogram of combined revenue

data from Airport and City
6
Month
10
12
8
6
4
2
Frequency
200000
Airport
2
350000
1e+05
200000
350000

of City revenue
0e+00
Sample Quantiles

of Airport revenue
difference
Boxplot of difference
1e+05
Boxplot of City revenue
350000
Boxplot of Airport revenue
160000
220000
280000
100000
City
Figure 3.6: Diagnostic plots for car rental data.
200000
300000
Total
400000
FINAL EXAM 2014
65

4. [15 marks]
A survey of primary school children was conducted to investigate whether geographical location had an impact on the importance the children placed on goals
of academic grades, sporting skill or social popularity. A random sample of 478
students was obtained and each child was asked to identify their most important
goal from the selection. The results of the survey are summarised in Table 3.4 which
lists the counts of children by their identified goal and location.
Table 3.4: Observed counts of Location and Goals for primary school children
Location
Rural
Suburban
Urban
Grades
Goal
Popular
Sports
57
87
103
50
42
49
42
22
26
Using the information in Table 3.4, answer the questions below.

(a) Carry out a statistical hypothesis test to examine if there is a relationship
between Location and Goal for these students. You may use the fact that
X 2 = 18.8276 and you do NOT need to compute the value yourself. In your
answer:
i. Find the marginal distributions of Location and Goal.
ii. Assuming that the Location and Goal variables are independent, find
the expected number of individuals who live in a suburban area and have
a goal of sports. (Note, you are NOT asked to find the whole table of
expected counts.)
iii. Clearly state the hypotheses H0 and Ha .
iv. State the exact distribution of X 2 assuming that H0 is true.
v. Give an expression for the P-value, and provide bounds on its value.
vi. State your conclusion in simple language.
(b) State what conditions are necessary for the hypothesis test in 4a) to be valid.
66

(c) A claim is made that rural children are more likely to have an interest in sports
compared to urban children. Denote by p1 the true proportion of children in
rural areas who have a main goal of sports, and by p2 the true proportion of
children in urban areas who have a main goal of sports.
Under some conditions, the distribution of the sample proportions pb1 and pb2
can be approximated with,
!
!
r
r
p1 (1 p1 )
p2 (1 p2 )
approx
approx
pb1 N p1 ,
and pb2 N p2 ,
.
149
178
i. What are these conditions that need to be satisfied so that pb1 and pb2 can
be well approximated with Normal distributions?
ii. Are these conditions satisfied here?
iii. Give an approximate 95% confidence interval estimate of the dierence in
proportions for urban and rural children. That is, compute an approximate
confidence interval estimate of p1 p2 .
iv. Using this confidence interval, determine if there is evidence that rural
children are more likely to have a goal of sports compared to urban children.
FORMULAE SHEET
1. Mean and standard deviation of a sample. For a set of n observations

{x1 , x2 , . . . , xn }, the sample mean and standard deviation are:
v
u
n
n
X
u 1 X
1
x=
xi
s=t
(xi x)2
n i=1
n 1 i=1
{(x1 , y1 ), (x2 , y2 ), . . . , (xn , yn )}.
The correlation is
r=
1
n
n
X
xi
i=1
x
sx
yi
y
sy

y = b0 + b1 x
sy
and b0 = y b1 x.
sx
Under the appropriate assumptions,
where b1 = r
T =
b1
1
t(n
SEb1
2).
3. Mean and standard deviation of a discrete random variable. For a discrete

random variable X which takes values x1 , x2 , . . . , xk , the mean and variance are:
X = x 1 p 1 + x 2 p 2 + . . . + x k p k
2
X
= (x1
X )2 p1 + (x2
X )2 p2 + . . . + (xk
X ) 2 p k
FINAL EXAM 2014
67

P (X = x) =
n
x
px (1
p)n x .
2
X has mean X = np and variance X
= np(1 p).
q
p(1 p)
.
n
5. Central Limit Theorem. For a random sample from a population with mean
and standard deviation , when
n is large,
, p
6. One sample. For a random sample from N (, )

X
p t(n
S/ n
T =
1).

X
p t(n
S/ n
T =
1).
8. Two independent random samples. When we have two independent random

samples from normal populations with a common standard deviation (i.e. 1 =
2 = ),
(n1 1)S12 + (n2 1)S22
Sp2 =
n1 + n2 2
T =
X1
X2
q
Sp
(1
1
n1
2 )
1
n2
t(n1 + n2
2)

X2 =
where
X (observed
expected =
expected)2
expected

n
Under the hypothesis of independence of variables, X 2
2
(r 1)(c 1) .
68
Chapter 4
Past Exam Solutions
This chapter contains solutions to all of the past exam papers presented in the previous
chapter. In some cases a marking guide is also included, so that you can check what your
raw mark would have been on any given exam, to get an indication of whether you are
on track!
In the past, the occasional typographic error has been found in solutions if you think
you have found such an error, then please let your course authority know.
Final Exam 2011 Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

69
70
CHAPTER 4. PAST EXAM SOLUTIONS
Final Exam 2011 Solutions

NOVEMBER 2011 SOLUTIONS
1. [15 marks]
(a)
i. Comparative boxplots or side-by-side boxplots

ii. Suggests that retirees watches are slower than students (more positive
time-dierence, on average).
(b) Two-sample t-test:

i. H0 : student = retiree vs Ha : student 6= retiree
ii.
1)s2student + (nretiree 1)s2retiree
nstudent + (nretiree 2
(45 1)116.62 + (13 1)98.62
=
45 + 13 2
2
= 112.98
s2p =
(nstudent
and so the test statistic is

t =
xstudent
q
sp
=
=
1
nstudent
xretiree
+
1
nretiree
32.2 78.8
q
1
1
112.98 45
+ 13
1.31
which comes from a t distribution with 45 + 13 2 = 56 degrees of freedom

if H0 is true.
iii. The P -value is 2 P (|t(56)| > 1.31) with is between 2 0.05 = 0.1 and
2 0.1 = 0.2 if H0 is true.
iv. There is no evidence that retirees and students set their times dierent
amounts, on average.
(c)
i.
Independence (of observations across samples and in each sample)

Normality
Equal variance in each sample
FINAL EXAM 2011 SOLUTIONS

ii.
71
Independence assumptions OK because subjects were randomly selected (from dierent populations, students and retirees)
Normality OK because total sample size is large (and there is no sign
of skew or major outliers)
Equal variance is OK because sample standard deviations are similar.
2. [15 marks]
(a)
i.
Genotype
AA
Aa
aa
Response status
Responder Non-responder
3
11
38
97
88
62
Total
129
Total
14
135
150
170
ii.

14 129
=
= 6.04
n
299
(b) i. Chi-square test for independence
ii. Independence of observations
All expected counts exceed 10
iii. Independence sample randomly
Expected counts one is 6.04 so they do not all exceed 10
expected count =
(c)
i. To ensure that all expected counts exceed 10.

ii. H0 : there is no association between genotype and response status.
Ha : there is an association.
The table of expect counts is
Response status
Responder Non-responder Total
A*
64.3
84.7
149
aa
64.7
85.3
150
Total
129
170
and so the test statistic is
X Observed counts Expected counts
X2 =
Expected counts
i
64.3)2 (108 84.7)2 (88 64.7)2 (62 85.3)2

+
+
+
64.3
84.7
64.7
85.3
= 8.4 + 6.4 + 8.4 + 6.4
=
(41
= 29.6
which comes from a chi-square distribution with (2
degrees of freedom.
1) (2
1) = 1
72

The P -value is P (
2
1
> 29.6) < 0.005
So there is strong evidence of an association between genotype and

response status.
73
3. [15 marks]
(a) Yes the P -value for a test of no relationship is 0.0368, so there is some
evidence of a relationship.
(b) r2 = 0.1571 so the relationship is weak.
(c) A 95% CI for
is:
b1 t SEb1 = 0.171 2.056 0.07782
= 0.171 0.160
= (0.011, 0.331)
(d)
Independence cant check, well let that one slide.

Linearity residual plot shows no pattern so OK.
Normality normal quantile plot fairly good.
Equal variance residual plot shows no pattern so OK.

(e) To satisfy assumptions.
(Or: to take mutliplicative size variables and put them on an additive scale.)
(f)
i.
residual = y
y ' 3.19
(3.71 + 0.171 3.13) '
1.06
ii. Australia has an unusually small amount of government debt, given its
population size.
74

4. [15 marks]
(a)
i. P (X = 0) = 1
0.39
0.35
0.16 = 0.10.
ii.
X =
X
all x
(b)
xk pk = 0 0.10 + 1 0.39 + 2 0.35 + 3 0.16 = 1.57
i. for applying a sampling design from MATH1041:

A simple random sample of Australian households.
A stratified random sample (stratifying by city/country, by state, or

by other potentially useful strata).
Multi-stage (cluster) sample (sample suburb/postcode/etc then households).
ii. for a sensible justification:
SRS: removes selection bias (and guarantees independence).
Stratified random sample: controls for variation in geographic location.

Multi-stage sample: more feasible logistically.
(c)
i. Normal.
ii. The Central Limit Theorem.
iii.
2
P
Xi /n
s
2
1
2
P
=
i Xi
n
v
u 2
u 1 X
2
= t
if the Xi are independent
n
i
r
r
2
1
2 =
=
n
=p
n2
n
n
iv. 95% (68-95-99.7 rule), or using tables, 94.44%.

(d)
i.
0.71
p
' 0.05
200
ii. A 95% CI for the true mean is:

s
x t p
= 1.49 1.98 0.05
n
= 1.49 0.10 = (1.39, 1.59)
iii. No the mean number of cars could be 1.57 (its inside the CI).
75

1. [15 marks]
(a) R2 = 0.9229, so the relationship is strong (92% of the variation in Y is explained
by regression against X).
(b) A 95% CI for
(b1
is:
t SEb1 , b1 + t SEb1 ) = ( 0.0288

= ( 0.0288
2.262 0.002775,
0.0288 + 2.262 0.002775)
0.0063, 0.0288 + 0.0063)
= ( 0.0351, 0.0225)
t is the value from the t(n
t ) = 0.95.
1) = t(9) distribution such that P ( t < t(7) <
(c) We are 95% confident that the expected reduction in world record is between
0.035 and 0.022 of a second per year
(d) There are four assumptions:
Linearity (reasonable since no pattern on residual vs fits plot)
Normality (OK? Questionable? from normal quantile plot)
Independence (cant be checked from data)
Equal variance (OK since no pattern on residual vs fits plot)

(e) The predicted time (in seconds) is:
y = 67.8543
0.0288 2012 = 9.91
(f) We are extrapolating predicting well outside of the range of the data, and
the relationship may not be linear outside of the data range.
76

2. [15 marks]
(a) (III).
(b) The graph suggests the dierences are centered around zero, with some values
negative (as much as 1 second) and some positive (as much as 1.6 seconds). So
overall, it seems athletes performed similarly at London as at the trials, with
some performing better and some performing worse.
(c)
i. H0 : = 0 Ha : > 0 where is the average dierence, London Trial.

ii. This is a paired t-test so the formula is:
t=
x
0.006
p =
p = 0.036
s/ n
0.680/ 17
iii. Under H0 , t comes from a t(n
1) = t(16) distribution.
iv. P (t(16) > 0.036) > 0.25

v. There is no evidence against the null hypothesis, that it, there is no evidence that the athletes under-performed at the Olympics.
(d)
i. Independence and normality (of dierences).

ii.
Independence (of dierences) not guaranteed since no information on

whether or not this was an SRS (although perhaps reasonable anyway)
Normality from boxplot, data look symmetric so this is reasonable
(since n > 15 we only needed a fair approximation anyway).
77
3. [15 marks]
(a) Table (II)
(b)
The data are paired and this table was constructed taking this into account.
This table allows us to control for variation across events in how well
Australians perform
(c)
i. H0 : p = 0.5 Ha : p < 0.5 where p is the probability of a medal at

London, out of the events where Australia medalled at one games but not
both.
ii. z = p p 0.5
= p3/15 0.5 = 2.32
0.5(1 0.5)/n
0.50.5/15
iii. Under H0 , z comes from a N (0, 1) distribution.

iv. P (Z <
2.32) = 0.0102
v. There is some evidence against H0 , that is, there is some evidence that
swimmers under-performed at London (in the sense that they won less
medals than at Beijing).
(d)
i. Independence and normality (of p)

ii. np0 = 15 0.5 = 7.5 < 10 so n was not large enough to guarantee that
this assumption was satisfied.
78

4. [15 marks]
(a)
Probability histogram of medals won
probability
0.4
0.3
0.2
0.1
0
0
1
2
3
Number of medals
(b)
X = x1 p1 + x2 p2 + . . . xk pk = 0 0.23 + 1 0.45 + 2 0.27 + 3 0.05 = 1.14
(c) In the long run, we would expect that 1.14 out of three London swimmers
would be medallists.
(d)
i. If we let X = number of swimmers with a medal, then there are (3 X)

swimmers without a medal. Swimmers that won a medal get $10, 000 while
the swimmers that didnt win a medal get $1, 000. So the payout would
be Y = 10000 X + 1000 (3 X) = 3000 + 9000X.
ii.
Y = 3000 + 9000X = 3000 + 9000 1.14 = 13260
(e)
i. P (X > 1) = 1
0.23 = 0.77
ii.
P (X = 3|X > 1) =
P (X = 3, X > 1)
P (X = 3)
0.05
5
=
=
=
' 0.065
P (X > 1)
P (X > 1)
0.77
77
(f)
P (X = 3) =
18
3
29
0
47
3
18 17 16
' 0.05
47 46 45
(g) Independent events satisfy P (B|A) = P (B), that is, the probability does not
change on conditioning. But the distribution of medals did change when we
conditioned on male swimmers, so gender is not independent of whether or not
they won a medal.
79

1. [15 marks]
(a)
i. Year is the explanatory variable and birth rate is the response variable
ii. Both are quantitative variables.
(b) There is a linear trend and it is decreasing.

i. residual = y yb. So, substituting the first y value, y = 19.4 and the first
fitted value into the equation gives the answer.
0.5
2.0
residuals
1.0
(c)
1970
1980
1990
2000
2010
Year
ii.
iii. Linear regression seems appropriate for prediction since r2 = 0.7284 is

relatively high and there is no pattern in the residual plot suggesting linear
regression is appropriate.
(d)
i. At x = 1978,
yb = 231.1398
0.1084x = 231.1398
0.1084 1978 = 16.7246
ii. Residual calculation at x = 1978,

residual = y
(e)
i. Prediction at x = 2025,
yb = 15
yb = 231.1398
= 231.1398
= 11.6298
ii. It is dangerous to extrapolate.

2. [15 marks]
16.7246 =
0.1084x
0.1084 2025
1.7246
80

(a)
i. X is a normal random variable with X N ( = 501, pn =

Computing the probability,
!
X
500 501
P (X 500) = P
1
p
= P (Z
p2
36
= 13 ).
3)
= P (Z 3)
= 0.9987
P36
ii. The total weight can be written T =
i=1 Xi N (T = 501 36 =
18036, T = 6 2 = 12). The probability,
T T
18000 18036
P (T 18000) = P
12
T
= P (Z
3)
= P (Z 3)
= 0.9987
iii. Assumptions are that data are independent with identical mean and variance. This then means the CLT can be used.
(b)
i. Proportion using table,

pb = proportion mislabelled as 5500g =
ii. pb N (p,
20
= 0.2 = 20%
100
p
p(1
p)/n). So we use the following interval,

r
r
pb(1 pb)
0.2(1 0.2)
pb 1.96
= 0.2 1.96
n
100
= (0.2 1.96 0.04)
= (0.1216, 0.2784)
iii. Hypotheses are,

H0 : The two faults are associated/independent
vs
Ha : The two faults are not associated/independent
H
Distribution of X 2 0
P-Value = P (
2
1.
2
1
Bounding the p-value.

5.674) P (
2
1
5.02) = 0.025 < 0.05
Since the p-value is small, we reject the null hypothesis in favour of the
alternative. i.e. we have evidence against the hypothesis that the faults
are independent.
3. [15 marks]
81
(a) The elderly have the larger variability (sy > sx ). Two elderly patients have
very high SBP (190 and 195) compared to everyone else which would make the
variability of the elderly patients larger.
(b)
i. The null and alternative hypothesis are H0 : x = y vs Ha : x < y

where x and y are the average SBP for the young adults and the elderly
respectively.
ii. This is a two-sample t-test so the formula is:
t=
sp
x
q
y
1
nx
1
ny
with pooled sd
sp =
=
s
s
(nx
1)s2x + (ny 1)s2y

nx + ny 2
(13
1) 19.572 + (15 1) 30.342

(13 + 15 2)
= 25.93
t=
sp
x
q
y
1
nx
1
ny
iii. Under H0 , t comes from a tnx +ny
126.15
25.93
2
133.6
1
13
1
15
0.76
= t26 distribution.
iv. P-Value= P (t26 < 0.76) = P (t26 > 0.76). From the table this means
that 0.2 < P-Value < 0.25.
v. There is no evidence to suggest that the SBP of the young adults is lower
than the SBP of the elderly, on average.
(c)
i. The required assumptions for the test are

The sd of both patient groups are equal to each other.
The sample means are approximately normal (or all observations are
normal)
All the observations are independent
ii. Checking the validity of these
Normality Can check the QQ plots in (I) and (II). Seems approximately satisfied overall. The qq plot for X looks ok. The qq plot for
Y has slight right skewness (last two values too large).
Equal Variance from summary, the y has a larger sd sy = 30.34
compared to sx = 19.57. However, this falls into the rule of thumb
that sy /sx < 2.
Independence not guaranteed or verifiable from the the question.
Although the patients are dierent people.
82

4. [15 marks]
(a)
i. Computing the correct missing value,

P (X = 0) = 1
P (X 6= 0) = 1
0.55
0.12
0.03 = 0.3
ii.
X =
3
X
x=0
xP (X = x) = 0 0.3 + 1 0.55 + 2 0.12 + 3 0.03
= 0.88
Probability
Probability Histogram of X
iii.
(b)
0.5
0.4
0.3
0.2
0.1
0
0
1
2
3
Number of times people have been married
i. Possible values are X = 1, 2 or 3.

ii. Conditional probabilities, P (A|B) = P (A and B)/P (B) So here, P (X >
0) = 1 0.3 = 0.7. Then the conditional probabilities are,
x
P (X = x|X > 0)
iii. The conditional mean is,
X|X>0 =
3
X
x=0
0.7854
0.1695
0.0451
xP (X = x|X > 0) = 1 0.7854 + 2 0.1695 + 3 0.0451
= 1.2597
(c)
i. The marginal table

Number of Marriages
Male
Female
38213 60561
33283 70881
13432 3589
14915 3944
Total
115795
123023
Total 71496 131442 28347 7533 238818

ii. Comparing the expected counts under independence and comparing with
the observed counts, we see they do not match.
0
Male
Female
34666.06 63731.91 13744.53 3652.5

36829.94 67710.09 14602.47 3880.5

x
P (X = x|F )
83
0
0.27
0.575
0.121
0.032
Table 4.1: Conditional Probability distribution of X|F .

x
P (X = x|M )
0.33
0.523
0.116
0.031
Table 4.2: Conditional Probability distribution of X|M .

Alternatively one can compare the conditional to unconditional distributions of X (given male, M or female F ). The conditional distributions are
in Table 3.1 and 3.2.
The conditional probabilties of number of marriages given a gender (either
M or F ), do not match to the unconditional probabilities (probability of
number of marriages ignoring gender.
Alternatively again, one can choose to do a
hypotheses,
-test. We wish to test the
H0 :There is no association between Gender and Number of Marriages

vs
Ha :There is an association between Gender and Number of Marriages
The
statistic is X 2 given by
X2 =
X (Observed
Expected)2
Expected
(38213 34666)2 (33283 36830)2 (60561 63732)2 (70881 67710)2

X =
+
+
+
34666
36830
63732
67710
2
2
2
(13432 13745)
(14915 14602)
(3589 3653)
(3944 3880)2
+
+
+
+
13745
14602
3653
3880
1027
2
Under the Null hypothesis, X 2 behaves like a

c = 4 ) (r 1)(c 1) = 1 3 = 3)
So the P-Value is,
P-value = P (
2
3
2
3
distribution (r = 2,
1026.7033) < 0.005
The P-value is small so we reject H0 . That is, we have evidence against the
null hypothesis. There is evidence to suggest that there is an association
between Gender and Number of Marriages.
84

1. [15 marks]
(a) Possible linear relationship, points are increasing and with moderate linear
strength, but there is also an outlier.
(b) R2 = 0.148001 (or the adjusted R2 = 0.08), this implies that only 14.8% (or
8%) of the variation in brain weight is explained by the linear regression on
body weight. Therefore the linear relationship seems weak.
(c)
i. The reduced dataset. It is the only dataset that has

Equal variance of residuals (Equal spread in residual plot for reduced
dataset)
Normality of residauls (linear pattern in quantile plot for reduced
dataset)
Linearity of data (No pattern in reduced dataset residual plot)
(Bonus reason): increase in R2 when humans not included.
The full dataset has an outlier which ruins the diagnostic criteria above.
ii. A.
b1 t SE(b1 ) = 0.917 3.055 0.162 = 0.917 0.495 = (0.422, 1.412)
B. We are 95% confident that the expected increase in brain weight (in
g) for each unit increase in body weight (in kg) is between 0.422 and
1.412 g.
iii. The human point is an outlier. The least squares regression line minimises
the square distance of each point to the line. A point far away will have a
lot of influence since the line wants to minimise that large distance.
2. [15 marks]
(a)
i. Note that each time they toss the three coins the probability of 3 heads is
1
= 12 12 12 because the outcomes of the three coin flips are independent.
8
So the probability they eat at the Whitehouse is 18 .
(Similarly the probability of three tails is 18 so the probability they eat at
the Roundhouse is 18 . And the probability of 2 heads and a tail, or 2 tails
and a head is 68 = 34 , so the probability they eat at the Quad is 34 .
ii. There are 5 days, and each time the probability of eating at the whitehouse
is p = 18 . The outcomes of the n = 5 days are independent. So the
distribution of X is binomial, with n = 5 and p = 18 , that is
1
X B(5, ).
8
85
Furthermore,
X = np = 5
1
5
=
8
8
and
2
X
= n p (1
p) = 5
1 7
35
= .
8 8
64
iii. This is just the probability P (X = 1), where X B(5, 18 ).

P (X = 1) =
5
1
( 18 )1 (1
1 4
)
8
= 0.3663635 = 0.37 (to 2 decimal places).
iv. There were a great many dierent ways of answering this, and some really
wonderful correct answers given by students to this part.
One of the many possibilities is:
The probability we are looking for is the conditional probability of 3 heads,
given 3 tails has not occurred:
P (3H|3T has not occurred) = P (3H)/(1
(b)
P (3T )) =
1
8
7
8
1
7
i. The total weight (in kg) has a normal distribution with mean 13 75 and
variance 13 92 , that is the total weight,
T N (13 75,
13 92 ) = N (975,
Note that this means that Z =

ii. P (T > 1000) = P (Z >
1000 975
32.45
1000 975
)
32.45
1053) = N (975, 32.45).
N (0, 1).
= P (Z > 0.77) = 0.2206(from tables).
3. [15 marks]
(a)
i. Let be the mean dierence in revenue dollars (Airport-city) in random

month. H0 : = 0,
Ha : 6= 0.
ii. This is a paired t

T =
test, so uses the mean and sd from the dierences.
X
p .
s/ n
When H0 is true, T tn
= t12
= t11 .
The observed value of the test statistic is T =

decimal places).
57404
p
52278/ 12
= 3.8038 (to 4
iii. P -value= 2 P (T > 3.80) when T t11 .

From tables, 0.001 < P (T > 3.80) < 0.0025,
so 0.002 < P -value< 0.005.
iv. This is a very (very) small P -value, so there is very strong evidence that
there is a dierence in revenue between the two offices.
86

(b) We need to assume that the data values (the dierences) for the dierent
months are independent. (the plot of dierence vs month does not show any
particular trend.)
We need to assume that the dierences are approximately normally distributed. The quantile plot of the dierences is approximately linear so this
assumption is plausible. (could comment on boxplot or histogram of dierence
but need to include comment on the qqplot)
(c)
i. A 95% confidence interval for the mean dierence in monthly revenue

(airport-city) is given by
(x
s
s
t p ,x + t p )
n
n
where t for 95% confidence from t(11) is t = 2.201. This gives

(57404
52278
52278
2.201 p , 57404 = 2.201 p ) = (24188, 90620)
12
12
ii. We are 95% confident that the mean dierence in revenue is between
$24188 and $90620 that is we are 95% confident that on average the revenue at the airport depot is between $24188 and $90620 more than that
at the city depot.
4. [15 marks]
(a)
i. The correct marginal counts for Goal and Location are below.
Location
Rural
Suburban
Urban
ii. Expected count =
Grades
Goal
Popular
Sports
57
87
103
50
42
49
42
22
26
149
151
178
247
141
90
478
Row totalcolumn total

overall total
15190
478
= 28.4309623.
iii. H0 : Goal and location are not associated/independent

Ha : Goal and location are associated/not independent.
H
iv. X 2 0
2
4.
v. P-Value = P (
2
4
18.8276) < 0.005 from table.
vi. Small P-value is evidence the claim that goal and location are related.
(b) Counts need to be independent and all expected counts > 10.
(c)
i. Require ni pbi > 10, ni (1

need to be independent.
pbi ) > 10, for i = 1, 2 and sampled individuals
87
ii. Checking bounds, n1 pb1 = 42 > 10, n1 (1 pb1 ) = 107 > 10 and n2 pb2 = 26 >
10, n2 (1 pb2 ) = 152 > 10. Independence would need to be checked by
looking at the sampling protocols.
q
iii. pb1 pb2 N (p1 p2 , p1 (1n1 p1 ) + p2 (1n2 p2 ) ). From output pb1 pb2 = 0.2818792
q
pb1 (1 pb1 )
0.1460674 = 0.1358118. Estimate SE(b
p1 pb2 ) =
+ pb2 (1n2 pb2 ) =
n1
0.0453794 with z = 1.96 ) m = 0.0889435 ,
95% CI for p1
p2 = 0.1358118 0.0889435 = (0.0468682, 0.2247553)
iv. The interval is entirely above zero suggesting evidence that rural children
are more likely to have a goal of sports.

1041 Partc S2 15

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

1041 Partc S2 15

Transféré par

Droits d'auteur :

Formats disponibles

36

CHAPTER 3. PAST EXAM PAPERS

Final Exam 2011

Time difference (seconds)

i. What type of graph is presented above?

i. What assumptions were required for your hypothesis test to be valid?

FINAL EXAM 2011

Use a separate book clearly marked Question 2

i. Find the marginal distributions of response status and genotype.

i. Why were data rearranged like this prior to analysis?

Calculate the test statistic, showing your working.

State the distribution of the test statistic (including any degrees of

CHAPTER 3. PAST EXAM PAPERS

Use a separate book clearly marked Question 3

Government Debt (%GDP) [log scale]

Population (millions) [log scale]

Estimate Std. Error t value Pr(>|t|)

Residual standard error: 0.4958 on 26 degrees of freedom

FINAL EXAM 2011

Normal quantile plot of residuals

Residual vs fitted values

(a) Is there evidence of a relationship between government debt and population

CHAPTER 3. PAST EXAM PAPERS

Use a separate book clearly marked Question 4

FINAL EXAM 2011

1. Mean and standard deviation. For a set of n observations {x1 , x2 , . . . , xn }, the

2. Correlation and regression. Consider a set of n paired observations

The least squares regression line for y on x is

has a t-distribution with n

3. Binomial Probabilities. If X B(n, p) then for x = 0, 1, . . . , n

5. One sample. For a simple random sample from N (, )

CHAPTER 3. PAST EXAM PAPERS

7. Two independent simple random samples. When we have two independent

8. Two way tables. For a r c table of observed counts of n independent events,

row total column total

Under the hypothesis of independence of variables, X 2 has a

FINAL EXAM 2012

Final Exam 2012

CHAPTER 3. PAST EXAM PAPERS

Residual standard error: 0.07108 on 9 degrees of freedom

FINAL EXAM 2012

Use a separate book clearly marked Question 2

CHAPTER 3. PAST EXAM PAPERS

i. What assumptions were required for your hypothesis test to be valid?

FINAL EXAM 2012

Use a separate book clearly marked Question 3

i. State two conditions necessary for this test to be valid.

CHAPTER 3. PAST EXAM PAPERS

Use a separate book clearly marked Question 4

(a) Draw a probability histogram of X.

FINAL EXAM 2012

1. Mean and standard deviation. For a set of n observations {x1 , x2 , . . . , xn }, the

2. Correlation and regression. Consider a set of n paired observations

The least squares regression line for y on x is

has a t-distribution with n

3. Binomial Probabilities. If X B(n, p) then for x = 0, 1, . . . , n

5. One sample. For a simple random sample from N (, )

CHAPTER 3. PAST EXAM PAPERS

7. Two independent simple random samples. When we have two independent

8. Two way tables. For a r c table of observed counts of n independent events,

row total column total

Under the hypothesis of independence of variables, X 2 has a

FINAL EXAM 2013