Académique Documents
Professionnel Documents
Culture Documents
sd
116.6
98.6
400
Students
Retirees
n
45
13
200
100
0
100
200
300
Retirees
(a)
Students
(b) Use a hypothesis test to find out if there is evidence that, on average, students
and retirees dier in how accurately their watches keep time. In your answer:
i. State clearly the hypotheses H0 and Ha .
ii. State the test statistic and its distribution assuming H0 is true.
iii. Give an expression for the P -value, and find this P -value as accurately as
you can using tables.
iv. State your conclusion in simple language.
(c)
37
Genotype
AA
Aa
aa
Response status
Responder Non-responder
3
11
38
97
88
62
While their genotypes are dierent, Aa individuals are expected to behave the same
(in terms of response status) as AA individuals.
(a)
(b)
i. Which type of hypothesis test could you carry out to examine whether
there is a relationship between response status and genotype?
ii. State two conditions necessary for this test to be valid.
iii. Where appropriate, check that these assumptions hold, or explain what
could be done when collecting data to ensure these assumptions were satisfied.
(c) In order to carry out the hypothesis test, the data were rearranged as follows:
Genotype
A* (AA or Aa)
aa
Responder
41
88
Non-responder
108
62
38
100
50
200
Australia
2
10
20
50
100
200
Call:
lm(formula = log(Debt) ~ log(Population))
Coefficients:
(Intercept)
Population
--Signif. codes:
39
3.8
4.0
4.2
4.4
Australia
0.0
0.5
0.5
1.0
Residuals
Standardized residuals
1.0
4.6
Fitted values
Australia
Theoretical Quantiles
40
0
?
1
0.39
2
0.35
3
0.16
The proportion of households with more than 3 cars was zero (to two decimal
places).
i. A number is missing from the above table. Find this missing number.
ii. Find the mean number of registered motor vehicles per household in 2006.
(b) To look at the question of whether car ownership is dierent in 2011 than it
was in 2006, a sample of 200 households is taken to estimate the mean number
of registered motor vehicles per household, in 2011.
i. How would you sample Australian households in order to estimate the
mean number of registered motor vehicles in 2011?
ii. Explain why you suggest this sampling design, including in your reasoning
at least one advantage of the selected design.
(c) We need to understand the properties of the sample mean (X) in order to use
it to make inferences about the true mean ().
i. What type of distribution does the sample mean number of registered cars
(X) come from, when calculated from 200 households?
ii. Give a reason why you think the sample mean comes from this distribution.
iii. An important result which we use for inference about means is that the
standard deviation (or standard error) of X, X , is:
X
=p
Show where this formula comes from, including in your answer any assumptions you need to make to derive the above formula.
iv. What proportion of sample means would you expect to fall within two
standard errors of the true mean?
(d) The 2011 sample of 200 households had a sample mean of 1.49 cars per household, with a standard deviation of 0.71. Using this sample:
i. Estimate the standard error of the mean.
ii. Construct a 95% confidence interval for the true mean.
iii. Is there evidence that the mean number of cars in 2011 is dierent to the
mean from the 2006 census? Use your confidence interval to answer this
question.
41
FORMULAE SHEET
X )2 p1 + (x2
= (x1
X )2 p2 + . . . + (xk
X ) 2 p k
1
n
n
X
xi
i=1
x
sx
yi
y
sy
sy
and b0 = y
sx
b1 x.
T =
b1
1
SEb1
2 degrees of freedom.
p(1 p)
For large n, p is approximately N p,
.
n
4. Central Limit Theorem. For a simple random sample from a population with
mean and standard deviation
, when n is large,
X is approximately N
, p
X
p
S/ n
1 degrees of freedom.
42
X
p
S/ n
1 degrees of freedom.
X1
X2
q
Sp
(1
1
n1
2 )
1
n2
t(n1 + n2
2)
X (observed
expected =
expected)2
expected
distribution with
43
Call:
lm(formula = dat$Time ~ dat$Year)
Residuals:
Min
1Q
-0.11838 -0.02490
Median
0.01570
3Q
0.04625
Max
0.08607
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 67.8543
5.480807
12.38 5.90e-07 ***
dat$Year
-0.0288
0.002775 -10.38 2.63e-06 ***
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1
1
44
(a) How strong is the relationship between world record time and year? Include
an appropriate numerical summary in your answer.
(b) Construct a 95% confidence interval for the true regression slope.
(c) Explain in words what this confidence interval tells us about how the womens
100 metre world record is changing over time.
(d) Is linear regression appropriate for this dataset? In your answer refer to any
specific assumptions that are made, whether or not they seem reasonable, and
what output you used in coming to this conclusion.
(e) Use the above regression analyses to predict the world record for the womens
100 metres in 2012.
(f) Your prediction in the previous question was wrong by a relatively large amount
the world record is currently 10.49 seconds (it has not changed since 1988).
Use ideas learnt in MATH1041 to identify the main reason why your prediction
was so wrong.
45
46
47
(II)
Beijing
Medal won
No medal
Total
Beijing
19
13
32
London
10
22
32
Total
29
35
London
Medal won No medal
7
12
3
10
10
22
Total
19
13
32
Table (I) records the number of events in which a medal was won or not won across
each of the 32 events, separately for each Games.
Table (II) cross-classifies the data, recording the number of events for which a
medal was won at both Games, neither, or only one of the two Games.
(a) Which table better answers the research question?
Use study design as your main consideration in answering this question.
(b) Briefly explain how the research question is better answered by the table you
chose in part 3a).
(c) Hence test whether significantly fewer medals were won in the pool at London
than at Beijing. In your answer:
i. State clearly the hypotheses H0 and Ha .
ii. Calculate the test statistic, showing your working.
iii. State the distribution of the test statistic assuming H0 is true.
iv. Give an expression for the P -value, and find this P -value as accurately as
you can using tables.
v. State your conclusion in simple language.
(d)
48
0
0.23
1
0.45
2
0.27
3
0.05
i. Find the probability that at least one of the people you meet won a medal.
ii. Assume now that you notice a medal hanging around one of the swimmers
necks. Given this information, what is the chance that all swimmers won
medals?
(f) The three swimmers you meet at the shops can be considered as a simple
random sample of the 47 Australian swimmers at the London Games. Of these
47 swimmers, 18 won a medal.
Use this information to verify that P (X = 3) = 0.05, to two decimal places.
(g) If the three swimmers at the shops were male, then the probability distribution
of X would have been as follows:
X
Probability
0
0.39
1
0.46
2
0.14
3
0.01
Is gender of the swimmers you meet independent of whether or not they won
any medals? Include reasons.
49
FORMULAE SHEET
X )2 p1 + (x2
= (x1
X )2 p2 + . . . + (xk
X ) 2 p k
1
n
n
X
xi
i=1
x
sx
yi
y
sy
sy
and b0 = y
sx
b1 x.
T =
b1
1
SEb1
2 degrees of freedom.
p(1 p)
For large n, p is approximately N p,
.
n
4. Central Limit Theorem. For a simple random sample from a population with
mean and standard deviation
, when n is large,
X is approximately N
, p
X
p
S/ n
1 degrees of freedom.
50
X
p
S/ n
1 degrees of freedom.
X1
X2
q
Sp
(1
1
n1
2 )
1
n2
t(n1 + n2
2)
X (observed
expected =
expected)2
expected
distribution with
51
1965 1970 1975 1980 1985 1990 1995 2000 2005 2009
19.4 18.4 14.8 15.9 15.6 16.4 14.8 14.4 14.0 13.5
16
14
Rate
18
1970
1980
1990
2000
2010
Year
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
Call:
lm(formula = Rate ~ Year)
Residuals:
Min
1Q Median
-2.260 -0.310 0.085
3Q
0.640
Max
1.250
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 231.1398
46.4169
4.98
0.0011 **
Year
-0.1084
0.0234
-4.64
0.0017 **
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 1.05 on 8 degrees of freedom
Multiple R-squared: 0.729,Adjusted R-squared: 0.695
F-statistic: 21.5 on 1 and 8 DF, p-value: 0.00166
(a)
52
1
1.25
2
3
4
5
0.79 -2.26 -0.62 -0.38
6
7
0.96 -0.10
8
0.05
9
0.19
10
0.12
(e)
53
Incorrect 5500g
12
8
correct 500g
25
55
test.
54
100
100
104
104
112
130
130
136
140
140
142
146
156
95
100
100
110
110
122
130
135
136
138
140
141
162
190
195
nx = 13
x = 126.15
sx = 19.57
ny = 15
y = 133.6
sy = 30.34
1.5
0.5
0.5
1.5
Theoretical Quantiles
Frequency
4 6 8
0
100
100
X quantiles
120
140
(III)
10
(II)
Y quantiles
140
180
(I)
1
0
1
Theoretical Quantiles
80
120
160
200
SBP values for combined data
55
(a) Which group of patients has the larger variability in SBP, the young adults
or the elderly? Identify the numerical summaries that show this and explain
what data point(s) could be the cause of this larger variability.
(b) Making any necessary assumptions, use a hypothesis test to find out if there
is any evidence that, on average, the elderly patients SBP is higher than the
young adults SBP in the intensive care unit. In your answer:
i. State clearly the hypotheses H0 and Ha .
ii. Calculate the test statistic, showing your working.
iii. State the distribution of the test statistic assuming H0 is true.
iv. Give an expression for the P -value, and find this P -value as accurately as
you can using tables.
v. State your conclusion in simple language.
(c)
i. What assumptions were required for the hypothesis test in (3b) to be valid?
ii. Explain whether these assumptions are valid. In your answer, refer to any
appropriate information given or explain what further information might
be required.
56
(a)
0
?
1
2
3
0.549 0.1185 0.0315
i. Determine the missing value in the above table. That is, what is chance
of not being married?
ii. Given the information, compute X , the mean of the distribution.
iii. Using the table and the result computed in (4(a)i), construct a probability
histogram and comment on its features.
(b) Consider individuals that have been married at least once. That is, consider
the values of X, where X > 0.
i. Using Table 3.2, list the set of possible outcomes for X if it is known
already that this particular person has been married at least once.
ii. Create a new conditional probability distribution table for the values considered in (4(b)i). That is, create a table of the conditional probabilities
P (X = k|X > 0) for all possible values of k.
iii. Compute the conditional mean, X|X>0 , using the table you created in
(4(b)ii).
(c) It is suggested that the individuals of dierent genders have a dierent marriage
distribution. Data was collected from 238,818 individuals that can be used to
investigate this claim.
Table 3.3: Frequency table of X across gender.
Number of Marriages
Male
Female
57
FORMULAE SHEET
1
n
n
X
xi
i=1
x
sx
yi
y
sy
T =
b1
1
t(n
SEb1
2).
= (x1
X )2 p1 + (x2
X )2 p2 + . . . + (xk
X ) 2 p k
p(1 p)
For large n, p is approximately N p,
.
n
5. Central Limit Theorem. For a random sample from a population with mean
and standard deviation , when
n is large,
X is approximately N
, p
X
p t(n
S/ n
1).
58
T =
1).
X1
X2
q
Sp
(1
1
n1
2 )
1
n2
t(n1 + n2
2)
X (observed
expected =
expected)2
expected
2
(r 1)(c 1) .
59
1000
600
200
Human
100
200
300
400
500
(a) Describe the relationship between body and brain weights in the data as displayed in the scatterplot in Figure 1.
(b) A linear regression is fitted to the data and a partial output summary is given
below.
##
##
##
##
##
##
Call:
lm(formula = brain ~ body, data = MedAnimals)
Residual standard error: 317 on 13 degrees of freedom
Multiple R-squared: 0.148,Adjusted R-squared: 0.0825
F-statistic: 2.26 on 1 and 13 DF, p-value: 0.157
A researcher views the above analysis output and states that The linear relationship between body weight and brain weight is weak. Justify their statement by including an appropriate numerical summary and explain how this
numerical summary shows the linear relationship is weak.
60
200
Residuals
400 800
Residuals
100
100
200
300
400
500
Fitted values
600
Human
1
0
1
Theoretical Quantiles
Standardised residuals
100
100
1
0
1
Theoretical Quantiles
Figure 3.4: Diagnostic plots. Row 1: with humans, Row 2: without humans.
(c) To see the impact of the human data point on the linear regression, the
human data point is removed from the dataset and the linear regression recalculated. The summary output is given below and diagnostic plots for the
regression with and without the human data point are given in Figure 2.
##
##
##
##
##
##
##
##
##
##
Call:
lm(formula = brain ~ body, data = MedAnimals[-human, ])
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 135.502
41.032
3.30
0.0063
body
0.917
0.162
5.67
0.0001
Residual standard error: 112 on 12 degrees of freedom
Multiple R-squared: 0.728,Adjusted R-squared: 0.705
i. Is linear regression appropriate for the full dataset, the reduced dataset
(without the human data) or for both of them? In your answer refer to
any assumptions that are made, whether or not they seem reasonable, and
what output or diagnostic plots you used in coming to this conclusion.
ii. Assume that linear regression is appropriate for the reduced dataset.
A. Compute a 99% confidence interval for the expected change in brain
weight in grams for each kg increase of body weight.
B. Explain in plain English what your computed interval says about the
average brain weights of vertebrates.
61
iii. The two least-squares regression lines for the full dataset and reduced
dataset are given in Figure 3. The solid line is calculated using the full
dataset while the dashed line is calculated using the data with the human
data point removed. With reference to the least-squares method, explain
why the solid line has tilted upwards compared to the dashed line.
1000
600
200
Human
100
200
300
400
500
62
Otherwise, (if they get 2 Heads 1 Tail, or 2 Tails 1 Head) they go to the
Quad.
i. On a random day during session what is the probability they eat at the
Whitehouse?
ii. At the beginning of one week they discuss how often they might be eating
at the Whitehouse over the 5 lunch-times (Monday to Friday). Let X be
the number of lunches at the Whitehouse in a random week.
A. What is the distribution of X? State the name of the distribution and
the value of any parameters.
B. State the mean and variance of X, that is state X and
2
X.
iii. Determine the probability that in a random week (Monday to Friday) they
eat at the Whitehouse just once, and on 4 days they eat elsewhere.
iv. In the study break the students still meet for lunch, but the Roundhouse is
closed. They determine their lunch place in the same way, but if they get
3 Tails then they cant now go to the Roundhouse so they flip the coins
again, possibly several times, until they get a dierent outcome. They
then eat at the Whitehouse (if they get 3 Heads) and otherwise they go
the Quad.
On a random day during study break what is the probability they eat at
the Whitehouse?
(b) Assume that the distribution of weights of people is approximately normal and
that the mean is 75 kg and standard deviation is 9 kg. Lifts in the Red Centre
at UNSW have notices saying that the maximum number of people to go in
the lift is 13 and the maximum weight is 1000 kg.
i. What is the distribution of the total weight of 13 random people? State
the name of the distribution and the value of any parameters.
ii. Suppose that 13 people get in to the lift in the Red Centre.
A. Estimate the probability that the total weight of the 13 people exceeds
1000kg.
B. What assumptions did you make to estimate this probability?
63
204380
38873
57404
52278
There are some graphs of this data given in Figure 4 on page 64.
(a) Carry out a hypothesis test to see if there is significant evidence that there is
a dierence in expected revenue per month. In your answer
i. State clearly the hypotheses H0 and Ha .
ii. State the test statistic.
State the distribution of the test statistic assuming H0 is true.
Determine the observed value of the test statistic.
iii. Give an expression for the P -value.
Determine this P -value as accurately as you can from the tables.
iv. State your conclusion in plain language.
(b) What assumptions did you need to make for your hypothesis test to be valid?
Do you think these assumptions were reasonable?
In your answer refer to the graphs and diagnostic plots on the next page, stating
carefully which plots you are using.
(c)
64
0e+00
240000
160000
200000
0.5
0.5
1.5
1.5
0.5
0.5
0e+00
1e+05
240000
160000
Sample Quantiles
1.5
1.5
1.5
0.5
0.5
1.5
Theoretical Quantiles
Theoretical Quantiles
difference vs month
Airport vs City
6
Month
10
12
8
6
4
2
Frequency
200000
Airport
2
350000
Theoretical Quantiles
1e+05
200000
350000
0e+00
Sample Quantiles
difference
Boxplot of difference
1e+05
350000
160000
220000
280000
100000
City
200000
300000
Total
400000
65
Location
Rural
Suburban
Urban
Grades
Goal
Popular
Sports
57
87
103
50
42
49
42
22
26
66
i. What are these conditions that need to be satisfied so that pb1 and pb2 can
be well approximated with Normal distributions?
ii. Are these conditions satisfied here?
iii. Give an approximate 95% confidence interval estimate of the dierence in
proportions for urban and rural children. That is, compute an approximate
confidence interval estimate of p1 p2 .
iv. Using this confidence interval, determine if there is evidence that rural
children are more likely to have a goal of sports compared to urban children.
FORMULAE SHEET
r=
1
n
n
X
xi
i=1
x
sx
yi
y
sy
T =
b1
1
t(n
SEb1
2).
= (x1
X )2 p1 + (x2
X )2 p2 + . . . + (xk
X ) 2 p k
67
n
x
px (1
p)n x .
2
X has mean X = np and variance X
= np(1 p).
q
p(1 p)
For large n, p is approximately N p,
.
n
5. Central Limit Theorem. For a random sample from a population with mean
and standard deviation , when
n is large,
, p
X is approximately N
T =
1).
T =
1).
X1
X2
q
Sp
(1
1
n1
2 )
1
n2
t(n1 + n2
2)
X (observed
expected =
expected)2
expected
2
(r 1)(c 1) .
68
Chapter 4
Past Exam Solutions
This chapter contains solutions to all of the past exam papers presented in the previous
chapter. In some cases a marking guide is also included, so that you can check what your
raw mark would have been on any given exam, to get an indication of whether you are
on track!
In the past, the occasional typographic error has been found in solutions if you think
you have found such an error, then please let your course authority know.
69
70
1. [15 marks]
(a)
s2p =
(nstudent
xstudent
q
sp
=
=
1
nstudent
xretiree
+
1
nretiree
32.2 78.8
q
1
1
112.98 45
+ 13
1.31
i.
71
Independence assumptions OK because subjects were randomly selected (from dierent populations, students and retirees)
Normality OK because total sample size is large (and there is no sign
of skew or major outliers)
Equal variance is OK because sample standard deviations are similar.
2. [15 marks]
(a)
i.
Genotype
AA
Aa
aa
Response status
Responder Non-responder
3
11
38
97
88
62
Total
129
Total
14
135
150
170
ii.
(c)
(41
= 29.6
which comes from a chi-square distribution with (2
degrees of freedom.
1) (2
1) = 1
72
2
1
73
3. [15 marks]
(a) Yes the P -value for a test of no relationship is 0.0368, so there is some
evidence of a relationship.
(b) r2 = 0.1571 so the relationship is weak.
(c) A 95% CI for
is:
b1 t SEb1 = 0.171 2.056 0.07782
= 0.171 0.160
= (0.011, 0.331)
(d)
i.
residual = y
y ' 3.19
1.06
ii. Australia has an unusually small amount of government debt, given its
population size.
74
i. P (X = 0) = 1
0.39
0.35
0.16 = 0.10.
ii.
X =
X
all x
(b)
i. Normal.
ii. The Central Limit Theorem.
iii.
2
P
Xi /n
s
2
1
2
P
=
i Xi
n
v
u 2
u 1 X
2
= t
if the Xi are independent
n
i
r
r
2
1
2 =
=
n
=p
n2
n
n
i.
0.71
p
' 0.05
200
75
is:
2.262 0.002775,
= ( 0.0351, 0.0225)
t is the value from the t(n
t ) = 0.95.
(c) We are 95% confident that the expected reduction in world record is between
0.035 and 0.022 of a second per year
(d) There are four assumptions:
Linearity (reasonable since no pattern on residual vs fits plot)
Normality (OK? Questionable? from normal quantile plot)
Independence (cant be checked from data)
(f) We are extrapolating predicting well outside of the range of the data, and
the relationship may not be linear outside of the data range.
76
x
0.006
p =
p = 0.036
s/ n
0.680/ 17
1) = t(16) distribution.
77
3. [15 marks]
(a) Table (II)
(b)
The data are paired and this table was constructed taking this into account.
This table allows us to control for variation across events in how well
Australians perform
(c)
0.50.5/15
2.32) = 0.0102
v. There is some evidence against H0 , that is, there is some evidence that
swimmers under-performed at London (in the sense that they won less
medals than at Beijing).
(d)
78
0.4
0.3
0.2
0.1
0
0
1
2
3
Number of medals
(b)
X = x1 p1 + x2 p2 + . . . xk pk = 0 0.23 + 1 0.45 + 2 0.27 + 3 0.05 = 1.14
(c) In the long run, we would expect that 1.14 out of three London swimmers
would be medallists.
(d)
(e)
i. P (X > 1) = 1
0.23 = 0.77
ii.
P (X = 3|X > 1) =
P (X = 3, X > 1)
P (X = 3)
0.05
5
=
=
=
' 0.065
P (X > 1)
P (X > 1)
0.77
77
(f)
P (X = 3) =
18
3
29
0
47
3
18 17 16
' 0.05
47 46 45
(g) Independent events satisfy P (B|A) = P (B), that is, the probability does not
change on conditioning. But the distribution of medals did change when we
conditioned on male swimmers, so gender is not independent of whether or not
they won a medal.
79
i. Year is the explanatory variable and birth rate is the response variable
ii. Both are quantitative variables.
0.5
2.0
residuals
1.0
(c)
1970
1980
1990
2000
2010
Year
ii.
i. At x = 1978,
yb = 231.1398
0.1084x = 231.1398
i. Prediction at x = 2025,
yb = 15
yb = 231.1398
= 231.1398
= 11.6298
16.7246 =
0.1084x
0.1084 2025
1.7246
80
= P (Z
p2
36
= 13 ).
3)
= P (Z 3)
= 0.9987
P36
ii. The total weight can be written T =
i=1 Xi N (T = 501 36 =
18036, T = 6 2 = 12). The probability,
T T
18000 18036
P (T 18000) = P
12
T
= P (Z
3)
= P (Z 3)
= 0.9987
iii. Assumptions are that data are independent with identical mean and variance. This then means the CLT can be used.
(b)
ii. pb N (p,
20
= 0.2 = 20%
100
p
p(1
Distribution of X 2 0
P-Value = P (
2
1.
2
1
2
1
Since the p-value is small, we reject the null hypothesis in favour of the
alternative. i.e. we have evidence against the hypothesis that the faults
are independent.
3. [15 marks]
81
(a) The elderly have the larger variability (sy > sx ). Two elderly patients have
very high SBP (190 and 195) compared to everyone else which would make the
variability of the elderly patients larger.
(b)
x
q
y
1
nx
1
ny
with pooled sd
sp =
=
s
s
(nx
(13
= 25.93
t=
sp
x
q
y
1
nx
1
ny
126.15
25.93
2
133.6
1
13
1
15
0.76
= t26 distribution.
iv. P-Value= P (t26 < 0.76) = P (t26 > 0.76). From the table this means
that 0.2 < P-Value < 0.25.
v. There is no evidence to suggest that the SBP of the young adults is lower
than the SBP of the elderly, on average.
(c)
The sample means are approximately normal (or all observations are
normal)
All the observations are independent
Normality Can check the QQ plots in (I) and (II). Seems approximately satisfied overall. The qq plot for X looks ok. The qq plot for
Y has slight right skewness (last two values too large).
Equal Variance from summary, the y has a larger sd sy = 30.34
compared to sx = 19.57. However, this falls into the rule of thumb
that sy /sx < 2.
Independence not guaranteed or verifiable from the the question.
Although the patients are dierent people.
82
P (X 6= 0) = 1
0.55
0.12
0.03 = 0.3
ii.
X =
3
X
x=0
= 0.88
Probability
Probability Histogram of X
iii.
(b)
0.5
0.4
0.3
0.2
0.1
0
0
1
2
3
Number of times people have been married
3
X
x=0
0.7854
0.1695
0.0451
= 1.2597
(c)
38213 60561
33283 70881
13432 3589
14915 3944
Total
115795
123023
83
0
0.27
0.575
0.121
0.032
0.33
0.523
0.116
0.031
statistic is X 2 given by
X2 =
X (Observed
Expected)2
Expected
2
3
2
3
distribution (r = 2,
The P-value is small so we reject H0 . That is, we have evidence against the
null hypothesis. There is evidence to suggest that there is an association
between Gender and Number of Marriages.
84
2. [15 marks]
(a)
i. Note that each time they toss the three coins the probability of 3 heads is
1
= 12 12 12 because the outcomes of the three coin flips are independent.
8
So the probability they eat at the Whitehouse is 18 .
(Similarly the probability of three tails is 18 so the probability they eat at
the Roundhouse is 18 . And the probability of 2 heads and a tail, or 2 tails
and a head is 68 = 34 , so the probability they eat at the Quad is 34 .
ii. There are 5 days, and each time the probability of eating at the whitehouse
is p = 18 . The outcomes of the n = 5 days are independent. So the
distribution of X is binomial, with n = 5 and p = 18 , that is
1
X B(5, ).
8
85
Furthermore,
X = np = 5
1
5
=
8
8
and
2
X
= n p (1
p) = 5
1 7
35
= .
8 8
64
5
1
( 18 )1 (1
1 4
)
8
iv. There were a great many dierent ways of answering this, and some really
wonderful correct answers given by students to this part.
One of the many possibilities is:
The probability we are looking for is the conditional probability of 3 heads,
given 3 tails has not occurred:
P (3H|3T has not occurred) = P (3H)/(1
(b)
P (3T )) =
1
8
7
8
1
7
i. The total weight (in kg) has a normal distribution with mean 13 75 and
variance 13 92 , that is the total weight,
T N (13 75,
13 92 ) = N (975,
1000 975
32.45
1000 975
)
32.45
N (0, 1).
3. [15 marks]
(a)
X
p .
s/ n
When H0 is true, T tn
= t12
= t11 .
57404
p
52278/ 12
= 3.8038 (to 4
86
s
s
t p ,x + t p )
n
n
52278
52278
2.201 p , 57404 = 2.201 p ) = (24188, 90620)
12
12
ii. We are 95% confident that the mean dierence in revenue is between
$24188 and $90620 that is we are 95% confident that on average the revenue at the airport depot is between $24188 and $90620 more than that
at the city depot.
4. [15 marks]
(a)
i. The correct marginal counts for Goal and Location are below.
Location
Rural
Suburban
Urban
Grades
Goal
Popular
Sports
57
87
103
50
42
49
42
22
26
149
151
178
247
141
90
478
15190
478
= 28.4309623.
iv. X 2 0
2
4.
v. P-Value = P (
2
4
vi. Small P-value is evidence the claim that goal and location are related.
(b) Counts need to be independent and all expected counts > 10.
(c)
87
ii. Checking bounds, n1 pb1 = 42 > 10, n1 (1 pb1 ) = 107 > 10 and n2 pb2 = 26 >
10, n2 (1 pb2 ) = 152 > 10. Independence would need to be checked by
looking at the sampling protocols.
q
iii. pb1 pb2 N (p1 p2 , p1 (1n1 p1 ) + p2 (1n2 p2 ) ). From output pb1 pb2 = 0.2818792
q
pb1 (1 pb1 )
0.1460674 = 0.1358118. Estimate SE(b
p1 pb2 ) =
+ pb2 (1n2 pb2 ) =
n1
0.0453794 with z = 1.96 ) m = 0.0889435 ,
95% CI for p1
iv. The interval is entirely above zero suggesting evidence that rural children
are more likely to have a goal of sports.