Académique Documents
Professionnel Documents
Culture Documents
Instructions to Candidates:
Attempt ALL questions.
Each question is of equal mark value.
Start your solution to each question on a new page.
To ensure full marks show all the steps in working out your
solution. Marks may be deducted for failure to show appropriate
calculations or formulae.
Unless otherwise stated, use a significance level of 5%.
Selected statistical tables are attached to the back of the
examination paper.
Page 1 of 18
Page 2 of 18
N
200
200
200
200
Mean SE Mean
0.6150 *drip1*
37.52
*drip2*
5.155
*drip3*
3.000
*drip4*
Q1
0.000
28.25
4.000
2.000
Median
1.0000
38.00
5.000
3.000
StDev
0.4878
15.87
1.894
1.607
Q3
1.0000
47.00
7.000
4.000
Minimum
0.000
1.00
1.000
0.000
Maximum
1.0000
81.00
10.000
9.000
ix. The number which should be present at drip2 is (to 3 decimal places)
a. 2.653
b. 0.188
c. 1.122
d. 0.079
x. Based on the above descriptive statistics, which box in the graph below best
represents the variable age?
Page 3 of 18
Data
100
75
50
25
0
Age 1
a.
b.
c.
d.
Age 2
Age 3
Age 4
Age 1
Age 2
Age 3
Age 4.
The doctor is interested in the differences between her male and female patients.
A boxplot is given below of the ages of the patients, split by gender. Use it to
answer question (xi).
Boxplot of Age vs Gender
90
80
70
Age
60
50
40
30
20
10
0
0
1
Gender
Page 4 of 18
Variable
Age
Gender
0
1
N
77
123
Mean
36.09
38.41
SE Mean
1.98
1.34
Variable
Age
xii.
xiii.
xiv.
xv.
StDev
17.37
14.85
Page 5 of 18
The doctor is interested in the proportion of patients (regardless of gender) who have
visited 4 or fewer times over the year. She finds that in her sample of 200, 117
patients have visited 4 or fewer times. Based on this information, a 98% confidence
interval for the population proportion visiting 4 or fewer times in a year is calculated
p (1 p )
.
to be p c
xviii. In the confidence interval formula above, the value p should be replaced by
which of the following numbers?
117
a.
- Answer b is correct
200
83
b.
200
83
c.
117
d. 0.01 .
xix. In the confidence interval formula above, the value c should be replaced by
which of the following numbers?
a. 0.01
b. 0.02
c. 1.96
d. 2.33 - Answer d is correct
xx. In the confidence interval formula above, the value n should be replaced by
which of the following numbers?
a. 117
b. 83
c. 200 - Answer c is correct
d. Not enough information is available to answer this question.
Page 6 of 18
0.5
0.4
0.3
0.2
0.1
0.0
0
6
X
10
12
Variable
X
Y
N
30
30
Mean
3.798
0.1960
Variable
X
Y
Minimum
0.510
0.000
SE Mean
0.629
0.0416
Median
2.045
0.0800
StDev
3.443
0.2279
Maximum
11.040
0.9100
Covariances: X, Y
X
Y
X 11.855989
Y
0.572271
0.051942
A regression is performed in Minitab, but a minor chemical spill has obscured some
of the output.
Page 7 of 18
Coef
0.01268
0.048269
SE Coef
T
0.04355 *spill2*
0.008559
5.64
P
0.773
0.000
Analysis of Variance
Source
Regression
Residual Error
Total
DF
1
28
29
SS
0.80106
0.70526
1.50632
MS
0.80106
0.02519
F
31.80
P
0.000
Unusual Observations
Obs
6
X
8.2
Y
0.9100
Fit
0.4085
SE Fit
0.0475
Residual
0.5015
St Resid
3.31R
Percent
90
50
10
1
99
-2
0
2
Standardized Residual
3
2
1
0
-1
0.00
Frequency
10.0
7.5
5.0
2.5
0.0
-1
0
1
2
Standardized Residual
0.30
Fitted Value
0.45
0.60
0.15
3
2
1
0
-1
8 10 12 14 16 18 20 22 24 26 28 30
Observation Order
Page 8 of 18
Fit
0.1092
0.2058
0.3023
0.3988
SE Fit
0.0328
0.0290
0.0346
0.0462
95%
(0.0420,
(0.1463,
(0.2315,
(0.3042,
CI
0.1764)
0.2652)
0.3731)
0.4934)
95%
(-0.2228,
(-0.1247,
(-0.0304,
( 0.0602,
PI
0.4412)
0.5362)
0.6350)
0.7374)
X
2.00
4.00
6.00
8.00
cov( x, y )
0.572271
=
= 0.7293 to 4 decimal places.
sx s y
3.443*0.2279
c. (3 marks) Calculate the coefficient of determination for the regression
performed. Interpret this value.
Y = 0.01268 +0.048269X
e. (2 marks) What number should be shown at spill2 (Constant row, T
column of Table 1)?
Spill 2 = test statistic for testing null hypothesis 0 =0
0 0.01268
= 0
=
= 0.2912 to 4dp
s
0.04355
0
f. (3 marks) Write down the null and alternative hypotheses being tested
by the P-value of 0.000 (X row, P column of Table 1). What
conclusion would you draw from this test in terms of the original
variables?
Page 9 of 18
H 0 : 1 = 0
H A : 1 0
P-value = 0.000
Conclusion: Reject the null hypothesis. That is, there is a significant linear
relationship between X and Y. The slope of the regression is significantly different
from zero.
g. (3 marks) Comment on the Residual plots and the Unusual
Observations flagged by Minitab. Do you see any cause for concern
about the validity of the model?
Unusual observations: One large standardised residual, from 30 observations is
approximately 3%, so is not a cause for concern, that is, it does not indicate
significant non-normality in the residuals.
Residual plots: The normal probability plot and histogram suggest that the distribution
of residuals has a slight positive skew, i.e. is not normal. This is a violation of the
assumption of normality and so is cause for concern. The residuals in order show no
pattern, and so there is no violation of the assumption of independence. The residuals
vs fitted values show a possible increasing variance with increasing fitted values this
violates the assumption of constant variance.
So, there is cause for concern about the validity of the model with the normality and
constant variance of the residuals in some doubt.
h. (2 marks) There is interest in finding a 95% interval for the average
spend on R&D overseas for a chemical firm with Australian R&D
spending of 4%. Give the interval required from the output above.
X=4%. Want to find a 95% Confidence interval for E(Y).
From the output, the interval required is (0.1463, 0.2652).
i. (2 marks) Which value of the predictor will give the narrowest possible
prediction interval? Briefly explain your answer.
The value of X that will give the narrowest prediction interval is X =3.798. At this
point, the maximum possible information is available as equal information is available
for X larger than and X smaller than this value. In terms of the calculation formula,
the term ( xg x ) takes its minimum value (0) at this point, giving the narrowest
2
possible interval.
Page 10 of 18
PZ >
= 0.25
128.24
Page 11 of 18
( nD 1) sD2 + ( nN 1) sN2
(Y
nD + nN 2
YD ) 0
46*128.242 + 52*1492
= 19499.438 to 3dp
47 + 53 2
( 44452 56493) 0
= 430.37 to 2dp
1
1
1
1
19499.438 +
s
+
47 53
nD nN
Decision Rule: Compare to a t-distribution with 98 degrees of freedom. For alpha =
5%, we will reject the null hypothesis if |TS|>t98,0.025 t100,0.025 =1.984.
Conclusion: We reject the null hypothesis. There us very strong evidence against the
null hypothesis. The average wage difference between staff with and without degrees
is significantly different from zero.
N
2
pooled
iii. (4 marks) It is claimed that 50% of the employees of this firm have
university degrees. Do the data support this claim? Perform a formal
hypothesis test at 10% level to answer this question.
H 0 : p = 0.5
H A : p 0.5
Test Statistic:
p p0
0.47 0.5
=
= 0.6
Z=
p0 (1 p0 )
0.5 (1 0.5 )
n
100
Decision Rule: Compare to a Z distribution. For alpha = 10%, reject the null
hypothesis if |TS|>1.645.
Conclusion: Do not reject the null hypothesis. There is insufficient evidence to dispute
the claim than 50% of staff have degrees.
iv. (2 marks) Based on your answer to part (iii), would you expect to find the
value 0.5 within a 90% confidence interval for the population proportion
of employees with degrees? Explain why or why not.
Yes, 50% would be within a 90% confidence interval for p as the null hypothesis was
not rejected against a 2-sided alternative with 10% significance.
v. (4 marks) A university science department wishes to conduct a study into
the average income of all its graduates employed in chemical companies.
Assuming that the population standard deviation of annual incomes is
known to be $128.24, how many former students should they include in
their sample to obtain a 99% confidence interval with maximum width of
$50?
Page 12 of 18
= 2* Z / 2 *
2* Z / 2 *
< 50
< 25
n
For 99% confidence, Z / 2 = Z 0.005 = 2.575
Z / 2 *
< 25
n
25
<
n 2.575
Given =128.24.
2.575*
Page 13 of 18
Question 4
With newspaper reports of skyrocketing crime rates, a sample of 150 successful
prosecutions (i.e. crimes for which a conviction has been obtained) in the past 24
months is taken for the purpose of studying the patterns of offender age and type of
crime. The table below gives the data obtained from the sample.
Age of offender (in years)
Type of Crime
Under 20
20-40
Over 40
Violent
27
41
14
Nonviolent
12
34
22
a. One criminal file is selected at random from the 150 by a judge for review.
i. (1 mark) Estimate the probability that the file selected relates to a violent
crime.
P(violent) = (27+41+14) / 150 = 82/150
ii. (2 marks) If it is known that the file deals with a violent crime, what is
the probability that it also relates to a criminal under 20 years of age?
P(criminal<20 | violent) = 27/82
b. (2 marks) Are age of offender and type of crime plausibly independent? Explain
your answer using examples (or a counter-example).
If A and B are independent, P(A|B) = P(A).
Here P(Criminal<20|violent) = 27/82 P(Criminal<20) = 39/150
Therefore, age and nature of crime are not independent.
c. (3 marks) Estimate the marginal probabilities for age of offender, presenting your
answer in a table form.
Age bracket of offender
Probability
<20
39/150 = 0.26
20-40
75/150 = 0.5
>40
36/150 = 0.24
Another study is performed into non-violent criminals under 20, this time classifying
them by the number of times they have been successfully prosecuted. Upon
examining all records available, the following probability distribution is found to
apply to the group.
Number of
1
2
3
4
Convictions
Probability
0.33
0.34
0.22
0.11
d. (4 marks) Find the mean number of convictions among non-violent criminals
under 20 years of age. Is this an observable value, i.e. is it possible that there will
be a non-violent criminal in this age group with that exact number of convictions?
Does this indicate a problem with the data? Explain your answer.
Let X be the number of convictions.
n
E ( X ) = xi p ( xi )
i =1
Page 14 of 18
i =1
X
3 2.11
So, P ( X < 3) = P
<
n 0.9899 50
= P ( Z < 6.364 )
1
Page 15 of 18
Question 5
(a) An unnamed lecturer has a reputation for her lectures running over time, that is,
taking longer than the 50 minutes allocated for her teaching slots. The duration of
her lectures is a random variable best represented by a distribution described by the
following equation: if x is the length of a lecture, in minutes, then
x < 40
0,
x 40
,
40 x < 55
150
f ( x) =
.
x
60
, 55 x < 60
50
0,
60 x
i. (3 marks) Draw a graph of f(x), clearly marking all axes and points of
interest.
0.2
f(y)
0.15
0.1
0.05
0
47
48
49
50
51
52
ii. (1 mark) Find the expected length of a randomly selected lecture given by
Professor Good.
a + b 47 + 52
=
= 49.5
2
2
iii. (1 mark) Find the standard deviation of length of a lecture given by
Professor Good.
E (Y ) =
var (Y ) =
(b a )
( 52 47 )
= 2.083 to 3dp.
12
12
standard deviation(Y) = 2.083 = 1.443 to 3dp.
iv. (2 marks) Find the probability that a lecture given by Professor Good runs
for longer than 50 minutes.
P(Y>50) = 2/5
v. (1 mark) Find the probability that a lecture given by Professor Good takes
exactly 50 minutes.
P(Y=50) = 0
Page 17 of 18
Y ~ N = E (Y ) , var =
50 49.5
Y
>
P (Y > 50 ) = P
var(Y ) n 1.443 50
Page 18 of 18