Vous êtes sur la page 1sur 36

# Statistics

CVEN2002/2702

Week 8

This lecture

## 7. Inferences concerning a mean

7.6 Confidence interval on the mean of a distribution, variance
unknown
7.7 Prediction intervals
7.8 Confidence interval on a proportion

7.4 in the textbook

CVEN2002/2702 (Statistics)

Dr Justin Wishart

2 / 36

## Confidence interval on the mean of a distribution,

variance unknown
Previously we showed how to build confidence intervals for the mean
of a distribution, assuming that the population variance 2 was known
; this is probably not very realistic!
Suppose now that the population variance 2 is not known
; we can no longer make practical use of the core result

(a)
N (0, 1)
Z = n X

## However, from the random sample X1 , X2 , . . . , Xn we have a natural

estimator of the unknown 2 : the sample variance (Slide 38, Week 2)
n

S2 =


1 X
2,
Xi X
n1
i=1

## which will provide an estimated sample variance s2 =

upon observation of a sample x1 , x2 , . . . , xn
CVEN2002/2702 (Statistics)

Dr Justin Wishart

1
n1

i (xi

x )2

3 / 36

## Confidence interval on the mean of a normal

distribution, variance unknown
A natural procedure is thus to replace with the sample standard
deviation S, and to work with the random variable

X
T = n
S
In the case of a normal population, Z was just a standardised version
and was therefore N (0, 1)-distributed
of a normal r.v. X
and S)
However, T is now a ratio of two random variables (X
; T is not N (0, 1)-distributed !
Indeed, T cannot have exactly the same distribution as Z , as the
approximation of the constant by a random variable S introduces
some extra variability
; the random variable T varies more in value from sample to sample
than Z (i.e. Var(T ) > Var(Z ))
CVEN2002/2702 (Statistics)

Dr Justin Wishart

4 / 36

## The Students t-distribution

The first person who realised that replacing with an estimation did
affect the distribution of Z was William Gosset (1876-1937), a British
chemist and mathematician who, in the early 20th century, worked at
the Guinness Brewery in Dublin
As the story goes, another researcher at Guinness had previously
published a paper containing trade secrets of the Guinness brewery,
so that Guinness prohibited its employees from publishing any
scientific papers regardless of the contained information
; Gosset used the pseudonym Student for his publications to avoid
their detection by his employer
He showed that, in a normal population, the exact distribution of T is
the so-called t-distribution with n 1 degrees of freedom:
T tn1
This distribution is now referred to as Students t-distribution (which
might otherwise have been Gossets t-distribution)
CVEN2002/2702 (Statistics)

Dr Justin Wishart

5 / 36

## The Students t-distribution

A random variable, say T , is said to follow the Students t-distribution
with degrees of freedom, i.e.
T t
Its probability density function is given by
 
 +1
2
+1
t2
2

f (t) =
1
+

; ST = R

## for some integer

Note: the Gamma function is given by
Z +
(y ) =
x y 1 ex dx,

for y > 0

## It can be shown that (y ) = (y 1) (y 1), so that, if y is a positive

integer n,
(n) = (n 1)!
There is no simple expression for the Students t-cdf
CVEN2002/2702 (Statistics)

Dr Justin Wishart

6 / 36

0.4
f(t)

0.0

0.0

0.2

0.1

0.4

0.2

F(t)

0.6

0.3

0.8

1.0

## pdf f (t) = F 0 (t)

cdf F (t)
CVEN2002/2702 (Statistics)

Dr Justin Wishart

7 / 36

0.4

## Student's distributions and standard normal

f(t)

0.0

0.1

0.2

0.3

t1
t2
t5
t10
t50
N(0,1)

CVEN2002/2702 (Statistics)

Dr Justin Wishart

8 / 36

## The Students t-distribution

It can be shown that the mean and the variance of the t -distribution
are

(for > 2)
E(T ) = 0
and
Var(T ) =
2
The Students t distribution is similar in shape to the standard normal
distribution in that both densities are symmetric, unimodal and
bell-shaped, and the maximum value is reached at 0
However, the Students t distribution has heavier tails than the normal
; there is more probability to find the random variable T far away
from 0 than there is for Z
This is more marked for small values of
As the number of degrees of freedom increases, t -distributions look
more and more like the standard normal distribution
In fact, it can be shown that the Students t distribution with degrees
of freedom approaches the standard normal distribution as
CVEN2002/2702 (Statistics)

Dr Justin Wishart

9 / 36

## The Students t-distribution: quantiles

Similarly to what we did for the Normal distribution, we can define the
quantiles of any Students t-distribution:
tdistribution

0.4

0.3

P(T > t; ) = 1

f(t)

1
0.0

t;1 = t;

0.1

## Like the standard normal

distribution, the symmetry of any
t -distribution implies that

0.2

for T t

t, 4

## For any , the main quantiles of interest may be found in the

t-distribution critical values tables
CVEN2002/2702 (Statistics)

Dr Justin Wishart

10 / 36

## Confidence interval on the mean of a normal

distribution, variance unknown
So we have, for any n 2,

X
n
tn1
S
Note: the number of degrees of freedom for the t-distribution is the
number of degrees of freedom associated with the estimated variance
S 2 (Slide 38, Week 2)
T =

## It is now easy to find a 100 (1 )% confidence interval for by

proceeding essentially as we did when 2 was known
We may write



X
P tn1;1/2 n
tn1;1/2 = 1
S
or


S
S

P X tn1;1/2 X + tn1;1/2
=1
n
n
CVEN2002/2702 (Statistics)

Dr Justin Wishart

11 / 36

## t-confidence interval on the mean of a normal

distribution
; if x and s are the sample mean and sample standard deviation of
an observed random sample of size n from a normal distribution, a
confidence interval of level 100 (1 )% for is given by


s
s
x tn1;1/2 , x + tn1;1/2
n
n
This confidence
interval is sometimes called
h
i t-confidence interval, as

## opposed to x z1/2 n , x + z1/2 n (z-confidence interval)

Because tn1 has heavier tails than N (0, 1), tn1;1/2 > z1/2 , n
; this reflects the extra variability introduced by the estimation of
(less accuracy)
Note: One can also define one-sided 100 (1 )% t-confidence
i
h

intervals 
, x + tn1;1 sn and x tn1;1 sn , +
CVEN2002/2702 (Statistics)

Dr Justin Wishart

12 / 36

## t-confidence interval: example

Example
An article in Materials Engineering describes the results of tensile adhesion
test on 22 U 700 alloy specimens. The load at specimen failure is as
follows (in megapascals):
7.6, 8.1, 11.7, 14.3, 14.3, 14.1, 8.3, 12.3, 15.9, 16.4,
11.3, 12.0, 12.9, 15.0, 13.2, 14.6, 13.5, 10.4, 13.8,
15.6, 12.2, 11.2
Construct a 99% confidence interval for the true average load at failure for
this type of alloy
Elementary computations give
x = 12.67 MPa

CVEN2002/2702 (Statistics)

## and s = 2.47 MPa

Dr Justin Wishart

13 / 36

## t-confidence interval: example

Normal QQ Plot
2

Theoretical Quantiles

## The quantile plot below provides good

support for the assumption that the
population is normally distributed

10

12

14

16

Sample Quantiles

## Since n = 22, we have n 1 = 21 degrees of freedom for t. In the table, we

find t21;0.995 = 2.831. The resulting CI is

 

2.47
s
s
x tn1;1/2 , x + tn1;1/2 = 12.67 2.831
n
n
22
= [11.18, 14.16]
; we are 99% confident that the true average load at failure for this type of
alloy lies between 11.18 MPa and 14.16 MPa
CVEN2002/2702 (Statistics)

Dr Justin Wishart

14 / 36

## Confidence interval on the mean of an arbitrary

distribution, variance unknown
What if the population is not normal ?
As in the case 2 known, we can rely on the Central Limit Theorem

a
which asserts that, for n large, Z = n X
N (0, 1) to deduce a
result like
a
X
tn1
T = n
S
from which we could find a CI on for n large enough
However, recall that, when is large, t is very much like N (0, 1)
; in large samples, estimating with S has very little effect on the
distribution of T , to which the approximation by the standard
normal distribution is more than enough:
a

T N (0, 1)
CVEN2002/2702 (Statistics)

Dr Justin Wishart

15 / 36

## Confidence interval on the mean of an arbitrary

distribution
Consequently, if x and s are the sample mean and standard deviation
of an observed random sample of large size n from any distribution, an
approximate confidence interval of level 100 (1 )% for is


s
s
x z1/2 , x + z1/2
n
n
This expression holds regardless of the population distribution, as long
as n is large enough ; it is called a large-sample confidence interval
Generally, n should be at least 40 to use this result reliably (the CLT
usually holds for n 30, but a larger sample size is recommended
because replacing by S still results in some additional variability)
As usual, corresponding one-sided confidence intervals could be
defined: (, x + z1 sn ] and [x z1 sn , +)
CVEN2002/2702 (Statistics)

Dr Justin Wishart

16 / 36

## Confidence interval on the mean of an arbitrary

distribution: example
Example
An article in Transactions of the American Fisheries Society reports the
results of a study to investigate the mercury contamination in largemouth
bass. A sample of 53 fishes was selected from some Florida lakes, and
mercury concentration in the muscle tissue was measured (in ppm):
1.23, 0.49, 1.08, . . ., 0.16, 0.27
Find a confidence interval on , the mean mercury concentration in the
muscle tissue of fish
An histogram and a quantile plot for the data are displayed below
; both plots indicate that the distribution of mercury concentration may not
be normally distributed (positively skewed)
But anyway, the sample is large enough (n = 53) to use the Central Limit
Theorem and compute an approximate confidence interval for
CVEN2002/2702 (Statistics)

Dr Justin Wishart

17 / 36

## Confidence interval on the mean of an arbitrary

distribution: example
histogram

Normal QQ Plot

1.0

1
0

Theoretical Quantiles

0.6

0.4

Density

0.8

0.2

0.0

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

0.0

0.2

concentration

0.4

0.6

0.8

1.0

1.2

Sample Quantiles

## Elementary computations give x = 0.525

h ppm and s = 0.3486 ppm.iA large
sample confidence interval is given by x z1/2 sn , x + z1/2 sn
With z0.975 = 1.96 and the above values, we have


0.3486
0.3486
0.525 1.96
, 0.525 + 1.96
= [0.4311, 0.6189]
53
53
; we are 95% confident that the true average mercury concentration in the
muscle tissue of the fishes is between 0.4311 and 0.6189 ppm
CVEN2002/2702 (Statistics)

Dr Justin Wishart

18 / 36

## Confidence intervals for the mean: summary

The several situations leading to different confidence intervals for the
mean can be summarised as follows:
The first question is: Is the population normal? (check from a
histogram and a quantile plot, for instance)
if yes, is known ?
I

## if yes, use an exact z-confidence interval:

h
i
x z1/2 n , x + z1/2 n
if no, use an exact t-confidenceh interval:
i
x tn1;1/2 s , x + tn1;1/2 s
n

## if no, use an approximate large sample

h confidence interval:
i
x z1/2 sn , x + z1/2 sn ,
(provided the sample size is large, say n 40)
What if the sample size is small and the population is not normal ?
; check on a case by case basis (beyond the scope of this course)
CVEN2002/2702 (Statistics)

Dr Justin Wishart

19 / 36

## Prediction interval for a future observation

In some situations, we may be interested in predicting a future
observation of a variable
; different than estimating the mean of the variable !
; instead of confidence intervals, we are after
100 (1 )% prediction interval on a future observation
As an illustration, suppose that X1 , X2 , . . . , Xn is a random sample from
a normal population with mean and standard deviation
; we wish to predict the value Xn+1 , a single future observation
As Xn+1 comes from the same population as X1 , X2 , . . . , Xn ,
information contained in the sample should be used to predict Xn+1
; the predictor of Xn+1 , say X , should be a statistic

CVEN2002/2702 (Statistics)

Dr Justin Wishart

20 / 36

## Prediction interval for a future observation

We desire the predictor to have expected prediction error equal to 0:
E(Xn+1 X ) = 0

E(X ) =

## ; the predictor X must be an unbiased estimator for !

We said that an efficient unbiased estimator for was the sample
mean, so we take it as predictor:
n

X
=1
Xi
X =X
n

i=1

## Now, the variance of the prediction error is

) = Var(Xn+1 ) + Var(X
)
Var(Xn+1 X ) = Var(Xn+1 X


2
1
2
2
= +
= 1+
n
n
)
(because Xn+1 is independent of X1 , X2 , . . . , Xn and so of X
CVEN2002/2702 (Statistics)

Dr Justin Wishart

21 / 36

## Prediction interval for a future observation

are normally distributed (normal
Finally, because both Xn+1 and X
is also normally distributed
population), the prediction error Xn+1 X
Hence,
Z =

Xn+1 X
q
N (0, 1)
1 + n1

## Replacing the possibly unknown with the sample standard deviation

S yields

Xn+1 X
T = q
tn1
S 1 + n1
Manipulating Z and T as we did previously for CI leads to the
100 (1 )% z- and t-prediction intervals on the future observation:


q
q
1
1
x z1/2 1 + n , x + z1/2 1 + n


q
q
1
1

x tn1;1/2 s 1 + n , x + tn1;1/2 s 1 + n
CVEN2002/2702 (Statistics)

Dr Justin Wishart

22 / 36

## Prediction interval for a future observation: remarks

Remark 1:
The length of a confidence interval on of level 100 (1 )% is
2 z1/2 n
(see Slide 25 Week 7)
The length of aq
prediction interval on Xn+1 of level 100 (1 )% is

2 z1/2

1+

1
n

## Prediction intervals for a single observation will always be longer than

confidence intervals for , because there is more variability associated
with one observation than with an average
Remark 2:
As n gets larger (n ), the length of the CI for decreases to 0 (we
are more and more accurate when estimating ), but this is not the
case for a prediction interval: the inherent variability of Xn+1 never
vanishes, even when we have observed many other observations
before!
CVEN2002/2702 (Statistics)

Dr Justin Wishart

23 / 36

## Prediction interval for a future observation: example

Example
Reconsider the example on Slide 13. Find a 99% confidence interval for the
true average load at failure. We plan to test a 23rd specimen. Find a 99%
prediction interval on the load at failure for this specimen.
From the data (n = 22) we had found x = 12.67 MPa and s = 2.47 MPa, and
a 99% confidence interval for was [11.18, 14.16]
Now, t21;0.995 = 2.831 (from the table), so that a 99% prediction interval for the
next observation is
"
# "
#
r
r
1
1
x tn1;1/2 s 1 +
= 12.67 2.831 2.47 1 +
n
22
= [5.52, 19.82]
; we are 99% confident that the failure load for the next specimen will be
between 5.52 and 19.82 MPa
CVEN2002/2702 (Statistics)

Dr Justin Wishart

24 / 36

## Inferences concerning proportions

Many engineering problems deal with proportions, percentages or
probabilities:
we are concerned with the proportion of defectives in a lot, with the
percentage of certain components which will perform satisfactorily
during a stated period of time, or with the probability that a newly
produced item meets some quality standards
; qualitative information can also be included in statistical studies!
It should be clear that problems concerning proportions, percentages
or probabilities are really equivalent: a percentage is merely a
proportion multiplied by 100, and a probability is a proportion in a
(infinitely) long series of trials (Slide 13, Week 3)
We would like to learn about , the proportion of the population that
has a characteristic of interest, but as usual all we have is just a
sample of size n from that population
CVEN2002/2702 (Statistics)

## ; confidence interval for

Dr Justin Wishart

25 / 36

## 7.8 Confidence interval on a proportion

Estimation of a proportion
In this situation, the random variable to study is

1 if the individual has the characteristic of interest
X =
0 if not
which is Bernoulli distributed (see Slide 9, Week 5), with parameter
being the value of interest:
X Bern()
The random sample X1 , X2 , . . . , Xn is a set of n independent Bern()
random variables
; the number Y of individuals of the sample with the characteristic is
Y =

n
X

Xi Bin(n, )

i=1

## and the sample proportion is

CVEN2002/2702 (Statistics)

=Y
P
n
Dr Justin Wishart

26 / 36

## 7.8 Confidence interval on a proportion

Estimation of a proportion
is obviously a natural candidate for
This sample proportion P
estimating the population proportion
From the properties of the Binomial distribution, we know that
E(Y ) = n

and

Var(Y ) = n(1 )

=
so that E(P)
n

1
n2

Var(Y ) =

n(1)
n2

(1)
n

## is an unbiased and consistent estimator for :

Hence, P
=
= (1 ) ( 0 as n )
E(P)
and
Var(P)
n
q

## ; the standard error of P is thus sd(P) = (1)

n
UponPobservation of a random sample x1 , x2 , . . . , xn , in which
y = ni=1 xi individuals have the characteristics, an estimate of is
y
=
p
n
CVEN2002/2702 (Statistics)

Dr Justin Wishart

27 / 36

## 7.8 Confidence interval on a proportion

Sampling distribution
using the Binomial
We could make inference about from p
distribution of Y . However, it is probably easier to use the Central
Limit Theorem (Slides 33-34, Week 7). Indeed:
n

X
=Y =1
Xi ,
P
n
n
i=1

so that P
guarantees that

P
a
np
N (0, 1)
(1 )
a

## if n is large ( again stands for approximately follows)

We also know that the quality of the approximation depends on the
symmetry of the initial distribution of the Xi s, here Bern()
(1 p
) > 5
; should not be too close to 0 or 1 ; empirical rule: np
CVEN2002/2702 (Statistics)

Dr Justin Wishart

28 / 36

## Confidence interval for a proportion

As the sampling distribution

np

P
(1 )

N (0, 1)

a
is just a particular case of n X
N (0, 1), we can use (almost)
directly the large-sample confidence interval we derived for a mean
Specifically, we have that
P z1/2

np
z1/2
(1 )

'1

or
r
z1/2
P P

(1 )
+ z1/2
P
n

(1 )
n

!
'1

## ; a confidence interval for takes shape

CVEN2002/2702 (Statistics)

Dr Justin Wishart

29 / 36

## Confidence interval for a proportion

that is the factor
Unfortunately, the standard error of P,
contains the unknown

(1)
,
n

## In such a situation, we may replace the unknown value by its estimate,

that is, to use the estimated standard error of the estimator (see Slide
r
12, Week 7)
(1 p
)
p
\

sd(P) =
n
in the expression of the confidence interval
is the sample proportion in an observed random
Consequently, if p
sample of size n, an approximate two-sided confidence interval of level
100 (1 )% for is given by
"
#
r
r
(1 p
)
(1 p
)
p
p
z1/2
+ z1/2
p
,p
n
n
As this is based on the CLT and requires n large, it is a large sample
confidence interval for
CVEN2002/2702 (Statistics)

Dr Justin Wishart

30 / 36

## One-sided confidence intervals for a proportion

We may also find one-sided large-sample confidence intervals for the
proportion by a simple modification of the previous development
We find:
"

(1 p
)
p
n

(1 p
)
p
,1
n

r
+ z1
0, p

and
"

r
z1
p

CVEN2002/2702 (Statistics)

Dr Justin Wishart

31 / 36

## Choice of the sample size

is the estimate of , we can define the error in estimating by p
as
Since p
|. From Slide 29, we are approximately 100 (1 )% confident
e = |p
r
that this error is less than
(1 )
z1/2
n
In situations where the sample size can be selected, we may choose n to be
100 (1 )% confident that the error is less than any specified value e:


z1/2 2
n=
(1 )
(compare Slide 26, Week 7)
e
; this depends on , for which no information is available at this point
Idea: use an upper bound which holds for any value of
Actually, (1 ) 1/4, with equality for = 1/2, thus with


z1/2 2
n=
2e
we are at least 100 (1 )% confident that this error is less than e and this,
regardless of the value of (this is very conservative, though)
CVEN2002/2702 (Statistics)

Dr Justin Wishart

32 / 36

## Confidence interval for a proportion: example

Example
In a random sample of 85 car engine crankshaft bearings, 10 have a surface
finish that is rougher than the specifications allow. a) Find a 95% confidence
interval on the true proportion of produced bearings that exceeds the
roughness specification; b) How large is a sample required if we want to be
95% confident that the error in estimating is less than 0.05?
= yn = 10
a) The estimate of is p
85 = 0.118. Thus, the estimated standard
error is
r
r

p
(1

p
)
0.118 (1 0.118)
\
=
sd(P)
=
= 0.035
n
85
and an approximated two-sided 95% confidence interval for is


\
= [0.118 1.96 0.035] = [0.049, 0.186]
z0.975 sd(
p
P)
; we are 95% confident that the true proportion of produced bearings
outside specifications is between 0.049 and 0.186
CVEN2002/2702 (Statistics)

Dr Justin Wishart

33 / 36

## Confidence interval for a proportion: example

Example
In a random sample of 85 car engine crankshaft bearings, 10 have a surface
finish that is rougher than the specifications allow. a) Find a 95% confidence
interval on the true proportion of produced bearings that exceeds the
roughness specification; b) How large is a sample required if we want to be
95% confident that the error in estimating is less than 0.05?
b) the previous CI has width 0.137, which is quite large for a CI for . If we
want a CI of width at most 2 0.05, we need



2
z1/2 2
1.96
n=
=
= 384.16
2e
2 0.05
; we need at least 385 observations
Note that this number would guarantee the required accuracy, regardless of
the true value of ; this is why it is so high (conservative)
= 0.118 as preliminary estimate of , we would have
(Using p
2
(1 p
) = (1.96/0.05)2 0.118 0.882 = 159.93)
n ' z1/2 /e p
CVEN2002/2702 (Statistics)

Dr Justin Wishart

34 / 36

## Alternative confidence interval for a proportion

The previous confidence does not work well near the boundary, i.e.,
near 0 or 1. Recall the probability statement
!

P
P z1/2 n p
z1/2 ' 1
(1 )
The three-way inequality is a quadratic function of , solving that
equation gives the following confidence interval
s
(1 p
)
1
p
1
+ w z1/2 (1 w)
(1 w)p
+w
2
2
2
n + z1/2
4(n + z1/2
)
where
w=

2
z1/2
2
n + z1/2

## * First derived by EB Wilson (1927)

CVEN2002/2702 (Statistics)

Dr Justin Wishart

35 / 36

## 7. Inferences concerning a mean

Objectives
Now you should be able to:
Construct z- and t-confidence intervals on the mean of a normal
distribution, advisedly using either the normal distribution or the
Students t distribution
Construct large sample confidence intervals on a mean of an
arbitrary distribution with unknown variance
Explain the difference between a confidence interval and a
prediction interval
Construct prediction intervals for a future observation in a normal
population
Construct confidence intervals on a population proportion
Recommended exercises: ; Q7, Q9, p.301, Q13, Q15 p.302, Q20
p.303, Q35 p.319, Q39 p.320, Q43(a-b) p.320, Q55 p.328, (optional)
Q71, Q73 p.340, Q55 p.238
CVEN2002/2702 (Statistics)

Dr Justin Wishart

36 / 36