Vous êtes sur la page 1sur 4

CN3421 Statistics Tutorial 3 (Week 5)

1. If α is the significance level, all other symbols are of their conventional meaning, consider
the equation below, and discuss its application:

2
z σ 
n =  α /2 
 x − µ0 
This is to estimate the minimum sample size that is necessary for a two-tail Z-test, where
we can conclude that the difference between sample mean and the setpoint is significant (at
a level of α) given the population variance.

From the formula we can see that:


(1) Smaller error difference ( x − µ0 ) leads to a larger required sample size, n, i.e., if one wants
to detect a small difference between the sample mean and the estimated population mean
(µ0), one needs a large sample size, or if the mean obtained from a large size of sample, this
sample mean is a good estimation (with high accuracy or small error) of the population mean
(µ0), and then a small difference between the sample mean and the estimated population
mean (µ0) will be regarded as a statistically significant difference.
(2) Larger population variance (σ) leads to a larger sample size, n, i.e., if the population
distribution is wide, one needs larger sample size to detect the difference between the sample
mean and estimated population mean, because larger population distribution results in larger
uncertainty, and so the sample mean is not a good estimator of population mean. This
uncertainty should be decreased by increasing the sample size.
(3) Larger zα/2 leads to larger sample size, because larger zα/2 means smaller α or higher
statistical significance level, and it requires larger sample size to increase the statistical
significance.
2. Consider “statistical significance” and “physical significance” through a scenario: a factory
is producing 20 micron thick protection layers with a known (population) standard deviation
of 0.5 micron. To test whether the machine making the protection layers is working
satisfactorily, a random sample of 50 protection layers are measured, and it is found that the
average thickness is 20.05 micron. Is it producing unsatisfactory products? What if the
random sample is 500 layers, with the same sample average? As the P-value becomes
“significant”, is it really a “significant” deviation for a product compared with the standard?

(1) H0: µ = µ0 = 20 micron, H1: µ ≠ µ0

X − µ0 20.05 − 20 0.05 50
(2) The test statistic is Z = = = = 0.707
σ / n 0.5 / 50 0.5

(3) Find the P-value = 0.48 (by Excel: =2*(1-NORM.S.DIST(0.707,TRUE))


(4) From Sample 1 (size = 50, avg = 20.05), we have no strong evidence to conclude that
the machine produces layers (statistically) significantly different from the standard (20).
Thus, we have not strong evidence to say the machine is producing unsatisfactory products.
Consider Sample 2 (size = 500, avg = 20.05), the P-value is found to be 0.0253. At a
significance level of 0.05, we have strong evidence to conclude that the machine is
producing layers (statistically) significantly different from the standard (20 micron).
Increasing in sample size made the sample information more “accurate”. If we look at 95%
CI of the population mean, for these two samples:

 1.96
 20.05 ± 0.5 = 20.05 ± 0.139 (19.911, 20.189)
zα / 2 1.96  50
µ=X± σ = 20.05 ± 0.5 = 
n n 20.05 ± 1.96 0.5 = 20.05 ± 0.044 (20.006, 20.094)
 500

For sample to, you are more confident that the population mean should be in a much smaller
range, which does not include 20. This means statistically, it has very low probability to be
20, (and thus high probability to be different from 20).
All above discussions are about the “statistical significance of difference”, i.e., whether the
data really show a detectable “difference”, or the real mean is really different from a
compared reference. However, is this difference really matter? For example, for Sample 2,
we know that the population mean is probable in a range of 20.006 and 20.094. The
maximum difference from the standard is 0.094 (almost not possible to be larger). Then,
does this 0.094 difference really matter? We call this “physical significance of difference”.
One may absolutely accept a batch of layers which is 0.094 micron (or 0.5%) different from
the standard. So it is not physically significant. In industry, we use “tolerance” to be a
criterion. For example, a factory is producing a batch of tubes with inner diameter of 1 mm,
with a tolerance of 0.1 mm. That means any tube whose inner diameter is between 0.99 mm
and 1.01 mm is acceptable. Statistical tool can only tell you from the sample information
whether your product’s specification is really different from the standard, but only engineers
can tell you whether this difference is physically important.
3. Consider “correlation” and “causal-effect” relationship: in studying the time of recovery
from disease A, we looked at the records of 2000 patients all over the world and categorized
them into three groups according to their IQ’s (namely, low, average, and high IQ groups),
and performed an ANOVA test. The result show that that people with higher IQ recovered
significantly faster. What does this result imply?

This result implies that the people’s IQ (independent variable) is correlated to the recovery
time from disease A (dependent variable). It does NOT necessarily mean that IQ imposes
an effect on the recovery time. The real deterministic factor of the recovery time could be
some other variables, for example, the diet. It is possible that the diet influences the IQ, and
influences the recovery time as well, thus people who are on a certain diet will get a high
IQ and get a shorter recovery time from the disease simultaneously, then the IQ looks
correlated with the recovery time. One should do further experiment to verify the conclusion.
4. One wants to study the effect of cover design on the sales of a book (same contents and
prices). Three different covers (namely, A, B, and C) are designed, and these books are
placed in the checkout line of a supermarket for sale. After a few weeks, one found the sale
record as in the Excel file “Tutorial_3_Data.xlsx” (empty cell means no available data).

Use MATLAB and EXCEL tool to find whether the design of the cover page correlated to
the sales of a book.

MATLAB code:

clear;
A=xlsread('Tutorial_3_Data.xlsx','$B:$B');
B=xlsread('Tutorial_3_Data.xlsx','$C:$C');
C=xlsread('Tutorial_3_Data.xlsx','$D:$D');

na=numel(A);
nb=numel(B);
nc=numel(C);
% Always check how many data are there in a group, for grouping later

V=[A;B;C]; % combine all data


group=[ones(1,na),ones(1,nb)*2,ones(1,nc)*3]; %grouping into 1,2,3
[p astat]=anova1(V,group);

Results:

P-value = 0.0012, a small number. This means the H1 (at least one mean differs) is
significant, i.e., the sale is correlated with the design of cover page.

Excel analysis will end up with the same P-value and thus the same conclusion. More
tutorial will be found in the video.

Vous aimerez peut-être aussi