Académique Documents
Professionnel Documents
Culture Documents
PREPARED BY
MOHAMED SALAMA
Training Programme
PART I
INTRODUCTION
What is Statistics?
…includes
describing the problem
gathering data
summarizing data
Analyzing the data and communicating
meaningful conclusions
Statistics in chemical analysis
Lab Chemists are concerned with the chemical analysis
processes that quantify analytes in different matrices.
DESCRIPTIVE STATISTICS
INFERENTIAL STATISTICS
POPULATION
The collection, or set, of ALL individuals, items, or
events of interest.
SAMPLE
The subset of the population for which we have
data available.
Example
Population Sample
Use statistics to
summarize features
Use parameters to
summarize features
ORDINAL –
Ordered categories. Example: year groups (1st
year, 2nd year, 3rd year etc.)
Types of data II
Deterministic Sample
Elements are selected on the basis some algorithmic
approach, for example, selecting every 10th member
of the population.
Danger of biased results.
Random Sample
Each member of the population is selected with a
certain probability.
So that !
Statistics is a valuable tool in all sciences.
2. Trueness
The closeness of agreement between the average
value obtained from large series of test results and
an accepted reverence value
Definitions
3. Bias
The difference between the expectation of the test
results and an accepted reference value
9. Repeatability conditions
Conditions where independent test results are
obtained with the same operator on identical test
items in the same lab by the same operator using
the same equipment within short intervals of time.
Definitions
10. Reproducibility
Precision under reproducibility conditions
Trueness Precision
PART II
FUNDAMENTALS OF
STATISTICS
Statistical Fundamentals
1. Measures of Central Tendency
1.1 Average
n
xi
i=1
X=
n
Statistical Fundamentals
PART III
ERRORS IN CLASSICAL
ANALYSIS
Training Programme
1. DISTRIBUTION OF
ERRORS
Normal Distribution
Continuous random variable
Values from interval of numbers
Absence of gaps
Continuous probability
distribution
Distribution of continuous random
variable
Most important continuous
probability distribution
The normal distribution
Normal Distribution
“Bell shaped”
f(X)
Symmetrical
Mean, median and
mode are equal
Interquartile range X
equals 1.33
Random variable Mean
has infinite range Median
Mode
Normal Distribution
Is symmetric about the mean
Mean = Median
Figure 6.2.2
50% 50%
Mean
Normal Distribution
1
1 2
f X
X
2
e
2 2
f X : density of random variable X
3.14159; e 2.71828
: population mean
: population standard deviation
X : value of random variable X
Normal Distribution
shifts the curve alongthe axis increases the spreadandflattens the curve
1 =6
1= 2 =6
2 =12
68% chance of falling 95% chance of falling 99.7% chance of falling
between and between and between and
Normal Distribution
Normal Distribution
Probability is
the area under
the curve! P c X d ?
f(X)
X
c d
Normal Distribution
6.2 X 0.12 Z
5 Z 0
Example Problems
Example Problems
Exercises
Training Programme
2. CONFIDENCE
INTERVALS
Confidence Intervals
The problem
How large are the error bounds when
we use data from a sample to estimate
parameters of the underlying population?
Sample
Confidence Intervals
X = Error = X
X Error
Z
X X
Error Z x
X Z X
Confidence Intervals
X Z X X Z
n
x_
_
X
1.645 x 1.645 x
90% Samples
1.96 x 1.96 x
95% Samples
2.58 x 2.58 x
99% Samples
Intervals & Level of Confidence
Probability that the unknown population parameter is in
the confidence interval in 100 trials.
Denoted (1 - ) % = level of confidence e.g.
90%, 95%, 99%
Is Probability That the Parameter Is Not Within the
Interval in 100 trials (NOT THIS TRIAL ALONE!)
Intervals & Level of Confidence
Sampling
Distribution of _
x
the Mean /2 /2
1 -
_
X
Intervals
Extend from
X
(1 - ) % of
X Z X Intervals
Contain .
to
% Do Not.
X Z X
Confidence Intervals
Factors Affect Interval Width
Data Variation
measured by Intervals Extend from
Sample Size X - Z to X + Z
x x
XofConfidence
Level X / n
(1 - )
Confidence Intervals
CONCLUDING REMARK
X Z / 2
X Z / 2
n n
CI (σ Unknown)
Assumptions
Population Standard Deviation Is Unknown
Sample size must be large enough for central limit theorem or
Population Must Be Normally Distributed
Use Student’s t Distribution
Confidence Interval Estimate
X t / 2,n 1
S
n
X t / 2,n1
S
n
Student’s t Distribution
Standard
Normal
0 t
Degrees of Freedom (df)
Number of Observations that are free
to vary after sample Mean has been
calculated
Example
Mean of 3 Numbers Is 2 degrees of freedom =
X1 = 1 (or Any Number) n -1
X2 = 2 (or Any Number)
= 3 -1
X3 = 3 (Cannot Vary)
Mean = 2
=2
Student’s t Table
/2 Assume: n = 3 df
=n-1=2
Upper Tail Area
= .10
df .25 .10 .05 /2 =.05
PART IV
SIGNIFICANCE TESTS
Training Programme
1. T-TEST
Student’s t Test
When solving probability problems for the sample mean,
one of the steps was to convert the the sample mean
values to z-scores using the following formula:
x x
z where x and x
x n
What happens if we do not know the population
standard deviation ? If we substitute the population
standard deviation with the sample standard
deviation, s, can we use the standard normal table?
Answer: no.
Student’s t Test
This question was addressed in 1908 when W.S. Gosset
found that if we replace with the sample standard
deviation s, the distribution becomes a t-distribution. If
x
T
s/ n
then T has a t-distribution with n-1 degrees of freedom.
The t-distribution is similar to the z-curve in that it is bell
shaped, but the shape of the t-distribution changes with
the degrees of freedom.
We will use the T-tables to get the critical t-values at
different levels of and degrees of freedom.
Student’s t Test
1.One-sample t-test
x
T
s/ n
When using t-test TAKE CARE !!! Is it one-tailed or
two-tailed test?
Student’s t Test
2. Independent sample t-test (Equal variances)
2. F-TEST
f -Test
f - Test is used to compare the standard deviation of two
samples, and to make a test to determine whether the
populations from which they come have equal variances
Training Programme
3. HYPOTHESIS TESTING
Hypothesis Testing
A hypothesis is a
claim (assumption) I claim the mean GPA of
this class is 3.5!
about the population
parameter
Examples of
parameters
are population mean
or proportion
The parameter must
be identified before
analysis
Hypothesis Testing
Assume the
population
mean age is 50.
( H 0 : 50 ) Identify the Population
REJECT
Null Hypothesis X 20
Hypothesis Testing
Sampling Distribution of X
It is unlikely that ... Therefore,
we would get a we reject the
sample mean of null hypothesis
this value ... that m = 50.
... if in fact this were
the population mean.
20 = 50 X
If H0 is true
Hypothesis Testing
H0: Innocent
Jury Trial Hypothesis Test
The Truth The Truth
Verdict Innocent Guilty Decision H0 True H0 False
Do Not Type II
Innocent Correct Error Reject 1-
Error ( )
H0
Type I
Error Correct Reject Power
Guilty Error
H0 (1 - )
( )
Hypothesis Testing
If you reduce the probability of one
error, the other one increases so that
everything else is unchanged.
Hypothesis Testing
True value of population parameter
Increases when the difference between
hypothesized parameter and its true value decrease
Significance level
Increases when decreases
Population standard deviation
Increases when increases
Sample size
Increases when n decreases
n
Hypothesis Testing
Convert sample statistic (e.g.: X ) to test statistic
(e.g.: Z, t or F –statistic)
Obtain critical value(s) for a specified
from a table or computer
If the test statistic falls in the critical region, reject
H0
Otherwise do not reject H0
Hypothesis Testing
0 Z 0 Z
Small values of Z don’t
Z Must Be Significantly
contradict H0
Below 0 to reject H0
Don’t Reject H0 !
Example Problems
Example Problems
Example Problems
Example Problems
Example Problems
Exercises
Training Programme
4. ANALYSIS OF VARIANCE
(ANOVA)
Analysis of Variance (ANOVA)
The statistical tests described previously are
used in the comparison of two sets of data, or to
compare a single sample of measurements with
a standard or reference value. Frequently,
however, it is necessary to compare three or
more sets of data, and in that case we can make
a use of a very powerful statistical method with
a great range of applications – Analysis of
Variance (ANOVA).
Analysis of Variance (ANOVA)
If there is only one source of
variation apart from this
measurement area, a one-way
ANOVA calculation is appropriate,
if there are two sources of
variation we use two-way ANOVA
calculations, and so on.
Analysis of Variance (ANOVA)
Example:
A sample of fruit is analysed for its pesticide content by a
liquid chromatographic procedure, but four different
extraction procedures A-D (solvent extraction with different
solvents, solid-phase extraction etc) are used, the
concentration in each case being measured three times. The
results (mg kg-1) are indicated in the shown table, Is there
any evidence that the four different sample preparation
methods yield different results?
Analysis of Variance (ANOVA)
Results A B C D
1 10.5 9.9 9.9 9.2
2 11.5 10.8 9.1 8.5
3 10.7 10.8 8.9 9.0
Average 10.9 10.5 9.3 8.9
Solution:
Ho : The same number of breakages by each worker (no difference
in reliability assuming that the workers use the lab for an equal length of time)
HA : Different number of breakages by each worker
The CHI-Squared Test
The null hypothesis implies that since the total number of
breakages is 61, the expected number of breakages per worker
is 61 / 4 = 15.25. Obviously it is not possible in practice to
have a non integral number of breakages (this number is only a
mathematical concept). The question to be answered is
whether the difference between the observed and expected
frequencies is so large that the null hypothesis should be
rejected.
The calculation of chi-squared, χ2, the quantity used for test for
a significant difference, is shown in the following table:
The CHI-Squared Test
Observed Expected O-E (O-E)2/E
Frequency Frequency
(O) (E)
24 15.25 8.75 5.020
χ2= ∑ (O-E)2/E
= 8.966
The CHI-Squared Test
If χ2, (propability, df) exceeds a certain critical value the null hypothesis is
rejected.
df χ2, (P = 0.05)
1 3.84
2 5.99
χ2calculated = 8.966
3 7.81
χ2critical = 7.81 4 9.49
5 11.07
6 12.59
χ2calculated > χ2critical
7 14.07
The null hypothesis is rejected at the 5 % 8 15.51
significance level i.e. there is evidence that 9 16.92
the workers do differ in their reliability 10 18.31
Training Programme
6. CONCLUSIONS FROM
SIGNIFICANCE TESTS
Significance Tests - Conclusions
What are the conclusions which may be drawn from
significance test?
Ho , µ = 3.0 %
The solid line shows the sampling
distribution of the mean if the null
hypothesis is true. This sampling
distribution has mean 3.0 and s.e.m
= 0.03 / √4 % . If the sample mean lies
above the indicated critical value , Xc ,
the null Hypothesis is rejected.
Thus the shaded region, with area 0.05,
represent the probability of a Type 1
error.
Significance Tests - Conclusions
H1 , µ = 3.05 %
The broken line shows the sampling
distribution of the mean if the
alternative hypothesis is true. Even if
this is the case, The null hypothesis
will be retained if the Sample mean
lies below Xc. The probability of this
type 2 error is Represented by the
hatched area
Significance Tests - Conclusions
The diagram makes clear the
interdependence of the two types of
error. If, for example, the significance
level is changed to P = 0.01 in order
to reduce a risk of type 1 error , Xc
will Be increased and the risk of a
type 2 Error is also increased.
Conversely, A decrease in the risk of
type 2 error Can only be achieved at
the expense of an increase in the
probability of Type 1 error.
Significance Tests - Conclusions
The only way in which both errors can
be reduced is by increasing the sample
size. The effect of increasing n to 9,
for example, is illustrated in the
shown figure : the resultant decrease
in the standard error of the mean
produces a decrease in both types of
error for a given value of Xc .
Significance Tests - Conclusions
The probability that a false null hypothesis is
rejected is known as the POWER of the test.
that is, the power of a test is (1 – the
probability of a type 2 error).
PART V
OUTLIERS
Outliers – Dixon’s Test