Académique Documents
Professionnel Documents
Culture Documents
Basics of Statistics
Types of data
Normal distribution
Sample size selection
Hypothesis testing
Introduction to statistical tests : z-test and ttest.
Types of data
Ordinal data
Ordinal data is also a type of categorical data but in this, categories are ordered logically. These data can be
ranked in order of magnitude. One can say definitely that one measurement is equal to, less than, or greater
than another. Most of the scores and scales used in research fall under the ordinal data. For example, rating
score/scale for the color, taste, smell, ease of application of products, etc. form of contingency tables like 2
2 tables.
Interval data
Interval data has a meaningful order and also has the quality that equal intervals between measurements
represent equal changes in the quantity of whatever is being measured. But these types of data have no
natural zero. Example is Celsius scale of temperature. In the Celsius scale, there is no natural zero, so we
cannot say that 70C is double than 35C. In interval scale, zero point can be choosen arbitraly. IQ Test is also
interval data as it has no natural zero.
Ratio data
Ratio data has all the qualities of interval data (natural order, equal intervals) plus a natural zero point. This
type of data is observed to be used most frequently. Example of ratio data is height, weight, length, etc. In
this type of data, it can be said meaningfully that 10 m of length is double than 5 m. This ratio hold true
regardless of which scale the object is being measured in (e.g., meters or yards). Reason for this is the
presence of natural zero.
There are various methods for checking the normal distribution, some of them are plotting
histogram, plotting box and whisker plot, plotting Q-Q plot, measuring skewness and
kurtosis, using formal statistical test for normality (Kolmogorov-Smirnov test, Shapiro-Wilk
test, etc). Formal statistical tests like Kolmogorov-Smirnov and Shapiro-Wilk are used
frequently to check the distribution of data. All these tests are based on null hypothesis
that data are taken from the population which follows the normal distribution. P value is
determined to see the alpha error.
If P value is less than 0.05, data is not following the normal distribution and nonparametric
test should be used in that kind of data. If the sample size is less, chances of non-normal
distribution are increased.
Frequency Distribution
Frequency distribution is a curve that gives us the frequency of
the occurrence of a particular data point in an experiment. This
is usually the limit of a histogram of frequencies when the data
points are very large and the results can be treated to be
varying continuously instead of taking on discrete values.
Total number of
No. of Value of the four
ways
Heads coin flips
of getting a head
0
T-T-T-T
1
H-T-T-T
T-H-T-T
1
4
T-T-H-T
T-T-T-H
H-H-T-T
H-T-H-T
H-T-T-H
T-H-H-T
T-H-T-H
T-T-H-H
T-H-H-H
H-T-H-H
H-H-T-H
H-H-H-T
H-H-H-H
Normal distribution occurs very frequently in statistics, economics, natural and social
sciences and can be used to approximate many distributions occurring in nature and in
the manmade world.
For example, the height of all people of a particular race, the length of all dogs of a
particular breed, IQ, memory and reading skills of people in a general population and
income distribution in an economy all approximately follow the normal probability
distribution shaped like a bell curve.
The theory of normal distribution also finds use in advanced sciences like astronomy,
photonics and quantum mechanics.
The normal distribution can be characterized by the mean and standard deviation. The
mean determines where the peak occurs, which is at 0 in our figure for all the curves.
The standard deviation is a measure of the spread of the normal probability
distribution, which can be seen as differing widths of the bell curves in our figure.
The Formula
The mean is generally represented by and the standard deviation by . For a perfect
normal distribution, the mean, median and mode are all equal. The normal distribution
function can be written in terms of the mean and standard deviation as follows:
Hypothesis Testing
The null and alternative hypothesis
The null hypothesis is essentially the "devil's advocate" position. That is, it
assumes that whatever you are trying to prove did not happen (hint:it usually
states that something equals zero).
For example, the two different teaching methods did not result in different
exam performances (i.e., zero difference). Another example might be that there
is no relationship between anxiety and athletic performance (i.e., the slope is
zero).
The alternative hypothesis states the opposite and is usually the hypothesis you
are trying to prove (e.g., the two different teaching methods did result in
different exam performances). Initially, you can state these hypotheses in more
general terms (e.g., using terms like "effect", "relationship", etc.), as shown
below for the teaching methods example:
t-score
n
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
n-1
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
300
300/n 300/(n-1) diff
75.0
100.0
25.0
60.0
75.0
15.0
50.0
60.0
10.0
42.9
50.0
7.1
37.5
42.9
5.4
33.3
37.5
4.2
30.0
33.3
3.3
27.3
30.0
2.7
25.0
27.3
2.3
23.1
25.0
1.9
21.4
23.1
1.6
20.0
21.4
1.4
18.8
20.0
1.3
17.6
18.8
1.1
16.7
17.6
1.0
15.8
16.7
0.9
15.0
15.8
0.8
14.3
15.0
0.7
13.6
14.3
0.6
13.0
13.6
0.6
12.5
13.0
0.5
12.0
12.5
0.5
11.5
12.0
0.5
11.1
11.5
0.4
10.7
11.1
0.4
10.3
10.7
0.4
10.0
10.3
0.3
z-score
n
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
n-1
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
300
300/n 300/(n-1) diff
9.7
10.0
9.4
9.7
9.1
9.4
8.8
9.1
8.6
8.8
8.3
8.6
8.1
8.3
7.9
8.1
7.7
7.9
7.5
7.7
7.3
7.5
7.1
7.3
7.0
7.1
6.8
7.0
6.7
6.8
6.5
6.7
6.4
6.5
6.3
6.4
6.1
6.3
6.0
6.1
5.9
6.0
5.8
5.9
5.7
5.8
5.6
5.7
5.5
5.6
5.4
5.5
5.3
5.4
0.3
0.3
0.3
0.3
0.3
0.2
0.2
0.2
0.2
0.2
0.2
0.2
0.2
0.2
0.2
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
In a hypothesis test, a type I error occurs when the null hypothesis is rejected when it is in fact true; that is, H0 is
wrongly rejected.
For example, in a clinical trial of a new drug, the null hypothesis might be that the new drug is no better, on
average, than the current drug; i.e.
H0: there is no difference between the two drugs on average.
A type I error would occur if we concluded that the two drugs produced different effects when in fact there was no
difference between them.
The following table gives a summary of possible results of any hypothesis test:
A type I error is often considered to be more serious, and therefore more important to avoid, than a type II error.
The hypothesis test procedure is therefore adjusted so that there is a guaranteed 'low' probability of
rejecting the null hypothesis wrongly; this probability is never 0. This probability of a type I error can be
precisely computed as
P(type I error) = significance level = Alpha.
The exact probability of a type II error is generally unknown.
If we do not reject the null hypothesis, it may still be false (a type II error) as the sample may not be big enough to
identify the falseness of the null hypothesis (especially if the truth is very close to hypothesis).
For any given set of data, type I and type II errors are inversely related; the smaller the risk of one, the higher the
risk of the other.
Decision
A type I error can also be referred to as an error of the first kind.
Reject H0
Truth
Don't reject
H0
H0
Type I Error
Right decision
H1
Right decision
Type II Error
sample mean
average /
arithmetic mean
s2
sample variance
population samples 2
s =4
variance estimator
sample standard
deviation
population samples
standard deviation s = 2
estimator
Population
variance
variance of
population values
standard deviation
Population
value of random
standard deviation
variable X
x = (2+5+9) / 3 =
5.333
2 = 4
X = 2
Degrees of freedom:
For the z-test degrees of freedom are not
required since z-scores of 1.96 and 2.58 are
used for 5% and 1% respectively.
For unequal and equal variance t-tests =
(n1 + n2) - 2
For paired sample t-test = number of pairs - 1
Degrees of Freedom: Definition. The number of degrees of freedom generally refers to
the number of independent observations in a sample minus the number of population
parameters that must be estimated from sample data
t-score
n
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
n-1
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
300
300/n 300/(n-1) diff
75.0
100.0
25.0
60.0
75.0
15.0
50.0
60.0
10.0
42.9
50.0
7.1
37.5
42.9
5.4
33.3
37.5
4.2
30.0
33.3
3.3
27.3
30.0
2.7
25.0
27.3
2.3
23.1
25.0
1.9
21.4
23.1
1.6
20.0
21.4
1.4
18.8
20.0
1.3
17.6
18.8
1.1
16.7
17.6
1.0
15.8
16.7
0.9
15.0
15.8
0.8
14.3
15.0
0.7
13.6
14.3
0.6
13.0
13.6
0.6
12.5
13.0
0.5
12.0
12.5
0.5
11.5
12.0
0.5
11.1
11.5
0.4
10.7
11.1
0.4
10.3
10.7
0.4
10.0
10.3
0.3
z-score
n
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
n-1
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
300
300/n 300/(n-1) diff
9.7
10.0
9.4
9.7
9.1
9.4
8.8
9.1
8.6
8.8
8.3
8.6
8.1
8.3
7.9
8.1
7.7
7.9
7.5
7.7
7.3
7.5
7.1
7.3
7.0
7.1
6.8
7.0
6.7
6.8
6.5
6.7
6.4
6.5
6.3
6.4
6.1
6.3
6.0
6.1
5.9
6.0
5.8
5.9
5.7
5.8
5.6
5.7
5.5
5.6
5.4
5.5
5.3
5.4
0.3
0.3
0.3
0.3
0.3
0.2
0.2
0.2
0.2
0.2
0.2
0.2
0.2
0.2
0.2
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
t-distribution
Statistical tests
t-test
Two sample t-test : Matched sample t-test (paired) i.e. Before and after means.
Two sample t-test i.e. Independent sample t-test (random) : Two sample means
are compared.
z-test
Two sample z-test : Matched sample z-test (paired) i.e. Before and after means.
Two sample z-test i.e Independent sample z-test (random) Two sample means
are compared.
Two sample z-test i.e. z-test for proportions. i.e. Before and after proportions.
Two sample z-test i.e. Independent sample z-test (random) : Two sample
proportions are compared.
A retailer wants to compare the sales of his two stores- if they perform differently on the sales
last quarter. Store A had 25 products with an average sale of 70K /month, standard deviation
of 15K. Store B has 20 products with an average sale of 74K /month, with a standard deviation
of 25K. Did these 2 stores perform differently- use 5pc. significance level.
H0 : Store A = Store B
H1 : Store A Store B
2. State the Alpha : = 0.05
3.
If z is less than -1.96 or greater than 1.96, reject the Null Hypothesis
If z is less than -1.96 or greater than 1.96, reject the Null Hypothesis
There was a significant difference in the effectiveness between the medication group
and the placebo group.
What is the probability that a tablet has paracetamol content greater than 195
What is the probability that a tablet has paracetamol content greater than 205
What paracetamol content level corresponds to the 75th percentile
What is the probability that a tablet has paracetamol content between 195
and 215
5. If a sample of 100 tablets is selected, what is the probabilty that the mean
paracetamol content level is greater that 200
z=
Observation - mean
Standard deviation
2004
2014
2014
2004