Vous êtes sur la page 1sur 83

-Basics of Statistics.

-Applications of Statistics in solving


industrial dilema.
-Basics of Operations Research.
-Applications of OR in solving industrial
dilema.
- Some more industry examples

Basics of Statistics

Types of data
Normal distribution
Sample size selection
Hypothesis testing
Introduction to statistical tests : z-test and ttest.

Types of data

Nominal data/Categorical data


In these kinds of data, observations are given a particular name. Like a person is observed to be 'male' or
'female' or Name of drug as generic or brand, etc. Nominal data cannot be measured or ordered but can be
counted. These types of data are considered as categorical data but the order of the categories is
meaningless. Data that consist of two classes like male/female or dead/alive are called binomial data, and
those that consist of more than two classes like tablet/capsule/syrup are known as multinomial data. Data of
these types are usually presented in t

Ordinal data
Ordinal data is also a type of categorical data but in this, categories are ordered logically. These data can be
ranked in order of magnitude. One can say definitely that one measurement is equal to, less than, or greater
than another. Most of the scores and scales used in research fall under the ordinal data. For example, rating
score/scale for the color, taste, smell, ease of application of products, etc. form of contingency tables like 2
2 tables.

Interval data
Interval data has a meaningful order and also has the quality that equal intervals between measurements
represent equal changes in the quantity of whatever is being measured. But these types of data have no
natural zero. Example is Celsius scale of temperature. In the Celsius scale, there is no natural zero, so we
cannot say that 70C is double than 35C. In interval scale, zero point can be choosen arbitraly. IQ Test is also
interval data as it has no natural zero.
Ratio data
Ratio data has all the qualities of interval data (natural order, equal intervals) plus a natural zero point. This
type of data is observed to be used most frequently. Example of ratio data is height, weight, length, etc. In
this type of data, it can be said meaningfully that 10 m of length is double than 5 m. This ratio hold true
regardless of which scale the object is being measured in (e.g., meters or yards). Reason for this is the
presence of natural zero.

Whether our Data Follow the Normal Distribution or Not?


This is the second prerequisite for selection of appropriate statistical test. If you know the
type of data (nominal, ordinal, interval, and ratio) and distribution of data (normal
distribution or not normal distribution), selection of statistical test will be very easy. There
is no need to check distribution in the case of ordinal and nominal data. Distribution
should only be checked in the case of ratio and interval data.
If your data are following the normal distribution, parametric statistical test should be
used and nonparametric tests should only be used when normal distribution is not
followed.

There are various methods for checking the normal distribution, some of them are plotting
histogram, plotting box and whisker plot, plotting Q-Q plot, measuring skewness and
kurtosis, using formal statistical test for normality (Kolmogorov-Smirnov test, Shapiro-Wilk
test, etc). Formal statistical tests like Kolmogorov-Smirnov and Shapiro-Wilk are used
frequently to check the distribution of data. All these tests are based on null hypothesis
that data are taken from the population which follows the normal distribution. P value is
determined to see the alpha error.
If P value is less than 0.05, data is not following the normal distribution and nonparametric
test should be used in that kind of data. If the sample size is less, chances of non-normal
distribution are increased.

Which statistical test to use when


Comparison of 1 group to a hypothetical value. N- Chi
square or binomial distibution for ND.
R,I, N - 1 sample t-test OR 1 sample Z-test.
Comparison of 2 groups.
-paired groups/Unpaired groups
R,I, N -paired t-test/paired z-test/One-way ANOVA
R,I, N -Unpaired t-test/unpaired z-test and Fischer's
test i.e.N- Chi square test for large samples
Comparison of more than 2 groups.
-R,I, N -Repeated measure ANOVA and Chi-square test
for ND

There are two types of chi square tests: the


goodness-of-fit test (does a coin tossed 100 times
turn up heads 50 times and tails 50 times?) and the
test of independence (is there a relationship
between gender and a perfect SAT score?)

Statistical Hypothesis Testing


Statistical hypothesis testing is used to determine whether an experiment conducted
provides enough evidence to reject a proposition.
Suppose you want to study the effect of smoking on the occurrence of lung cancer cases. If
you take a small group, it may happen that there appears no correlation at all, and you
find that there are many smokers with healthy lungs and many non-smokers with lung
cancer.
However, it can just happen that this is by chance, and in the overall population this isn't
true. In order to remove this element of chance and increase the reliability of
our hypothesis, we use statistical hypothesis testing.
In this, you will first assume a hypothesis that smoking and lung cancer are unrelated. This is
called the 'null hypothesis', which is central to any statistical hypothesis testing.
You should therefore first choose a distribution for the experimental group. Normal
distribution is one of the most common distributions encountered in nature, but it can be
different in different special cases.

Confidence Interval in Statistics


In statistics, confidence interval refers to the amount of error that is allowed in the
statistical data and analysis.
Typical Confidence Levels:
One Tailed
>.95 (>95%)
>.99
>.999
Two Tailed
>.975
Suppose the survey shows that 34% of the people vote for Candidate A.
The confidence that these results are accurate for the whole group can never be 100%;
For this the survey would need to be taken for the entire group.
Therefore if you are looking at say a 95% confidence interval in the results, it could
mean that the final result would be 30-38%. If you want a higher confidence interval,
say 99%, then the uncertainty in the result would increase; say to 28-40%.
In normal statistical analysis, the confidence interval tells us the reliability of the sample
mean as compared to the whole mean.

Frequency Distribution
Frequency distribution is a curve that gives us the frequency of
the occurrence of a particular data point in an experiment. This
is usually the limit of a histogram of frequencies when the data
points are very large and the results can be treated to be
varying continuously instead of taking on discrete values.

For example, consider a fair coin that is tossed four times. We


want to derive the frequency distribution for the number of
heads that
can occur. There are different possibilities, through which these
heads might occur, which are summarized in the table below:
The frequency distribution is easy to see. On an average, if the
number of flips are very high, then out of every 16 coin flips, 1
will end up with 0 heads, 4 will end up with 4 heads, 6 will end
up with 2 heads, 4 will end up with 3 heads and 1 will end up as
all 4 heads. This of course is assuming that the coin used for the
experiment is a fair coin, with an equal probability of a head
and tail on any given flip.
In the above case, the coin is flipped only 4 times. If the coin is
tossed many more times, like say 100 times, and the frequency
distribution drawn, it will be exactly like a normal probability
distribution in shape.

Total number of
No. of Value of the four
ways
Heads coin flips
of getting a head
0
T-T-T-T
1
H-T-T-T
T-H-T-T
1
4
T-T-H-T
T-T-T-H

H-H-T-T
H-T-H-T
H-T-T-H
T-H-H-T
T-H-T-H
T-T-H-H
T-H-H-H
H-T-H-H
H-H-T-H
H-H-H-T
H-H-H-H

For sufficiently large values of sample


size, it can be mathematically shown
through the central limit theorem that
the distribution is approximately normal
distribution. In such a case, the 95%
confidence level occurs at an interval of
1.96 times the standard deviation.
Normal probability distribution, also
called Gaussian distribution refers to a
family of distributions that are bell
shaped.
These are symmetric in nature and peak
at the mean, with the probability
distribution decreasing away before and
after this mean smoothly, as shown in
the figure below.
= Mean of the Population, =
Standard Deviation

Normal distribution occurs very frequently in statistics, economics, natural and social
sciences and can be used to approximate many distributions occurring in nature and in
the manmade world.

For example, the height of all people of a particular race, the length of all dogs of a
particular breed, IQ, memory and reading skills of people in a general population and
income distribution in an economy all approximately follow the normal probability
distribution shaped like a bell curve.

The theory of normal distribution also finds use in advanced sciences like astronomy,
photonics and quantum mechanics.

The normal distribution can be characterized by the mean and standard deviation. The
mean determines where the peak occurs, which is at 0 in our figure for all the curves.
The standard deviation is a measure of the spread of the normal probability
distribution, which can be seen as differing widths of the bell curves in our figure.
The Formula
The mean is generally represented by and the standard deviation by . For a perfect
normal distribution, the mean, median and mode are all equal. The normal distribution
function can be written in terms of the mean and standard deviation as follows:

Hypothesis Testing
The null and alternative hypothesis
The null hypothesis is essentially the "devil's advocate" position. That is, it
assumes that whatever you are trying to prove did not happen (hint:it usually
states that something equals zero).
For example, the two different teaching methods did not result in different
exam performances (i.e., zero difference). Another example might be that there
is no relationship between anxiety and athletic performance (i.e., the slope is
zero).
The alternative hypothesis states the opposite and is usually the hypothesis you
are trying to prove (e.g., the two different teaching methods did result in
different exam performances). Initially, you can state these hypotheses in more
general terms (e.g., using terms like "effect", "relationship", etc.), as shown
below for the teaching methods example:

Some more concepts


Standard deviation.
Type I error and Type II error.

Concept of degrees of freedom.


Statistically significant results.
One tailed and two-tailed tests.

The standard deviation is the square root of


the variance.
Variance is a measure of how spread out
a data set is.

Concept of degrees of freedom

Degrees of freedom are Sample size- 1

t-score
n
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30

n-1
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29

300
300/n 300/(n-1) diff
75.0
100.0
25.0
60.0
75.0
15.0
50.0
60.0
10.0
42.9
50.0
7.1
37.5
42.9
5.4
33.3
37.5
4.2
30.0
33.3
3.3
27.3
30.0
2.7
25.0
27.3
2.3
23.1
25.0
1.9
21.4
23.1
1.6
20.0
21.4
1.4
18.8
20.0
1.3
17.6
18.8
1.1
16.7
17.6
1.0
15.8
16.7
0.9
15.0
15.8
0.8
14.3
15.0
0.7
13.6
14.3
0.6
13.0
13.6
0.6
12.5
13.0
0.5
12.0
12.5
0.5
11.5
12.0
0.5
11.1
11.5
0.4
10.7
11.1
0.4
10.3
10.7
0.4
10.0
10.3
0.3

z-score
n
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57

n-1
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56

300
300/n 300/(n-1) diff
9.7
10.0
9.4
9.7
9.1
9.4
8.8
9.1
8.6
8.8
8.3
8.6
8.1
8.3
7.9
8.1
7.7
7.9
7.5
7.7
7.3
7.5
7.1
7.3
7.0
7.1
6.8
7.0
6.7
6.8
6.5
6.7
6.4
6.5
6.3
6.4
6.1
6.3
6.0
6.1
5.9
6.0
5.8
5.9
5.7
5.8
5.6
5.7
5.5
5.6
5.4
5.5
5.3
5.4

0.3
0.3
0.3
0.3
0.3
0.2
0.2
0.2
0.2
0.2
0.2
0.2
0.2
0.2
0.2
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1

Concept of degrees of freedom


Degrees of freedom are used to determine
whether a particular null hypothesis can be
rejected based on the number of variables
and samples of in the experiment. For
example, while a sample size of 50 students
might not be large enough to obtain
significant information, obtaining the same
results from a study of 500 samples can be
judged as being valid.

In a hypothesis test, a type I error occurs when the null hypothesis is rejected when it is in fact true; that is, H0 is
wrongly rejected.
For example, in a clinical trial of a new drug, the null hypothesis might be that the new drug is no better, on
average, than the current drug; i.e.
H0: there is no difference between the two drugs on average.
A type I error would occur if we concluded that the two drugs produced different effects when in fact there was no
difference between them.
The following table gives a summary of possible results of any hypothesis test:
A type I error is often considered to be more serious, and therefore more important to avoid, than a type II error.
The hypothesis test procedure is therefore adjusted so that there is a guaranteed 'low' probability of
rejecting the null hypothesis wrongly; this probability is never 0. This probability of a type I error can be
precisely computed as
P(type I error) = significance level = Alpha.
The exact probability of a type II error is generally unknown.
If we do not reject the null hypothesis, it may still be false (a type II error) as the sample may not be big enough to
identify the falseness of the null hypothesis (especially if the truth is very close to hypothesis).
For any given set of data, type I and type II errors are inversely related; the smaller the risk of one, the higher the
risk of the other.
Decision
A type I error can also be referred to as an error of the first kind.

Reject H0
Truth

Don't reject
H0

H0

Type I Error

Right decision

H1

Right decision

Type II Error

Outcome 1 : We reject the Null Hypothesis, when it is False : Good.


Outcome 2 : We reject the Null Hypothesis, when in reality, it is True : Type I error.
Outcome 3 : We retain the Null Hypothesis, when in reality, it is False : Type II error.
Outcome 4 : We retain the Null Hypothesis, when it is True : Good.

Statistically Significant Results


For most disciplines, the researcher looks for a significance level of 0.05,
signifying that there is only a 5% probability that the observed results and
trends occurred by chance.
For some scientific disciplines, the required level is 0.01, only a 1% probability
that the observed patterns occurred due to chance or error. Whatever the level,
the significance level determines whether the null or alternative is rejected, a
crucial part of hypothesis testing.

The structure of hypothesis testing


1.Define the research hypothesis for the study.
2.Explain how you are going to operationalize (that is, measure or operationally
define) what you are studying and set out the variables to be studied.
3.Set out the null and alternative hypothesis (or more than one hypothesis; in other
words, a number of hypotheses).
4.Set the significance level.
5.Make a one- or two-tailed prediction.
6.Determine whether the distribution that you are studying is normal (this has
implications for the types of statistical tests that you can run on your data)
7.Select an appropriate statistical test based on the variables you have defined and
whether the distribution is normal or not.
8.Run the statistical tests on your data and interpret the output.
9.Reject or fail to reject the null hypothesis.

T- test and z-test


One-sample t-tests are used to compare a sample mean with the
known population mean.
Two-sample t-tests, the other hand, are used to compare either
independent samples or dependent samples.- [Limited sample size
(n < 30) as long as the variables are approximately normally
distributed and the variation of scores in the two groups is not
reliably different.]
It is also great if you do not know the populations standard
deviation. If the standard deviation is known, then, it would be best
to use another type of statistical test, the Z-test.
Z-tests are often applied in large samples (n > 30)

Difference between Z-test, F-test, and T-test


A z-test is used for testing the mean of a population versus a standard, or comparing the means of
two populations, with large (n 30) samples whether you know the population standard deviation
or not. It is also used for testing the proportion of some characteristic versus a standard proportion,
or comparing the proportions of two populations.
Example: Comparing the average engineering salaries of men versus women.
Example: Comparing the fraction defectives from 2 production lines.
A t-test is used for testing the mean of one population against a standard or comparing the means
of two populations if you do not know the populations standard deviation and when you have a
limited sample (n < 30). If you know the populations standard deviation, you may use a z-test.
Example: Measuring the average diameter of shafts from a certain machine when you have a small
sample.
An F-test is used to compare 2 populations variances. The samples can be any size. It is the basis of
ANOVA.
Example: Comparing the variability of bolt diameters from two machines.
Matched pair test is used to compare the means before and after something is done to the samples.
A t-test is often used because the samples are often small. However, a z-test is used when the
samples are large. The variable is the difference between the before and after measurements.
Example: The average weight of subjects before and after following a diet for 6 weeks

sample mean

average /
arithmetic mean

s2

sample variance

population samples 2
s =4
variance estimator

sample standard
deviation

population samples
standard deviation s = 2
estimator

Population
variance

variance of
population values

standard deviation
Population
value of random
standard deviation
variable X

x = (2+5+9) / 3 =
5.333

2 = 4
X = 2

Degrees of freedom:
For the z-test degrees of freedom are not
required since z-scores of 1.96 and 2.58 are
used for 5% and 1% respectively.
For unequal and equal variance t-tests =
(n1 + n2) - 2
For paired sample t-test = number of pairs - 1
Degrees of Freedom: Definition. The number of degrees of freedom generally refers to
the number of independent observations in a sample minus the number of population
parameters that must be estimated from sample data

t-score
n
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30

n-1
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29

300
300/n 300/(n-1) diff
75.0
100.0
25.0
60.0
75.0
15.0
50.0
60.0
10.0
42.9
50.0
7.1
37.5
42.9
5.4
33.3
37.5
4.2
30.0
33.3
3.3
27.3
30.0
2.7
25.0
27.3
2.3
23.1
25.0
1.9
21.4
23.1
1.6
20.0
21.4
1.4
18.8
20.0
1.3
17.6
18.8
1.1
16.7
17.6
1.0
15.8
16.7
0.9
15.0
15.8
0.8
14.3
15.0
0.7
13.6
14.3
0.6
13.0
13.6
0.6
12.5
13.0
0.5
12.0
12.5
0.5
11.5
12.0
0.5
11.1
11.5
0.4
10.7
11.1
0.4
10.3
10.7
0.4
10.0
10.3
0.3

z-score
n
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57

n-1
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56

300
300/n 300/(n-1) diff
9.7
10.0
9.4
9.7
9.1
9.4
8.8
9.1
8.6
8.8
8.3
8.6
8.1
8.3
7.9
8.1
7.7
7.9
7.5
7.7
7.3
7.5
7.1
7.3
7.0
7.1
6.8
7.0
6.7
6.8
6.5
6.7
6.4
6.5
6.3
6.4
6.1
6.3
6.0
6.1
5.9
6.0
5.8
5.9
5.7
5.8
5.6
5.7
5.5
5.6
5.4
5.5
5.3
5.4

0.3
0.3
0.3
0.3
0.3
0.2
0.2
0.2
0.2
0.2
0.2
0.2
0.2
0.2
0.2
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1

t-distribution

1.96 is the no. of std deviations away from the mean.


t distribution is Normal distribution at large sample sizes.

Statistical tests
t-test

One sample t-test : population and sample data are compared.

Two sample t-test : Matched sample t-test (paired) i.e. Before and after means.

Two sample t-test i.e. Independent sample t-test (random) : Two sample means
are compared.
z-test

One sample z-test : population and sample data are compared.

One sample z-test for proportions

Two sample z-test : Matched sample z-test (paired) i.e. Before and after means.

Two sample z-test i.e Independent sample z-test (random) Two sample means
are compared.

Two sample z-test i.e. z-test for proportions. i.e. Before and after proportions.

Two sample z-test i.e. Independent sample z-test (random) : Two sample
proportions are compared.

One sample t-test


In the population, the average IQ is 100. A team of scientists wants to test a new medication
to see if it has either a positive or negative effect on intelligence, or no effect at all. A sample
of 30 participants who have taken the medication has a mean of 140 with a standard deviation
of 20.
Did the medication affect the intelligence ? Alpha is 0.05

1.Define the Null and Alternative Hypotheses.


2.State the Alpha.
3.Calculate the degrees of freedom.
4.State the decision rule.
5.Calculate the Test Statistic.
6.State the Results.
7.State the Conclusion.

2. State the Alpha : = 0.05


3.

Calculate the degrees of freedom : n-1 = 30-1 = 29

2. State the Alpha : 0.05

Independent sample t-test


Here we compare 2 independent samples.

A retailer wants to compare the sales of his two stores- if they perform differently on the sales
last quarter. Store A had 25 products with an average sale of 70K /month, standard deviation
of 15K. Store B has 20 products with an average sale of 74K /month, with a standard deviation
of 25K. Did these 2 stores perform differently- use 5pc. significance level.

1.Define the Null and Alternative Hypothesis.


2.State the Alpha.
3.Calculate the degrees of freedom.
4.State the decision rule.
5.Calculate the Test Statistic.
6.State the Results.
7.State the Conclusion.

1. Define the Null and Alternative Hypothesis.

H0 : Store A = Store B
H1 : Store A Store B
2. State the Alpha : = 0.05
3.

Calculate the degrees of freedom

2. State the Alpha : = 0.05

If z is less than -1.96 or greater than 1.96, reject the Null Hypothesis

12= 1.5, n1 = 60, x1= 18


22 =2,
n2 = 60, x2=19,
Are the two populations different

12= 1.5, n1 = 60, x1= 18


22 =2, n2 = 60, x2=19,

2. State the Alpha : = 0.05

2. State the Alpha : = 0.05

If z is less than -1.96 or greater than 1.96, reject the Null Hypothesis

There was a significant difference in the effectiveness between the medication group
and the placebo group.

The composition of Paracetamol, in a certain brand of pain-killer are normally


distributed with a mean of 205 and a standard deviation of 40
1.
2.
3.
4.

What is the probability that a tablet has paracetamol content greater than 195
What is the probability that a tablet has paracetamol content greater than 205
What paracetamol content level corresponds to the 75th percentile
What is the probability that a tablet has paracetamol content between 195
and 215
5. If a sample of 100 tablets is selected, what is the probabilty that the mean
paracetamol content level is greater that 200

z=

Observation - mean
Standard deviation

Real-life application of z-scores occurs in usability testing.

2004

2014

2. State the Alpha : = 0.05

2014
2004

Vous aimerez peut-être aussi