Vous êtes sur la page 1sur 49

BASIC STATS

AND
HYPOTHESIS
TESTING

GOWTHAM K
JAYACHANDIRAN
WHAT IS DATA SCIENCE AND WHY DO WE NEED IT?
Quite simply, it’s about using data to solve problems.

DEFINITELY NOT
THIS!!
THIS IS WHAT WE DO!!
SIMPLY!!
THE PROCESS
THE SCIENCE
WHY DO WE NEED A DATA SCIENTIST?
WHY DO WE NEED STATISTICS IN DATA SCIENCE?
“Data Scientist is a person who is better at statistics than any programmer and
better at programming than any statistician.”

To become a successful Data Scientist you must know your basics. Math and Stats are the
building blocks of Machine Learning algorithms.

It is important to know the techniques behind various Machine Learning algorithms in order to
know how and when to use them.

Now the question arises, what exactly is Statistics?

“Statistics is a Mathematical Science pertaining to


data collection, analysis, interpretation and
presentation”
CATEGORIES IN STATISTICS
Descriptive Statistics:

“Descriptive Statistics uses the data to provide descriptions of the population,


either through numerical calculations or graphs or tables”

Descriptive Statistics helps organize data and focuses on the characteristics of data
providing parameters.
INFERENTIAL STATISTICS
“Inferential Statistics makes inferences and predictions about a population based on a
sample of data taken from the population in question”

Inferential statistics generalizes a large data set and applies probability to arrive at a conclusion.
It allows you to infer parameters of the population based on sample stats and build models on it.
population is a set of similar
items or events.

Sample is a set of data


collected and/or selected from a
statistical population by a
defined procedure.(A subset of
population).
Data that you do apply any aggregation functions and making sense on your aggregations is
Numerical Data.
Character data & any other data that you do apply any aggregation functions and making no
sense is your Categorical Data.
BINOMIAL DISTRIBUTION

“A distribution where only two outcomes are possible, such as success or failure, gain or loss, win or
lose and where the probability of success and failure is same for all the trials is called a Binomial
Distribution”

• Each trial is independent since the outcome of the previous toss doesn’t determine or affect the
outcome of the current toss.
• An experiment with only two possible outcomes repeated n number of times is called binomial. 

The properties of a Binomial Distribution are

1.Each trial is independent.


2.There are only two possible outcomes in a trial- either a success or a failure.
3.A total number of n identical trials are conducted.
4.The probability of success and failure is same for all trials. (Trials are identical.)
A binomial distribution graph where the probability of success does not equal the probability of failure looks
like

Now, when probability of success = probability of failure, in such a situation the graph of binomial distribution
looks like
NORMAL DISTRIBUTION
Any distribution is known as Normal distribution if it has the following characteristics:

1.The mean, median and mode of the distribution coincide.


2.The curve of the distribution is bell-shaped and symmetrical about the line x=μ.
3.Exactly half of the values are to the left of the center and the other half to the right.

A standard normal distribution is defined as the distribution with mean 0


and standard deviation 1.
EMPIRICAL RULE

• 68% of data falls within the first standard deviation from the mean.
• 95% fall within two standard deviations.
• 99.7% fall within three standard deviations.
WHAT IS
3
HYPOTHESIS?

 Hypothesis is a predictive statement, capable of being tested by


scientific methods, that relates an independent variables to
some dependent variable.
 A hypothesis states what we are looking for and it is a proportion
which can be put to a test to determine its validity

e.g.
Students who receive counseling will show a greater increase in
creativity than students not receiving counseling
CHARACTERISTICS OF
HYPOTHESIS

 Clear and precise.


 Capable of being tested.
 Stated relationship between variables.
 limited in scope and must be specific.
 Stated as far as possible in most simple terms so that the
same is easily understand by all concerned. But one must
remember that simplicity of hypothesis has nothing to do with
its significance.
 Consistent with most known facts.
 Responsive to testing with in a reasonable time. One can’t spend
a life time collecting data to test it.
 Explain what it claims to explain; it should have empirical
reference.
NULL
5
HYPOTHESIS

 It is an assertion that we hold as true unless we have sufficient statistical


evidence to conclude otherwise.
 Null Hypothesis is denoted by 𝐻0
 If a population mean is equal to hypothesised mean then Null Hypothesis
can be written as

𝐻0: 𝜇 = 𝜇0
ALTERNATIVE
HYPOTHESIS

 The Alternative hypothesis is negation of null hypothesis and is denoted by 𝐻𝑎

If Null is given 𝐻0 : 𝜇 = 𝜇 0
as

Then alternative Hypothesis can be written as


𝐻 𝑎 : 𝜇 ≠ 𝜇0

𝐻 𝑎 : 𝜇 > 𝜇0
𝐻 𝑎 : 𝜇 < 𝜇0
LEVEL OF SIGNIFICANCE AND CONFIDENCE
INTERVAL

 Significance means the percentage risk to reject a null hypothesis when it is true
and it is denoted by 𝛼. Generally taken as 1%, 5%, 10%
 (1 − 𝛼) is the confidence interval in which the null hypothesis will exist when it is
true.
Risk of rejecting a Null Hypothesis
when it is true
Risk Confidence
Designati 𝜶 𝟏−𝜶 Description
on
More than $100 million
0.001 0.999
Supercritic (Large loss of life, e.g.
0.1% 99.9%
al nuclear
disaster
0.01 0.99 Less than $100 million
Critical
1% 99% (A few lives lost)
0.05 0.95 Less than $100
Important
5% 95% thousand (No lives lost,
injuries occur)
0.10 0.90 Less than $500
Moderate
10% 90% (No injuries
occur)
TYPE I AND TYPE II
ERROR

Decision
Situation
Accept Null Reject Null

Null is true Correct Type I error


( 𝛼 𝑒𝑟𝑟𝑜𝑟 )

Null is false Type II error Correct


( 𝛽 𝑒𝑟𝑟𝑜𝑟 )
TWO TAILED TEST
@ 5% Significance level

Acceptance and Rejection


regions in case of a Two Suitable When 𝐻0: 𝜇 = 𝜇0
tailed test 𝐻 𝑎 : 𝜇 ≠ 𝜇0

𝑅𝑒𝑗𝑒𝑐𝑡𝑖𝑜𝑛 𝑟𝑒𝑔𝑖𝑜𝑛 𝑅𝑒𝑗𝑒𝑐𝑡𝑖𝑜𝑛 𝑟𝑒𝑔𝑖𝑜𝑛


𝑇𝑜𝑡𝑎𝑙 𝐴𝑐𝑐𝑒𝑝𝑡𝑎𝑛𝑐𝑒 𝑟𝑒𝑔𝑖𝑜𝑛 /𝑠𝑖𝑔𝑛𝑖𝑓𝑖𝑐𝑎𝑛𝑐𝑒 𝑙𝑒𝑣𝑒𝑙
/𝑠𝑖𝑔𝑛𝑖𝑓𝑖𝑐𝑎𝑛𝑐𝑒 𝑙𝑒𝑣𝑒𝑙
(𝛼 = 0.025 𝑜𝑟 2.5%) 𝑜𝑟 𝑐𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 𝑙𝑒𝑣𝑒𝑙 (𝛼 = 0.025 𝑜𝑟 2.5%)
(1 − 𝛼) = 95%
𝐻0: 𝜇 = 𝜇0
LEFT TAILED TEST
@ 5% Significance level

Acceptance and Rejection


regions in case of a left tailed Suitable When 𝐻0: 𝜇 = 𝜇0
test 𝐻 𝑎 : 𝜇 < 𝜇0

𝑅𝑒𝑗𝑒𝑐𝑡𝑖𝑜𝑛 𝑟𝑒𝑔𝑖𝑜𝑛 𝑇𝑜𝑡𝑎𝑙 𝐴𝑐𝑐𝑒𝑝𝑡𝑎𝑛𝑐𝑒 𝑟𝑒𝑔𝑖𝑜𝑛


/𝑠𝑖𝑔𝑛𝑖𝑓𝑖𝑐𝑎𝑛𝑐𝑒 𝑙𝑒𝑣𝑒𝑙 𝑜𝑟 𝑐𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 𝑙𝑒𝑣𝑒𝑙
(𝛼 = 0.05 𝑜𝑟 5%) (1 − 𝛼) =
95%
𝐻0: 𝜇 = 𝜇0
RIGHT TAILED TEST
@ 5% Significance level

Acceptance and Rejection


regions in case of a Right Suitable When 𝐻0: 𝜇 = 𝜇0
𝐻 𝑎 : 𝜇 > 𝜇0
tailed test

𝑇𝑜𝑡𝑎𝑙 𝐴𝑐𝑐𝑒𝑝𝑡𝑎𝑛𝑐𝑒 𝑟𝑒𝑔𝑖𝑜𝑛 𝑅𝑒𝑗𝑒𝑐𝑡𝑖𝑜𝑛 𝑟𝑒𝑔𝑖𝑜𝑛


𝑜𝑟 𝑐𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 𝑙𝑒𝑣𝑒𝑙 /𝑠𝑖𝑔𝑛𝑖𝑓𝑖𝑐𝑎𝑛𝑐𝑒 𝑙𝑒𝑣𝑒𝑙
(1 − 𝛼) = 95% (𝛼 = 0.05 𝑜𝑟 5%)

𝐻0: 𝜇 = 𝜇0
PROCEDURE FOR
HYPOTHESIS
Testin
g
State the null
(Ho)and alternate
State a
significance
Decide a test
statistics; z-test,
Calculate the
value of test
(Ha) Hypothesis level; 1%, 5%, t- test, F-test. statistics
10% etc.

Compare
Calculate the p-
the p-value P-value >
value at given
with Calculated Accept Ho
significance
calculated value
level from the
value
table

P-value <
Calculated Reject Ho
value
Z-TEST FOR TESTING
MEANS

Test Condition Test


 Population normal
Statistics
and infinite
𝑋 −𝜇 𝐻 0
 Sample size large or
small 𝑧=
 Population variance
𝜎𝑝 𝑛
is known
 Ha may be one-sided
or two sided
Z-TEST FOR TESTING
MEANS

Test Condition Test


 Population normal
Statistics
and finite,
 Sample size large or
small, 𝑋 − 𝜇𝐻0
𝑧 = 𝜎𝑝
 Population variance x 𝑁−𝑛𝑁− 1
is known
 Ha may be one-sided 𝑛
or two sided
Z-TEST FOR TESTING
MEANS

Test Condition Test


 Population is finite and Statistics
may not be normal, 𝑋 − 𝜇𝐻0
 Sample size is large,
𝑧= 𝜎
𝑠
𝑛 × 𝑁−𝑛𝑁− 1
 Population variance
is unknown
 Ha may be one-sided
or two sided
T-TEST FOR TESTING
MEANS

Test Condition Test


Population is infinite
Statistics
 𝑋 −𝜇 𝐻 0
and normal,
𝑡=
 Sample size is small, 𝜎𝑠 �

 Population variance 𝑤𝑖𝑡ℎ 𝑑. 𝑓. = 𝑛 − 1
is unknown
 Ha may be one-sided 2
or two sided 𝜎𝑠 = 𝑋𝑖
(𝑛 − 1)
−𝑋
T-TEST FOR TESTING
MEANS

Test Condition Test


 Population is finite
Statistics
𝑋 − 𝜇𝐻0
and normal, 𝑡= 𝜎
𝑠
𝑛
× 𝑁−𝑛𝑁− 1
 Sample size is small,
 Population variance 𝑤𝑖𝑡ℎ 𝑑. 𝑓. = 𝑛 − 1
is unknown
2
 Ha may be one-sided 𝑋𝑖
𝜎𝑠 =
or two sided (𝑛 − 1)
−𝑋
F-TEST FOR TESTING EQUALITY OF VARIANCES OF TWO
NORMAL

Test conditions Test


 The populations are normal statistics
 Samples have been 𝜎2
𝑠1
drawn randomly 𝐹=
2
𝜎𝑠2
 Observations are
independent; and 𝑤𝑖𝑡ℎ 𝑛1 − 1 and 𝑛2 − 1 d.
 There is no
f.
measurement error 𝜎𝑠21 is the sample estimate 𝑝
for 𝜎 2 1
 Ha may be one sided or 𝜎𝑠22
is the sample estimate 𝑝
two sided for 𝜎 2 2
LIMITATIONS OF THE TEST OF HYPOTHESIS
 Testing of hypothesis is not decision making itself; but help for
decision making
 Test does not explain the reasons as why the difference exist, it
only indicate that the difference is due to fluctuations of sampling
or because of other reasons but the tests do not tell about the
reason causing the difference.
 Tests are based on the probabilities and as such cannot be
expressed with full certainty.
 Statistical inferences based on the significance tests cannot
be said to be entirely correct evidences concerning the truth
of the hypothesis.
CHI-SQUARE TEST
Chi-square test is used to compare categorical variables.
Intuition:
a. A small chi-square value means that data fits
b. A high chi-square value means that data doesn’t fit.

The hypothesis being tested for chi-square is

Null: Variable A and Variable B are independent


Alternate: Variable A and Variable B are not independent.

The statistic used to measure significance, in this case, is called chi-square statistic. The
formula used for calculating the statistic is

Χ2 = Σ [ (Or,c — Er,c)2 / Er,c ] where

Or,c = observed frequency count at level r of Variable A and level c of Variable B


Er,c = expected frequency count at level r of Variable A and level c of Variable B
ANOVA

• Basically, you’re testing groups to see if there’s a difference between them.


Examples of when you might want to test different groups:

 A group of psychiatric patients are trying three different therapies: counseling,


medication and biofeedback. You want to see if one therapy is better than the
others.

 A manufacturer has two different processes to make light bulbs. They want to
know if one process is better than the other.

 Students from different colleges take the same exam. You want to see if one
college outperforms the other.
WHEN TO USE WHAT?

Vous aimerez peut-être aussi