Académique Documents
Professionnel Documents
Culture Documents
AND
HYPOTHESIS
TESTING
GOWTHAM K
JAYACHANDIRAN
WHAT IS DATA SCIENCE AND WHY DO WE NEED IT?
Quite simply, it’s about using data to solve problems.
DEFINITELY NOT
THIS!!
THIS IS WHAT WE DO!!
SIMPLY!!
THE PROCESS
THE SCIENCE
WHY DO WE NEED A DATA SCIENTIST?
WHY DO WE NEED STATISTICS IN DATA SCIENCE?
“Data Scientist is a person who is better at statistics than any programmer and
better at programming than any statistician.”
To become a successful Data Scientist you must know your basics. Math and Stats are the
building blocks of Machine Learning algorithms.
It is important to know the techniques behind various Machine Learning algorithms in order to
know how and when to use them.
Descriptive Statistics helps organize data and focuses on the characteristics of data
providing parameters.
INFERENTIAL STATISTICS
“Inferential Statistics makes inferences and predictions about a population based on a
sample of data taken from the population in question”
Inferential statistics generalizes a large data set and applies probability to arrive at a conclusion.
It allows you to infer parameters of the population based on sample stats and build models on it.
population is a set of similar
items or events.
“A distribution where only two outcomes are possible, such as success or failure, gain or loss, win or
lose and where the probability of success and failure is same for all the trials is called a Binomial
Distribution”
• Each trial is independent since the outcome of the previous toss doesn’t determine or affect the
outcome of the current toss.
• An experiment with only two possible outcomes repeated n number of times is called binomial.
Now, when probability of success = probability of failure, in such a situation the graph of binomial distribution
looks like
NORMAL DISTRIBUTION
Any distribution is known as Normal distribution if it has the following characteristics:
• 68% of data falls within the first standard deviation from the mean.
• 95% fall within two standard deviations.
• 99.7% fall within three standard deviations.
WHAT IS
3
HYPOTHESIS?
e.g.
Students who receive counseling will show a greater increase in
creativity than students not receiving counseling
CHARACTERISTICS OF
HYPOTHESIS
𝐻0: 𝜇 = 𝜇0
ALTERNATIVE
HYPOTHESIS
If Null is given 𝐻0 : 𝜇 = 𝜇 0
as
𝐻 𝑎 : 𝜇 > 𝜇0
𝐻 𝑎 : 𝜇 < 𝜇0
LEVEL OF SIGNIFICANCE AND CONFIDENCE
INTERVAL
Significance means the percentage risk to reject a null hypothesis when it is true
and it is denoted by 𝛼. Generally taken as 1%, 5%, 10%
(1 − 𝛼) is the confidence interval in which the null hypothesis will exist when it is
true.
Risk of rejecting a Null Hypothesis
when it is true
Risk Confidence
Designati 𝜶 𝟏−𝜶 Description
on
More than $100 million
0.001 0.999
Supercritic (Large loss of life, e.g.
0.1% 99.9%
al nuclear
disaster
0.01 0.99 Less than $100 million
Critical
1% 99% (A few lives lost)
0.05 0.95 Less than $100
Important
5% 95% thousand (No lives lost,
injuries occur)
0.10 0.90 Less than $500
Moderate
10% 90% (No injuries
occur)
TYPE I AND TYPE II
ERROR
Decision
Situation
Accept Null Reject Null
𝐻0: 𝜇 = 𝜇0
PROCEDURE FOR
HYPOTHESIS
Testin
g
State the null
(Ho)and alternate
State a
significance
Decide a test
statistics; z-test,
Calculate the
value of test
(Ha) Hypothesis level; 1%, 5%, t- test, F-test. statistics
10% etc.
Compare
Calculate the p-
the p-value P-value >
value at given
with Calculated Accept Ho
significance
calculated value
level from the
value
table
P-value <
Calculated Reject Ho
value
Z-TEST FOR TESTING
MEANS
The statistic used to measure significance, in this case, is called chi-square statistic. The
formula used for calculating the statistic is
A manufacturer has two different processes to make light bulbs. They want to
know if one process is better than the other.
Students from different colleges take the same exam. You want to see if one
college outperforms the other.
WHEN TO USE WHAT?