Vous êtes sur la page 1sur 29

Basics of Probability

Inferential Statistics (Optional)


Introduction: Inferential Statistics
Welcome to the module on ‘Inferential Statistics’.

In this module
Many a time, you may require a very large amount of data for your analysis which may need too
much time and resources to acquire. In such situations, you are forced to work with a smaller
sample of the data, instead of having the entire data to work with.

Situations like these arise all the time at big companies like Amazon. For example, say the
Amazon QC department wants to know what proportion of the products in its warehouses are
defective. Instead of going through all of its products (which would be a lot!), the Amazon QC
team can just check a small sample of 1,000 products and then find, for this sample, the defect
rate (i.e. the proportion of defective products). Then, based on this sample's defect rate, the team
can "infer"what the defect rate is for all the products in the warehouses.

This process of “inferring” insights from sample data is called “Inferential Statistics”.

BasicsofProbability_13-11-2018 13_10_44.mp4

Eg: Exit Poll – Result from a smaller sample are extrapollated for a larger population. How to
estimate the number of samples ? Inferential statistics is using information from smaller
samples to infer for larger population. Inferential statistics is used in Food industry,
Pharmaceutical industry almost all large scale production industry.

Another Example discussed : Office of employee 30,000. Only 100 employees average commute
time from office to home is collected. Using will all 30000 employees commute time are
calculated.
Play Video
Note that even after using inferential statistics, you would only be able to estimate the population
data from the sample data, but not find the exact values. This is because when you don't have the
exact data, you can only make reasonable estimates about it with a limited level of certainty.
Therefore, when certainty is limited, we talk in terms of probability.

Prerequisites
You’ll need to brush up on your concepts of probability before you begin this module,
specifically the following concepts (links provided on the relevant pages):
 Basic definition of probability
 Multiplication rule of probability
 Addition rule of probability
 (Combinatorics)
Guidelines for in-module questions
The in-video and in-content questions for this module are not graded.

People you will hear from in this module


Subject Matter Expert
Tricha Anjali
Associate Professor, IIIT- B
The International Institute of Information Technology, Bangalore, also known as IIIT-B, is one
of India's foremost graduate schools. Through its Integrated M.Tech., M.Tech., M.S. (Research)
and PhD programs in the IT space, it focuses equally on innovation and education.

Reference Ebook
Statistical Inference for Data Science by Brian Caffo.
Note: In this module, some interactive graphics have been sourced from the Seeing
Theory project of Brown University.

Seeing Theory is a project designed and created by Daniel Kunin with support from Brown
University's Royce Fellowship Program. The goal of the project is to make statistics more
accessible to a wider range of students through interactive visualisations.

Introduction: Basics of Probability


Welcome to the session on ‘Basics of Probability’.

In this session
In this session, you will learn a few basic concepts of probability through a specific example.
The broad agenda for this session is as follows:
 Random variables
 Probability Distributions
 Expected value
Prerequisites
It has been assumed in this session that you have knowledge of the basic definition of
probability.

You can brush up on these topics using the links given below -
1. Prerequisites for Session 1  Upgrad Course Statistics Essential Mod 7 Ses 1
You can also solve the practice questions provided in the Resources section.
1. Optional Questions for Session 1  Upgrad Course Statistics Essential Mod 7 Ses 1 Quiz

Guidelines for in-module questions


The in-video and in-content questions for this session are not graded.

People you will hear from in this session


Subject Matter Expert
Tricha Anjali
Associate Professor, IIIT- B
The International Institute of Information Technology, Bangalore, also known as IIIT-B, is one
of India's foremost graduate schools. Through its Integrated M.Tech., M.Tech., M.S. (Research)
and PhD programs in the IT space, it focuses equally on innovation and education.

Reference Ebook
Statistical Inference for Data Science by Brian Caffo.

Downloaded and kept as LittleInferenceBook.pdf

Random Variables
Welcome to the first session on inferential statistics! This will be a very interactive session, with
a lot of questions that will compel you to think about a concept, helping you explore it more
actively.

So, let’s get started.


Eg : Casino : House always wins. Machines are designed in such a way. They use probability. To
demo that play game with bag of 5 balls with 3 Red and 2 Blue Balls.
- Each participant has to pull a ball notice the colour and put it back
- Each allowed to pull 4 times
- Who ever pulls all 4 Red balls will receive Rs150/-
- Others have to pay Rs.10/- penalty to house.

BasicsofProbability_13-11-2018 13_53_54.mp4
Video Question :
All possible outcome or 4 draws

BasicsofProbability_13-11-2018 13_58_29.mp4

BasicsofProbability_13-11-2018 14_00_11.mp4
Recall the original question: In the long run (i.e. if it is played a lot of times), is this game
profitable for the players or for the house? Or will everybody break even in the long run?

Recall that we established a three-step process for answering this question:


1. Find all the possible combinations
2. Find the probability of each combination
3. Use the probabilities to estimate the profit/loss per player

We have completed step 1, i.e. finding all the possible combinations. Now, let’s proceed to step
2, i.e. finding the probability of each combination. What are the steps involved in finding the
probability? Let’s hear more from Professor Tricha on this:

BasicsofProbability_13-11-2018 14_01_52.mp4

Similary X has various values for each combination.

In Video Question :
BasicsofProbability_13-11-2018 14_04_28.mp4

So, the random variable X basically converts outcomes of experiments to something


measurable.

For example, let’s say as a data analyst at a bank, you are trying to find out which of the
customers will default on their loan, i.e. stop paying their loans. Based on some data, you have
been able to make the following predictions:

Yearly Amount of
Customer Number of
Income (in Loan Due (in Default Prediction (Yes/No)
No. Dependents
rupees) rupees)
1 ₹10 lakh ₹75 lakh 3 Yes
2 ₹15 lakh ₹50 lakh 2 No
3 ₹20 lakh ₹40 lakh 1 No

Now, instead of processing the yes/no response, it will be much easier if you define a random
variable X, indicating whether the customer is predicted to default or not. The values will be
assigned according to this rule:

X = 1, if the customer defaults


X = 0, if the customer does not default

Now, the data changes to the following:

Customer Yearly Income (in Amount of Loan Due (in Number of X (random
No. rupees) rupees) Dependents variable)
1 ₹10 lakh ₹75 lakh 3 1
2 ₹15 lakh ₹50 lakh 2 0
3 ₹20 lakh ₹40 lakh 1 0

Now, in this form, the table is entirely quantified, i.e. converted to numbers. Now that the data
is entirely in quantitative terms, it becomes possible to perform a number of different kinds of
statistical analyses on it.

Probability Distributions - I
Recall that, in our UpGrad game example, we need to find out if the game would be profitable
for the players or for us (i.e. the house) in the long run. The three-step process for this is:
1. Find all the possible combinations
2. Find the probability of each combination
3. Use the probabilities to estimate the profit/loss per player

So far, we have completed step 1, and are on step 2, i.e. finding the probability of each
combination. For this purpose, we defined a random variable X which helped us convert the
outcomes of our experiment to something measurable. Now, let’s actually find the probability of
each of these combinations.

BasicsofProbability_13-11-2018 14_06_32.mp4

The game was played with 75 people.


Probability
Based on this histogram, can you tell what the probability will be for X = 2?
The correct answer is 0.35. Let's listen to Prof. Tricha as she tells us why.
BasicsofProbability_13-11-2018 14_09_24.mp4

(Give your answer as a number rounded to 2 digits after the decimal point.)

BasicsofProbability_13-11-2018 14_11_37.mp4

Figure 1 - Image for In-video Question

You can play around with the shiny app to make your understanding better for Frequency
Distribution.
An Sample R program [ if you have time code it in python]

Discrete Random Variables


by kshitij.jain@upgrad.com

show with app

 discrete_rv_casino_distribution.R

library(shiny)
library(ggplot2)
library(ggthemes)

# Define server logic


shinyServer(function(input, output) {
trials = 100

balls <- c(1, 1, 1, 0, 0)


number_of_red_balls <- vector(mode = "integer", length = trials)

for (n in 1:trials){
s = sample(balls, 4, replace = T)
number_of_red_balls[n] <- sum(s)

}
# Takes the value of 'action' from input

output$redPlot <- renderPlot({


t = min(trials, input$action)
red = data.frame(number_of_red_balls[1:t])
colnames(red) <- "red"

r = red$red
s = summary(factor(r, levels=c(0,1,2,3,4)))
agg = sum(s)
s = data.frame(s)
s$x1 = c(0, 1, 2, 3, 4)
ggplot(s, aes(x=factor(x1), y=s, width=0.18)) + geom_bar(stat="identity", fill="#
f98866", width=0.25) +
theme_minimal() +
coord_cartesian(ylim = c(0, 40)) + xlab("X (Number of red balls)") +
ylab("Frequency") + theme(legend.position = "none") + theme(text = element_text
(size=14))
})

output$value <- renderPrint({


number_of_red_balls[1:min(trials,input$action)]
})

output$length <- renderPrint({


length(number_of_red_balls[1:min(trials, input$action)])
})

output$plot1 <- renderPlot({


qplot(sample(1:100, 50))
})

})

Probability Distributions - II
So basically, a probability distribution is ANY form of representation that tells us the
probability for all possible values of X. It could be any of the following:
 A table

Figure 2 - Tabular Form of Probability Distribution


 A chart
Figure 3 - Bar Chart Form of Probability Distribution
 An equation
P(x) = x/21
(for x = 1, 2, 3, 4, 5 and 6)
Feedback :
This is not a valid distribution. P(X = 0) has been given as -0.05. However, there is no such thing as
a negative probability. Also, the total probability is 1.1, which does not make sense. The total of all
probabilities in a probability distribution should be equal to 1.
Hence, in a valid, complete probability distribution, there are no negative values, and the total of
all probability values adds up to 1. These two conclusions follow from the basic definition of
probability.

Also, if you recall, we discussed that the probability distribution and frequency distribution
would be exactly similar in shape, just with different scales. You can try it out in this interactive
app. The graph on the left shows the frequency distribution and the one on the right shows the
probability distribution.
Now, let’s say that a company’s management is pondering over investing in a certain project.
Before doing this, it wants to use probability to find whether it can safely expect to make a profit.
Whether the company makes a profit or not will actually depend on which economic cycle is
going on, i.e. recession, boom, and so on.

Based on the opinions of some experts, the following table is created:

Economic Cycle Probability


Recession 0.1
Normal 0.7
Boom 0.2

Suppose, as an analyst in the investment division, you have been asked to find the answer to the
question: “Can the company expect to make a profit or not? Should it invest in this project?”

However, in this form, the table is of no help at all. Hence, let’s quantify it using a random
variable. Since you are interested in whether the company will profit or not, let’s define X as X =
Net revenue of the project.
Now, through some calculations, a fellow analyst of the company has found what the net revenue
would be for each of these scenarios. She creates a probability distribution with this data:

X (Net Revenue of Project, in ₹ crores) P(x)


-305 0.1
+15 0.7
+95 0.2

Now, you finally have a probability distribution for X, the net revenue of the project. Using this
probability distribution, you can find the answer to our original question - “Can the company
expect a profit from this project? Or should it expect a loss?” However, to answer it, you will
have to learn the concept of expected value, which is what we will cover next.

Expected Value - I
Again, let’s go back to the three-step process we followed to find whether the UpGrad red ball
game was profitable for the players or for the house:
1. Find all the possible combinations
2. Find the probability of each combination
3. Use the probabilities to estimate the profit/loss per player
Now that we have completed steps 1 and 2, let’s move on to step 3 where we will use the
probabilities we calculated to estimate the profit/loss per player.

BasicsofProbability_13-11-2018 14_19_37.mp4

How to calculate the 4 Red ball game profitable to the house or not? We will see the calculation
BasicsofProbability_13-11-2018 14_21_01.mp4
BasicsofProbability_13-11-2018 14_22_53.mp4
Play Video

So, the expected value for a variable X is the value of X we would “expect” to get after
performing the experiment once. It is also called the expectation, average, and mean value.
Mathematically speaking, for a random variable X that can take values x1,x2,x3,...........,xn, the
expected value (EV) is given by:

EV(X)=x1∗P(X=x1)+x2∗P(X=x2)+x3∗P(X=x3)+...........+xn∗P(X=xn)
As you may recall, for our red ball game, the expected value came out to be 2.385. What does
that mean? How does that even help us with our original question, which was how much money,
on average, are the players expected to make?

Let’s explore that in the following video.

BasicsofProbability_13-11-2018 14_26_28.mp4
BasicsofProbability_13-11-2018 14_30_33.mp4

So on average each player will get 11.28 Rs. Which will create loss for the house.If lot of people
play we will loss lot of money.

How to increase the profit?


The Expected value of money won by the player should be Negative.
A) You can decrease the prize money.
B) Increase the penalty
C) Decrease the players chances of winning.

The expected value should be interpreted as the average value you get after the experiment has
been conducted an infinite number of times. For example, the expected value for the number of
red balls is 2.385. This means that if we conduct the experiment (play the game) infinite times,
the average number of red balls per game would end up being 2.385. You can try it out in this
interactive app.

So, you can clearly see that, after a large number of simulations, the average value does, in fact,
converge towards the expected value, which is 2.385.

Expected Value - II
Let’s try it for a different activity. Let’s say that you’re throwing a die. You’ve defined X as the
number obtained upon throwing it once. By calculations, you would find that the expected value
for this is 3.5. Let’s see what our simulations show:
Now, you may recall that the expected value of a player’s winnings after playing our game once
was ₹11.28. We said that reducing the prize money and/or increasing the penalty for our game
might make the expected value negative. Let’s see how much we need to reduce or increase the

prize money.
Recall the problem you saw earlier, where we were asked by a company to suggest whether it
should invest in a given project or not. We had made this probability distribution for X, the net
revenue of the project.

X (Net Revenue of Project, in ₹ crores) P(x)


-305 0.1
+15 0.7
+95 0.2

Now, we are in a position to find the expected value for X, the return of the project. This is
called the expected return. If it comes out to be negative, we can say that the project is not
worth investing in.
The expected value of X, which is also called the expected return, is equal to

(-305)*P(X=-305) + (+15)*P(X=+15) + (+95)*P(X=+95) = (-305)*0.1 + (+15)*0.7 + (+95)*0.2


= -1 crore rupees.

So, the expected return of the project is -1 crore rupees. Hence, we can conclude that the project
is not worth investing in.

You can find more examples of expected returns here.

Summary: Basics of Probability


In the first section, you learnt how to quantify the outcomes of events by using random
variables.

For example, recall that we quantified the colours of the balls we would get after playing our
game by assigning a value of X to each outcome. We did so by defining X as the number of red
balls we would get after playing the game once.

Figure 4 - Quantifying Using Random Variables


Next, we found the probability distribution, which was a distribution giving us the
probability for all possible values of X.
We created this distribution in a tabular form:

Figure 5 - Tabular Form of Probability Distribution


We also created it in a bar chart form:

Figure 6 - Bar Chart Form of Probability Distribution


You saw that, in the bar chart form, we were able to visualise the probability in a much better
way., Thus, this form is used more widely as it helps you see trends easily.

Then, we went on to find the expected value for X, the money won by a player after playing the
game once. The expected value (EV) for X was calculated using the formula:

Another way of writing this is


Calculating the answer this way, we found the expected value to be +11.28.

In other words, if we conduct the experiment (play the game) infinite times, the average
moneywon by a player would be ₹11.28. Hence, we decided that we should either decrease the
prize money or increase the penalty to make the expected value of X negative. A negative
expected value would imply that, on average, a player would be expected to lose money and the
house would profit.

Practice Questions
These questions are not graded.

Let's say that Rajya Laxmi Bank has given student loans to 10,000 people. However, the
government wants to ensure that the bank is not giving away very risky loans and, hence, wants
to know the “Expected Loss” of the bank’s student loan portfolio.

The expected loss is basically the expected value of the money lost by the bank due to people
defaulting on their loans, i.e. not paying their EMIs.

The data that the bank uses to calculate the expected loss, looks like this -

Customer No. Exposure at Default (in ₹ lakh) Recovery (%) Probability of Default
00001 11.50 20% 0.007

Here,
 Exposure at default (EAD) is the total money owed by the customer in case of default
 Recovery (R) is the percentage of the exposed money that the bank would be able to
recover.
o For example, in the above example, the bank would recover 20% of the exposed
money, i.e. 20%*11.5 = ₹2.3 lakh.
 Probability of default (PD) is the probability that the customer will default. This is
calculated for each customer using a number of factors such as family income, university
attended, etc.
Inferential Statistics - Student Loan file_downloadDownload

In fact, banks are also required to report the unexpected loss. If you want to read more on how
the expected loss and unexpected loss are calculated, you can find that here.
Practice Questions II
A roulette wheel is a game found in many casinos. Let’s go through it and understand how the
roulette wheel's design ensures that, in the long run, the player always loses and the house
always wins.

This is a European roulette wheel.


Figure 7 - European Roulette Wheel

It contains the numbers 0 to 36 written in an irregular sequence. The players can bet on any
number from 0 to 36. For example, let’s say that Kriti bets £100 on the number 5. Now, a ball
would be dropped into the wheel, which is then given a spin. If the ball lands on the pocket
marked 5, Kriti would win (£100)*36 = £3,600, resulting in net winnings of £3,600 - £100 =
£3,500. However, if the ball lands on any other pocket, Kriti would not win anything, resulting in
net winnings of £0 - £100 = - £100.

Let’s see what the expected value is for Kriti’s net winnings if she plays this game and bets £100
on the number 5.
This is an American roulette wheel.
The only difference between the American roulette wheel and the one you saw before, i.e. the
European roulette wheel, is that, in addition to the numbers 0 to 36, it contains a number ‘00’.
The player can bet on any number from 1 to 36.

If you want to read more on this topic, specifically, how other games' designs ensure a negative
expected value, go through this link.

Vous aimerez peut-être aussi