Vous êtes sur la page 1sur 64

PGP Bootcamp - Statistics

April 2019
Abhishek Rishabh
ISB
Topics
• Basic Probability Theory

• Descriptive Statistics

• Normal Distribution

• Relationship between variables


Basic Probability Theory
Introducing Random Variable
• A random variable is a variable that takes on one of multiple
different values, each occurring with some probability.

• Example –
• Toss a coin. – What happens?
• You can either get heads or tails.
• What about the probability of getting heads or tails?
• How do you know that?
Examples
• Suppose you roll a die. – What happens?

• What is the random variable here?


• What do you see after the die roll?

• What is the probability of getting any value?

• How do you know that?


Examples
• Suppose you roll a die and toss a coin together. – What
happens?

• What is the random variable here?


• What do you see after the coin toss and die roll?

• What is the probability of getting any value?

• How do you know that?


Some terms used in probability theory
• Random Experiment - Any experiment or a process whose
outcome cannot be predicted with certainty.

• Basic outcome – a possible outcome of a random experiment.

• Sample Space - The collection of all possible outcomes of a


random experiment.

• Event – A subset of the sample space. The set of outcomes we are


interested in.
Example of terms
• Tossing a coin: (Random Experiment)
• Basic outcomes = {H}, {T}
• Sample space = {H, T}
• Event: Getting a tail={T}

• Rolling a die: (Random Experiment)


• Basic outcomes = {1}, {2}, {3}, {4}, {5}, {6}
• Sample space = {1,2,3, 4, 5, 6}
• Event: Getting an odd number = {1, 3, 5}
Real life example
• Random Experiment: Who is going to die in Avengers Infinity War movie?

• Sample Space: { No one dies,


Only Iron Man dies,
Only Captain America dies,
Only Vision dies,
Iron Man and Captain America die,
Iron Man and Vision die,
Captain America and Vision die,
All three die }

• Basic outcomes: {}, {(I)}, {(C)}, {(V)}, {(I,C)}, {(I,V)}, {(C,V)}, {(I,C,V)}

• Event: Iron Man dies – {(I), (I,C), (I,V), (I,C,V)}


Types of random variables
• A Random Variable can be a Discrete random variable or a Continuous random variable
• A discrete random variable can only assume values that are distinct and separate. : "A discrete random
variable takes countable values.“ Only some values are meaningful .Values are not limited only by
precision.

• A continuous r.v can take any value within some interval of numbers. It is measured and not counted.

• Discrete random variables:


• No. of new born children per year in Delhi
• No. of new leprosy cases per year in India

• Continuous random variables:


• Height, Weight, Temperature etc.
Probability Distribution: Discrete and
Continuous
• A probability distribution is a rule that identifies possible outcomes of a
random variable and assigns a probability to each outcome.

• A discrete distribution has a finite number of values.


• e.g. face value of a card

• A continuous distribution has all possible values in some range.


• e.g. sales per month, height of students in this class

• Continuous distributions are nicer to deal with and are good approximations
when there are a large number of possible values
Probability Mass Function (PMF)
• What is it?

• Probability Mass Function (PMF)


• A set of probability value assigned to each of the values taken by the
discrete random variable

• and

• Probability :
Probability Mass Function(PMF)
• Example – Rolling a die. x p(x)
1 p(x=1)=1/6
p(x)
2 p(x=2)=1/6

1/6 3 p(x=3)=1/6

1 2 3 4 5 6 x 4 p(x=4)=1/6

 P(x)  1
all x
5 p(x=5)=1/6

6 p(x=6)=1/6
1.0
Cumulative Distribution Function(CDF)
• What is it?

• F(x) accumulates all of the probability less than or equal to x


Cumulative Distribution Function(CDF)
x Prob(x≤A)
• Example – Rolling a die.
1 Prob(x≤1)=1/6

1.0 P(x) 2 Prob(x≤2)=2/6


5/6
2/3
1/2 3 Prob(x≤3)=3/6
1/3
1/6 4 Prob(x≤4)=4/6
1 2 3 4 5 6 x
5 Prob(x≤5)=5/6

6 Prob(x≤6)=6/6
Example
• The number of patients seen in the ER in any given hour is a
random variable represented by x. The probability distribution
for x is:
x 10 11 12 13 14

Prob(x) .4 .2 .2 .1 .1

Find the probability that in a given hour:


 Prob(x=14)= .1
a. Exactly 14 patients arrive
b. At least 12 patients arrive Prob(x12)= (.2 + .1 +.1) = .4
c. At most 11 patients arrive Prob(x≤11)= (.4 +.2) = .6
Probability Density Function (pdf)
• First way to describe continuous r.v is in
terms of pdf which is usually denoted by f(t)

• The pdf of a r.v X has following two


characteristics:
a) The area lying under the pdf curve is
equal to 1
b) The probability that X lies between two
given values a and b is equal to entire
area under the curve a and b.
Set Theory
• A set is a collection of objects, which are the elements of the set. If S is a
set and x is an element of S, we write . If x is not an element of S, we write .

• A set can have no elements, in which case it is called the empty set,
denoted by .

• Sets can be specified in a variety of ways

• For example, the set of all possible outcomes of a die roll is {1, 2, 3, 4, 5, 6}
Set Theory
• The complement of set is denoted by

• e.g. S={1,2,3,4,5,6,7,8} , A={1,4,7} , A’=?

• The intersection of A and B is denoted by

• If S={1,2,3,4,5,6,7,8} , A ={4,5,6,7,8} ,
B={1,4,6,7},
=?
Venn Diagrams and Basic Set Operations
Example
•• Toss a coin three times, what is the probability of at least two heads ?

• There are 8 possible outcomes which, if the coin is unbiased, should all be equally likely:-

• Two or more heads result from the 4 outcomes which are ringed.
• The probability of two or more heads is, therefore:
Probability =
Probability and Set Theory
• If A and B are events then

• Probability of A

A B

P(A) = Blue Area ÷ Total Area


Probability and Set Theory
• If A and B are events then

• Probability of AUB

A B

P(AUB) = P(A)÷ P(B)-P(AnB)


Question
• The manager of a factory claims that among his 400 employees:
‾ 312 got a pay rise last year
‾ 248 got increased pension benefits last year
‾ 173 got both pension benefits and pay rise last year
‾ 13 got neither

• Using last years figures as your guide to this years prospects,


• calculate the probability of:

a)Getting a pay rise


b)Not getting a pay rise
c)Getting both a pay rise and pension benefits
d)Getting no pay rise or benefit increase
e)Getting a pay rise or benefits
Answer
A B Let A~
173 Pay rise
139 75 B~
Benefits
13

P(A) = (139+173) ÷ (138+173+75+13) = 311/400 = 0.7775

P(not A) = P(A’) = 1- 311/400 = 0.2225

P(A n B) = 173/400 = 0.4325 - pay rise and benefits

P(A’ U B’) = 13/400 = 0.0325 - no rise or benefits


Conditional Probability
• These are the probabilities calculated on the basis that something
has already happened

• For example :

—The probability that I will pay my electricity bill given that I have
just been paid.

• If these two events are A and B then they are not INDEPENDENT
we write P(A|B) ~ P(A given B)
Conditional Probability
If B has already happened then our event must be somewhere in B

A B
BUT, How can A happen if our event must be in the B space ?

We can only be in the following Space on our Venn Diagram


A B

And so Our Probability P(A|B) is the ratio of Green Space ÷ Red space

P( A  B) A B
P( A | B) 
P( B)
Descriptive Statistics
Descriptive Statistics – Meaning
• Descriptive statistics are used to describe the basic features of
the data in a study. They provide simple summaries about the
sample and the measures. Together with simple graphics
analysis, they form the basis of virtually every quantitative
analysis of data.
Descriptive Statistics – Example

Data File: GMAT.jmp


Descriptive Statistics
• Measures of central tendency: Mean, Median, Mode

• Measures of Spread: Range, Variance, Standard Deviation

• Measures of relative position: Percentile Score


• Mean= =
• Variance =2 =
• Standard Deviation =  =
• percentile value of x = (number of values less than or equal to x / total
number of values) * 100
Histograms
• A histogram is used to summarize discrete or continuous data.
In other words, it provides a visual interpretation of numerical
data by showing the number of data points that fall within a
specified range of values (called “bins”).
Histograms- Example
Boxplot
• Box plots (also called box-whisker plots) give a good
graphical image of the concentration of the data. They also
show how far the extreme values are from most of the data. A
box plot is constructed from five values: the minimum value, the
first quartile, the median, the third quartile, and the maximum
value.
Boxplot – Example
• Data set {3, 7, 8, 5, 12, 14, 21, 13, 18}.

• Minimum: 3, Q1 : 6, Median: 12, Q3 : 16, and Maximum: 21.

• The box part represents the interquartile range and represents


approximately the middle 50% of all the data
Expected Value of a Random Variable
• Expected value is just the average or mean (µ) of random
variable x.

• It’s also how we expect X to behave on-average over the long run

• Example : Your apartment can give you a rent of 30000 with


probability 0.2 , 25000 with probability 0.4 and 20000 with
probability 0.4.What is the expected value of the rent from your
apartment?
Computing mean and variance for Discrete
Random Variables
• X is a discrete random variable. Assume we know the probabilities Prob(X=x) for all values x. Call this
• prob(x).
• Mean (Expected Value):

• Standard Deviation:
• First compute variance:

)2 = E(X2) – E(X)2

And then take the square root


}
Question
• Linda is a sales associate at a large auto dealership. At her commission rate of
25% of gross profit on each vehicle she sells, Linda expects to earn $350 for
each car sold and $400 for each truck or SUV sold. Linda motivates herself by
using probability estimates of her sales. For a sunny Saturday in April, she
estimates her car and truck sales as follows:
Cars Sold 0 1 2 3
Probability 0.3 0.4 0.2 0.1
Trucks/SUV Sold 0 1 2
Probability 0.4 0.5 0.1

• What is Linda’s expected income?


Answer
• Linda’s expected income from cars=
0.3*0 + 0.4*350+0.2*700+0.1*1050= 385$

• Linda’s Expected income from trucks=


0.4*0 + 0.5*400 + 0.1*800 +0*1200 = 280$

Linda’s net expected income = 665$


Computing mean and variance for
Continuous Random Variables
•• It’s a bit complex than discrete random variables.

• The expected value of a continuous random variable X is defined by



E ( X )  �xf ( x)dx
-�

• Variance is given by:


Normal Distribution
Normal Distribution (Bell Curve)
• What is it?

• Loose definition – A probability distribution that is symmetric


about the mean and has data near the mean more frequent
than data away from the mean.

• Looks like a bell, therefore also called a bell curve.


Normal Distribution (Bell Curve)
• What does it look like?
Normal Distribution (Bell Curve)
• What is ‘normal’ about normal distribution

• Early statisticians noticed the same shape coming up over and over again in different

distributions—so they named it the normal distribution.

• A lot of things can be represented using normal distribution- such as height of students etc.

• https://galtonboard.com/probabilityexamplesinlife
Normal Distribution (Bell Curve)
• Some properties of normal distributions

• Normal distributions are symmetric around their mean.

• The mean, median, and mode of a normal distribution are equal.

• The area under the normal curve is equal to 1.0.

• Normal distributions are denser in the center and less dense in the tails.

• Normal distributions are defined by two parameters, the mean (μ) and the standard
deviation (σ).
Probability Calculations for the Normal
“Model”
• The probability associated with any single value of the random variable is Zero. Why?
• Probability of values being in a range = Area under the pdf curve in that range

P(x1 ≤ X ≤ x2) = P(X≤ x2) - P(X≤ x1)

• Area under the entire curve = P(- ∞ ≤ X≤ + ∞) = 1


• How to calculate P(X≤ x) ? Perhaps a table? Single table for all normal distributions?
Standard Normal
• Standard Normal Distribution:

• A r.v follows Z follows Standard normal distribution


if Z is normally distributed with mean � = 0 and std
dev � = 1.0. In this case we write:

� ~ �(0, 1)

• We usually reserve the symbol Z that obeys Standard


normal distribution, we can also call it standard normal
r.v.
Z-scores, Standard Normal Distribution
• For every value (x) of the random variable X, we
can calculate its Z-score:

• Interpretation – How many standard deviations


away is the value from the mean?

• If X~N(μ, σ2), then


• Z-scores have a normal distribution with μ=0
and σ=1
i.e. Z ~ N(0,1)
• Standard Normal Distribution

• P(X ≤ x) = P(Z ≤ z)
Z Table
• What is it?

• Table that represents cumulative probabilities of standard


normal distribution.

• http://www.z-table.com/
Z Table
• Find P(Z < 2.13)

• Find P(Z> 2.14)

• Find P(1<Z<1.67)
Relationship between variables
Relationship - Meaning
• Relationship means how changing one variable affects other
variables.

• Consider, two variables price and sales.

• What ‘generally’ happens to sales when we increase price of a good?

• What ‘generally’ happens to sales when we decrease price of a good?


Dimensions of relationship
• Direction of relationship

• Form of relationship

• Strength of relationship

• What is the direction, form and strength of relationship in the sales and
price relationship example?

• Generally you can’t answer, until you have data.


Dimensions of relationship
Example – Sales v/s Ad Exp.
Ad Exp Sales Ad Exp Sales
200
5 46 8 52
180
6 62 2 15
160
2 9 1 10
140
6 48 3 7
120
13 66 2 5
100
11 179 1 6 Sales
80
12 120 2 3
60
13 59 8 80
40
14 170 9 75
20
6 54 11 125
0
6 54 12 100 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

7 16 13 60
Ad Exp
Covariance
• Covariance can be used to find the direction of relationship
between two variables but not the strength.

• Random variables with zero covariance are called uncorrelated


• All independent variables are uncorrelated
• All uncorrelated variables are not independent

• It is difficult to establish the strength of the relationship using


covariance because it depends on the unit of measurement
Covariance – Math Formula

n
(Xi - X)(Yi -Y)
Cov(X,Y)  SXY  i1
n-1

Xi – the values of the X-variable

Yi – the values of the Y-variable


XX – the mean of the X-variable
Ȳ – the mean of the Y-variable
n – the number of the data points
Covariance - Example
mean of x is (98+87+90+85+95+75)/6= 88.33.
mean of y is (15+12+10+10+16+7)/6= 11.67
X Y
98 15
87 12 This is positive. Hence, X
and Y have a positive
90 10 relationship.
85 10
95 16
75 7

The final step is to divide by (n-1) = 6 - 1 = 5.


125.66/5 = 25.132
Correlation
• Correlation tells us the strength of the relationship between variables.
• Correlation is dimensionless because it is standardized using standard
deviations. (Formula on next slide)

• It always takes a value between -1 and 1


• Close to +1 implies a linear relationship with a positive slope
• Close to -1 implies a linear relationship with a negative slope
• Close to 0 implies that there is no linear relationship

• Correlation captures the association between two variables at a time


Correlation - Formula

Cov ( X ,Y ) S XY
Correl( X ,Y )  rXY  
SD( X ).SD(Y ) S X SY
Cov(X,Y) – Covariance between X and Y

SD(Xi ) standard deviation of X

SD(Yi ) standard deviation of Y


Correlation - Example
Correlation - Example

NO Correlation!
Linear Combination of Random Variables
• Construct W = aX + bY ; a, b are any constants; X, Y are two random variables
• Mean: E[W] = aE[X] + bE[Y]
• Variance: Var[W] = a2Var[X] + b2Var[Y] + 2abCov[X,Y]

• The variance of the combination increases or decreases depending on the sign


of the covariance term (+ or -)

• If X and Y are independent then Var[W]= a2Var[X] + b2Var[Y]

• If X and Y are independent then Var[X+Y]= Var[X]+ Var[Y]


Example
• A manufacturer supplies bottles to two wholesalers A and B. The number
of bottles ordered by A varies normally with mean 10000 and standard
deviation 3000 . The number of bottles ordered by B varies normally with
mean 15000 and standard deviation 4000.Given that the orders placed
by A and B are uncorrelated , what is the mean and the standard
deviation of the demand faced by the manufacturer.

Vous aimerez peut-être aussi