Vous êtes sur la page 1sur 42

Statistics for Business

and Economics
Module 1:Probability Theory and
Statistical Inference
Spring 2010
Lecture 3: Continuous probability distributions
Priyantha Wijayatunga, Department of Statistics, Ume
University
These materials
are altered ones from copyrighted lecture slides ( 2009 W.H.
priyantha.wijayatunga@stat.umu.se
Freeman and Company) from the homepage of the book:
The Practice of Business Statistics Using Data for Decisions :Second Edition
by Moore, McCabe, Duckworth and Alwan.

Continuous probability
distributions

Probability density

Uniform probability density

Normal distributions, standard normal distribution

Law of large numbers

Sampling distributions

The mean and standard deviation of sample mean

The central limit theorem

Recall: Discrete Probability


Distributions
Let X denote the # of days a student comes to class (in a week).
Probability distibution is

0.1
0.2

P X x p ( x) 0.2
0.3

0.2

if x 1
if x 2
if x 3
if x 4
if x 5

then
1)what is the probability that a student comes to the class more than 3 days?
2)what is the probability that a student comes to the class 2 or 3 days?

Continuous Probability
A
continuous random variable X takes all values in an interval.
Distributions
Example: There is an infinity of numbers between 0 and 1 (e.g., 0.001, 0.4, 0.0063876).

The probability distribution of a continuous random variable is described


by a density curve ( also called density function or probability
density).
The probability of any event is the area under the density curve for the
values of X that make up the event.
This is a uniform density curve for the variable X.
The probability that X falls between 0.3 and 0.7 is
the area under the density curve for that interval:
P(0.3 X 0.7) = (0.7 0.3)*1 = 0.4
Density function:
X

f(x)= 1; for 0 x 1
f(x)= 0; for x<0 or x>1

Intervals
All continuous probability distributions assign probability 0 to every
individual outcome. Only intervals can have a positive probability, represented
by the area under the density curve for that interval.

The probability of a single event is zero:


P(X=1) = (1 1)*1 = 0
Height
=1

The probability of an interval is the same whether


boundary values are included or excluded:
P(0 X 0.5) = (0.5 0)*1 = 0.5
P(0 < X < 0.5) = (0.5 0)*1 = 0.5

P(0 X < 0.5) = (0.5 0)*1 = 0.5

P(X < 0.5 or X > 0.8) = P(X < 0.5) + P(X > 0.8) = 1 P(0.5 < X < 0.8) = 0.7

Assigning Probabilities: intervals of


outcomes

A sample space may contain all numbers within a range.

For continuous outcomes, the probability model is a density curve.

Area under the entire density curve is equal to 1.

Probability model assigns probabilities as areas under the density


curve.

Assigning Probabilities: intervals of


If
all possible outcomes are equally likely: for example, obtaining a
outcomes
value from 0 to 1 is equally likely.

Uniform density curve (uniform probability distribution) on [0,1].

Probabilities are computed as areas


P(0.3 X 0.7) = 0.4
Similarly, P(X < 0.5 or X > 0.8) = 0.5 +0.2 = 0.7

General uniform probability


If
the outcomes are equally likely for any value in between two numbers a and b
distribution
(random variable X can take any value in between a and b) where a<b,

then the probability density of X is

f (x)

(b - a)

if a x b
otherwise

Ex: The number of minutes that a student


takes to solve a math problem is
known to be any number in between
10 to 20 with equal chances.
Find the probability that a student
takes more than 6 but less than 12
minutes to solve a given math problem.

Continuous random variable and population


distribution
The shaded area under a density
curve shows the proportion, or %,
of individuals in a population with
values of X between x1 and x2.

Because the probability of drawing


one individual at random
depends on the frequency of this
type of individual in the population,
the probability is also the shaded
area under the curve.

% individuals with X
such that x1 < X < x2

Normal probability models

Normal probability models look like:

The scores of students on the ACT college entrance examination


in a recent year had the normal distribution with mean =18.6 and
standard deviation = 5.9.
What is the probability that a randomly chosen student scores 21 or
higher?

Normal probability
distributions
The
probability distribution of many random variables is a normal
distribution. It shows what values the random variable can take and is
used to assign probabilities to those values.

Example: Probability
distribution of womens
heights.
Here since we chose a woman
randomly, her height, X, is a
random variable.

To calculate probabilities with the normal distribution, we will


standardize the random variable (z score) and use Table A.

Normal distributions
Normal or Gaussian distributions are a family of symmetrical, bell
shaped density curves defined by a mean (mu) and a standard
deviation (sigma) : N().

f ( x)

1
2

1 x

x
e = 2.71828 The base of the natural logarithm
= pi = 3.14159

A family of density curves


Here means are the same ( = 15)
while standard deviations are
different ( = 2, 4, and 6).

Here means are different


( = 10, 15, and 20) while
standard deviations are the same
( = 3)

The 68-95-99.7 rule

About 68% of all observations

are within 1 standard deviation

Inflection point

(of the mean ().

About 95% of all observations

are within 2 of the mean .

Almost all (99.7%) observations

are within 3 of the mean.


mean = 64.5

standard deviation = 2.5

N(, ) = N(64.5, 2.5)

The standard Normal distribution


Because all Normal distributions share the same properties, we can
standardize our data to transform any Normal curve N() into the
standard Normal curve N(0,1).
N(64.5, 2.5)

N(0,1)

=>

Standardized height (no units)

For each x we calculate a new value, z (called a z-score).

Standardizing: calculating zA
z-score measures the number of standard deviations that a data
scores
value x is from the mean .

(x )
z

When x is 1 standard deviation larger


than the mean, then z = 1.

for x , z

When x is 2 standard deviations larger


than the mean, then z = 2.

for x 2 , z

2 2

When x is larger than the mean, z is positive.


When x is smaller than the mean, z is negative.

Ex. Women heights

N(, ) =
N(64.5, 2.5)

Women heights follow the N(64.5,2.5)


distribution. What percent of women are

Area= ???

shorter than 67 inches tall (thats 56)?


mean = 64.5"
standard deviation = 2.5"
x (height) = 67"

Area = ???

= 64.5 x = 67
z=0

z=1

We calculate z, the standardized value of x:

(x )
(67 64.5) 2.5
, z

1 1 stand. dev. from mean

2.5
2.5

Because of the 68-95-99.7 rule, we can conclude that the percent of women
shorter than 67 should be, approximately, .68 + half of (1 - .68) = .84 or 84%.

What is the probability, if we pick one woman at random, that her height will be
some value X? For instance, between 68 and 70 inches P(68 < X < 70)?
Because the woman is selected at random, X is a random variable.

(x )
z

N(, ) =
N(64.5, 2.5)

As before, we calculate the zscores for 68 and 70.

For x = 68",

(68 64.5)
1. 4
2.5

For x = 70",

(70 64.5)
2.2
2.5

0.9192
0.9861

The area under the curve for the interval [68" to 70"] is 0.9861 0.9192 = 0.0669.
Thus, the probability that a randomly chosen woman falls into this range is 6.69%.
P(68 < X < 70) = 6.69%

Using Table A
Table A gives the area under the standard Normal curve to the left of any z value.

.0082 is the
area under
N(0,1) left
of z = -2.40

.0080 is the area


under N(0,1) left
of z = -2.41

0.0069 is the area


under N(0,1) left
of z = -2.46

()

Percent of women shorter than 67


For z = 1.00, the area under
the standard Normal curve
to the left of z is 0.8413.

N(, ) =
N(64.5, 2.5)
Area 0.84

Conclusion:
84.13% of women are shorter than 67.

Area 0.16

By subtraction, 1 - 0.8413, or 15.87% of


women are taller than 67".

= 64.5 x = 67
z=1

Tips on using Table A


Because the Normal distribution
is symmetrical, there are 2 ways
Area = 0.9901

that you can calculate the area


under the standard Normal curve

Area = 0.0099

to the right of a z value.


z = -2.33

area right of z = area left of -z

area right of z =

area left of z

Tips on using Table A


To calculate the area between 2 z-values, first get the area under N(0,1)
to the left for each z-value from Table A.
Then subtract the
smaller area from the
larger area.
A common mistake made by
students is to subtract both zvalues, but the Normal curve is
not uniform.

area between z1 and z2 =


area left of z1 area left of z2

The area under N(0,1) for a single value of z is zero


(Try calculating the area to the left of z minus that same area!)

The National Collegiate Athletic Association (NCAA) requires Division I athletes to


score at least 820 on the combined math and verbal SAT exam to compete in their
first college year. The SAT scores of 2003 were approximately normal with mean
1026 and standard deviation 209.
What proportion of all students would be NCAA qualifiers (SAT 820)?

x 820
1026

209
(x )
z

(820 1026)
z
209
206
z
0.99
209
Table A : area under
N(0,1) to the left of
z - .99 is 0.1611
or approx.16%.

area right of 820

=
=

total area
1

area left of 820


0.1611

84%

Note: The actual data may contain students who scored


exactly 820 on the SAT. However, the proportion of scores
exactly equal to 820 is 0 for a normal distribution is a
consequence of the idealized smoothing of density curves.

The NCAA defines a partial qualifier eligible to practice and receive an athletic
scholarship, but not to compete, as a combined SAT score is at least 720.
What proportion of all students who take the SAT would be partial
qualifiers? That is, what proportion have scores between 720 and 820?

x 720
1026
209
(x )
z

(720 1026)
z
209
306
z
1.46
209
Table A : area under
N(0,1) to the left of
z - .99 is 0.0721
or approx. 7%.

area between
720 and 820
9%

=
=

area left of 820


0.1611

area left of 720


0.0721

About 9% of all students who take the SAT have scores


between 720 and 820.

The cool thing about working with


normally distributed data is that
we can manipulate it and then find
answers to questions that involve
comparing seemingly noncomparable distributions.

We do this by standardizing the


data. All this involves is changing
the scale so that the mean now = 0
and the standard deviation = 1. If
you do this to different distributions
it makes them comparable.

(x )
z

N(0,1)

Finding a value when given a proportion


Backward normal calculations: We may also want to find
the observed range of values that correspond to a given proportion under the
curve.
For that, we use Table A backward:

we first find the desired

area/proportion in the
body of the table

we then read the

corresponding z-value
from the left column and
top row
For an area to the left of 1.25 % (0.0125),
the z-value is -2.24

Backward Normal Calculations

Miles per gallon ratings of compact cars (2001 models) follow


approximately the N(25.7, 5.88) distribution. How many miles per gallon
must a vehicle get to place in the top 10% of all 2001 model compact cars?
1. z = 1.28 is the standardized
value with area 0.9 to its left and
0.1 to its right.

2. Unstandardize

x 25.7
1.28
5.88
Solving for x gives x = 33.2
miles per gallon.

Other Standard Normal


probability tables
0.2
0.0

0.1

density

0.3

0.4

Standard normal distribution

-3

-2

-1

If X ~ N (10,0.3) then what is P X 11.025 ?


Z
P(Z > 1.87 )= 0.03

X 10

P X 11 P

11.025 10

0.3

P Z 1.87
1 P Z 1.87
1 - 0.9693
0.0307

0.3

Assessing the Normality of data


One way to assess if a distribution is indeed approximately normal is to
plot the data on a normal quantile plot.
The data points are ranked and the percentile ranks are converted to zscores with Table A. The z-scores are then used for the x axis against
which the data are plotted on the y axis of the normal quantile plot.

If the distribution is indeed normal the plot will show a straight line,
indicating a good match between the data and a normal distribution.

Systematic deviations from a straight line indicate a nonnormal


distribution. Outliers appear as points that are far away from the overall
pattern of the plot.

Normal quantile plot of


the earnings of 15 black
female hourly workers at
National Bank. This
distribution is roughly
Normal except for one
low outlier.

The Normal Distributions

Normal quantile plot of


the salaries of Cincinnati
Reds players on opening
day of the 2000 season.
This distribution is
skewed to the right.

Law of large numbers


As the number of randomly drawn
observations in a sample increases,
the mean of the sample

gets

closer and closer to the population


mean .
This is the law of large numbers. It
is valid for any population.

Note: We often intuitively expect predictability over a few random observations,


but it is wrong. The law of large numbers only applies to really large numbers.

Reminder: What is a sampling


distribution?
The sampling distribution of a statistic is the distribution of all
possible values taken by the statistic when all possible samples of a
fixed size n are taken from the population. It is a theoretical idea we
do not actually build it.

The sampling distribution of a statistic is the probability distribution


of that statistic.

Sampling distribution of
We
take many random
samples of a given size n from a population
sample
mean
with mean and standard deviation

Some sample means will be above the population mean and some
will be below, making up the sampling distribution.
Sampling
distribution
of x bar
Histogram
of some
sample
averages

For any population with mean and standard deviation :


The mean of the sampling distribution is equal to the population
mean

standard deviation of the sampling distribution is /n, where n


is the sample size.
The

Sampling distribution of x bar

Mean and standard deviation of


sample
mean
Mean of a sampling distribution of
x

There is no tendency for a sample mean to fall systematically above or


below even if the distribution of the raw data is skewed. Thus, the mean
of the sampling distribution is an unbiased estimate of the population
mean it will be correct on average in many samples.

Standard deviation of a sampling distribution of

The standard deviation of the sampling distribution is smaller than the


standard deviation of the population by a factor of n. Averages are
less variable than individual observations. Also, the results of large
samples are less variable than the results of small samples.

For normally distributed


populations
When a variable in a population is normally distributed, the sampling
distribution of the sample mean for all possible samples of size n is
also normally distributed.
Sampling distribution

If the population is N( )
then the sample means
distribution is N( /n).
Population

The central limit theorem


Central Limit Theorem: When randomly sampling from any population
with mean and standard deviation , when n is large enough, the
sampling distribution of x bar is approximately normal: ~ N( /n).

Population with
strongly skewed
distribution

Sampling
distribution of
x for n = 2
observations

Sampling
distribution of
x for n = 10
observations

Sampling
distribution of
x for n = 25
observations

The central limit theorem


Histogram of 1000 sample means of 50-sized samples

Density
1.0

1.0

0.5

0.5

0.0

0.0

Density

1.5

1.5

2.0

2.5

Bin(5,0.7)

3.0

3.2

3.4

3.6

3.8

sample mean

From a highly skewed distribution (mean=3.5, sd=1.024695) get


random samples with n=50 and get their sample means
Relative frequency distribution is pproximately normal (bell shaped)
mean=3.50164 and sd=0.1471508

1.024695/ 50 0.1449138

IQ scores: population vs. sample


In a large population of adults, the mean IQ is 112 with standard deviation 20.
Suppose 200 adults are randomly selected for a market research campaign.
The

distribution of the sample mean IQ is:

A) Exactly normal, mean 112, standard deviation 20


B) Approximately normal, mean 112, standard deviation 20
C) Approximately normal, mean 112 , standard deviation 1.414
D) Approximately normal, mean 112, standard deviation 0.1

C) Approximately normal, mean 112 , standard deviation 1.414

Application
Hypokalemia is diagnosed when blood potassium levels are low, below
3.5mEq/dl. Lets assume that we know a patient whose measured potassium
levels vary daily according to a normal distribution N( = 3.8, = 0.2).
If only one measurement is made, what is the probability that this patient will be
misdiagnosed hypokalemic?

( x ) 3.5 3.8
z

0.2

z = 1.5, P(z < 1.5) = 0.0668 7%

If instead measurements are taken on 4 separate days, what is the probability


of such a misdiagnosis?

( x ) 3.5 3.8
z

n
0.2 4

z = 3, P(z < 1.5) = 0.0013 0.1%

Note: Make sure to standardize (z) using the standard deviation for the sampling
distribution.

Income distribution
Lets consider the very large database of individual incomes from the Bureau of
Labor Statistics as our population. It is strongly right skewed.

We take 1000 SRSs of 100 incomes, calculate the sample mean for
each, and make a histogram of these 1000 means.

We also take 1000 SRSs of 25 incomes, calculate the sample mean for
each, and make a histogram of these 1000 means.

Which histogram
corresponds to the
samples of size
100? 25?

How large a sample size?


It depends on the population distribution. More observations are
required if the population distribution is far from normal.

A sample size of 25 is generally enough to obtain a normal sampling


distribution from a strong skewness or even mild outliers.

A sample size of 40 will typically be good enough to overcome extreme


skewness and outliers.

In many cases, n = 25 isnt a huge sample. Thus,


even for strange population distributions we can
assume a normal sampling distribution of the mean
and work with it to solve problems.