Vous êtes sur la page 1sur 98

Business Stats Review

Kiriakos Vlahos
ALBA 2016
Session overview
Descriptive Statistics
The Normal Distribution
Sampling
Confidence Intervals
Hypothesis Testing

Kiriakos Vlahos
What is statistics good for?
Descriptive Statistics Inferential Statistics
Collect Predict and forecast values of
Organize population parameters
Summarize Test hypotheses (draw
Display conclusions) about values of
population parameters
Analyze
Make decisions

It is a capital mistake to theorize before you have the data.


Sherlock Holmes

Kiriakos Vlahos
Samples and Populations
A population consists of the set of all
measurements in which we are interested.
A sample is a subset of the measurements
selected from the population.
A census is a complete enumeration of every
item in a population.

Kiriakos Vlahos
Samples and Populations
A population consists of the set of all
measurements in which we are interested.
A sample is a subset of the measurements
selected from the population.
A census is a complete enumeration of every
item in a population.

Kiriakos Vlahos
Descriptive Statistics
Measures of location
Percentiles and quartiles
Mean, median and mode (measures of central tendency)
Measures of variability
Range
Interquartile range
Variance
Standard deviation
Other summary measures
Skewness

Kiriakos Vlahos
Measures of central tendency The
arithmetic mean

Mean
Commonly referred to as average
The sum of the observed values divided by the
number of observations.
(in Excel: =AVERAGE(range) )
Population Mean Sample Mean


=1 =1
= =

Kiriakos Vlahos
Other measures of central
tendency
Median
Middle value when sorted in terms of magnitude
(in Excel: =MEDIAN(range) )
50% of the values lie below the median
Mode
Most frequently occurring value (in Excel:
=MODE(range) )

Kiriakos Vlahos
Measures of central
tendency - Example
Mean = 110/11 = 10
Median = 8
Length of Stay (LOS) of 11
hospital patients in days
5
5 Mode = 6 and 9
6
6
6
8
9
9
9
9
38
Sum 110

Kiriakos Vlahos
On the meaning of
average
Things look slightly better for
the government if you take
the median rather than the
average income. The median
is the mid-point between the
highest and the lowest
incomes in a given group and
considered by statisticians to
be a more reliable figure.
Daily Telegraph

Kiriakos Vlahos
Income distributions

For 2006. Source: Wikipedia


Kiriakos Vlahos

Page 11
Measures of Variability or
Dispersion
Range
Difference between maximum and minimum
values
Interquartile Range (IQR)
Difference between third and first quartile (Q3 -
Q1)
Variance
Average*of the squared deviations from the mean
Standard Deviation
Square root of the variance
Kiriakos Vlahos
Quartiles and the Interquartile
Range
Quartiles are the percentage points that
break down the ordered data set into
quarters.
The first quartile is the point below which lie
1/4 of the data.
The second quartile is the point below which
lie 1/2 of the data. This is also called the
median.
The third quartile is the point below which lie
3/4 of the data.
Kiriakos Vlahos
Variance and Standard Deviation

Population Variance Sample Variance



=1( )2 =1( )2
2 = 2 =
( 1)

= 2 = 2

Excel Functions Excel Functions


=VARP(range) =VAR(range)
=STDEVP(range) =STDEV(range)

Kiriakos Vlahos
Calculation of Sample
Variance
= 15.85
2
=1(
)
2 = =
( 1)
378.55
= 19.923
(20 1)

= 2 = 19.923 = 4.46

But what does it


actually mean?

Kiriakos Vlahos
Empirical Rule

For roughly bell-shaped and symmetric distributions,


approximately:
68% 1 standard deviation
of the mean

95% Lie 2 standard deviations


within of the mean

All 3 standard deviations


of the mean

Kiriakos Vlahos
Comparing measures of
dispersion
Measure Advantages Disadvantages
Range Easily understood, Based on two observations,
familiar Grossly distorted by outliers,
Descriptive measure only
Inter-quartile Easily understood Not well known,
range Descriptive measure only
Variance Mathematically tractable Wrong units
No intuitive appeal
Standard Mathematically tractable Too-involved for descriptive
deviation Well known because of purposes
its use in various
theories
Standard measure of risk
in finance

Kiriakos Vlahos
Skewness
Symmetric

Kiriakos Vlahos
Skewness
Skewed to left

Kiriakos Vlahos
Skewness
Skewed to right

Kiriakos Vlahos
Skewness
Skewness
Measure of asymmetry of a frequency distribution
Negatively skewed or skewed to left
Symmetric or unskewed
Positively skewed or skewed to right

Coefficient of skewness
3( )
=

Kiriakos Vlahos
Discrete and Continuous
Random Variables
A discrete random variable: A continuous random variable:
counts occurrences measures (e.g.: height, weight, speed,
has a countable number of possible values value, duration, length)
has discrete jumps between successive has an infinite number of possible
values values
has measurable probability associated moves continuously from value to
with individual values value
probability is height has no measurable probability
associated with individual values
For example: probability is area
Binomial Binomial: n=3 p=.5 Minutes to Complete Task
n=3 p=.5 0.4 For example:
0.3
In this case, the
0.3
x P(x) shaded area 0.2
P(x)

P(x)
0 0.125 0.2
presents the
1 0.375 0.1
0.1 probability that
2 0.375
3 0.125 0.0 the task takes 0.0
1 2 3 4 5 6
1.000
0 1
C1
2 3
between 2 and 3 Minutes
minutes.

Kiriakos Vlahos
The Normal Distribution

The normal distribution is defined by just two parameters:


N( , 2 ) : mean (location)
2: variance (spread)

Properties of Normal distribution:
bell-shaped curve
values close to mean most likely
small probability of extreme values
symmetric, over- and under-estimates equally likely (no skew)

Example: Assume that we have a forecast of the /$ exchange rate in three months
time (call this X- A random variable whose value we cannot know)
Forecasted value: 1 = $1.65 with SD (of the forecast error) = 5c

Assume that future exchange rate follows Normal distribution : x ~ N( 165, 52 )

Kiriakos Vlahos
IQ distribution

Source : mindhacks.com

Kiriakos Vlahos

Page 24
The Normal Probability
Distribution
The normal probability density function:
N o rm a l D is trib u tio n : = 0 , = 1

0.4

0.3

Where e = 2.7182818 and = 3.14159205

f(x)
0.2

0.1

0.0

-5 0 5
x

The formula was constructed by the famous German mathematician


Carl Friedrich Gauss (17771855). This is why the normal distribution is also called
Gaussian distribution.
No need to know, remember or use the function!!

Kiriakos Vlahos
Normal Probabilities
(Empirical Rule)
The probability that a normal random
variable will be within 1 standard S ta n d a rd N o rm a l D is trib u tio n

deviation from its mean (on either 0 .4

side) is 0.6826, or approximately 0.68.


0 .3

The probability that a normal random

f(z)
variable will be within 2 standard 0 .2

deviations from its mean is 0.9544, or 0 .1

approximately 0.95. 0 .0

The probability that a normal random -5 -4 -3 -2 -1 0


Z
1 2 3 4 5

variable will be within 3 standard


deviation from its mean is 0.9974.

Kiriakos Vlahos
Application of the
empirical rule
Exchange rate forecast: x ~ N( 165, 52 )

Two-thirds chance of being


within one standard-deviation
of the mean (160 - 170)
68%
5% chance of being more than
two standard-deviations from
the mean (below 155 or above 175)
2.5% 2.5%
=> 95% chance of being within
+/- two standard-deviations of
the mean (between 155-175)
150 155 160 165 170 175 180
m-3s m-2s m-s m m+s m+2s m+3s

Kiriakos Vlahos
The Standard Normal
Distribution
The standard normal random variable, Z, is the normal random
variable with mean = 0 and standard deviation = 1: Z~N(0,12).
Standard Normal Distribution

0 .4

0 .3

=1
f( z)

{
0 .2

0 .1

0 .0

-5 -4 -3 -2 -1 0 1 2 3 4 5

=0

Z
Kiriakos Vlahos
Normal Probability
Distributions
All of these are normal probability density functions, though each has a different mean and variance.

Normal Distribution: =40, =1 Normal Distribution: =30, =5 Normal Distribution: =50, =3


0.4 0.2 0.2

0.3
f(w)

f(y)
f(x)
0.2 0.1 0.1

0.1

0.0 0.0 0.0


35 40 45 0 10 20 30 40 50 60 35 45 50 55 65
w x y

W~N(40,12) X~N(30,252) Y~N(50,92)


Normal Distribution: =0, =1
0.4

0.3 Consider:
P(39 W 41) The probability in each case is
f(z)

0.2

0.1
P(25 X 35) an area under a normal
0.0
P(47 Y 53) probability density function.
-5 0
z
5 P(-1 Z 1)

Kiriakos Vlahos
The Transformation of
Normal Random Variables
The transformation of X to Z:
X - m
Z = Normal Distribution: =50, =10
s
0.07
0.06

Transformation 0.05
0.04

f(x)
(1) Subtraction: (X - x) 0.03
0.02 =10

{
Standard Normal Distribution 0.01

0.4 0.00
0 10 20 30 40 50 60 70 80 90 100
X
0.3
f(z)

0.2
(2) Division by x)
{

0.1 1.0 The inverse transformation of Z to X:


0.0
-5 -4 -3 -2 -1 0 1 2 3 4 5 X = x + Z x

Z
Kiriakos Vlahos
Using the Normal
Transformation
The monthly starting salaries MBA
X of recent MBA graduates
z = follows the normal
distribution with a mean of
$4,000 and a standard
deviation of $400. What is
the z-value for a salary of
$4,400?
$4,400 - $4000
= $400

= 1.00

Kiriakos Vlahos
Finding probabilities of the Standard Normal
Distribution:P(Z > 1.56)

Standard Normal Probabilities


z 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
0.0 0.5000 0.4960 0.4920 0.4880 0.4840 0.4801 0.4761 0.4721 0.4681 0.4641
0.1 0.4602 0.4562 0.4522 0.4483 0.4443 0.4404 0.4364 0.4325 0.4286 0.4247
0.2 0.4207 0.4168 0.4129 0.4090 0.4052 0.4013 0.3974 0.3936 0.3897 0.3859
0.3 0.3821 0.3783 0.3745 0.3707 0.3669 0.3632 0.3594 0.3557 0.3520 0.3483
P(Z >z) 0.4 0.3446 0.3409 0.3372 0.3336 0.3300 0.3264 0.3228 0.3192 0.3156 0.3121
0.5 0.3085 0.3050 0.3015 0.2981 0.2946 0.2912 0.2877 0.2843 0.2810 0.2776
0.6 0.2743 0.2709 0.2676 0.2643 0.2611 0.2578 0.2546 0.2514 0.2483 0.2451
z
0.7 0.2420 0.2389 0.2358 0.2327 0.2296 0.2266 0.2236 0.2206 0.2177 0.2148
0.8 0.2119 0.2090 0.2061 0.2033 0.2005 0.1977 0.1949 0.1922 0.1894 0.1867
0.9 0.1841 0.1814 0.1788 0.1762 0.1736 0.1711 0.1685 0.1660 0.1635 0.1611
1.0 0.1587 0.1562 0.1539 0.1515 0.1492 0.1469 0.1446 0.1423 0.1401 0.1379
1.1 0.1357 0.1335 0.1314 0.1292 0.1271 0.1251 0.1230 0.1210 0.1190 0.1170
1.2 0.1151 0.1131 0.1112 0.1093 0.1075 0.1056 0.1038 0.1020 0.1003 0.0985
1.3 0.0968 0.0951 0.0934 0.0918 0.0901 0.0885 0.0869 0.0853 0.0838 0.0823
1.4 0.0808 0.0793 0.0778 0.0764 0.0749 0.0735 0.0721 0.0708 0.0694 0.0681
1.5 0.0668 0.0655 0.0643 0.0630 0.0618 0.0606 0.0594 0.0582 0.0571 0.0559
1.6 0.0548 0.0537 0.0526 0.0516 0.0505 0.0495 0.0485 0.0475 0.0465 0.0455
1.7 0.0446 0.0436 0.0427 0.0418 0.0409 0.0401 0.0392 0.0384 0.0375 0.0367
Look in row labeled 1.5 and 1.8 0.0359 0.0351 0.0344 0.0336 0.0329 0.0322 0.0314 0.0307 0.0301 0.0294
1.9 0.0287 0.0281 0.0274 0.0268 0.0262 0.0256 0.0250 0.0244 0.0239 0.0233
column labeled .06 to find 2.0 0.0228 0.0222 0.0217 0.0212 0.0207 0.0202 0.0197 0.0192 0.0188 0.0183

P(z >1.56) = .0594 2.1


2.2
0.0179
0.0139
0.0174
0.0136
0.0170
0.0132
0.0166
0.0129
0.0162
0.0125
0.0158
0.0122
0.0154
0.0119
0.0150
0.0116
0.0146
0.0113
0.0143
0.0110
2.3 0.0107 0.0104 0.0102 0.0099 0.0096 0.0094 0.0091 0.0089 0.0087 0.0084
2.4 0.0082 0.0080 0.0078 0.0075 0.0073 0.0071 0.0069 0.0068 0.0066 0.0064
2.5 0.0062 0.0060 0.0059 0.0057 0.0055 0.0054 0.0052 0.0051 0.0049 0.0048
2.6 0.0047 0.0045 0.0044 0.0043 0.0041 0.0040 0.0039 0.0038 0.0037 0.0036
2.7 0.0035 0.0034 0.0033 0.0032 0.0031 0.0030 0.0029 0.0028 0.0027 0.0026
2.8 Kiriakos
0.0026 0.0025Vlahos
0.0024 0.0023 0.0023 0.0022 0.0021 0.0021 0.0020 0.0019
2.9 0.0019 0.0018 0.0018 0.0017 0.0016 0.0016 0.0015 0.0015 0.0014 0.0014
Finding probabilities of the Standard Normal
Distribution: P(Z < -2.47)

z ... .06 .07 .08


To find P(Z<-2.47): . . . .
Find table area for 2.47 . . . .
. . . .
P(Z > 2.47) = .0.0068 2.3 ... 0.0091 0.0089 0.0087
2.4 ... 0.0069 0.0068 0.0066
P(Z < -2.47) = P(Z >2.47) 2.5 ... 0.0052 0.0051 0.0049
since the normal distribution is .
.
symmetric .

Standard Normal Distribution


Area to the left of -2.47 0.4

P(Z < -2.47) = .P(Z>2.47


= 0.0068
0.3
Table area for 2.47
P(Z > 2.47) = 0.0.068
f(z)

0.2

0.1

0.0
-5 -4 -3 -2 -1 0 1 2 3 4 5
Z
Kiriakos Vlahos
Finding Probabilities of the Standard Normal
Distribution: P(1< Z < 2)

To find P(1 < Z < 2): z .00 ...


. .
1. Find table area for 2.00 . .
. .
F(2) = P(Z > 2.00) = 0.0228 0.9 0.1841 ...
1.0 0.1587 ...
2. Find table area for 1.00 1.1 0.1357 ...
F(1) = P(Z > 1.00) = 0.1587 . .
1.9 0.0287 ...
3. P(1 < Z < 2.00) = P(Z > 1.00) - P(Z > 2.00) 2.0 0.0228 ...
2.1 0.0179 ...
= .1587 - .0228 = .1359 . .
. .
Standard Normal Distribution . .
0.4

0.3
Area between 1 and 2
f(z)

0.2
P(1 < Z < 2) = 0.1587 - 0.0228 = .1359
0.1

0.0
-5 -4 -3 -2 -1 0 1 2 3 4 5
Z
Kiriakos Vlahos
Finding Values of the Standard Normal
Random Variable: P(Z < z) = 0.90

z 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
To find z such that 0.0 0.5000 0.4960 0.4920 0.4880 0.4840 0.4801 0.4761 0.4721 0.4681 0.4641
P(Z <z) = .90: 0.1 0.4602 0.4562 0.4522 0.4483 0.4443 0.4404 0.4364 0.4325 0.4286 0.4247

1. Find a probability as close as 0.2 0.4207 0.4168 0.4129 0.4090 0.4052 0.4013 0.3974 0.3936 0.3897 0.3859
0.3 0.3821 0.3783 0.3745 0.3707 0.3669 0.3632 0.3594 0.3557 0.3520 0.3483
possible to P(Z>z) = 1-0.9 0.4 0.3446 0.3409 0.3372 0.3336 0.3300 0.3264 0.3228 0.3192 0.3156 0.3121

=0.10 in the table of 0.5 0.3085 0.3050 0.3015 0.2981 0.2946 0.2912 0.2877 0.2843 0.2810 0.2776
0.6 0.2743 0.2709 0.2676 0.2643 0.2611 0.2578 0.2546 0.2514 0.2483 0.2451
standard normal probabilities. 0.7 0.2420 0.2389 0.2358 0.2327 0.2296 0.2266 0.2236 0.2206 0.2177 0.2148
0.8 0.2119 0.2090 0.2061 0.2033 0.2005 0.1977 0.1949 0.1922 0.1894 0.1867
0.9
2. Then determine the value of z 1.0
0.1841
0.1587
0.1814
0.1562
0.1788
0.1539
0.1762
0.1515
0.1736
0.1492
0.1711
0.1469
0.1685
0.1446
0.1660
0.1423
0.1635
0.1401
0.1611
0.1379
from the corresponding row 1.1 0.1357 0.1335 0.1314 0.1292 0.1271 0.1251 0.1230 0.1210 0.1190 0.1170

and column. 1.2 0.1151 0.1131 0.1112 0.1093 0.1075 0.1056 0.1038 0.1020 0.1003 0.0985
1.3 0.0968 0.0951 0.0934 0.0918 0.0901 0.0885 0.0869 0.0853 0.0838 0.0823

Standard Normal Distribution


P(Z >1.28) = .10
0.4

P(Z <1.28) = 1-0.1 =.90 Area to the left of 0.3

1.28
f(z)
0.2

P(z < 1.28) = .90


0.1 Area = 0.1
0.0
-5 -4 -3 -2 -1 0 1 2 3 4 5

Kiriakos Vlahos
Z
z = 1.28
Statistics is a Science of
Inference
Statistical Inference:
On basis of sample statistics derived
Predict and forecast values of
population parameters... from limited and incomplete sample
Test hypotheses about values of information
population parameters...
Make decisions...

Make generalizations On the basis of


about the characteristics observations of a
of a population... sample, a part of a
population

Kiriakos Vlahos
Examples of sampling
A bank wants
to find what customers think about redesigned shops
how many employees would be prepared to work on Saturdays for an
extra fee
what percentage of the buying public are aware of the existence of a
new banking product
how much are customers worried about the security of online
banking
In all of this cases dealing with the whole population is either
impractical or too expensive and sampling is the only option.

Kiriakos Vlahos
US Election 1948

Based on telephone polls newspapers ware so certain that Dewry would


beat Truman that they declared him as the winner the day after the
election. Of course as we know Truman won the election.

Kiriakos Vlahos
Unbiased and biased
sampling
Unbiased
Sample
Unbiased, representative sample
drawn at random from the entire
population.
Democrats Republicans
Population

Biased
People who have phones and/or Sample Biased, unrepresentative sample
cars.
drawn from people who have cars
and/or telephones.
Democrats Republicans
Population

Kiriakos Vlahos
Inferential statistics
(sample size)

Population Sampling
Sample
, , p

Confidence Interval
Summarising
H0 , Ha Data
Hypothesis Inference

Statistics
x, s, p
Hypothesis Testing
Reject ?
Say not I have found the truth, but rather, I have found a truth.
--Kahlil Gibran, The Prophet

Kiriakos Vlahos
Sampling Distributions
The sampling distribution of a statistic is the
probability distribution of all possible values the
statistic may assume, when computed from random
samples of the same size, drawn from a specified
population.
The sampling distribution of X is the probability
distribution of all possible values the random
variable X may assume when a sample of size n is
taken from a specified population.

Kiriakos Vlahos
Sampling
Population: = 32.0, = 3.3

27 33
33 33 37 28
30
27 31
40
Take a sample 28 34
27 33 35 31
of size 5, n = 5
30 31
31 33
32 29
30
34 33 31 38

x = 31.6 x = 33.0 x = 35.2

Variation!
Variation among sample means is measured by their standard deviation
Standard deviation of sample means is called standard error
Sampling Distributions
(Continued)
Uniform population of integers from 1 to 8:
Unifo rm D is trib utio n (1 ,8 )
0 .2
= 4.5
2 = 5.25
= 2.2913
P(X)

0 .1

0 .0
1 2 3 4 5 6 7 8
X

Kiriakos Vlahos
Sampling Distributions
(Continued)
There are 8*8 = 64 different but equally- Each of these samples has a sample mean.
likely samples of size 2 that can be drawn For example, the mean of the sample (1,4)
from a uniform population of the integers is 2.5, and the mean of the sample (8,4) is
from 1 to 8: 6.

Samples of Size 2 from Uniform (1,8) Sample Means from Uniform (1,8), n =
1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8
1 1,1 1,2 1,3 1,4 1,5 1,6 1,7 1,8 1 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5
2 2,1 2,2 2,3 2,4 2,5 2,6 2,7 2,8 2 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0
3 3,1 3,2 3,3 3,4 3,5 3,6 3,7 3,8 3 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5
4 4,1 4,2 4,3 4,4 4,5 4,6 4,7 4,8 4 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0
5 5,1 5,2 5,3 5,4 5,5 5,6 5,7 5,8 5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5
6 6,1 6,2 6,3 6,4 6,5 6,6 6,7 6,8 6 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0
7 7,1 7,2 7,3 7,4 7,5 7,6 7,7 7,8 7 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5
8 8,1 8,2 8,3 8,4 8,5 8,6 8,7 8,8 8 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0
Kiriakos Vlahos
Sampling Distributions
(Continued)
The probability distribution of the sample mean is called the
sampling distribution of the the sample mean.

Sam pling Distribution of the Mean

X = 4.5
0 .10
2X = 2.625
X = 1.6202
P(X)

0 .05

0 .00
1.0 1 .5 2.0 2.5 3 .0 3.5 4.0 4 .5 5 .0 5.5 6.0 6 .5 7.0 7.5 8 .0
X

Kiriakos Vlahos
Properties of the Sampling
Distribution of the Sample
Mean Uniform Distribution (1,8)
0.2

Comparing the population


distribution and the sampling

P(X)
0.1

distribution of the mean:


The sampling distribution is more bell- 0.0
1 2 3 4 5 6 7 8
shaped and symmetric. X

Both have the same center. Sampling Distribution of the Mean


The sampling distribution of the mean
is more compact, with a smaller 0.10

variance.
P(X)
0.05

0.00
1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0
X

Kiriakos Vlahos
Population Parameters and the Sampling
Distribution of the Sample Mean

The expected value of the sample mean is equal to the population mean:

E( X ) = = X X

The variance of the sample mean is equal to the population variance divided by the
sample size:
2

V(X) = 2
X
= X

n
The standard deviation of the sample mean, known as the standard error of the mean,
is equal to the population standard deviation divided by the square root of the sample
size:

SD( X ) = = X
X

Kiriakos Vlahos
Sampling from a Normal
Population
When sampling from a normal population with mean and standard deviation , the
sample mean, X, has a normal sampling distribution:

X ~ N (, )
n

This means that, as the sample size Sampling Distribution of the Sample Mean
increases, the sampling distribution of the
0.4
sample mean remains centered on the Sampling Distribution: n =16
population mean, but becomes more 0.3
Sampling Distribution: n =4
compactly distributed around that f(X)
0.2

population mean 0.1


Sampling Distribution: n =2
Normal population
Normal population
0.0

Kiriakos Vlahos
The Central Limit Theorem
n=5
When sampling from a population with mean 0.25
0.20

and finite standard deviation , the

P(X)
0.15
0.10

sampling distribution of the sample mean will 0.05


0.00
X
tend to a normal distribution with mean and
n = 20
standard deviation as the sample size 0.2
n
becomes large

P(X)
0.1

(n >=30).
0.0
X

For large enough n: X ~ N ( , / n)


2
Large n
0.4
0.3

f(X)
0.2
0.1
0.0

-
X

Kiriakos Vlahos
The Central Limit Theorem Applies to
Sampling Distributions from Any
Population
Normal Uniform Skewed General

Population

n=2

n = 30

X X X X

Kiriakos Vlahos
Some words by W. J.
Youden
The
normal
law of error
stands out in
the experience of
mankind as one of the
broadest generalisations of
natural philosophy. It serves
as the guiding instrument in
researches in the physical and social
sciences and in medicine, agriculture,
and engineering. It is an indispensable
tool for the analysis and the interpretation of the
basic data obtained by observation and experiment.

Kiriakos Vlahos
The Central Limit Theorem -Example

Mercury makes a 2.4 liter V-6 engine,


the Laser XRi, used in speedboats. The X 217
companys engineers believe the P ( X 217) = P

engine delivers an average power of
n n
220 horsepower and that the standard
deviation of power delivered is 15 HP.
A potential buyer intends to sample 217 220 217 220
100 engines (each engine is to be run = P Z = P Z
15 15
a single time). What is the probability
100 10
that the sample mean will be less than
217HP? = P ( Z 2) = 0.0228

Kiriakos Vlahos
Types of Estimators
Point Estimate
A single-valued estimate.
A single element chosen from a sampling distribution.
Conveys little information about the actual value of the
population parameter or about the accuracy of the
estimate.
Confidence Interval or Interval Estimate
An interval or range of values believed to include the
unknown population parameter.
Associated with the interval is a measure of the
confidence we have that the interval does indeed
contain the parameter of interest.
Kiriakos Vlahos
Confidence Interval or
Interval Estimate
A confidence interval or interval estimate is a range or interval of numbers
believed to include an unknown population parameter. Associated with the
interval is a measure of the confidence we have that the interval does indeed
contain the parameter of interest.

A confidence interval or interval estimate has two


components:
A range or interval of values
An associated level of confidence

Kiriakos Vlahos
Confidence Interval for when
is known
If the population distribution is normal, the sampling distribution of
the mean is normal.
If the sample is sufficiently large, regardless of the shape of the
population distribution, the sampling distribution is normal (Central
Limit Theorem).
In either case: Standard Normal Di stribution: 95 % Interval

0.4


P 1.96 x + 1.96 = 0.95
n
0.3
n

f(z)
0.2

or 0.1

0.0

x + 1.96
-4 -3 -2 -1 0 1 2 3 4
P x 1.96 = 0.95
n n
z

Kiriakos Vlahos
Confidence Interval for when
is Known (Continued)
Before sampling, there is a 0.95 probability that the interval

1.96
n
will include the sample mean (and 5% that it will not).

Conversely, after sampling, approximately 95% of such intervals



x 1.96
n
will include the population mean (and 5% of them will not).


That is, x 1.96 is a 95% confidence interval for .
n

Kiriakos Vlahos
Critical Values of z and
Levels of Confidence

Conf. z S t an d ard N o r m al D i s trib uti o n


2
level 2
0.4
(1 )
0.99 0.005 2.576 0.3

f(z)
0.98 0.010 2.326 0.2


0.1
0.95 0.025 1.960 2 2
0.0

0.90 0.050 1.645 -5 -4 -3 -2


z
-1 0
Z
1
z
2 3 4 5

2 2

0.80 0.100 1.282

Kiriakos Vlahos
The Level of Confidence and the
Width of the Confidence Interval
When sampling from the same population, using a fixed sample size, the
higher the confidence level, the wider the confidence interval.

S t a n d ar d N o r m al D i s t ri b u ti o n S t a n d ar d N o r m al D i s t ri b u ti o n

0.4 0 .4

0.3 0 .3

f(z)
f(z)

0.2 0 .2

0.1 0 .1

0.0 0 .0
-5 -4 -3 -2 -1 0 1 2 3 4 5 -5 -4 -3 -2 -1 0 1 2 3 4 5
Z Z

80% Confidence Interval: 95% Confidence Interval:




x 1.28 x 1.96
n n

Kiriakos Vlahos
The Sample Size and the Width of
the Confidence Interval

When sampling from the same population, using a fixed confidence level, the
larger the sample size, n, the narrower the confidence interval.

S a m p lin g D is trib u tio n o f th e M e a n S a m p lin g D is trib u tio n o f th e M e a n

0 .4 0 .9

0 .8

0 .3 0 .7

0 .6

0 .5
f(x)

f(x)
0 .2
0 .4
0 .3
0 .1
0 .2

0 .1
0 .0 0 .0

x x

95% Confidence Interval: n = 20 95% Confidence Interval: n = 40

Kiriakos Vlahos
Example

Population consists of the Fortune 500 Companies (Fortune Web


Site), as ranked by Revenues. You are trying to find out the
average Revenues for the companies on the list. The population
standard deviation is $15,056.37. A random sample of 30
companies obtains a sample mean of $10,672.87. Give a 95%
and 90% confidence interval for the average Revenues.

Kiriakos Vlahos
Sampling Error: Unknown
We want to estimate the mean of the population,
The standard deviation of the population is unknown,
We estimate that true average with the sample mean, X
n is the sample size
The sampling error is estimated with:
s
SE
n
Where s is the standard deviation of the sample

61
Overview: Confidence
intervals for sample mean
Population Standard Deviation ( )

Known Unknown

Large Sample x z n xzs n

Small Sample
Normal x z n xts n
Population
Distribution:

non-Normal Non-parametric tests

NB: z means look up value from standard Normal table N(0,1)


t means look up value from t-dist with (n-1) degrees of freedom

Kiriakos Vlahos
Confidence Interval in
Excel
DataData AnalysisDescriptive Statistics:
Female Salary

Mean 62576.92
Standard Error 1438.86
Median 62650
Mode 60900
Standard Deviation 7336.774
Sample Variance 53828246
Kurtosis -0.09267
Skewness -0.33406
Range 30800
Minimum 45600
Maximum 76400
Sum 1627000
Count 26
Confidence Level(95.0%) 2963.385

ts n
Confidence Interval = Mean Confidence Level

63
Hypothesis Testing

The concept of hypothesis testing


Formulating the hypotheses
Selecting the significance level
Testing for population means
Testing for population proportions
p-values
Testing for population differences

Kiriakos Vlahos
The court analogy
The defendant is presumed to be innocent
until proven guilty beyond reasonable doubt.
What do we mean by reasonable doubt?
Cot death case
O.J. Simpson
Al Capone

Kiriakos Vlahos

Page 65
Decision-Making
In statistics you cant prove anything. You try
to show that the alternative is highly unlikely

One hypothesis is maintained to be true until


a decision is made to reject it on the basis of
collected evidence as false:
Guilt is proven beyond a reasonable doubt
Innocence appears to be highly improbable

Kiriakos Vlahos
What is a hypothesis?
A hypothesis is a statement or assertion about the state of nature
(about the true value of an unknown population parameter):
The accused is innocent
= 100
Every hypothesis implies its contradiction or alternative:
The accused is guilty
100
A hypothesis is either true or false, and you may fail to reject it or
you may reject it on the basis of information:
Trial testimony and evidence
Sample data

Kiriakos Vlahos
Statistical Hypothesis Testing

A null hypothesis, denoted by H0, is an assertion about one or more


population parameters. This is the assertion we hold to be true until we
have sufficient statistical evidence to conclude otherwise.
H0: = 100
The alternative hypothesis, denoted by H1, is the assertion of all situations
not covered by the null hypothesis.
H1: 100
H0 and H1 are:
Mutually exclusive
Only one can be true.
Exhaustive
Together they cover all possibilities, so one or the other must be true.

Kiriakos Vlahos
The Null Hypothesis, H0

The null hypothesis:


Often represents the status quo situation or an
existing belief.
Is maintained, or held to be true, until a test
leads to its rejection in favor of the alternative
hypothesis.
Is accepted as true or rejected as false on the
basis of a consideration of a test statistic.
It is what we are trying to reject!
Kiriakos Vlahos
Inferential statistics
(sample size)

Population Sampling
Sample
, , p

Confidence Interval
Summarising
H0 , Ha Data
Hypothesis Inference

Statistics
x, s, p
Hypothesis Testing
Reject ?

Kiriakos Vlahos
The process of hypothesis testing
Step 1: State null and alternate hypotheses

Step 2: Select a level of significance

Step 3: Identify the test statistic

Step 4: Formulate a decision rule

Step 5: Take a sample, arrive at a decision

Do not reject null Reject null and accept alternate

Kiriakos Vlahos
The Concepts of
Hypothesis Testing
A test statistic is a sample statistic computed from sample data. The value
of the test statistic is used in determining whether or not we may reject the
null hypothesis.
The decision rule of a statistical hypothesis test is a rule that specifies the
conditions under which the null hypothesis may be rejected.

Consider H0: = 100. We may have a decision rule that says: Reject H0 if the
sample mean is less than 95 or more than 105.

In a courtroom we may say: The accused is innocent until proven guilty beyond
a reasonable doubt.

Kiriakos Vlahos
Decision Making
A decision may be correct in two ways:
Fail to reject a true H0
Reject a false H0
A decision may be incorrect in two ways:
Type I Error: Reject a true H0
The Probability of a Type I error is denoted by (level of
significance).
Type II Error: Fail to reject a false H0
The Probability of a Type II error is denoted by (power of
the test).

Kiriakos Vlahos
Type I and Type II Errors

A contingency table illustrates the possible outcomes of a


statistical hypothesis test.

Kiriakos Vlahos
Decision Making (example)

A decision to fail to reject or reject a hypothesis may be:


H0: defendant is innocent
Correct
A true hypothesis is not rejected
An innocent defendant may be acquitted
A false hypothesis is rejected
A guilty defendant may be convicted
Incorrect
A true hypothesis is rejected (Type I error)
An innocent defendant may be convicted
A false hypothesis is not rejected (Type II error)
A guilty defendant may be acquitted

Question: What happens in court cases as you raise the burden of


evidence?

Kiriakos Vlahos
Choosing the significance
level
If it is of paramount importance to avoid a
Type I error then we should choose a low
significance level e.g. 1%
If a Type II error is also costly then we need to
strike a balance and choose a higher
significance level
The most commonly used value is 5%

Kiriakos Vlahos

Page 76
Testing Population Means

Cases in which the test statistic is Z

is known and the population is normal.


the sample size is at least 30. (The population need not be normal)

The formula for calculating Z is :


x
z=


n

Kiriakos Vlahos
Overview: tests of sample
mean
Population Standard Deviation ( )

Known Unknown

Large Sample z n z s n

Small Sample
Normal z n t s n
Population
Distribution:

non-Normal Non-parametric tests

NB: z means look up value from standard Normal table N(0,1)


t means look up value from t-dist with (n-1) degrees of freedom

Kiriakos Vlahos
The rejection region
The rejection region is the range of values that will lead us to reject
the null hypothesis if the test statistic should fall within this region.
The rejection region is designed so that, before the sampling takes
place, our test statistic will have a probability of falling within the
rejection region if the null hypothesis is true
The non-rejection (acceptance) region
consists of all values not included in the
rejection region

Kiriakos Vlahos
Example
A company that delivers packages within a large metropolitan area claims that it
takes an average of 28 minutes for a package to be delivered from your door to the
destination. Suppose that you want to carry out a hypothesis test of this claim.

s 5
Set the null and alternative hypotheses: x z . 025
= 315
. 196
.
H0: = 28 n 100
H1: 28
. .98 = 30.52, 32.48
= 315
Collect sample data:
n = 100 We can be 95% sure that the average time for all
x = 31.5 packages is between 30.52 and 32.48 minutes.
s=5 Since the asserted value, 28 minutes, is not in
this 95% confidence interval, we may reasonably
Construct a 95% confidence interval for reject the null hypothesis at the 5% significance
the average delivery times of all packages: level.

Kiriakos Vlahos
Picturing Hypothesis
Testing
95% confidence interval
Population mean around observed sample
under H0 mean

= 28 30.52 x = 31.5 32.48

It seems reasonable to reject the null hypothesis, H0: = 28, since the hypothesized value lies outside
the 95% confidence interval. If were 95% sure that the population mean is between 30.52 and 32.58
minutes, its very unlikely that the population mean is actually be 28 minutes.

Note that the population mean may be 28 (the null hypothesis might be true), but then the observed
sample mean, 31.5, would be a very unlikely occurrence. Theres still the small chance ( = .05) that we
might reject the true null hypothesis. represents the level of significance of the test.

Kiriakos Vlahos
Relationship to confidence
intervals
If the observed sample mean falls within the non-rejection (acceptance) region, then you fail to reject
the null hypothesis as true. Construct a 95% non-rejection region around the hypothesized population
mean, and compare it with the 95% confidence interval around the observed sample mean:

s 5 95% non- 95% Confidence s 5


0 z.025 = 28 1.96 rejection region Interval x z .025 = 315
. 1.96
n 100 n 100
around the around the
= 28.98 = 27,02 ,28.98
population Mean Sample Mean . .98 = 30.52 ,32.48
= 315

27.02 0=28 28.98 30.52 X=31.5 32.48

The non-rejection region and the confidence interval are the same width, but centered on different
points. In this instance, the non-rejection region does not include the observed sample mean, and the
confidence interval does not include the hypothesized population mean.

Kiriakos Vlahos
The Decision Rule
The Hypothesized Sampling Distribution of the Mean

0.8
0.7 .95
0.6
0.5
0.4
0.3
.025 .025
0.2
0.1
0.0

27.02 0=28 28.98


x=

Lower Rejection Nonrejection Upper Rejection


Region Region Region

Construct a (1-) non-rejection region around the hypothesized population


mean.
Do not reject H0 if the sample mean falls within the non-rejection region (between the
critical points).
Reject H0 if the sample mean falls outside the non-rejection region (inside the rejection
region).
Critical Values of z and
Levels of Confidence

Conf. z S t an d ard N o r m al D i s trib uti o n


2
level 2
0.4
(1 )
0.99 0.005 2.576 0.3

f(z)
0.98 0.010 2.326 0.2


0.1
0.95 0.025 1.960 2 2
0.0

0.90 0.050 1.645 -5 -4 -3 -2


z
-1 0
Z
1
z
2 3 4 5

2 2

0.80 0.100 1.282

Kiriakos Vlahos
Example
An insurance company believes that, over the last few years, the mean liability insurance per board
seat in companies defined as small companies has been $2000. Using = 0.01, test this hypothesis
using Growth Resources, Inc. survey data. (sample size 100 and sample mean $2700)
n = 100
H0: = 2000
x = 2700
H1: 2000
s = 947

For = 0.01, critical values of z are 2.576


x 0 2700 - 2000
x 0 z = =
The test statistic is: z = s
s 947

n n 100

Do not reject H0 if: [-2.576 z 2.576] 700


= = 7 .39 Reject H
94.7 0
Reject H0 if: [z < -2.576] or [z > 2.576]

Kiriakos Vlahos
Example : Continued

The Standard Normal Distribution


0.8 Since the test statistic falls in the upper
0.7
0.6
.99 rejection region, H0 is rejected, and we
0.5 may conclude that the average
0.4
0.3
insurance liability per board seat in
.005 .005
0.2
small companies is not equal to
0.1
0.0 $2000.
z
-2.576 0 2.576
7.39
Lower Rejection Nonrejection Upper Rejection
Region Region Region

Kiriakos Vlahos
1-Tailed and 2-Tailed Tests

If action is to be taken if a parameter is either greater than or less than some value a, then
the alternative hypothesis is that the parameter is not equal to a, and the test is a two-tailed
test. H0: = 50
H1: 50

The tails of a statistical test are determined by the need for an action. If action is to be taken
if a parameter is greater than some value a, then the alternative hypothesis is that the
parameter is greater than a, and the test is a right-tailed test.
H0: 50
H1: > 50

If action is to be taken if a parameter is less than some value a, then the alternative
hypothesis is that the parameter is less than a, and the test is a left-tailed test.
H0: 50
H1: < 50
Kiriakos Vlahos
Rejection region for
different types of tests

Two-tailed test Right-tailed test Left-tailed test


The Standard Normal Distribution The Standard Normal Distribution
The Standard Normal Distribution
0.8 0.8
0.8 0.7
0.7
0.7 .99 .99
.99 0.6 0.6
0.6 0.5
0.5
0.5 0.4
0.4
0.4
0.3 0.3 .01
0.3 .01 0.2
0.2
0.2 .005 .005 0.1
0.1
0.1 0.0
0.0
0.0
0 2.33 z -2.33 0 z
-2.576 0 2.576 z

Non-rejection Rejection Rejection Non-rejection


Lower Rejection Non-rejection Upper Rejection
Region Region Region Region
Region Region Region

Page 88
Critical Values of z and Levels of
Confidence (one-tailed tests)
Conf. a za S t an d ard N o r m al D i s trib uti o n
level 0.4
(1 )
0.99 0.01 2.326 0.3

0.98 0.02 2.054

f(z)
0.2

a
0.95 0.05 1.645 0.1

0.0
0.90 0.10 1.282 -5 -4 -3 -2 -1 0
Z
1 2 3 4 5

za
0.80 0.20 0.842

Kiriakos Vlahos
Example
An automatic bottling machine fills cola into two liter (2000 cc) bottles. A consumer advocate wants to
test the null hypothesis that the average amount filled by the machine into a bottle is at least 2000 cc. A
random sample of 40 bottles coming out of the machine was selected and the exact content of the
selected bottles are recorded. The sample mean was 1999.6 cc. The population standard deviation is
known from past experience to be 1.30 cc.
Test the null hypothesis at the 5% significance level.

H0: 2000 n = 40
H1: < 2000 x = 1999.6
n = 40 = 1.3
For = 0.05, the critical value
of z is -1.645
x 0 x
z= z= 0 = 1999.6 - 2000

The test statistic is: 1.3
n
n 40
Do not reject H0 if: [z -1.645]
Reject H0 if: [z < -1.645]
Kiriakos Vlahos = 1.95 Reject H
0
The p-Value

The p-value is the probability of obtaining a value of the test statistic as


extreme as, or more extreme than, the actual value obtained, when the null
hypothesis is true.

The p-value is the smallest level of significance, , at which the null hypothesis
may be rejected using the obtained value of the test statistic.

Policy: When the p-value is less than , reject H0.

Reporting the p-value allows the reader to choose her own level of
significance

Kiriakos Vlahos
The p-Value and
Hypothesis Testing
The further away in the tail of the distribution the test statistic falls, the smaller is the p-
value and, hence, the more convinced we are that the null hypothesis is false and should be
rejected.

In a right-tailed test, the p-value is the area to the right of the test statistic if the test statistic
is positive.

In a left-tailed test, the p-value is the area to the left of the test statistic if the test statistic is
negative.

In a two-tailed test, the p-value is twice the area to the right of a positive test statistic or to
the left of a negative test statistic.

For a given level of significance,:


Reject the null hypothesis if and only if p-value

Kiriakos Vlahos
The p-Value: Rules of
Thumb
When the p-value is smaller than 0.01, the result is called very significant.

When the p-value is between 0.01 and 0.05, the result is called significant.

When the p-value is between 0.05 and 0.10, the result is considered by some as
marginally significant (and by most as not significant).

When the p-value is greater than 0.10, the result is considered not significant.

Kiriakos Vlahos
Testing for Differences
Applications for testing differences between samples
Difference in average running cost for different makes of vehicle
Difference in average salary between different groups of employees
Difference in profits between regions, managers, etc

Chances are if we take two different samples there will be


some difference
Is this due to chance alone (sampling error) or is there a significant difference ?

Approach:
measure the difference in the average (cost, salary, etc.) of each group
calculate the sampling error as a standard error for this statistic
quantify the sampling error by estimating a confidence interval for the difference
alternatively, perform a specific hypothesis test using Excel

Kiriakos Vlahos

94
Comparisons of Two Population
Means: Test Statistic

Large-sample test statistic for the difference between two population means:

( x x ) ( )
z= 1 2 1 2 0

2
2

1
+ 2

n1
n 2

The term (1- 2)0 is the difference between 1 an 2 under the null hypothesis. It
is equal to zero in situations I and II, and it is equal to the prespecified value D in
situation III. The term in the denominator is the standard deviation of the
difference between the two sample means (it relies on the assumption that the
two samples are independent). This test also assumes unequal variances.

Kiriakos Vlahos
Comparison of Salaries
Female Salary Male Salary
57,000 79,400
61,300 67,400
62,000 66,500
70,100 72,600 Female and Male Salary
45,600 63,600
71,200 74,500 12
64,700 76,400 10
53,800 67,900

Frequency
8
60,900 61,600 Female
62,700 75,500 6
Male
76,400 64,500 4
57,900 73,400 2
68,200 76,100
0
65,800 72,200
60,300 69,600
62,600 53,100
67,000 65,500 Salary
62,700 78,400
54,700 77,600
71,400 82,000
50,400
71,800
59,800
80,800 Discrimination?
64,100 74,800
70,400 71,000
53,100
60,900 96
Discrimination?
Excel: Tools.Data Analysis.t-test two-sample assuming
unequal variances

Kiriakos Vlahos

97
Discrimination?
Output from Excel:
t-Test: Two-Sample Assuming Unequal Variances

Female Salary Male Salary


Mean 62577 71008
Variance 53828246 52349493
Observations 26 24
Hypothesized Mean Difference 0
df 48
t Stat -4.09
P(T<=t) one-tail 0.01%
t Critical one-tail 1.68
P(T<=t) two-tail 0.02%
t Critical two-tail 2.01

t Stat is the difference in the group means, measured in units of the SE


H0: average male salary - average female salary = 0 (two-tailed)
t Stat value of 4.09 is greater than the critical value of 2.01 => reject H0,
conclude that there is a significant difference
Kiriakos Vlahos

98

Vous aimerez peut-être aussi