Vous êtes sur la page 1sur 39

E370

5/13/17
Descriptive
Statistics
Statist Sampling Simple random
ics Methods Probabi Pseudo random
lity Stratified
Systematic
Cluster

Non Convenience
probabil Judgment
ity
Qualitativ Nominal
Data Types
e or
Categoric Ordinal
al
Quantitati Discrete
ve or
Numerical Continuous

Graphical Types Bar/column


Qualitativ
Pie
e
Pareto Diagram
Histogram
Quantitati Frequency polygon
ve Ogive
Stem-n-Leaf Plot

Descriptive Qualitativ Mode (Nominal &


e Ordinal)
Median (Ordinal only)
Statistics
Center Spread Shape
Quantitati Mean Range Skewness
ve Median Variance Symmetric
Mode Standard Deviation Uni-, bi-
modal, etc.
Coefficient of Variation
Measures of Center or Central
Tendency
the location of the data, the middle of the data,
the balance point of the data, the most common
element in the data.
Spread
how similar to one another are the observations
within the data set, how distant they are from
one another, how far observations are from
some fixed point.
Shape
how the mass of the data is arranged over the
distribution

Dimensions of Data
General statistical symbol
conventions: Symbols differ for
populations and samples
Populations are described by parameters
Parameters are represented by Greek letters
For example, = mu = population mean
= sigma = population standard deviation
N=Population Size
Samples are described by statistics
Statistics are represented by Latin letters
For example, = X-bar = sample mean
s = sample standard deviation
n = sample size

Symbols
Mean
thearithmetic average, the center
of balance for the data.

Median
The middle of the ordered data set

Mode
The most common value

Summary Methods--Center

Thesum of all the observations in a
data set divided by the number of
observations.

Any real number


Unique
Inclusive
or
Balanced
Sensitive

Mean Characteristics
It
is that value that divides the data set into two
parts of equal size with respect to the number of
observations. Specifically the value than which 50%
of the observations are larger and 50% are smaller.

Unique

Any real number

More applicable

Insensitive

Exclusive

Median Characteristics
Thevalue with the highest
frequency.
Universal use
Simple
Only hope for nominal data
Insensitive

Highly unstable
Doesnt always exist.
Sometimes there is more than one mode.
Small changes in observations can dramatically
change the mode.

Mode Characteristics
Summary Statistic Behavior
Xi Xi + 7 2Xi
1 1+7=8 2*1=2
2 9 4
3 10 6
3 10 6
4 11 8
5 12 10
=3 = 10 =6
Md = 3 Md =10 Md = 6
Mo = 3 Mo =10 Mo = 6

Irritating Data

Compare these
formulas.

Other Mean Formulas


Freq. Rel. Cum.
Age Mark
fi Freq. Freq.
10~15 1 0.03 1 12.5
15~20 11 0.31 12 17.5
20~25 14 0.40 26 22.5
25~30 5 0.14 31 27.5
30~35 3 0.09 34 32.5
35~40 1 0.03 35 37.5

Age of Mother at birth of 1st child


Absolute
Class
Age Frequenc Product
Mark
y
10~15 12.5 1 12.5
15~20 17.5 11 192.5
20~25 22.5 14 315
25~30 27.5 5 137.5
30~35 32.5 3 97.5
35~40 37.5 1 37.5
Sum 792.5
Sum/n 22.64
Estimating Means of Grouped Data--Absolute
Frequencies
Relative
Class
Age Frequenc Product
Mark
y
10~15 12.5 0.03 0.36
15~20 17.5 0.31 5.50
20~25 22.5 0.40 9.00
25~30 27.5 0.14 3.93
30~35 32.5 0.09 2.79
35~40 37.5 0.03 1.07
Sum 22.64

Estimating Means of Grouped DataRelative


Frequencies
Range
How far apart the highest value and the
lowest value in the set are, thus, Range =
(Max-Min)
Variance
Itmeasures the average squared distance an
observation is from the mean
Standard Deviation
The average distance an observation is from
the mean
Coefficient of Variation
A measure of relative dispersion, dispersion
relative to the mean.

Summary Methods--Spread
Simple and intuitive

Itdoesnt use all the data so it


tells nothing about how the data
falls between the high and the low
point

Itis sensitive to extreme values,


just like the mean

Range Characteristics
Interquartile Range
The distance between the first and
third quartiles of the data set. IQR = Q 3
- Q1
A quartile is a percentile, but instead of
dividing the data into 100 levels it
divides it into 4
The first quartile is the 25th percentile
The third quartile is the 75th percentile
The IQR cuts off the smallest 25% and
the largest 25% of the data, removing
outliers.

A Range Alternative
0 10 20 35 L25=(40+1)*(25/10
0)=10.25
1 10 21 35
10th Obs = 9
2 12 22 35 11th Obs = 10
4 13 22 38 10-9=1
5 13 22 39 1*.25=.25
5 13 24 45 Q1=9.25
L75=(40+1)*(75/10
6 14 24 50
0)=30.75
7 16 25 56 Q3=34.5
9 17 26 60 IQR=34.5-
9 19 33 63 9.25=25.25

Calculate the IQR



Why are there two formulas?


Conceptually, a sample does not
include all the information that a
population does
Samples tend to UNDER estimate the
variability found in a population.
If we divide by a slightly smaller
number (n1) we get a slightly larger
number.

Variance
Itis a unique value and uses all
information in the data set

Ithas desirable mathematic


properties

Itis an average, thus, it has the


same failings as the mean, that is,
sensitive to outliers

It is difficult to interpret
Variance Characteristics

Obviously the relative of


variance
It is no easier to calculate--you
have to get a variance first--it is
just easier to interpret.
It is measured in the same units
as the data is measured
It is also sensitive to outliers

Standard Deviation
Calculating relative frequencies is
the closest to dispersion one can
get with categorical data.
What if you want to compare two
data sets and they are in different
units, or they are in different
magnitudes?
The Coefficient of Variation (CV) is
a measure of relative dispersion.

Is everything covered?

Eliminates units and enables
comparisons
Eliminates the effect of differences
of magnitude.
Often the best choice for
comparative dispersion
Concerns
Not usable for data with a 0 mean
Inappropriate for data that can be
negative.
Coefficient of Variation
Summary Statistic Behavior
Xi Xi + 7 2Xi
1 8 2
2 9 4
3 10 6
3 10 6
4 11 8
5 12 10
=3 = 10 =6
M
Mdd =
=33 M
Mdd =10
=10 M
Mdd =
=66
M
Moo =
=33 M
Moo =10
=10 M
Moo =
=66
range
range = =4 4 range
range = =44 range
range = =88

2 =
2
= 1.667
1.667
2 =
2
= 1.667
1.667
2 =
2
= 6.667
6.667

= = 1.291
1.291
= = 1.291
1.291
= = 2.582
2.582
More irritated data
Skewness

Measures the degree of asymmetry in a data


set.
Pearsons 2nd Skewness Coefficient
Ranges from -3 to 3 usually.
Reflects the general result that
when > Md, the data is right skewed, Sk>0
when < Md, the data is left skewed, Sk<0
when = Md, the data is un-skewed, ie, symmetric,
Sk=0
This is not a rule, rather a rule of thumb.
Statisticians know that the size of the sample
and the value of the mode affect the skewness.

Shape Methods
Right Skewed Histogram

Left Skewed Histogram

Symmetric Histogram
Weight Bo Me Which
group of
males has the more
s: ys n uniform weight?
54.7 172.5 How do you know?
Mean
8 2
Median 53 171.5
Mode 52 171
Standard Which group is least
Deviation
7.93 21.81
symmetric?
Sample 62.9 475.4
Variance 1 8
Skewness 0.67 0.14
Range 32 103
Minimum 41 126
Maximum 73 229
Some
Sum Descriptive
2739 8626 Statistics
Count 50 50
14.5 12.6
Chebyshevs Empirical or Normal
Rule
Theorem 1 1 find about 68%
% O BS 1 2
of observations
k 2 find about 95%
of observations
k number of 3 find about
standard 99.7% of observations
Only bell-shaped and
deviations > 1 symmetric
Universal distributions.
application Only integer values of
Provides

minimum
guarantee
Methods for estimating probabilities
A Chebyshev example: If k=1.5,
Chebyshevs Empirical or Normal
Rule
Theorem 1 1 find about 68%
% O BS 1 2
of observations
k 2 find about 95%
of observations
k number of 3 find about
standard 99.7% of observations
Only bell-shaped and
deviations > 1 symmetric
Universal distributions.
application Only integer values of
Provides

minimum
guarantee
Methods for estimating probabilities
What minimum percent of
observations does Chebyshev
predict for 2 ?

Within how many standard


deviations will at least 44% of
observations lie?

Density Estimates: Chebyshev


The weights of a
part Samsung
Electronics
receives from
suppliers have a
mean of 40
micrograms, a
standard deviation
of 3 micrograms,
and a bell-shaped
symmetric
distribution.
Approximately
what percent of
parts weigh
between 34 and 37
Density Estimates:
micrograms?

Empirical Rule
Use the Empirical
Rule to isolate the
area shaded in red,
which is the
percentage of parts
weighing between
34 and 37 mcg.
The red area is half
the difference in
area between 1
and 2
The area that is 1
is 68%
The area that is 2
is 95%.
The difference in
area is (95%-
68%)=27%.
Half of 27% is 13.5%.

Density Estimates:
Empirical Rule
The weights of a part
Samsung Electronics
receives from
suppliers have a mean
of 40 micrograms, a
standard deviation of
3 micrograms, and a
bell-shaped symmetric
distribution.
Samsung rejects parts
that weigh more than
46 micrograms or less
than 37 micrograms.
Approximately what
percentage of parts
does Samsung
routinely accept?

Density Estimates:
Empirical Rule
There are two ways
to think about this
the first is which is
adding the area of
those parts accepted
using the Empirical
Rule.
The area shaded in
blue is the
percentage of
accepted parts.
The blue is the area for
1 , which is 68%,
plus half the
difference in area
between 1 and 2

The difference in area
is (95%- 68%)=27%,
Density Estimates: Empirical Rule
half of which is
(27%/2) =13.5%
Total area is 68% +
13.5% or 81.5%
There are two ways to
think about this and the
second is subtracting
from 1 the area of those
rejected using the
Empirical Rule.
The area shaded in black is
the percentage of rejected
parts. The black area is
the area outside 2 , 1-
95% = 5% plus half the
difference in area between
1 and 2
The difference in area is
(95%- 68%)=27%, half of
which is (27%/2) =13.5%
Total area of black is 5%
+13.5%=18.5%.
The blue area is 1 18.5% or
81.5%.

Density Estimates: Empirical Rule


The weights of a
part Samsung
Electronics
receives from
suppliers have a
mean of 40
micrograms, a
standard deviation
of 3 micrograms,
and a bell-shaped
symmetric
distribution.
What is the
approximate range
of the weights of
this particular

Density Estimates:
part?

Empirical Rule
Think of the
standard deviation
as a unit of distance
along the number
line, here 1 = 3
mcg.
The Empirical Rule
says that virtually
100% of
observations will fall
within 3 .
The range is the
distance between
the maximum value
and the minimum.
Since 3 = 6 ,
the range is
approximately
6*3mcg or 18 mcg.

Density Estimates: Empirical Rule


Methods for One Variable
Threedimensions of data
Measures of Center
Mean, median and mode
Measures of Spread
Range, IQR, variance, standard deviation and
coefficient of variation.
Measures and concepts for Shape
Pearsons skewness coefficient
Symmetric, right or positive skewed, left or
negative skewed
Estimating probabilities
Chebyshevs Theorem and the Empirical Rule

Descriptive Statistics

Vous aimerez peut-être aussi