Vous êtes sur la page 1sur 70

PROBABILITY AND

STATISTICS

Note: Most of the Slides were taken from


Elementary Statistics: A Handbook of Slide
Presentation prepared by Z.V.J. Albacea, C.E.
Reano, R.V. Collado, L.N. Comia and N.A.
Tandang in 2005 for the Institute of Statistics,
CAS, UP Los Banos

PROBABILITY AND STATISTICS

Statistics is like
a bikini, what it
reveals is
suggestive, what
it conceals is
vital
Session 1.2

PROBABILITY AND STATISTICS

Realities about Statistics

The man in the street distrusts statistics and


despises [his image of] statisticians, those who
diligently collect irrelevant facts and figures and
use them to manipulate society.
There are three kinds of lies: lies, damned lies, and
statistics Mark Twaine

One can not go about without statistics.


Statistics are like bikinis. What they reveal is suggestive,
but what they conceal is vital. Aaron Levenstein

Session 1.3

PROBABILITY AND STATISTICS

Florence Nightingale on Statistics

...the most important science in the whole


world: for upon it depends the practical
application of every other science and of every
art: the one science essential to all political
and social administration, all education, all
organization based on experience, for it only
gives results of our experience.
To understand God's thoughts, we must study
statistics, for these are the measures of His
purpose.
Session 1.4

PROBABILITY AND STATISTICS

Session 1.5

PROBABILITY AND STATISTICS

Definition of Statistics
plural sense: numerical facts, e.g. CPI,
peso-dollar exchange rate
singular sense: scientific discipline
consisting of theory and methods for
processing numerical information
that one can use when making
decisions in the face of uncertainty.

Session 1.6

PROBABILITY AND STATISTICS

History of Statistics
The

term statistics came from the Latin phrase


ratio status which means study of practical
politics or the statesmans art.
In the middle of 18th century, the term statistik
(a term due to Achenwall) was used, a German
term defined as the political science of several
countries
From statistik it became statistics defined as a
statement in figures and facts of the present
condition of a state.
Session 1.7

PROBABILITY AND STATISTICS

Application of Statistics
Diverse

applications

During the 20th Century statistical thinking


and methodology have become the
scientific framework for literally dozens of
fields including education, agriculture,
economics, biology, and medicine, and with
increasing influence recently on the hard
sciences such as astronomy, geology, and
physics. In other words, we have grown
from a small obscure field into a big obscure
field. Brad Efron
Session 1.8

PROBABILITY AND STATISTICS

Application of Statistics

Comparing the effects of five kinds of


fertilizers on the yield of a particular
variety of corn
Determining the income distribution of
Filipino families
Comparing the effectiveness of two diet
programs
Prediction of daily temperatures
Evaluation of student performance
Session 1.9

PROBABILITY AND STATISTICS

Two Aims of Statistics


Statistics aims to uncover
structure in data, to explain
variation
Descriptive
Inferential

Session 1.10

PROBABILITY AND STATISTICS

Areas of Statistics
Descriptive statistics

Inferential statistics

methods

methods

concerned w/
collecting, describing, and
analyzing a set of data
without drawing
conclusions (or inferences)
about a large group

concerned
with the analysis of a
subset of data leading
to predictions or
inferences about the
entire set of data

Session 1.11

PROBABILITY AND STATISTICS

Example of Descriptive Statistics


Present the Philippine population by constructing a
graph indicating the total number of Filipinos counted
during the last census by age group and sex

Session 1.12

Example of Inferential Statistics


A new milk formulation designed to improve the psychomotor
development of infants was tested on randomly selected infants.

Based on the results, it was concluded that the new milk formulation is
effective in improving the psychomotor development of infants.

Session 1.13

PROBABILITY AND STATISTICS

Inferential Statistics
Larger Set
(N units/observations)

Smaller Set
(n
units/observations)

Inferences and
Generalizations

Session 1.14

PROBABILITY AND STATISTICS

Key Definitions

A universe is the collection of things or


observational units under consideration.

A variable is a characteristic observed


or measured on every unit of the
universe.

A population is the set of all possible


values of the variable.

Session 1.15

Key Definitions

Parameters are numerical measures


that describe the population or universe
of interest. Usually donated by Greek
letters; (mu), (sigma), (rho),
(lambda), (tau), (theta), (alpha) and
(beta).

Statistics are numerical measures of a


sample
Session 1.16

Types of Variables
Qualitative variable

non-numerical values

Quantitative variable

numerical values
a.

Discrete

b.

Continuous

c.

countable
measurable

Constant
Session 1.17

Levels of Measurement
1.

Nominal

2.

Ordinal scale

3.

Accounts for order; no indication of


distance between positions

Interval scale

4.

Numbers or symbols used to classify

Equal intervals; no absolute zero

Ratio scale

Has absolute zero


Session 1.18

Methods of Collecting Data

Objective Method

Subjective Method

Use of Existing Records

Session 1.19

PROBABILITY AND STATISTICS

Methods of Presenting Data


Textual
Tabular
Graphical
Session 1.20

PROBABILITY AND STATISTICS

Summary Measures

Location

Variation
Percentile
Quartile
Decile

Maximum
Minimum
Central
Tendency

Mean

Range
Variance

Kurtosis
Coefficient of
Variation
Interquartile
Range

Mode

Median

Skewness

Standard Deviation

Session 1.21

PROBABILITY AND STATISTICS

Measures of Location
A Measure of Location summarizes a
data set by giving a typical value within
the range of the data values that describes
its location relative to entire data set.
Some Common Measures:
Minimum, Maximum
Central Tendency
Percentiles, Deciles, Quartiles

Session 1.22

PROBABILITY AND STATISTICS

Maximum and Minimum

Minimum is the smallest value


in the data set, denoted as MIN.

Maximum is the largest value in


the data set, denoted as MAX.
Session 1.23

PROBABILITY AND STATISTICS

Measure of Central Tendency

A single value that is used to identify


the center of the data
it is thought of as a typical value of
the distribution
precise yet simple
most representative value of the
data

Session 1.24

PROBABILITY AND STATISTICS

Mean

Most common measure of the center


Also known as arithmetic average

Population Mean

X
i 1

X1 X 2 K X N
N

Sample Mean

x1 x2 K xn
x

n
n
i 1

Session 1.25

PROBABILITY AND STATISTICS

Properties of the Mean

may not be an actual


observation in the data set
can be applied in at least
interval level
easy to compute
every observation contributes to
the value of the mean
Session 1.26

Properties of the Mean

subgroup means can be combined


to come up with a group mean
easily affected by extreme values

0 1 2 3 4 5 6 7 8 9 10

Mean = 5

0 1 2 3 4 5 6 7 8 9 10 12 14

Mean = 6

Session 1.27

Median

Divides the observations into two equal


parts

If the number of observations is odd, the


median is the middle number.
If the number of observations is even, the
median is the average of the 2 middle
numbers.

~
Sample median denoted as x

while population median is denoted as


Session 1.28

Properties of a Median

may not be an actual observation in


the data set
can be applied in at least ordinal level
a positional measure; not affected by
extreme values

0 1 2 3 4 5 6 7 8 9 10

0 1 2 3 4 5 6 7 8 9 10 12 14

Median = 5

Session 1.29

PROBABILITY AND STATISTICS

Mode

occurs most frequently


nominal average
may or may not exist

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

0 1 2 3 4 5 6

No Mode
Mode = 9
Session 1.30

Properties of a Mode

can be used for qualitative as


well as quantitative data
may not be unique
not affected by extreme values
can be computed for ungrouped
and grouped data

Session 1.31

Mean, Median & Mode


Use the mean when:

sampling stability is desired


other measures are to be
computed

Session 1.32

Mean, Median & Mode


Use the median when:

the exact midpoint of the


distribution is desired
there are extreme
observations

Session 1.33

Mean, Median & Mode


Use the mode when:

when the "typical" value is


desired
when the dataset is measured
on a nominal scale

Session 1.34

PROBABILITY AND STATISTICS

Percentiles

Numerical measures that give the


relative position of a data value
relative to the entire data set.
Divide an array (raw data arranged
in increasing or decreasing order of
magnitude) into 100 equal parts.
The jth percentile, denoted as Pj, is
the data value in the the data set
that separates the bottom j% of the
data from the top (100-j)%.

Session 1.35

PROBABILITY AND STATISTICS

EXAMPLE
Suppose LJ was told that relative
to the other scores on a certain
test, his score was the 95th
percentile.
This means that 95% of those
who took the test had scores less
than or equal to LJs score, while
5% had scores higher than LJs.
Session 1.36

PROBABILITY AND STATISTICS

Deciles

Divide an array into ten equal


parts, each part having ten
percent of the distribution of the
data values, denoted by Dj.

The 1st decile is the 10th


percentile; the 2nd decile is the
20th percentile..
Session 1.37

PROBABILITY AND STATISTICS

Quartiles

Divide an array into four equal


parts, each part having 25% of
the distribution of the data
values, denoted by Qj.
The 1st quartile is the 25th
percentile; the 2nd quartile is
the 50th percentile, also the
median and the 3rd quartile is
the 75th percentile.
Session 1.38

PROBABILITY AND STATISTICS

Measures of Variation
A

measure of variation is a
single value that is used to
describe the spread of the
distribution
A measure

of central tendency
alone does not uniquely
describe a distribution
Session 1.39

PROBABILITY AND STATISTICS

A look at dispersion
Data A
11

12

13

14

15

16

17

18

19

20 21

Mean = 15.5
s = 3.338

20 21

Mean = 15.5
s = .9258

Data B
11

12

13

14

15

16

17

18

19

Data C
11

12

13

14

15

16

17

18

19

20 21

Mean = 15.5
s = 4.57

Session 1.40

PROBABILITY AND STATISTICS

Two Types of Measures of


Dispersion
Absolute Measures of Dispersion:
Range
Inter-quartile Range
Variance
Standard Deviation

Relative Measure of Dispersion:


Coefficient of Variation
Session 1.41

PROBABILITY AND STATISTICS

Range (R)
The difference between the maximum and
minimum value in a data set, i.e.
R = MAX MIN
Example: Pulse rates of 15 male residents of a
certain village
54
74

58
75

58 60 62 65 66 71
77 78 80 82 85
R = 85 - 54 = 31
Session 1.42

PROBABILITY AND STATISTICS

Some Properties of the Range

The larger the value of the


range, the more dispersed
the observations are.
It is quick and easy to
understand.
A rough measure of
dispersion.
Session 1.43

PROBABILITY AND STATISTICS

Inter-Quartile Range (IQR)


The difference between the third quartile and
first quartile, i.e.
IQR = Q3 Q1
Example: Pulse rates of 15 residents of a
certain village
54
74

58
75

58 60 62 65 66 71
77 78 80 82 85

IQR = 78 - 60 = 18
Session 1.44

PROBABILITY AND STATISTICS

Some Properties of IQR

Reduces the influence of


extreme values.

Not as easy to calculate


as the Range.

Session 1.45

PROBABILITY AND STATISTICS

Variance
important measure of variation
shows variation about the mean

Population variance

(X
i 1

)2

N
n

Sample variance

s2

(x x)
i 1

n 1

Session 1.46

PROBABILITY AND STATISTICS

Standard Deviation (SD)

most important measure of variation


square root of Variance
has the same units as the original data
N

Population SD

(X
i 1

)2

N
n

Sample SD

(x x)
i 1

n 1
Session 1.47

PROBABILITY AND STATISTICS

Computation of Standard Deviation


Data: 10

12

n=8
=16

14

15

17

18

18

24

Mean

(10 16)2 (12 16)2 (14 16) 2 (15 16) 2 (17 16) 2 (18 16) 2 (24 16) 2
s
7
4.309

Session 1.48

PROBABILITY AND STATISTICS

Remarks on Standard Deviation

If there is a large amount of variation,


then on average, the data values will be
far from the mean. Hence, the SD will be
large.
If there is only a small amount of
variation, then on average, the data
values will be close to the mean. Hence,
the SD will be small.
Session 1.49

PROBABILITY AND STATISTICS

Comparing Standard Deviation


Data A
11 12

13

14

15

16

17

18

19

20 21

Mean = 15.5
s = 3.338

20 21

Mean = 15.5
s = .9258

20 21

Mean = 15.5
s = 4.57

Data B
11 12

13

14

15

16

17

18

19

Data C

11 12

13

14

15

16

17

18

19

Session 1.50

TEACHING BASIC STATISTICS .

Comparing Standard Deviation


Example: Team A - Heights of five marathon players in inches

Mean = 65
S
=0

65

65

65

65

65

Session 1.51

Comparing Standard Deviation


Example: Team B - Heights of five marathon players in inches

Mean = 65
s = 4.0

62

67

66

70

60

Session 1.52

Properties of Standard Deviation

It is the most widely used measure of


dispersion. (Chebychevs Inequality)
It is based on all the items and is rigidly
defined.
It is used to test the reliability of measures
calculated from samples.
The standard deviation is sensitive to the
presence of extreme values.
It is not easy to calculate by hand (unlike the
range).
Session 1.53

PROBABILITY AND STATISTICS

Chebyshevs Rule
It permits us to make statements about

the percentage of observations that


must be within a specified number of
standard deviation from the mean
The proportion of any distribution that
lies within k standard deviations of the
mean is at least 1-(1/k2) where k is any
positive number larger than 1.
This rule applies to any distribution.
Session 1.54

PROBABILITY AND STATISTICS

Chebyshevs Rule
For any data set with mean () and
standard deviation (SD), the following
statements apply:
At least 75% of the observations are
within 2SD of its mean.
At least 88.9% of the observations are
within 3SD of its mean.
Session 1.55

PROBABILITY AND STATISTICS

Illustration

At least 75%

At least 75% of the observations


are within 2SD of its mean.
Session 1.56

PROBABILITY AND STATISTICS

Example
The midterm exam scores of 100 STAT 1 students
last semester had a mean of 65 and a standard
deviation of 8 points.
Applying the Chebyshevs Rule, we can say that:
1. At least 75% of the students had scores
between 49 and 81.
2. At least 88.9% of the students had scores
between 41 and 89.

Session 1.57

PROBABILITY AND STATISTICS

Coefficient of Variation (CV)

measure of relative variation


usually expressed in percent
shows variation relative to mean
used to compare 2 or more groups
Formula :
SD
CV
100%
Mean
Session 1.58

PROBABILITY AND STATISTICS

Comparing CVs
Stock A: Average Price = P50
SD = P5
CV = 10%
Stock B: Average Price = P100
SD = P5
CV = 5%

Session 1.59

PROBABILITY AND STATISTICS

Measure of Skewness

Describes the degree of departures of the


distribution of the data from symmetry.
The degree of skewness is measured by
the coefficient of skewness, denoted as SK
and computed as,

3 Mean Median
SK
SD

Session 1.60

What is Symmetry?
A distribution is said to be
symmetric about the mean,
if the distribution to the left of
mean is the mirror image
of the distribution to the right
of the mean. Likewise, a
symmetric distribution has
SK=0 since its mean is
equal to its median and its
mode.
Session 1.61

Measure of Skewness
SK > 0
positively skewed

SK < 0
negatively skewed

Session 1.62

Measure of Kurtosis

Describes the extent of peakedness or


flatness of the distribution of the data.
Measured by coefficient of kurtosis (K)
computed as,
N

i 1

Session 1.63

PROBABILITY AND STATISTICS

Measure of Kurtosis
K=0
mesokurtic

K>0
leptokurtic

K<0
platykurtic
Session 1.64

Box-and-Whiskers Plot

Concerned with the symmetry of the


distribution and incorporates
measures of location in order to study
the variability of the observations.
Also called as box plot or 5-number
summary (represented by Min, Max,
Q1, Q2, and Q3).
Suitable for identifying outliers.
Session 1.65

Box-and-Whiskers Plot
The diagram is made up of a box which
lies between the first and third
quartiles.
The whiskers are the straight line
extending from the ends of the box to
the smallest and largest values that
are not outliers.

Session 1.66

TEACHING BASIC STATISTICS .

Steps to Construct a Box-and-Whiskers plot


Step 1: Draw a rectangular box whose left edge is at the Q1 and whose
right edge is at the Q3 so the box width is the IQR. Then draw a
vertical line segment inside the box where the median is found.

Q1

Md

Q3

75

78

85
Session 1.67

PROBABILITY AND STATISTICS

Steps to Construct a Box-and-Whiskers plot


Step 2: Place marks at distances 1.5 IQR from
either end of the box. (1.5 IQR =15)
1.5 IQR

1.5 IQR

60

Q1

Md

Q3

75

78

85

100

Session 1.68

Steps to Construct a Box-and-Whiskers


plot
Step 3:Draw the horizontal line segments
known as the whiskers from each of the
end box to the largest and smallest values
in the data set that are not outliers.
(An observation beyond 1.5 IQR is an
outlier.)
If the largest and smallest values in the
data set are outliers, extend whiskers until
1.5 IQR from either ends of the box.
Session 1.69

PROBABILITY AND STATISTICS

Steps to Construct a Box-and-Whiskers


plot
Step 4: For every outlier, draw a dot. If two or more dots
have the same values, draw the dots side by side.
1.5 IQR
1.5 IQR

.
.
55

60

Q1

Md

Q3

75

78

85

98

100
Session 1.70