Vous êtes sur la page 1sur 70

# PROBABILITY AND

STATISTICS

## Note: Most of the Slides were taken from

Elementary Statistics: A Handbook of Slide
Presentation prepared by Z.V.J. Albacea, C.E.
Reano, R.V. Collado, L.N. Comia and N.A.
Tandang in 2005 for the Institute of Statistics,
CAS, UP Los Banos

## PROBABILITY AND STATISTICS

Statistics is like
a bikini, what it
reveals is
suggestive, what
it conceals is
vital
Session 1.2

## The man in the street distrusts statistics and

despises [his image of] statisticians, those who
diligently collect irrelevant facts and figures and
use them to manipulate society.
There are three kinds of lies: lies, damned lies, and
statistics Mark Twaine

## One can not go about without statistics.

Statistics are like bikinis. What they reveal is suggestive,
but what they conceal is vital. Aaron Levenstein

Session 1.3

## ...the most important science in the whole

world: for upon it depends the practical
application of every other science and of every
art: the one science essential to all political
and social administration, all education, all
organization based on experience, for it only
gives results of our experience.
To understand God's thoughts, we must study
statistics, for these are the measures of His
purpose.
Session 1.4

Session 1.5

## PROBABILITY AND STATISTICS

Definition of Statistics
plural sense: numerical facts, e.g. CPI,
peso-dollar exchange rate
singular sense: scientific discipline
consisting of theory and methods for
processing numerical information
that one can use when making
decisions in the face of uncertainty.

Session 1.6

## PROBABILITY AND STATISTICS

History of Statistics
The

## term statistics came from the Latin phrase

ratio status which means study of practical
politics or the statesmans art.
In the middle of 18th century, the term statistik
(a term due to Achenwall) was used, a German
term defined as the political science of several
countries
From statistik it became statistics defined as a
statement in figures and facts of the present
condition of a state.
Session 1.7

## PROBABILITY AND STATISTICS

Application of Statistics
Diverse

applications

## During the 20th Century statistical thinking

and methodology have become the
scientific framework for literally dozens of
fields including education, agriculture,
economics, biology, and medicine, and with
increasing influence recently on the hard
sciences such as astronomy, geology, and
physics. In other words, we have grown
from a small obscure field into a big obscure
Session 1.8

## PROBABILITY AND STATISTICS

Application of Statistics

## Comparing the effects of five kinds of

fertilizers on the yield of a particular
variety of corn
Determining the income distribution of
Filipino families
Comparing the effectiveness of two diet
programs
Prediction of daily temperatures
Evaluation of student performance
Session 1.9

## Two Aims of Statistics

Statistics aims to uncover
structure in data, to explain
variation
Descriptive
Inferential

Session 1.10

## PROBABILITY AND STATISTICS

Areas of Statistics
Descriptive statistics

Inferential statistics

methods

methods

concerned w/
collecting, describing, and
analyzing a set of data
without drawing
conclusions (or inferences)
about a large group

concerned
with the analysis of a
subset of data leading
to predictions or
entire set of data

Session 1.11

## Example of Descriptive Statistics

Present the Philippine population by constructing a
graph indicating the total number of Filipinos counted
during the last census by age group and sex

Session 1.12

## Example of Inferential Statistics

A new milk formulation designed to improve the psychomotor
development of infants was tested on randomly selected infants.

Based on the results, it was concluded that the new milk formulation is
effective in improving the psychomotor development of infants.

Session 1.13

## PROBABILITY AND STATISTICS

Inferential Statistics
Larger Set
(N units/observations)

Smaller Set
(n
units/observations)

Inferences and
Generalizations

Session 1.14

Key Definitions

## A universe is the collection of things or

observational units under consideration.

## A variable is a characteristic observed

or measured on every unit of the
universe.

## A population is the set of all possible

values of the variable.

Session 1.15

Key Definitions

## Parameters are numerical measures

that describe the population or universe
of interest. Usually donated by Greek
letters; (mu), (sigma), (rho),
(lambda), (tau), (theta), (alpha) and
(beta).

## Statistics are numerical measures of a

sample
Session 1.16

Types of Variables
Qualitative variable

non-numerical values

Quantitative variable

numerical values
a.

Discrete

b.

Continuous

c.

countable
measurable

Constant
Session 1.17

Levels of Measurement
1.

Nominal

2.

Ordinal scale

3.

## Accounts for order; no indication of

distance between positions

Interval scale

4.

Ratio scale

Session 1.18

## Methods of Collecting Data

Objective Method

Subjective Method

Session 1.19

Textual
Tabular
Graphical
Session 1.20

## PROBABILITY AND STATISTICS

Summary Measures

Location

Variation
Percentile
Quartile
Decile

Maximum
Minimum
Central
Tendency

Mean

Range
Variance

Kurtosis
Coefficient of
Variation
Interquartile
Range

Mode

Median

Skewness

Standard Deviation

Session 1.21

## PROBABILITY AND STATISTICS

Measures of Location
A Measure of Location summarizes a
data set by giving a typical value within
the range of the data values that describes
its location relative to entire data set.
Some Common Measures:
Minimum, Maximum
Central Tendency
Percentiles, Deciles, Quartiles

Session 1.22

## Minimum is the smallest value

in the data set, denoted as MIN.

## Maximum is the largest value in

the data set, denoted as MAX.
Session 1.23

## A single value that is used to identify

the center of the data
it is thought of as a typical value of
the distribution
precise yet simple
most representative value of the
data

Session 1.24

Mean

## Most common measure of the center

Also known as arithmetic average

Population Mean

X
i 1

X1 X 2 K X N
N

Sample Mean

x1 x2 K xn
x

n
n
i 1

Session 1.25

## may not be an actual

observation in the data set
can be applied in at least
interval level
easy to compute
every observation contributes to
the value of the mean
Session 1.26

## subgroup means can be combined

to come up with a group mean
easily affected by extreme values

0 1 2 3 4 5 6 7 8 9 10

Mean = 5

0 1 2 3 4 5 6 7 8 9 10 12 14

Mean = 6

Session 1.27

Median

parts

## If the number of observations is odd, the

median is the middle number.
If the number of observations is even, the
median is the average of the 2 middle
numbers.

~
Sample median denoted as x

## while population median is denoted as

Session 1.28

Properties of a Median

## may not be an actual observation in

the data set
can be applied in at least ordinal level
a positional measure; not affected by
extreme values

0 1 2 3 4 5 6 7 8 9 10

0 1 2 3 4 5 6 7 8 9 10 12 14

Median = 5

Session 1.29

Mode

## occurs most frequently

nominal average
may or may not exist

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

0 1 2 3 4 5 6

No Mode
Mode = 9
Session 1.30

Properties of a Mode

## can be used for qualitative as

well as quantitative data
may not be unique
not affected by extreme values
can be computed for ungrouped
and grouped data

Session 1.31

## Mean, Median & Mode

Use the mean when:

## sampling stability is desired

other measures are to be
computed

Session 1.32

## Mean, Median & Mode

Use the median when:

## the exact midpoint of the

distribution is desired
there are extreme
observations

Session 1.33

## Mean, Median & Mode

Use the mode when:

## when the "typical" value is

desired
when the dataset is measured
on a nominal scale

Session 1.34

Percentiles

## Numerical measures that give the

relative position of a data value
relative to the entire data set.
Divide an array (raw data arranged
in increasing or decreasing order of
magnitude) into 100 equal parts.
The jth percentile, denoted as Pj, is
the data value in the the data set
that separates the bottom j% of the
data from the top (100-j)%.

Session 1.35

## PROBABILITY AND STATISTICS

EXAMPLE
Suppose LJ was told that relative
to the other scores on a certain
test, his score was the 95th
percentile.
This means that 95% of those
who took the test had scores less
than or equal to LJs score, while
5% had scores higher than LJs.
Session 1.36

Deciles

## Divide an array into ten equal

parts, each part having ten
percent of the distribution of the
data values, denoted by Dj.

## The 1st decile is the 10th

percentile; the 2nd decile is the
20th percentile..
Session 1.37

Quartiles

## Divide an array into four equal

parts, each part having 25% of
the distribution of the data
values, denoted by Qj.
The 1st quartile is the 25th
percentile; the 2nd quartile is
the 50th percentile, also the
median and the 3rd quartile is
the 75th percentile.
Session 1.38

## PROBABILITY AND STATISTICS

Measures of Variation
A

measure of variation is a
single value that is used to
describe the spread of the
distribution
A measure

of central tendency
alone does not uniquely
describe a distribution
Session 1.39

## PROBABILITY AND STATISTICS

A look at dispersion
Data A
11

12

13

14

15

16

17

18

19

20 21

Mean = 15.5
s = 3.338

20 21

Mean = 15.5
s = .9258

Data B
11

12

13

14

15

16

17

18

19

Data C
11

12

13

14

15

16

17

18

19

20 21

Mean = 15.5
s = 4.57

Session 1.40

## Two Types of Measures of

Dispersion
Absolute Measures of Dispersion:
Range
Inter-quartile Range
Variance
Standard Deviation

## Relative Measure of Dispersion:

Coefficient of Variation
Session 1.41

## PROBABILITY AND STATISTICS

Range (R)
The difference between the maximum and
minimum value in a data set, i.e.
R = MAX MIN
Example: Pulse rates of 15 male residents of a
certain village
54
74

58
75

58 60 62 65 66 71
77 78 80 82 85
R = 85 - 54 = 31
Session 1.42

## The larger the value of the

range, the more dispersed
the observations are.
It is quick and easy to
understand.
A rough measure of
dispersion.
Session 1.43

## Inter-Quartile Range (IQR)

The difference between the third quartile and
first quartile, i.e.
IQR = Q3 Q1
Example: Pulse rates of 15 residents of a
certain village
54
74

58
75

58 60 62 65 66 71
77 78 80 82 85

IQR = 78 - 60 = 18
Session 1.44

extreme values.

as the Range.

Session 1.45

## PROBABILITY AND STATISTICS

Variance
important measure of variation
shows variation about the mean

Population variance

(X
i 1

)2

N
n

Sample variance

s2

(x x)
i 1

n 1

Session 1.46

## most important measure of variation

square root of Variance
has the same units as the original data
N

Population SD

(X
i 1

)2

N
n

Sample SD

(x x)
i 1

n 1
Session 1.47

## Computation of Standard Deviation

Data: 10

12

n=8
=16

14

15

17

18

18

24

Mean

(10 16)2 (12 16)2 (14 16) 2 (15 16) 2 (17 16) 2 (18 16) 2 (24 16) 2
s
7
4.309

Session 1.48

## If there is a large amount of variation,

then on average, the data values will be
far from the mean. Hence, the SD will be
large.
If there is only a small amount of
variation, then on average, the data
values will be close to the mean. Hence,
the SD will be small.
Session 1.49

Data A
11 12

13

14

15

16

17

18

19

20 21

Mean = 15.5
s = 3.338

20 21

Mean = 15.5
s = .9258

20 21

Mean = 15.5
s = 4.57

Data B
11 12

13

14

15

16

17

18

19

Data C

11 12

13

14

15

16

17

18

19

Session 1.50

## Comparing Standard Deviation

Example: Team A - Heights of five marathon players in inches

Mean = 65
S
=0

65

65

65

65

65

Session 1.51

## Comparing Standard Deviation

Example: Team B - Heights of five marathon players in inches

Mean = 65
s = 4.0

62

67

66

70

60

Session 1.52

## It is the most widely used measure of

dispersion. (Chebychevs Inequality)
It is based on all the items and is rigidly
defined.
It is used to test the reliability of measures
calculated from samples.
The standard deviation is sensitive to the
presence of extreme values.
It is not easy to calculate by hand (unlike the
range).
Session 1.53

## PROBABILITY AND STATISTICS

Chebyshevs Rule
It permits us to make statements about

## the percentage of observations that

must be within a specified number of
standard deviation from the mean
The proportion of any distribution that
lies within k standard deviations of the
mean is at least 1-(1/k2) where k is any
positive number larger than 1.
This rule applies to any distribution.
Session 1.54

## PROBABILITY AND STATISTICS

Chebyshevs Rule
For any data set with mean () and
standard deviation (SD), the following
statements apply:
At least 75% of the observations are
within 2SD of its mean.
At least 88.9% of the observations are
within 3SD of its mean.
Session 1.55

Illustration

At least 75%

## At least 75% of the observations

are within 2SD of its mean.
Session 1.56

## PROBABILITY AND STATISTICS

Example
The midterm exam scores of 100 STAT 1 students
last semester had a mean of 65 and a standard
deviation of 8 points.
Applying the Chebyshevs Rule, we can say that:
1. At least 75% of the students had scores
between 49 and 81.
2. At least 88.9% of the students had scores
between 41 and 89.

Session 1.57

## measure of relative variation

usually expressed in percent
shows variation relative to mean
used to compare 2 or more groups
Formula :
SD
CV
100%
Mean
Session 1.58

## PROBABILITY AND STATISTICS

Comparing CVs
Stock A: Average Price = P50
SD = P5
CV = 10%
Stock B: Average Price = P100
SD = P5
CV = 5%

Session 1.59

## PROBABILITY AND STATISTICS

Measure of Skewness

## Describes the degree of departures of the

distribution of the data from symmetry.
The degree of skewness is measured by
the coefficient of skewness, denoted as SK
and computed as,

3 Mean Median
SK
SD

Session 1.60

What is Symmetry?
A distribution is said to be
symmetric about the mean,
if the distribution to the left of
mean is the mirror image
of the distribution to the right
of the mean. Likewise, a
symmetric distribution has
SK=0 since its mean is
equal to its median and its
mode.
Session 1.61

Measure of Skewness
SK > 0
positively skewed

SK < 0
negatively skewed

Session 1.62

Measure of Kurtosis

## Describes the extent of peakedness or

flatness of the distribution of the data.
Measured by coefficient of kurtosis (K)
computed as,
N

i 1

Session 1.63

## PROBABILITY AND STATISTICS

Measure of Kurtosis
K=0
mesokurtic

K>0
leptokurtic

K<0
platykurtic
Session 1.64

Box-and-Whiskers Plot

## Concerned with the symmetry of the

distribution and incorporates
measures of location in order to study
the variability of the observations.
Also called as box plot or 5-number
summary (represented by Min, Max,
Q1, Q2, and Q3).
Suitable for identifying outliers.
Session 1.65

Box-and-Whiskers Plot
The diagram is made up of a box which
lies between the first and third
quartiles.
The whiskers are the straight line
extending from the ends of the box to
the smallest and largest values that
are not outliers.

Session 1.66

## Steps to Construct a Box-and-Whiskers plot

Step 1: Draw a rectangular box whose left edge is at the Q1 and whose
right edge is at the Q3 so the box width is the IQR. Then draw a
vertical line segment inside the box where the median is found.

Q1

Md

Q3

75

78

85
Session 1.67

## Steps to Construct a Box-and-Whiskers plot

Step 2: Place marks at distances 1.5 IQR from
either end of the box. (1.5 IQR =15)
1.5 IQR

1.5 IQR

60

Q1

Md

Q3

75

78

85

100

Session 1.68

## Steps to Construct a Box-and-Whiskers

plot
Step 3:Draw the horizontal line segments
known as the whiskers from each of the
end box to the largest and smallest values
in the data set that are not outliers.
(An observation beyond 1.5 IQR is an
outlier.)
If the largest and smallest values in the
data set are outliers, extend whiskers until
1.5 IQR from either ends of the box.
Session 1.69

## Steps to Construct a Box-and-Whiskers

plot
Step 4: For every outlier, draw a dot. If two or more dots
have the same values, draw the dots side by side.
1.5 IQR
1.5 IQR

.
.
55

60

Q1

Md

Q3

75

78

85

98

100
Session 1.70