Statistics and Probability: Bill Thaddeus Padasas

STATISTICS AND PROBABILITY
Bill Thaddeus Padasas

CHAPTER 1
INTRODUCTION TO STATISTICS
The word statistics can be viewed in two contexts
Singular sense Statistics as a science
Plural sense Statistics as actual number derived from the data

Introduction to Statistics
The word statistics can be viewed in two contexts
Singular sense A body of knowledge concerned with the collection, organization,

presentation, analysis, and interpretation of data.
Plural sense A collection of facts and figures or processed data.

Roles of Statistics in Decision Making
Statistics provides tools in
> Designing experiments or surveys

> Analyzing data
Statistics can help us answer questions like
1.What options to choose?

2. How do we make a choice?
3.Why choose such options?
Statistics is used in every field of science.
Consider the following problems:
Agricultural problem: Economics:

Is new grain seed or fertilizer more productive? What will be the unemployment rate next
year?
Medical problem:
What is the right amount of dosage of drug to Technical problem:
treatment? How to improve quality of product?
Political science:
How accurate are the election polls?
Two Broad Categories of Statistics
Descriptive Statistics
Inferential Statistics
Descriptive Statistics
- used to describe a mass of data in a clear, concise, and informative way.
- deals with the methods of organizing, summarizing, and presenting data.
Population
Describes Sample
Results
Sample to Population
Sub-groups Compares
Correlations
Central Tendency
Summarizes
Spread
Inferential Statistics
- concerned with making generalizations about the characteristics of a
larger set where only a part is examined.
Smaller set
(n units or observations)
Larger set
(N units or observations)
Inferences and
Generalizations
CHAPTER 2
THE POPULATION
A. BASIC CONCEPTS
DATA
- Facts and figures that are collected, presented, and analyzed
- Can be numeric or non-numeric
- Must be contextualized
Who,What (and in what units),When,Where,Why, How

A. BASIC CONCEPTS
UNIVERSE
- A collection or set of all individuals or entities whose characteristics are to
be studied.
- Answers the question “Who?”
Types of Universe
1. Finite - when the elements of the universe can be counted for a given time
period.
2. Infinite – when the number of elements of the universe is unlimited.

A. BASIC CONCEPTS
VARIABLE
- Attribute or characteristic of interest measurable on each and every unit of the universe
- Answers the question “What?”
Qualitative
Variable Discrete
Quantitative
Continuous
A. BASIC CONCEPTS
Types of Variables
1. Qualitative
- assumes values that are not numerical but can be categorized.
- categories may be identified by either non-numerical descriptions or by
numeric codes.
2. Quantitative
- indicates the quantity or amount of a characteristic
- data are always numeric
- can be discrete or continuous
A. BASIC CONCEPTS
Types of Quantitative Variables
1. Discrete
- variable with a finite or countable number of possible values
2. Continuous
- variable that assumes any value in a given interval
A. BASIC CONCEPTS
POPULATION – set of all possible values of the variable.
UNIVERSE POPULATION
U1 Y1
U2 Y2
U3 Y3
... ...
... ...
UN YN
A. BASIC CONCEPTS
SAMPLE – subset of the population or universe
SAMPLE
UNIVERSE / POPULATION
A. BASIC CONCEPTS
LEVELS OF MEASUREMENT
Data may be classified into four hierarchical levels of measurement:
Nominal
Ordinal
Interval
Ratio
A. BASIC CONCEPTS
NOMINAL
Data collected are labels, names, or categories.

Frequencies or counts of observations belonging to the same category can be obtained
It is the lowest level of measurement
Name, gender
A. BASIC CONCEPTS
ORDINAL
Data collected are labels with implied ordering

The difference between two data labels is meaningless
Grade level, Socio-economic status

A. BASIC CONCEPTS
INTERVAL
Data can be ordered or ranked

The difference between two data values is meaningful
Data at this level may lack an absolute zero
IQ score, temperature (in C and F)

A. BASIC CONCEPTS
RATIO
Data have all the properties of interval scale

The number zero indicates the absence of the characteristic being measured
It is the highest level of measurement
Age (in years), weekly food allowance (in peso), height (in cm)
Statistical analysis depends on the variable’s level of measurement.
Knowledge of such will serve as guide in determining the appropriate algebraic

operations and consequently, the statistical tools that can be used for analysis.
Inappropriate analysis leads to erroneous conclusion, which may lead to dangerous

consequences brought about by wrong decisions.
Methods of Data Collection
Objective Method
Subjective Method
Use of Existing Records
B. METHODS OF DATA COLLECTION
OBJECTIVE METHOD
The data are collected through measurement, counting, or by observation.

This method requires the use of a measuring or counting instrument.
SUBJECTIVE METHOD
The information is provided by identified respondents.

The instrument used to gather data may take the form of a questionnaire.
The researcher collects the data by:
>conducting personal interviews either face-to-face or through telephones; or
>gathering responses using mailed questionnaires.
USE OF EXISTING RECORDS
This method uses data which have been previously collected by another person or
institution for some other purpose.
TYPES OF DATA
1. Primary
- data which were acquired directly from the source.
2. Secondary
- data which were not acquired directly from the source.
Methods of Data Presentation
1. Textual
2. Tabular
3. Graphical
C. METHODS OF DATA PRESENTATION
1.TEXTUAL
✓ A narrative form of describing characteristics of the universe or population based on the

data collected and organized by giving highlights
✓ Applicable only when presenting new information

2.TABULAR
✓ Data are organized into classes or categories by rows and/or columns and appropriate
pieces of information are found in the cells of the table.
✓ Relatively more information can be presented and trends can be easily seen.
✓ Some details are lost when data are summarized in tabular form.
PARTS OF A STATISTICAL TABLE
1. Table heading – includes the table number and title.
2. Caption – designates the information contained in the columns.
3. Body – main part of the table containing the information or figures

presented.
4. Stubs/Classes – categories which describe the data usually found at

the left-hand side of the table.
Table heading
Body
Stubs/Classes
3. GRAPHICAL PRESENTATION
✓ It provides visual presentation of the distributional properties of the data.

✓ This is the most efficient way of presenting trends.
✓ Some details are lost in using this type of presentation.
PIE CHART
Figure 1. Percentage distribution of browser usage in Europe.

BAR GRAPH
Figure 2. World birth rate in ASEAN countries from 1990 to 2008

LINE GRAPH
SCATTER PLOT
Height vs Weight of Grade 11 Students

90
80
70
60
Weight (kg)
50
40
30
20
10
0
155 160 165 170 175 180
Height (cm)
STEM AND LEAF PLOT
✓ Presents data in ordered form and provides an idea of the shape of

the distribution of a set of quantitative data
✓ Best for small number of observations with values greater than zero
✓ Also called stemplot

STEM AND LEAF PLOT
The performance scores received by

the vehicles in a car show are as
follows:
90 80 75 80 80 80 90
90 100 100 75 70 80 80
STEPS IN CONSTRUCTING
A STEM-AND-LEAF PLOT
90
1. Arrange data in ascending or 80
75
descending order. 80
80
70 75 75 80
80 80 80 90
90
80 80 80 100
90 90 90 100
75
100 100 70
80
80
90
2. Split each datum into a leaf value, which is 80
75
the last digit, and a stem value, which
80
consists of the remaining digits. 80
80
Examples: 90
90
100
75 100
Stem Leaf
100 75
70
80
80
90
3. List the stems vertically in 80
increasing or decreasing order. 75
80
4. Draw a vertical line to the right of 80
the stems. 80
90
7 90
8 vertical line 100
Stem 100
9
75
10 70
80
80
90
80
5. For each stem, write its leaves to the 75
right of the vertical line in ascending 80
order. 80
80
90
7 0 5 5 90
8 0 0 0 0 0 0 100
STEM AND LEAF PLOT 100
9 0 0 0 75
10 0 0 70
80
Figure 1. Distribution of the performance scores received by the vehicles in a car show. 80
STEM AND LEAF PLOT
IMPORTANT FEATURES:
✓ Reveals the center of the distribution
✓ Illustrates overall shape of the distribution like symmetry and spread
✓ Shows marked deviations from the overall shapes (gaps, extreme

values)
ANSWER THIS
The raw scores of Grade 11 students on their

Earth and Life Science First Term Assessment
are as follows:
36 53 41 52 56
25 25 21 36 42
60 40 50 54 46
43 51 55 53
30 53 54 56
Make a Stem and Leaf Plot based on these given
values
D. DESCRIPTIVE MEASURES
DESCRIPTIVE MEASURES
Values that are used to summarize the characteristics of a universe or population

Some of these are measures of:
Location
Dispersion
Skewness
Kurtosis
➢ Measures of Location
Summarizes a data set by giving a “typical value” within the range of the
data values that describes its location relative to entire data set.
Some common measures:

✓ Minimum, Maximum
✓ Measures of Central Tendency
✓ Percentiles, Deciles, Quartiles
➢ Minimum and Maximum
Minimum is the smallest value in the data set, denoted as MIN.

Maximum is the largest value in the data set, denoted as MAX.
➢ Measures of Central Tendency
Value(s) about which the set of observations tend to cluster

Also called as an average
Most commonly used averages:

✓ Mean
✓ Median
✓ Mode
➢ Mean (Arithmetic Mean)
The sum of all observations in the data set divided by the

total number of observations
N
x i
x1  x2  ...  xN
= i 1
=
N N
where xi = i th observation of the variable X and

N = total number of observations in the data set.
SOME PROPERTIES OF THE MEAN
❖ There is only one mean in a given data set.
❖ It is defined only for quantitative data.
❖ It reflects the magnitude of every observation.
❖ It is easily affected by the presence of extreme values.
❖ The sum of the deviations from the mean is equal to zero.
❖ The means of different sets/groups of comparable data may

be combined when properly weighted.
Find the mean of the sample data
2 -1 0 2
Answer: 0.75
A random sample of ten students is taken from the student body of
a college and their GPAs are recorded as follows:
1.90 3.00 2.53 3.71 2.12 1.76 2.71 1.39 4.00 3.33
Answer: 2.645
➢ Median
The middle value when the data values are arranged in

ascending or descending order of magnitude.
 x N21 , if N is odd,

Md   x N  x N
1
 2 2
, if N is even.
 2
S O M E P RO P E RTIE S O F T H E M E D I AN
❖ There is only one median for a data set.
❖ It is not amenable to further computations.
❖ The sum of the absolute deviations of the observations from
a value, say c, is smallest when c is equal to the median.
That is, x i -c is minimum when c = Md.

➢ Mode
The value(s) in the data set which occurs most frequently,

denoted by Mo.
Example:
Consider the following scores of 10 students in SA #4
10 10 10 8 8 10 5 8 10 10
Mo = 10
SOME PROPERTIES OF THE MODE
❖ It may or may not exist.
❖ If it exists, there can be more than one mode for a given

data set.
❖ It is determined by the frequency and not by the values of

the observations.
❖ It is applicable for both quantitative and qualitative data.

PRACTICE!
1. Find the mean and the median for the LDL cholesterol level
in a sample of ten heart patients.
132 162 133 145 148

139 147 160 150 153
2. Find the mean and the median for the LDL cholesterol level
in a sample of ten heart patients on a special diet.
127 152 138 110 152

113 131 148 135 158
PRACTICE!
3. Five laboratory mice with thymus leukemia are observed for a

predetermined period of 500 days. After 500 days, four mice have died but
the fifth one survives.The recorded survival times for the five mice are
493 421 222 378 500*
Where 500* indicates that the fifth mouse survived for at least 500 days
but the survival time (i.e., the exact value of the observation) is unknown.
a. Can you find the sample mean for the data set? If so, find it. If not, why
not?
b. Can you find the sample median for the data set? If so, find it. If not, why
not?
PERCENTILES
❖ Numerical measures that give the relative position of a data

value relative to the entire data set.
❖ Divide an array (raw data arranged in increasing or decreasing

order of magnitude) into 100 equal parts.
❖ The jth percentile, denoted as Pj, is the data value that

separates the bottom j% of the data from the top (100-j)%
FINDING THE J TH PERCENTILE
1. Arrange the data values in ascending order

2. Find the location of Pj in the arranged list by computing
 j 
L=  ×N
 100 
where N = the total number of data values, and

j = percentile of interest
L = location
FINDING THE J TH PERCENTILE
3.
(a) If L is a whole number, then Pj is the mean of the data
values in position L and position L + 1.
(b) If L is not a whole number, Pj is taken as the data value in the

next higher whole number position.
Remark: Percentiles are generally computed for large data sets.

DECILES
❖ These values divide an array into ten equal parts, each part
having ten percent of the data values, denoted by Dj.
❖ The 1st decile is the 10th percentile; the 2nd decile is the
20th percentile and so on.
DECILES
Remember:
In computing for the deciles, follow the procedure in the
computation of the equivalent percentiles.
D1  P10 D6  P60
D2  P20 D7  P70
D3  P30 D8  P80
D4  P40 D9  P90
D5  P50 D10  P100
QUARTILES
❖ These values divide an array into four equal parts, each part
having 25% of the data values, denoted by Qj.
❖ The 1st quartile is the 25th percentile; the 2nd quartile is

the 50th percentile (median), and the 3rd quartile is the 75th
percentile.
QUARTILES
Remember:
Similarly, in the computation of the quartiles we use the
procedure in the computation of the equivalent percentiles.
Q1  P25
Q2  P50 = Median  D5
Q3  P75
Q4  P100
ILLUSTRATION
Two different groups of 10 students were given

identical exams. The scores of the students are
shown as
Group A Group B
6 5 6 7 8 4 2
7 1 3 4 7 7 7 5 4 8 Mean: 71.5
6 2 7 Median: 72.0
Mean: 71.5 Mode: 77.0
Median: 72.0 7 7 7
Mode: 77.0 8 5
9 3
10 0
➢ Measures of Dispersion
✓ A quantity that describes the spread or variability of the observations in a

given data set.
✓ The higher the value, the greater the variability in the data set.
✓ Some common measures are:

•Range
•Inter-quartile Range
•Variance
•Standard Deviation
•Coefficient of Variation
T YP E S O F M E AS UR E S O F D I S P E RSIO N
Absolute Measures of Dispersion:

Range
Inter-Quartile Range
Variance
Standard Deviation
Relative Measure of Dispersion:

Coefficient of Variation
RANGE
The difference between the maximum and minimum values in

a data set, i.e.
R = MAX – MIN
Example: In the set of scores of the Group A students

6 5 6 7 8 R = MAX – MIN
7 1 3 4 7 7 7 = 77 – 65
= 12
PROPERTIES OF RANGE
✓ It is quick and easy to understand.
✓ It is a rough measure of dispersion.
✓ It is usually reported together with the median.

INTER-QUARTILE RANGE (IQR)
The difference between the third quartile and first quartile,

i.e.
IQR = Q3 – Q1
Example: In the set of scores of the Group A students

6 5 6 7 8
7 1 3 4 7 7 7
PROPERTIES OF IQR
✓ Not affected by the presence of extreme values
✓ Not as easy to calculate as the range

ILLUSTRATION
DATA SET 1 DATA SET 2
{-10, 0, 10, 20, 30} {8, 9, 10, 11, 12}

VARIANCE (σ 2 )
The average squared difference of the observations from the mean

N
 x  
2
i
2  i 1
N
where xi = ith observation in the data set,
 = mean of the data set,
N = total number of observations in the data set
PROPERTIES OF VARIANCE
✓ One of the most useful measures of dispersion
✓ All observations contribute in the computation
✓ Always non-negative
✓ Comes in the square of the unit of measure of the given

set of values
STANDARD DEVIATION (σ)
• The average deviation of the observations from the mean

• The positive square root of the variance,
  2
• Same unit of measure as that of the observations

• Usually reported with the mean
COEFFICIENT OF VARIATION
A relative measure that indicates the magnitude of variation

relative to the magnitude of the mean, expressed in percent,
denoted as CV,

CV     100%

PROPERTIES OF COEFFICIENT OF VARIATION
✓ Unitless
✓ Used to compare dispersion of two or more data sets

with the same or different units
✓ The higher the CV, the more variable is the data set
relative to its mean
CHEBYSHEV’S RULE
✓ It permits us to make statements about the percentage of

observations that must be within a specified number of
standard deviations from the mean.
✓ The proportion of any distribution that lies within k

standard deviations of the mean is at least 1-(1/k2) where
k is any positive number larger than 1.
CHEBYSHEV’S RULE
✓ For any data set with mean (μ) and standard deviation (σ),
the following statements apply:
✓ At least 75% of the observations are within 2σ of its

mean.
✓ At least 88.9% of the observations are within 3σ of its

mean.
ILLUSTRATION
At least 75% of the observations are within 2σ of its mean.

EXAMPLE
1. Find the fraction of all numbers of a data set with a

mean of 60 and standard deviation of 4 that must lie
between 52 and 68.
2. Suppose that the average score on a math SA is an 84

with a standard deviation of 4 points. According to
Chebyshev’s Rule, at least what percent of the tests
have a grade of at least 72 and most 96?
SYMMTERY
A distribution is said to be symmetric about the mean if the

distribution to the left of mean is the “mirror image” of the
distribution to the right of the mean.
 = Md = Mo  = Md = Mo
MEASURE OF SKEWNESS
✓ Describes the degree of departure of the distribution

from symmetry
✓ The degree of skewness measured by the coefficient of

skewness, denoted by SK
Pearsonian
second-order 3  Mean  Median 
coefficient of SK 
skewness 
✓ A symmetric distribution has SK=0 since its mean is equal
to its median and its mode.
MEASURE OF SKEWNESS
Mo < Md <   < Md < Mo
SK > 0 SK < 0
positively skewed negatively skewed
There are some extremely high values in the data set There are some extremely low values in the data set
MEASURE OF KURTOSIS
✓ Describes the extent of peakedness or flatness of the

distribution
✓ Measured by coefficient of kurtosis, denoted as K
 x  
4
i
K  i 1
3
N
4
MEASURE OF KURTOSIS
K=0
mesokurtic
K>0
leptokurtic
K<0
platykurtic
BOX AND WHISKERS PLOT
✓ Indicates the symmetry of the distribution and

incorporates measures of location to describe the
variability of the observations
✓ Also called boxplot or 5-number summary (represented

by Min, Max, Q1, Q2, and Q3)
✓ Used for identifying outliers

BOX AND WHISKERS PLOT
✓ The diagram is made up of a box which lies

between the first and third quartiles.
✓ The whiskers are the straight lines extending from

the ends of the box to the smallest and largest
data values that are not outliers.
STEPS IN CONSTRUCTING A
BOX-AND-WHISKERS PLOT
Step 1:
Draw a rectangular box with its left
Q1 Md Q3
edge at the Q1 and its right edge at the
Q3 , so the box length is the IQR.
Q3
Then draw a vertical line segment

Md
inside the box where the median is
found. Q1
Step 2: Place marks at distances 1.5 IQR from both ends of the box.
Q3 +1.5 IQR
Q1- 1.5 IQR Q3+1.5 IQR Q3
Md
Q1 Md Q3 Q1
Q1- 1.5 IQR
Step 3: Draw the horizontal line segments or the

“whiskers” from each end of the box to the largest
and smallest data values that are not outliers
Notes:
✓ An observation beyond ± 1.5 IQR is an outlier.
✓ If the largest and smallest data values are outliers,
extend whiskers until 1.5 IQR from either ends of
the box.
Q3 +1.5 IQR
Q1- 1.5 IQR Q3+1.5 IQR
w
Q3
h
i
s
Q1 Md Q3 k
Md e
r
Q1 s
whiskers
Q1- 1.5 IQR
Step 4: Represent each outlier by a dot. For outliers having the same
values place the dots one on top of the other.
Q3 +1.5 IQR
Q1- 1.5 IQR Q3

Q3+1.5 IQR
Md
Q1 Q3 Q1
Md
Q1- 1.5 IQR
COMPARISON OF DISTRIBUTIONS USING
BOX-AND-WHISKERS PLOTS
Data Set A
Data Set B
COMPARISON OF DISTRIBUTIONS USING
BOX-AND-WHISKERS PLOTS
Number Data Minimum = 113
1 113 Maximum = 136
2 116
3 119 1st Quartile = 124
2nd Quartile = 126.5
4 121
3rd Quartile = 130
5 124
6 124
7 125
8 126
9 126
10 126
11 127
12 127
13 128
14 129
15 130
16 130
17 131
18 132
19 133
20 136
ILLUSTRATION
Consider the activity of rolling a die.
1. What are the possible outcomes?

2. Can we repeat the activity?
3. If I roll the die now, can you predict the
exact number of dots that can be
observed?
RANDOM EXPERIMENT
A process of drawing observations capable of

repetition under the same conditions with well-
defined possible outcomes.
Note: Although the possible outcomes are well-defined,

the outcome of a particular trial is unpredictable.
SAMPLE SPACE
✓ It is a set or collection of all possible outcomes of a

random experiment.
✓ It may either be finite or infinite.
✓ Elements of the sample space are referred to as

sample points.
EVENT
✓ It is a subset of the sample space.
✓ It may either be simple or compound.
✓ Observing an element of an event indicates the

occurrence of the event.

Statistics and Probability: Bill Thaddeus Padasas

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Statistics and Probability: Bill Thaddeus Padasas

Transféré par

Droits d'auteur :

Formats disponibles

STATISTICS AND PROBABILITY

Bill Thaddeus Padasas

The word statistics can be viewed in two contexts

Singular sense Statistics as a science

Plural sense Statistics as actual number derived from the data

The word statistics can be viewed in two contexts

Singular sense A body of knowledge concerned with the collection, organization,

Plural sense A collection of facts and figures or processed data.

Statistics provides tools in

> Designing experiments or surveys

Statistics can help us answer questions like

1.What options to choose?

Consider the following problems:

Agricultural problem: Economics:

Who,What (and in what units),When,Where,Why, How

2. Infinite – when the number of elements of the universe is unlimited.

Types of Quantitative Variables

POPULATION – set of all possible values of the variable.

SAMPLE – subset of the population or universe

Data may be classified into four hierarchical levels of measurement:

Data collected are labels, names, or categories.

Data collected are labels with implied ordering

Grade level, Socio-economic status

Data can be ordered or ranked

IQ score, temperature (in C and F)

Data have all the properties of interval scale

Knowledge of such will serve as guide in determining the appropriate algebraic

Inappropriate analysis leads to erroneous conclusion, which may lead to dangerous

The data are collected through measurement, counting, or by observation.

The information is provided by identified respondents.

USE OF EXISTING RECORDS

✓ A narrative form of describing characteristics of the universe or population based on the

✓ Applicable only when presenting new information

1. Table heading – includes the table number and title.

2. Caption – designates the information contained in the columns.

3. Body – main part of the table containing the information or figures

4. Stubs/Classes – categories which describe the data usually found at

✓ It provides visual presentation of the distributional properties of the data.

Figure 1. Percentage distribution of browser usage in Europe.

Figure 2. World birth rate in ASEAN countries from 1990 to 2008

Height vs Weight of Grade 11 Students

✓ Presents data in ordered form and provides an idea of the shape of

✓ Also called stemplot

The performance scores received by

✓ Reveals the center of the distribution

✓ Illustrates overall shape of the distribution like symmetry and spread

✓ Shows marked deviations from the overall shapes (gaps, extreme

The raw scores of Grade 11 students on their

Values that are used to summarize the characteristics of a universe or population

Some common measures:

➢ Minimum and Maximum

Minimum is the smallest value in the data set, denoted as MIN.

➢ Measures of Central Tendency

Value(s) about which the set of observations tend to cluster

Most commonly used averages:

The sum of all observations in the data set divided by the

where xi = i th observation of the variable X and

❖ There is only one mean in a given data set.

❖ It is defined only for quantitative data.

❖ It reflects the magnitude of every observation.

❖ It is easily affected by the presence of extreme values.

❖ The sum of the deviations from the mean is equal to zero.

❖ The means of different sets/groups of comparable data may