Vous êtes sur la page 1sur 102

STATISTICS AND PROBABILITY

Bill Thaddeus Padasas


CHAPTER 1
INTRODUCTION TO STATISTICS

The word statistics can be viewed in two contexts

Singular sense Statistics as a science

Plural sense Statistics as actual number derived from the data


Introduction to Statistics

The word statistics can be viewed in two contexts

Singular sense A body of knowledge concerned with the collection, organization,


presentation, analysis, and interpretation of data.

Plural sense A collection of facts and figures or processed data.


Roles of Statistics in Decision Making

Statistics provides tools in

> Designing experiments or surveys


> Analyzing data

Statistics can help us answer questions like

1.What options to choose?


2. How do we make a choice?
3.Why choose such options?
Statistics is used in every field of science.

Consider the following problems:

Agricultural problem: Economics:


Is new grain seed or fertilizer more productive? What will be the unemployment rate next
year?
Medical problem:
What is the right amount of dosage of drug to Technical problem:
treatment? How to improve quality of product?

Political science:
How accurate are the election polls?
Two Broad Categories of Statistics

Descriptive Statistics

Inferential Statistics
Two Broad Categories of Statistics

Descriptive Statistics
- used to describe a mass of data in a clear, concise, and informative way.
- deals with the methods of organizing, summarizing, and presenting data.

Population
Describes Sample
Results

Sample to Population
Sub-groups Compares
Correlations

Central Tendency
Summarizes
Spread
Two Broad Categories of Statistics

Inferential Statistics
- concerned with making generalizations about the characteristics of a
larger set where only a part is examined.

Smaller set
(n units or observations)
Larger set
(N units or observations)

Inferences and
Generalizations
CHAPTER 2
THE POPULATION
A. BASIC CONCEPTS

DATA
- Facts and figures that are collected, presented, and analyzed
- Can be numeric or non-numeric
- Must be contextualized

Who,What (and in what units),When,Where,Why, How


A. BASIC CONCEPTS

UNIVERSE
- A collection or set of all individuals or entities whose characteristics are to
be studied.
- Answers the question “Who?”

Types of Universe

1. Finite - when the elements of the universe can be counted for a given time
period.

2. Infinite – when the number of elements of the universe is unlimited.


A. BASIC CONCEPTS

VARIABLE
- Attribute or characteristic of interest measurable on each and every unit of the universe
- Answers the question “What?”

Qualitative

Variable Discrete

Quantitative

Continuous
A. BASIC CONCEPTS

Types of Variables

1. Qualitative
- assumes values that are not numerical but can be categorized.
- categories may be identified by either non-numerical descriptions or by
numeric codes.

2. Quantitative
- indicates the quantity or amount of a characteristic
- data are always numeric
- can be discrete or continuous
A. BASIC CONCEPTS

Types of Quantitative Variables

1. Discrete
- variable with a finite or countable number of possible values

2. Continuous
- variable that assumes any value in a given interval
A. BASIC CONCEPTS

POPULATION – set of all possible values of the variable.

UNIVERSE POPULATION

U1 Y1

U2 Y2

U3 Y3
... ...

... ...

UN YN
A. BASIC CONCEPTS

SAMPLE – subset of the population or universe

SAMPLE

UNIVERSE / POPULATION
A. BASIC CONCEPTS

LEVELS OF MEASUREMENT

Data may be classified into four hierarchical levels of measurement:

Nominal
Ordinal
Interval
Ratio
A. BASIC CONCEPTS

NOMINAL

Data collected are labels, names, or categories.


Frequencies or counts of observations belonging to the same category can be obtained
It is the lowest level of measurement

Name, gender
A. BASIC CONCEPTS

ORDINAL

Data collected are labels with implied ordering


The difference between two data labels is meaningless

Grade level, Socio-economic status


A. BASIC CONCEPTS

INTERVAL

Data can be ordered or ranked


The difference between two data values is meaningful
Data at this level may lack an absolute zero

IQ score, temperature (in C and F)


A. BASIC CONCEPTS

RATIO

Data have all the properties of interval scale


The number zero indicates the absence of the characteristic being measured
It is the highest level of measurement

Age (in years), weekly food allowance (in peso), height (in cm)
Statistical analysis depends on the variable’s level of measurement.

Knowledge of such will serve as guide in determining the appropriate algebraic


operations and consequently, the statistical tools that can be used for analysis.

Inappropriate analysis leads to erroneous conclusion, which may lead to dangerous


consequences brought about by wrong decisions.
Methods of Data Collection

Objective Method
Subjective Method
Use of Existing Records
B. METHODS OF DATA COLLECTION

OBJECTIVE METHOD

The data are collected through measurement, counting, or by observation.


This method requires the use of a measuring or counting instrument.
B. METHODS OF DATA COLLECTION

SUBJECTIVE METHOD

The information is provided by identified respondents.


The instrument used to gather data may take the form of a questionnaire.
The researcher collects the data by:
>conducting personal interviews either face-to-face or through telephones; or
>gathering responses using mailed questionnaires.
B. METHODS OF DATA COLLECTION

USE OF EXISTING RECORDS

This method uses data which have been previously collected by another person or
institution for some other purpose.
B. METHODS OF DATA COLLECTION

TYPES OF DATA

1. Primary
- data which were acquired directly from the source.

2. Secondary
- data which were not acquired directly from the source.
Methods of Data Presentation

1. Textual

2. Tabular

3. Graphical
C. METHODS OF DATA PRESENTATION

1.TEXTUAL

✓ A narrative form of describing characteristics of the universe or population based on the


data collected and organized by giving highlights

✓ Applicable only when presenting new information


C. METHODS OF DATA PRESENTATION

2.TABULAR

✓ Data are organized into classes or categories by rows and/or columns and appropriate
pieces of information are found in the cells of the table.

✓ Relatively more information can be presented and trends can be easily seen.

✓ Some details are lost when data are summarized in tabular form.
PARTS OF A STATISTICAL TABLE

1. Table heading – includes the table number and title.

2. Caption – designates the information contained in the columns.

3. Body – main part of the table containing the information or figures


presented.

4. Stubs/Classes – categories which describe the data usually found at


the left-hand side of the table.
Table heading

Body

Stubs/Classes
C. METHODS OF DATA PRESENTATION

3. GRAPHICAL PRESENTATION

✓ It provides visual presentation of the distributional properties of the data.


✓ This is the most efficient way of presenting trends.
✓ Some details are lost in using this type of presentation.
PIE CHART

Figure 1. Percentage distribution of browser usage in Europe.


BAR GRAPH

Figure 2. World birth rate in ASEAN countries from 1990 to 2008


LINE GRAPH
SCATTER PLOT

Height vs Weight of Grade 11 Students


90

80

70

60
Weight (kg)

50

40

30

20

10

0
155 160 165 170 175 180
Height (cm)
STEM AND LEAF PLOT

✓ Presents data in ordered form and provides an idea of the shape of


the distribution of a set of quantitative data

✓ Best for small number of observations with values greater than zero

✓ Also called stemplot


STEM AND LEAF PLOT

The performance scores received by


the vehicles in a car show are as
follows:

90 80 75 80 80 80 90
90 100 100 75 70 80 80
STEPS IN CONSTRUCTING
A STEM-AND-LEAF PLOT

90
1. Arrange data in ascending or 80
75
descending order. 80
80
70 75 75 80
80 80 80 90
90
80 80 80 100
90 90 90 100
75
100 100 70
80
80
STEPS IN CONSTRUCTING
A STEM-AND-LEAF PLOT

90
2. Split each datum into a leaf value, which is 80
75
the last digit, and a stem value, which
80
consists of the remaining digits. 80
80
Examples: 90
90
100
75 100
Stem Leaf
100 75
70
80
80
STEPS IN CONSTRUCTING
A STEM-AND-LEAF PLOT

90
3. List the stems vertically in 80
increasing or decreasing order. 75
80
4. Draw a vertical line to the right of 80
the stems. 80
90
7 90
8 vertical line 100
Stem 100
9
75
10 70
80
80
STEPS IN CONSTRUCTING
A STEM-AND-LEAF PLOT

90
80
5. For each stem, write its leaves to the 75
right of the vertical line in ascending 80
order. 80
80
90
7 0 5 5 90
8 0 0 0 0 0 0 100
STEM AND LEAF PLOT 100
9 0 0 0 75
10 0 0 70
80
Figure 1. Distribution of the performance scores received by the vehicles in a car show. 80
STEM AND LEAF PLOT

IMPORTANT FEATURES:

✓ Reveals the center of the distribution

✓ Illustrates overall shape of the distribution like symmetry and spread

✓ Shows marked deviations from the overall shapes (gaps, extreme


values)
ANSWER THIS

The raw scores of Grade 11 students on their


Earth and Life Science First Term Assessment
are as follows:

36 53 41 52 56
25 25 21 36 42
60 40 50 54 46
43 51 55 53
30 53 54 56
Make a Stem and Leaf Plot based on these given
values
D. DESCRIPTIVE MEASURES

DESCRIPTIVE MEASURES

Values that are used to summarize the characteristics of a universe or population


Some of these are measures of:
Location
Dispersion
Skewness
Kurtosis
D. DESCRIPTIVE MEASURES

➢ Measures of Location

Summarizes a data set by giving a “typical value” within the range of the
data values that describes its location relative to entire data set.

Some common measures:


✓ Minimum, Maximum
✓ Measures of Central Tendency
✓ Percentiles, Deciles, Quartiles
D. DESCRIPTIVE MEASURES

➢ Minimum and Maximum

Minimum is the smallest value in the data set, denoted as MIN.


Maximum is the largest value in the data set, denoted as MAX.
D. DESCRIPTIVE MEASURES

➢ Measures of Central Tendency

Value(s) about which the set of observations tend to cluster


Also called as an average

Most commonly used averages:


✓ Mean
✓ Median
✓ Mode
➢ Mean (Arithmetic Mean)

The sum of all observations in the data set divided by the


total number of observations
N

x i
x1  x2  ...  xN
= i 1
=
N N

where xi = i th observation of the variable X and


N = total number of observations in the data set.
SOME PROPERTIES OF THE MEAN

❖ There is only one mean in a given data set.

❖ It is defined only for quantitative data.

❖ It reflects the magnitude of every observation.

❖ It is easily affected by the presence of extreme values.

❖ The sum of the deviations from the mean is equal to zero.

❖ The means of different sets/groups of comparable data may


be combined when properly weighted.
Find the mean of the sample data

2 -1 0 2

Answer: 0.75
A random sample of ten students is taken from the student body of
a college and their GPAs are recorded as follows:

1.90 3.00 2.53 3.71 2.12 1.76 2.71 1.39 4.00 3.33

Answer: 2.645
➢ Median

The middle value when the data values are arranged in


ascending or descending order of magnitude.

 x N21 , if N is odd,

Md   x N  x N
1
 2 2
, if N is even.
 2
S O M E P RO P E RTIE S O F T H E M E D I AN

❖ There is only one median for a data set.

❖ It is not amenable to further computations.

❖ The sum of the absolute deviations of the observations from

a value, say c, is smallest when c is equal to the median.

That is, x i -c is minimum when c = Md.


➢ Mode

The value(s) in the data set which occurs most frequently,


denoted by Mo.

Example:
Consider the following scores of 10 students in SA #4

10 10 10 8 8 10 5 8 10 10

Mo = 10
SOME PROPERTIES OF THE MODE

❖ It may or may not exist.

❖ If it exists, there can be more than one mode for a given


data set.

❖ It is determined by the frequency and not by the values of


the observations.

❖ It is applicable for both quantitative and qualitative data.


PRACTICE!

1. Find the mean and the median for the LDL cholesterol level
in a sample of ten heart patients.

132 162 133 145 148


139 147 160 150 153

2. Find the mean and the median for the LDL cholesterol level
in a sample of ten heart patients on a special diet.

127 152 138 110 152


113 131 148 135 158
PRACTICE!

3. Five laboratory mice with thymus leukemia are observed for a


predetermined period of 500 days. After 500 days, four mice have died but
the fifth one survives.The recorded survival times for the five mice are

493 421 222 378 500*

Where 500* indicates that the fifth mouse survived for at least 500 days
but the survival time (i.e., the exact value of the observation) is unknown.

a. Can you find the sample mean for the data set? If so, find it. If not, why
not?

b. Can you find the sample median for the data set? If so, find it. If not, why
not?
PERCENTILES

❖ Numerical measures that give the relative position of a data


value relative to the entire data set.

❖ Divide an array (raw data arranged in increasing or decreasing


order of magnitude) into 100 equal parts.

❖ The jth percentile, denoted as Pj, is the data value that


separates the bottom j% of the data from the top (100-j)%
FINDING THE J TH PERCENTILE

1. Arrange the data values in ascending order


2. Find the location of Pj in the arranged list by computing

 j 
L=  ×N
 100 

where N = the total number of data values, and


j = percentile of interest
L = location
FINDING THE J TH PERCENTILE

3.
(a) If L is a whole number, then Pj is the mean of the data
values in position L and position L + 1.

(b) If L is not a whole number, Pj is taken as the data value in the


next higher whole number position.

Remark: Percentiles are generally computed for large data sets.


DECILES

❖ These values divide an array into ten equal parts, each part
having ten percent of the data values, denoted by Dj.

❖ The 1st decile is the 10th percentile; the 2nd decile is the
20th percentile and so on.
DECILES

Remember:
In computing for the deciles, follow the procedure in the
computation of the equivalent percentiles.

D1  P10 D6  P60
D2  P20 D7  P70
D3  P30 D8  P80
D4  P40 D9  P90
D5  P50 D10  P100
QUARTILES

❖ These values divide an array into four equal parts, each part
having 25% of the data values, denoted by Qj.

❖ The 1st quartile is the 25th percentile; the 2nd quartile is


the 50th percentile (median), and the 3rd quartile is the 75th
percentile.
QUARTILES

Remember:
Similarly, in the computation of the quartiles we use the
procedure in the computation of the equivalent percentiles.

Q1  P25
Q2  P50 = Median  D5
Q3  P75
Q4  P100
ILLUSTRATION

Two different groups of 10 students were given


identical exams. The scores of the students are
shown as

Group A Group B
6 5 6 7 8 4 2
7 1 3 4 7 7 7 5 4 8 Mean: 71.5
6 2 7 Median: 72.0
Mean: 71.5 Mode: 77.0
Median: 72.0 7 7 7
Mode: 77.0 8 5
9 3
10 0
D. DESCRIPTIVE MEASURES

➢ Measures of Dispersion

✓ A quantity that describes the spread or variability of the observations in a


given data set.

✓ The higher the value, the greater the variability in the data set.

✓ Some common measures are:


•Range
•Inter-quartile Range
•Variance
•Standard Deviation
•Coefficient of Variation
T YP E S O F M E AS UR E S O F D I S P E RSIO N

Absolute Measures of Dispersion:


Range
Inter-Quartile Range
Variance
Standard Deviation

Relative Measure of Dispersion:


Coefficient of Variation
RANGE

The difference between the maximum and minimum values in


a data set, i.e.

R = MAX – MIN

Example: In the set of scores of the Group A students


6 5 6 7 8 R = MAX – MIN
7 1 3 4 7 7 7 = 77 – 65
= 12
PROPERTIES OF RANGE

✓ It is quick and easy to understand.

✓ It is a rough measure of dispersion.

✓ It is usually reported together with the median.


INTER-QUARTILE RANGE (IQR)

The difference between the third quartile and first quartile,


i.e.

IQR = Q3 – Q1

Example: In the set of scores of the Group A students


6 5 6 7 8
7 1 3 4 7 7 7
PROPERTIES OF IQR

✓ Not affected by the presence of extreme values

✓ Not as easy to calculate as the range


ILLUSTRATION

DATA SET 1 DATA SET 2

{-10, 0, 10, 20, 30} {8, 9, 10, 11, 12}


VARIANCE (σ 2 )

The average squared difference of the observations from the mean


N

 x  
2
i
2  i 1

N
where xi = ith observation in the data set,
 = mean of the data set,
N = total number of observations in the data set
PROPERTIES OF VARIANCE

✓ One of the most useful measures of dispersion

✓ All observations contribute in the computation

✓ Always non-negative

✓ Comes in the square of the unit of measure of the given


set of values
STANDARD DEVIATION (σ)

• The average deviation of the observations from the mean


• The positive square root of the variance,

  2

• Same unit of measure as that of the observations


• Usually reported with the mean
COEFFICIENT OF VARIATION

A relative measure that indicates the magnitude of variation


relative to the magnitude of the mean, expressed in percent,
denoted as CV,


CV     100%

PROPERTIES OF COEFFICIENT OF VARIATION

✓ Unitless

✓ Used to compare dispersion of two or more data sets


with the same or different units

✓ The higher the CV, the more variable is the data set
relative to its mean
CHEBYSHEV’S RULE

✓ It permits us to make statements about the percentage of


observations that must be within a specified number of
standard deviations from the mean.

✓ The proportion of any distribution that lies within k


standard deviations of the mean is at least 1-(1/k2) where
k is any positive number larger than 1.
CHEBYSHEV’S RULE

✓ For any data set with mean (μ) and standard deviation (σ),
the following statements apply:

✓ At least 75% of the observations are within 2σ of its


mean.

✓ At least 88.9% of the observations are within 3σ of its


mean.
ILLUSTRATION

At least 75% of the observations are within 2σ of its mean.


EXAMPLE

1. Find the fraction of all numbers of a data set with a


mean of 60 and standard deviation of 4 that must lie
between 52 and 68.

2. Suppose that the average score on a math SA is an 84


with a standard deviation of 4 points. According to
Chebyshev’s Rule, at least what percent of the tests
have a grade of at least 72 and most 96?
SYMMTERY

A distribution is said to be symmetric about the mean if the


distribution to the left of mean is the “mirror image” of the
distribution to the right of the mean.

 = Md = Mo  = Md = Mo
MEASURE OF SKEWNESS

✓ Describes the degree of departure of the distribution


from symmetry

✓ The degree of skewness measured by the coefficient of


skewness, denoted by SK
Pearsonian
second-order 3  Mean  Median 
coefficient of SK 
skewness 
✓ A symmetric distribution has SK=0 since its mean is equal
to its median and its mode.
MEASURE OF SKEWNESS

Mo < Md <   < Md < Mo

SK > 0 SK < 0

positively skewed negatively skewed

There are some extremely high values in the data set There are some extremely low values in the data set
MEASURE OF KURTOSIS

✓ Describes the extent of peakedness or flatness of the


distribution

✓ Measured by coefficient of kurtosis, denoted as K

 x  
4
i

K  i 1
3
N
4
MEASURE OF KURTOSIS

K=0
mesokurtic

K>0
leptokurtic

K<0
platykurtic
BOX AND WHISKERS PLOT

✓ Indicates the symmetry of the distribution and


incorporates measures of location to describe the
variability of the observations

✓ Also called boxplot or 5-number summary (represented


by Min, Max, Q1, Q2, and Q3)

✓ Used for identifying outliers


BOX AND WHISKERS PLOT

✓ The diagram is made up of a box which lies


between the first and third quartiles.

✓ The whiskers are the straight lines extending from


the ends of the box to the smallest and largest
data values that are not outliers.
STEPS IN CONSTRUCTING A
BOX-AND-WHISKERS PLOT

Step 1:
Draw a rectangular box with its left
Q1 Md Q3
edge at the Q1 and its right edge at the
Q3 , so the box length is the IQR.
Q3

Then draw a vertical line segment


Md
inside the box where the median is
found. Q1
STEPS IN CONSTRUCTING A
BOX-AND-WHISKERS PLOT
Step 2: Place marks at distances 1.5 IQR from both ends of the box.

Q3 +1.5 IQR
Q1- 1.5 IQR Q3+1.5 IQR Q3

Md

Q1 Md Q3 Q1
Q1- 1.5 IQR
STEPS IN CONSTRUCTING A
BOX-AND-WHISKERS PLOT

Step 3: Draw the horizontal line segments or the


“whiskers” from each end of the box to the largest
and smallest data values that are not outliers

Notes:
✓ An observation beyond ± 1.5 IQR is an outlier.
✓ If the largest and smallest data values are outliers,
extend whiskers until 1.5 IQR from either ends of
the box.
STEPS IN CONSTRUCTING A
BOX-AND-WHISKERS PLOT

Q3 +1.5 IQR
Q1- 1.5 IQR Q3+1.5 IQR
w
Q3
h
i
s
Q1 Md Q3 k
Md e
r
Q1 s
whiskers
Q1- 1.5 IQR
STEPS IN CONSTRUCTING A
BOX-AND-WHISKERS PLOT

Step 4: Represent each outlier by a dot. For outliers having the same
values place the dots one on top of the other.
Q3 +1.5 IQR

Q1- 1.5 IQR Q3


Q3+1.5 IQR

Md

Q1 Q3 Q1
Md
Q1- 1.5 IQR
COMPARISON OF DISTRIBUTIONS USING
BOX-AND-WHISKERS PLOTS

Data Set A

Data Set B
COMPARISON OF DISTRIBUTIONS USING
BOX-AND-WHISKERS PLOTS
Number Data Minimum = 113
1 113 Maximum = 136
2 116
3 119 1st Quartile = 124
2nd Quartile = 126.5
4 121
3rd Quartile = 130
5 124
6 124
7 125
8 126
9 126
10 126
11 127
12 127
13 128
14 129
15 130
16 130
17 131
18 132
19 133
20 136
ILLUSTRATION

Consider the activity of rolling a die.

1. What are the possible outcomes?


2. Can we repeat the activity?
3. If I roll the die now, can you predict the
exact number of dots that can be
observed?
RANDOM EXPERIMENT

A process of drawing observations capable of


repetition under the same conditions with well-
defined possible outcomes.

Note: Although the possible outcomes are well-defined,


the outcome of a particular trial is unpredictable.
SAMPLE SPACE

✓ It is a set or collection of all possible outcomes of a


random experiment.

✓ It may either be finite or infinite.

✓ Elements of the sample space are referred to as


sample points.
EVENT

✓ It is a subset of the sample space.

✓ It may either be simple or compound.

✓ Observing an element of an event indicates the


occurrence of the event.

Vous aimerez peut-être aussi