Vous êtes sur la page 1sur 113

CHAPTER 1: FREQUENCY DISTRIBUTION AND GRAPHS

When conducting a statistical study, the researcher must gather data for the particular variable
under study. To describe situations, draw conclusions, or make inferences about events, the
researcher must organize the data in some meaningful way.

The most convenient method of organizing data is to construct a frequency distribution. After
organizing the data, the researcher must present them so they can be understood by those who
will benet from reading the study. The most useful method of presenting the data is by
constructing statistical charts and graphs.

This chapter explains how to organize data by constructing frequency distributions and how to
present the data by constructing charts and graphs.

1.1 Organizing data

Consider a wastewater expert who collects 50 samples of coloured wastewater from a company
and analyzed their pH content (rounded to the nearest whole number) as:

1 2 6 7 12 13 2 6 9 5

14 3 3 11 4 13 1 10 5 2

4 12 4 5 8 6 5 14 5 2

1 4 3 1 9 2 10 11 4 10

3 14 8 2 4 14 1 3 2 6

Data in this form is known as raw data. Very little information can be obtained from the raw
data unless it is organized.

Raw data can be organized by constructing frequency distribution. Frequency distribution is the
organization of raw data in table form, using classes and frequencies.
Now, a more general observation can be obtained from the data. Example, more than 60% of
wastewater samples are acidic.

Two types of frequency distributions are used. These are categorical, and grouped distributions.

1.1.1 Categorical frequency distribution


This is used for data that can be placed in specific categories. Data such as field of study,
religious affiliation, and sports discipline preferred would use categorical frequency distribution.

Example 1.1
25 army inductees were given a blood test to determine their blood type. The data set is as
follows: A, B, B, AB, O, O, O, B, AB, B, B, B, O, A, O, A, O, O, O, AB, AB, A, O, B, and A.

Construct a frequency distribution for the data.

1.1.2 Grouped frequency distribution

When the variable is quantitative and large, the data can be grouped into classes. Consider a
distribution of the number of hours boat batteries last below.

In the 31 37 class limit, 31 is the lower class limit and 37 is the upper class limit. In the second
column are class boundaries. They are used to separate the classes so that there are no gaps in the
frequency distribution.

The rule of thumb for deciding class limits and class boundaries is that the class limits should
have the same decimal place value as the data, but class boundaries have one addition place
value and end in a 5.

Class width: upper (or lower) class limit of one class upper (or lower) class limit of the next
class. For example, the class width in the distribution = 31 24 = 30.5 23.5 = 7.

The following guidelines can be considered when constructing grouped frequency distribution:

1. There should be between 5 and 20 classes.


2. The class width should be an odd number.
3. The classes must be mutually exclusive.
4. The classes must be continuous.
5. The classes must be exhaustive.
6. The classes must be of equal width.

However, grouped frequency distribution can be constructed from other means.

Example 1.2

The following data represent the record high temperatures for each of the 50 states in the US.
Construct a grouped frequency distribution for the data using 7 classes.

112 100 127 120 134 118 105 110 109 112

110 118 117 116 118 122 114 114 105 109

107 112 114 115 118 117 118 122 106 110

116 108 110 121 113 120 119 111 104 111

120 113 120 117 105 110 118 112 114 114

Solution
The following steps can be used in constructing the grouped frequency distribution

- The range, R = highest value lowest value = 134 100 = 34


- Select the number of classes desired (usually between 5 20). 7 was chosen.
- Find class width: class width = R/(number of class) = 34/7 = 4.9. Round it up to 5.
- Select a starting point for the lowest class limit. This can be the smallest data value or any
convenient number less than smallest data value (100).
- Add the width to the lowest score taken from the starting point to get the lower limit of the next
class. Keep adding until there are 7 classes, i.e 100, 105, 110, etc.
- Subtract one unit from the lower limit of the second class to get the upper limit of the first
class. Then add the width to each upper limit to get all the upper limits. The first class is 100
104. The second class is 105 109, etc.
- Subtract 0.5 from each lower class limit and add 0.5 to each upper class limit to obtain the class
boundaries. These should be 99.5-104.5, 104.5-109.5,etc.
- Tally the data. Find the numerical frequencies from the tallies
- Add the frequencies cumulatively from one class to the other to form cumulative frequencies.
The results are summarized in the table below.

Class limits Class Tally Frequency Cumulative


boundaries frequency
100 104 99.5 104.5 2 2
105 109 104.5 109.5 8 10
110 114 109.5 114.5 18 28
115 119 114.5 119.5 13 41
120 124 119.5 124.5 7 48
125 129 124.5 129.5 1 49
130 134 129.5 134.5 1 50

1.2 Histogram, frequency polygon and ogive

After you have organized the data into a frequency distribution, you can present them in
graphical form. The purpose of graphs in statistics is to convey the data to the viewers in
pictorial form. It is easier for most people to comprehend the meaning of data presented
graphically than data presented numerically in tables or frequency distributions. This is
especially true if the users have little or no statistical knowledge. Statistical graphs can be used to
describe the data set or to analyze it. Graphs are also useful in getting the audiences attention in
a publication or a speaking presentation. They can be used to discuss an issue, reinforce a critical
point, or summarize a data set. They can also be used to discover a trend or pattern in a situation
over a period of time. The three most commonly used graphs in research are histogram,
frequency polygon, and cumulative frequency graph (also known as ogive)

1.2.1 Histogram

The histogram is a graph that displays the data by using contiguous vertical bars (unless the
frequency of a class is 0) of various heights to represent the frequencies of the classes.

Example 1.3
Construct a histogram to represent the data in example 1.2

Solution
The class boundaries and frequencies are shown below.

A plot of frequency against class boundaries, known as histogram, is shown in figure 1.1 below.

Figure 1.1: Histogram plot

1.2.2 Frequency polygon

The frequency polygon is a graph that displays the data by using lines that connect points plotted
for the frequencies at the midpoints of the classes. The frequencies are represented by the heights
of the points.
Example 1.4
Using the frequency distribution given in example 1.2, construct a frequency polygon.

Solution
The only missing link here is the class midpoint. This is calculated as the average of the class
boundaries. The complete table is shown below.

A plot of frequency against class midpoint, known as frequency polygon, is shown in figure 1.2.

Figure 1.2: Frequency polygon plot

1.2.3 Ogive

The ogive is a graph that represents the cumulative frequencies for the classes in a frequency
distribution. It is also known as cumulative frequency graph. The cumulative frequency is the
sum of the frequencies accumulated up to the upper boundary of a class in the distribution.
Example 1.5
Construct an ogive for the frequency distribution in example 1.2.

Solution
The cumulative frequency for each class is shown in the table below.

A plot of cumulative frequency against upper class boundary, known as ogive, is shown in figure
1.3.

Figure 1.3: Ogive plot

1.2.4 Relative frequency graphs

The histogram, the frequency polygon, and the ogive shown previously were constructed by
using frequencies in terms of the raw data. These distributions can be converted to distributions
using proportions instead of raw data as frequencies. These types of graphs are called relative
frequency graphs.
Figure 1.4: Relative frequency plot
Example 1.6
Construct a histogram, frequency polygon, and ogive using relative frequencies for the
distribution (shown here) of the miles that 20 randomly selected runners ran during a given
week.

Solution
The frequencies of each class are converted to relative frequencies by dividing each frequency by
the total frequency. The procedure for drawing relative frequency graphs follows the same order
as normal frequency graphs. The relative frequencies are shown here.

The relative frequency graphs are shown in figure 1.4.

1.3 Other types of graphs

Other graphs such as bar graphs, Pareto charts, time series graphs, and pie graphs are sometimes
used in statistics to give information.
1.3.1 Bar graph

When the data are qualitative or categorical, bar graphs can be used to represent the data. A bar
graph can be drawn using either horizontal or vertical bars.

1.3.2 Pareto graphs

When the variable displayed on the horizontal axis is qualitative or categorical, a Pareto chart
can also be used to represent the data (see figure 1.5 b).

A Pareto chart is used to represent a frequency distribution for a categorical variable, and the
frequencies are displayed by the heights of vertical bars, which are arranged in order from
highest to lowest.

Figure 1.5: Bar and Pareto charts

1.3.3 Time series charts

When data are collected over a period of time, they can be represented by a time series graph. A
time series graph represents data that occur over a specific period of time (figure 1.5 c).
1.3.4 Pie charts

The purpose of the pie graph is to show the relationship of the parts to the whole by visually
comparing the sizes of the sections. Percentages or proportions can be used. The variable is
nominal or categorical.

A pie graph is a circle that is divided into sections or wedges according to the percentage of
frequencies in each category of the distribution.

Figure 1.6: Time series and pie charts

1.3.5 Misleading graphs

Graphs which do not give full information about a variable can be misleading. Misleading graphs
are used in the marketing sector to convince people about a product or service. Care must be
taken when dealing with misleading graphs!
ASSIGNMENT 1

The data represent the ages of our Presidents at the time they were first inaugurated.

57 61 57 57 58 57 61 54 68
51 49 64 50 48 65 52 56 46
54 49 51 47 55 55 54 42 51
56 55 51 54 51 60 62 43 55
56 61 52 69 64 46 54 47

1. Were the data obtained from a population or a sample? Explain your answer.
2. What was the age of the oldest President?
3. What was the age of the youngest President?
4. Construct a frequency distribution for the data. (Use 7 classes)
5. Are there any peaks in the distribution?
6. ldentify any possible outliers.
7. Write a brief summary of the nature of the data as shown in the frequency distribution.

Solution
CHAPTER 2: DESCRIPTIVE STATISTICS

Chapter 1 showed how you can gain useful information from raw data by organizing them into a
frequency distribution and then presenting the data by using various graphs. This chapter shows
the statistical methods that can be used to summarize data. The most familiar of these methods is
the nding of averages.

The word average is ambiguous, since several different methods can be used to obtain an
average. Loosely stated, the average means the center of the distribution or the most typical case.
Measures of average are also called measures of central tendency and include the mean, median,
and mode.

In addition to knowing the average, you must know how the data values are dispersed. That is,
do the data values cluster around the mean, or are they spread more evenly throughout the
distribution? The measures that determine the spread of the data values are called measures of
variation, or measures of dispersion. These measures include the range, variance, and standard
deviation.

Another set of measures is necessary to describe data. These measures are called measures of
position. They tell where a specic data value falls within the data set or its relative position in
comparison with other data values. The most common position measures are percentiles, deciles,
and quartiles.

Finally, this chapter will look at exploratory data analysis. These techniques include dot plot,
stem and leaf plot, and box plot (five number summary).

2.1 Measures of central tendency

2.1.1 Mean

The mean, also known as the arithmetic average, is found by adding the values of the data and
dividing by the total number of values. It is given as
where and n are the sample mean and number of samples respectively. The population mean,
, is also given as

where N represents the total number of values in the population.

Example 2.1
Using the frequency distribution in example 1.6, find the mean. The data represent the number of
miles run during one week for a sample of 20 runners.

Solution
The procedure for finding the mean for grouped data assumes that the mean of all the raw data
values in each class is equal to the midpoint of the class. The complete table for the calculation
of the mean is shown below.

Class boundaries Frequency (f) Midpoint (Xm) f.Xm


5.5 10.5 1 8 8
10.5 15.5 2 13 26
15.5 20.5 3 18 54
20.5 25.5 5 23 115
25.5 30.5 4 28 112
30.5 35.5 3 33 99
35.5 40.5 2 38 76
= n = 20 . = 490

. 490
= = = 24.5
20
Alternatively, assume mean method can be used.

2.1.2 Median
The median is the midpoint of data array. To find the median, the data values must first be
arranged in ascending or descending order. The middle value in the arrangement gives the
median.

Example 2.2
The ages of 10 university students are given below. Find the median.

18 24 20 35 19 23 26 23 19 20

Solution
Arrange the numbers in ascending order as: 18, 19, 19, 20, 20, 23, 23, 24, 26, and 35. Since the
number of students is 10, the median will lie between the 5th and 6th positions. Therefore,

20 + 23
= = 21.5
2

Example 2.3
Using the frequency distribution in example 1.6, find the median.

Solution

The median class is the 4th class since 1 + 2 + 3 + 5 = 11. The median for grouped data is given
by

( )
= 1 + ( 2 1
)

where L1 = lower class boundary of the median class


N = Number of items in the data set (total frequency)
( f)1 = sum of frequencies of the class lower than the median class
c = size of median class
fmedian = frequency of median class

In this problem, L1 = (20.5+20.5)/2 = 20.5; N = 20; ( f)1 = 6; c = 5; fmedian = 5. Therefore,

20 6
= 20.5 + ( 2 ) 5 = 24.5
5
2.1.3 Mode

The mode is the value that occurs most in the data set. A data set can have more than one mode
or no mode at all.

Example 2.4
Eleven different automobiles were tested at a speed of 15 miles per hour for stopping distances.
The data, in feet, are shown below. Find the mode.

15 18 18 18 20 22 24 24 24 26 26

Solution
Since 18 and 24 both occurred three times, the modes are 18 and 24 feet. This data set is said to
be bimodal.

Example 2.5
Using the frequency distribution in example 1.6, find the mode.

Solution
The modal class is the 4th class. The mode for the grouped data is given as

1
= 1 + ( )
1 + 2

where L1 = lower class boundary of the modal class


1 = excess of modal frequency over the next lower class
2 = excess of modal frequency over the next higher class
c = size of modal class

In this problem, L1 = 20.5; 1 = 5 3 = 2; 2 = 5 4 = 1; c = 5. Therefore,

2
= 20.5 + ( ) 5 = 23.8
2+1
2.1.4 Midrange

The midrange is a rough estimate of the middle. It is found by adding the lowest and highest
values in the data set and dividing by 2. It is a very rough estimate of the average and can be
affected by one extremely high or low value.

2.1.5 The weighted mean

Sometimes, you must nd the mean of a data set in which not all values are equally represented.
Under such circumstances, the weighted mean is used. The weighted mean of a variable is
obtained as

Example 2.6
A student received an A in Experimental Data Analysis (3 credits), a C in Engineering Drawing
(3 credits), a B in Analytical Chemistry (4 credits), and a D in Introduction to Information
Technology (2 credits). Assuming A = 4 grade points, B = 3 grade points, C = 2 grade points, D
= 1 grade point, and F = 0 grade points. Find the students grade point average.

Solution

Course Credits (w) Grade (X)


Exp. Data Analysis 3 A (4 points)
Eng. Drawing 3 C (2 points)
Analytical Chemistry 4 B (3 points)
Intro. to IT 2 D (1 point)

(3 4) + (3 2) + (4 3) + (2 1) 32
= = = = 2.7
3+3+4+2 12
2.2 Measures of Variation

Example 2.7

A testing lab wishes to test two experimental brands of outdoor paint to see how long each one
would last before fading. The testing lab makes six gallons of each paint to test. Since different
chemical agents are added to each group and only six cans are involved, these two groups
constitute two small populations. The results (in months) follow. Is the mean a good statistic to
determine the best paint?

Solution
The mean for brand A

The mean for brand B

Figure 2.1 a: Variation of paint in brand A


Figure 2.1 b: Variation of paint in brand B

Since the means of the two brands are equal, one might conclude that both brands of paint last
equally well. However, when the data sets are examined graphically, a different conclusion
might be drawn. As can be seen from the pictures, even though the means are the same for both
brands, the spread, or variation, is quite different. The pictures show that brand B performs more
consistently; it is less variable.

For the spread or variability of a data set, three measures are commonly used: range, variance,
and standard deviation.

2.2.1 Range

The range is the simplest and it is defined as:

where R is the range.

2.2.2 Variance and standard deviation

The variance is the average of the squares of the distance of each value from the mean. The
population variance, , is defined as

2
( )2
=

where xi = individual value
= population mean
N = population mean

The sample variance, denoted by s2 is

( )2
2 =
1

Where = sample mean


n = sample size

The standard deviation is the square root of the variance. The population standard deviation, ,
is defined as

( )2
2
= =

Example 2.8
Find the variance and standard deviation for the amount of European auto sales for the six years
sample shown. The data are in millions of dollars.

11.2, 11.9, 12.0, 12.8, 13.4, 14.3

Solution

xi (xi - ) (xi - )2
11.2 - 1.4 1.96
11.9 - 0.7 0.49
12.0 - 0.6 0.36
12.8 0.2 0.04
13.4 0.8 0.64
14.3 1.7 2.89
xi = 75.60 (xi - )2 = 6.38
The mean,

= = 12.60

The variance,

2
( )2 6.38
= = = 1.28
1 61

The standard deviation, = 1.28 = 1.13

Example 2.9
Find the variance and standard deviation for the frequency distribution in example 1.6.

Solution
The procedure for finding the variance and standard deviation for grouped data is similar to that
for finding the mean for grouped data, and it uses the midpoint of each class.

Class boundaries Frequency (f) Midpoint (xm) f . xm f . xm2


5.5 10.5 1 8 8 64
10.5 15.5 2 13 26 338
15.5 20.5 3 18 54 972
20.5 25.5 5 23 115 2645
25.5 30.5 4 28 112 3136
30.5 35.5 3 33 99 3267
35.5 40.5 2 38 76 2888
f = n = 20 f . xm = 490 f . xm2 = 13310

The variance is
2
2
. [( . )2 /] 13310 [(490)2 /20]
= = = 68.7 2
1 20 1

The standard deviation is s = 68.7 = 8.3


2.2.3 Coefficient of variation

A statistic that allows one to compare standard deviations or variation when the units are
different is called coefficient of variation. The coefficient of variation is given as


= 100% = 100%

for samples for populations

Example 2.10
The mean of the number of sales of cars over a three month period is 87 and the standard
deviation is 5. The mean of the commissions is $5225 and the standard deviation is $773.
Compare the variations of the two.

Solution

The coefficients of variation are

5
= = 100 = 5.7%
87

773
= = 100 = 14.8%
5225

Since the coefficient of variation is larger for commission, the commissions are more variable
than the sales.

2.2.4 Chebyshevs theorem

As stated previously, the variance and standard deviation of a variable can be used to determine
the spread, or dispersion, of a variable. That is, the larger the variance or standard deviation, the
more the data values are dispersed.

Chebyshevs theorem states that the proportion of values from a data set that will fall within k
standard deviations of the mean will be at least 1 1/k2, where k is a number greater than 1 (k is
not necessarily an integer).

The table below summarizes the Chebyshevs theorem


k Proportion = 1 1/k2 Percentage (%) Spread of data values
2 75 2
3 8/9 89 3

The theorem can be applied to any distribution regardless of its shape.

Figure 2.2: Chebyshevs distribution

Other types of distribution are shown in figure 2.3.

2.2.5 Empirical (Normal) rule

Chebyshevs theorem applies to any distribution regardless of its shape. However, when a
distribution is bell-shaped (or what is called normal), the following statements, which make up
the empirical rule, are true.

1. Approximately 68% of the data values will fall within one standard deviation of the mean.

2. Approximately 95% of the data values will fall within two standard deviations of the mean.

3. Approximately 99.7% of the data values will fall within three standard deviations of the mean.

The empirical distribution is shown in figure 2.4


Figure 2.3: Types of distribution
Figure 2.4: Empirical distribution

2.3 Measures of Position

The measures of position include standard scores, percentiles, deciles, and quartiles.

2.3.1 Standard score or z score

Two different units can be compared using a relative standard similar to both. This comparison
uses the mean and standard deviation and is called standard score or z score. The z score
formula is given below. The formula can be used for both samples and populations.

Example 2.11
A student scored 65 in Process Calculations test that had a mean of 50 and a standard deviation
of 10; she scored 30 in Thermodynamics test with a mean of 25 and a standard deviation of 5.
Compare her relative positions in the two tests.

Solution
Calculate the z score for both Process Calculations and Thermodynamics. The z score Process
Calculations is 1.5 and that of Thermodynamics is 1.0. Since the z score of Process Calculations
is higher, her relative position in the Process Calculations class is higher than her relative
position in the Thermodynamics class.
2.3.2 Percentiles

A percentile P is an integer (1 P 99) such that Pth percentile is a value where P% of the data
values are less than or equal to the value and 100 P% of the data values are greater than or
equal to the value. The percentile corresponding to a given value (x) is calculated as

( ) + 0.5
= 100%

Example 2.12
The frequency distribution for the systolic blood pressure readings (in mm Hg) of 200 randomly
selected university students follows.
Class boundaries Frequency
89.5 104.5 24
104.5 119.5 62
119.5 134.5 72
134.5 149.5 26
149.5 164.5 12
164.5 179.5 4

a. Construct a percentile graph.


b. Find the percentile rank of a blood reading of 130.
c. What blood pressure corresponds to the 40th percentile?

Solution
To draw a percentile graph, cumulative percent is calculated. The complete table is shown below

Class boundaries Frequency Cumulative frequency Cumulative percent


89.5 104.5 24 24 12
104.5 119.5 62 86 43
119.5 134.5 72 158 79
134.5 149.5 26 184 92
149.5 164.5 12 196 98
164.5 179.5 4 200 100
From the graph,

(b) A blood pressure of 130 corresponds to approximately the 70th percentile.


(b) A person with a blood pressure of 118 is at the 40% percentile.

Example 2.13
A teacher gives a 20-point test to 10 students. The scores are shown here. Find the percentile
rank of a score of 12.

18, 15, 12, 6, 8, 2, 3, 5, 20, 10

Solution
Arrange the data in order from lowest to highest: 2, 3, 5, 6, 8, 10, 12, 15, 18, 20.
Since there are 6 numbers below the data value 12,

6 + 0.5
= 100 = 65%
10

Therefore, a score of 12 did better than 65% of the class.

2.3.3 Quartiles and deciles


Quartiles divide the distribution into four groups, denoted as Q1, Q2, and Q3. Q1 corresponds to
the 25th percentile, Q2 is the same as the 50% percentile or the median, and Q3 corresponds to
the 75th percentile.

Deciles divide the distribution into 10 groups. They are denoted as D1, D2, , D9 and they
correspond to P10, P20, , P90.

Quartiles and Percentiles of grouped frequency distribution

- Lower quartile

L1 = lower class boundary of lower quartile class


(f)1 = sum of frequencies of all classes lower than Q1 class
f1 = frequency of Q1 class
c1 = size of Q1 class

- Upper quartile

L3 = lower class boundary of upper quartile class


(f)3 = sum of frequencies of all classes lower than Q3 class
f3 = frequency of Q3 class
c3 = size of Q3 class

- Kth percentile
LK = lower class boundary of the Kth class

(f)K = sum of frequencies of all classes lower than Kth class

f3 = frequency of Kth class

c3 = size of Kth class

Try

Using the frequency distribution in example 1.6, find

(a) the lower quartile


(b) the upper quartile
(c) the 40th percentile

2.3.4 Outliers

An outlier is an extremely high or extremely low data value when compared with the rest of the
data values. An outlier can strongly affect the mean and standard deviation of a variable.

2.4 Exploratory data analysis

The purpose of exploratory data analysis is to examine data in order to find out what information
can be discovered. Usually, a dot plot (for small observations), stem and leaf plot, and box plot
are used.

2.4.1 Stem and leaf plot

The stem and leaf plot is a method of organizing data and is a combination of sorting and
graphing. It has the advantage over grouped frequency distribution of retaining the actual data
while showing them in graphic form.
Example 2.14

The levels of lead (Pb) in ppm determined in 20 fishes in a laboratory are shown below.

25, 31, 20, 32, 13


14, 43, 02, 57, 23
36, 32, 33, 32, 44
32, 52, 44, 51, 45

(a) Construct a stem and leaf plot for the data


(b) What is the median lead level?
(c) Calculate modal lead level.
(d) Calculate the lower quartile.

Solution
(a) Construct the stem and leaf plot as follows:

1. Sort the data in ascending order:


02, 13, 14, 20, 23, 25, 31, 32, 32, 32, 32, 33, 36, 43, 44, 44, 45, 51, 52, 57.
2. Separate the data according to the first digit as shown
02 13,14 20,23,25 31,32,32,32,32,33,36 43,44,45 51,52,57
3. List the leading digit as stem in a vertical column.
4. Record the trailing digit as leaf beside the stem in another column.
5. Record the number of leaves as frequency in the third column.

Leading digit (stem) Trailing digit (leaf) Frequency


0 2 1
1 3 4 2
2 0 3 5 3
3 1 2 2 2 2 3 6 7
4 3 4 4 5 4
5 1 2 7 3
f = 20

(b) The median will lie between the 10th and 11th positions. Hence, median = 32

(c) Mode = 32 because it occurred most

(d) The lower quartile is between the 5th and 6th positions. Hence, Q1 = 24
Note: When the data values are in the hundreds, such as 325, the stem will be 32 and the leaf
will be 5.

2.4.2 Box plot

The stem-and-leaf provides general visual impressions about a data set, while numerical values
such as and s are missing. The box plot is a graphical display that simultaneously describes
several important features of a data set, such as center, spread, departure from symmetry, and
identification of unusual observations or outliers.

A box plot is shown below.

Whisker extends to smallest data Whisker extends to largest data


point within 1.5IQR from Q1 point within 1.5IQR from Q3

Q1 Q2 Q3

Outliers Outliers Extreme outlier

1.5 IQR 1.5 IQR IQR 1.5 IQR 1.5 IQR

Constructing a box plot involves the use of five specific values. These are the lowest value of the
data set, the lower quartile, the median, the upper quartile, and the highest value of the data set.
These values are called a five-number summary of the data set.

Example 2.15
A stockbroker recorded the number of clients she saw each day over an 11-day period. The data
are shown below. Construct a box plot for the data.

33, 38, 43, 30, 29, 40, 51, 27, 42, 23, 31

Solution
1. Arrange the data in ascending order: 23, 27, 29, 30, 31, 33, 38, 40, 42, 43, 51.
2. Identify the minimum value: minimum value = 23
3. Find the median: Q2 = 33
4. Find the lower quartile Q1 = 29
5. Find the upper quartile Q3 = 42
6. Identify the maximum value: maximum value = 51
7. Plot these values on a straight line.

29 33 42
23 51

20 25 30 35 40 45 50

Example 2.16
A dietitian is interested in comparing the sodium content of real cheese with the sodium content
of a cheese substitute. The data for two random samples are shown. Compare the distributions,
using box plots.

Real cheese Cheese substitute


310 420 45 40 270 180 250 290
220 240 180 90 130 260 340 310
Solution

The box plots are shown below


ASSIGNMENT 2

The average undergraduate grade point average (GPA) for the 25 top-ranked medical schools is
listed below.
3.80 3.77 3.70 3.74 3.70
3.86 3.76 3.68 3.67 3.57
3.83 3.70 3.80 3.74 3.67
3.78 3.74 3.73 3.65 3.66
3.75 3.64 3.78 3.73 3.64

Find
(a) (i) the mean, (ii) the median, (iii) the mode, and (iv) the midrange.
(b) Draw a box plot for the distribution.
(c) Determine the number of outliers in the distribution.

Solution
CHAPTER 3: ANALYSIS OF ERROR

Suppose you what to determine the density of a mineral by measuring its mass (4.635 0.002 g)
and volume (1.13 0.05 mL). Density is mass per unit volume: 4.635g/1.13mL = 4.101 g/mL.
Now, what is the uncertainty in the computed density? How many significant figures should be
used for the density? This chapter discusses errors and their propagation during measurements.

3.1 Accuracy and precision

Accuracy is the degree of agreement between the measured value and the true value. An
absolute true value is seldom known. A more realistic definition of accuracy, then, would assume
it to be the agreement between a measured value and the accepted true value.

We can, by good analytical technique, such as making comparisons against a known standard
sample of similar composition, arrive at a reasonable assumption about the accuracy of a method,
within the limitations of the knowledge of the known sample (and of the measurements). The
accuracy to which we know the value of the standard sample is ultimately dependent on some
measurement that will have a given limit of certainty in it.

Precision is defined as the degree of agreement between replicate measurements of the same
quantity. That is, the repeatability of a results. The precision may be expressed as the standard
deviation, the coefficient of variation, the range of the data, or as a confidence interval (e.g.,
95%) about the mean value.

Good precision does not assure good accuracy. This would be the case, for example, if there
were systematic error in the analysis.

Figure 3.1: accuracy versus precision

These concepts can be illustrated with a target, as in figure 3.1. Suppose you are at target practice
and you shoot the series of bullets that all land in the bulls eye (left target). You are both precise
and accurate. In the middle target, you are precise (steady hand and eye), but inaccurate. In the
right target you are imprecise and therefore probably inaccurate. So we see that good precision is
needed for good accuracy, but it does not guarantee it.

Example 3.1

Two teams of three students each count the number of people entering the main gate of a football
stadium in the 5-minute period just prior to kickoff. Their individual and team counts are

Team A Team B

Akos: 577 people Kuevor: 577 people


Dinitia: 579 people Quarshiga: 585 people
Adjei: 581 people Adu-Gyamerah: 593 people

An electronic counter indicates that 581 people passed through the gate in the designated time
period.

(a) Which persons count is the most accurate?


(b) Which teams count is the most accurate?
(c) Which persons count is the most precise?
(d) Which teams count is the most precise?
Solution

(a) Adjeis count is the most accurate because his count is the same as the electronic count.
(b) The average count of 579 for team A is closer to 581 than the average count of 585 for
team B. Thus, team A made the more accurate count.
(c) Precision does not apply here because each person made only one count.
(d) Team As counts range from 577 to 581, which gives a spread of 4. Team Bs counts
range from 577 to 593, which gives a spread of 16. The smaller spread makes team As
count more precise.
3.1.1 Ways of expressing accuracy

There are various ways and units in which the accuracy of a measurement can be expressed, an
accepted true value for comparison being assumed.

a. Absolute error (Eabs)


It is the difference between the experimental value (xi) and true or accepted value (xt). It has
signs and experimental units.

= (2.1)

b. Relative error (Er)


It is the absolute error expressed as percentage of the true value. Or the absolute error corrected
for the size of the measurement or expressed as the fraction, percentage, or parts-per-thousand of
the true value. Er has a sign but no units.

( )
= 100 (2.2)

Parts per hundred (pph) = 100

Parts per thousand (ppt) = 1000

Parts per million (ppm) = 1000000

Example 3.2

The results of an analysis are 36.97 g, compared with the acceptable value of 37.06 g. What is
the relative error in parts per thousand?

Solution

Absolute error, Eabs = 36.97 37.06 = -0.09 g

0.09
Relative error, = 1000% = -2.4ppt
37.06
3.2 Types of errors

Two main classes of errors can affect the accuracy or precision of a measured quantity. They are
determinate and indeterminate errors.

3.2.1 Determinate errors (systematic errors)

Determinate errors are those that are determinable and can either be avoided or corrected. They
may be constant, as in the case of an uncalibrated weight that is used in all weighing. Or, they
may be variable such as a buret whose volume readings are in error by different amounts at
different volumes.

Systematic errors affect the accuracy of measurement.

Some examples of determinate errors are:

Instrument errors - failure to calibrate, degradation of parts in the instrument, power


fluctuations, etc. Instrumental errors can be corrected or minimized by proper calibration
and instrument measurement.
Method errors - errors due to no ideal physical or chemical behavior - completeness and
speed of reaction, interfering side reactions, sampling problems. Method errors can be
reduced by proper method development.
Personal errors - occur where measurements require judgment; result from prejudice,
color acuity problems. Personal errors can be minimized by proper training and
experience.

2.2.2 Indeterminate errors (random errors)

Indeterminate errors, often called accidental or random errors, represent the experimental
uncertainty that occurs in any measurement. These errors are revealed by small differences in
successive measurements made by the same analyst under identical conditions, and cannot be
predicted or estimated.

These accidental errors will follow a random distribution; therefore, mathematical laws of
probability can be applied to arrive at some conclusion regarding the most probable result of a
series of measurement. Indeterminate errors should follow a normal distribution, or Gaussian
curve. Such a curve is shown in figure 2.2. The symbol represents the standard deviation of an
infinite population of measurements, and this measure of precision defines the spread of the
normal population distribution.

Figure 3.2: Normal error curve

3.2.3 Gross errors

Examples of gross errors are instrument breakdown, loss of a crucial sample, and discovery that
a "pure" reagent was actually contaminated.

Gross errors cause an experimental value to be discarded. These lead to outliers and require
statistical techniques before they can be rejected.

3.3 Significant figures

The number of significant figures is the minimum number if digits needed to write a given value
in scientific notation without loss of accuracy. For example, the number 142.7 has four
significant figures because it can be written as 1.427 102. If you write 1.4270 102 (which has
five significant numbers), you imply that there is another number after the digit 7.
Zeros are significant when they occur (1) in the middle of a number or (2) at the end of a number
on the right-hand side of a decimal point. The bolded zeros in the following numbers are
significant: 106, 0.0106, 0.106, and 0.1060.

Figure 3.3: Scale of a spectrophotomer

The last significant digit (farthest to the right) in a measured quantity always has some
associated uncertainty. There is uncertainty in any measured quantity even if the measuring
instrument has a digital readout that does not fluctuate. For instance, the needle of the
spectrophotometer in figure 3.3 appears to be at an absorbance value of 0.234. This number has
three significant figures because the numbers 2 and 3 are completely certain but the number 4 is
an estimate. The value might be read as 0.233 or 0.235 by other people.

3.3.1 Addition and subtraction

1. If the numbers to be added or subtracted have equal number of digits, the answer goes to the
same decimal place as in any of the individual numbers:

1. 104

+ 1. 104

4. 104

The number of significant figures in the answer may exceed or be less than in the original data.

5.345 7. 1014

+ 6.728 - 6. 1014

12.073 0. 1014
2. If the numbers being added do not have the same number of significant figures, we are limited
by the least-certain number.

18.99840509

+ 83.798

102.79640509

The correct answer will stand as 102.796 as detated by the second number.

3. If the addition or subtraction of numbers are expressed in scientific notation, all numbers
should first be expressed with the same exponent:

1.632 105 1.632 105

+ 4.107 103 + 0.04107 105

+ 0.984 106 + . 105

. 105

3.3.2 Multiplication and division

In multiplication and division, we are normally limited to the number of digits contained in the
number with the fewest significant figures.

3.26 105 4.3179 1012 34.60

1.78 . 1019 2.46287

. 105 . 106 14.05

3.3.3 Logarithms and antilogarithms

A logarithm is composed of a characteristic and a mantissa. The characteristic is the integer part
and the mantissa is the decimal part.

For logarithms, the number of digits in mantissa of log x = number of significant figures in x.

log x = n Characteristic Mantissa


log 339 = 2.530 2 0.530
log 3.39 105 = -4.470 -4 0.470
For antilogarithms, the number of significant figures in the antilogarithm should equal he
number of digits in the mantissa.

antilog (-3.42) = 10 -3.42 = . 104

3.4 Propagation of uncertainty from random error

For most experiments, we need to perform arithmetic operation on numbers, each of which has
random error. The uncertainty is likely to propagate from one operation to the other. The
uncertainties can be monitored as described mathematically in the table below.

Type of Example Standard deviation of y


Calculation

Addition or y=a+bc = 2 + 2 + 2
(2.8)
Subtraction


Multiplication y = a.b/c = ( )2 + ( )2 + ( )2

(2.9)
Or division


Exponential y = ax = .

(2.10)


Logarithm = 10 = 0.434

(2.11)

Antilogarithm = 10 = 2.303.

(2.12)

a, b and c are experimental variables with Sa, Sb and Sc as standard deviations respectively

Table 3.1: Formulas for uncertainty propagation

3.4.1 Absolute and relative uncertainty

Absolute uncertainty expresses the margin of uncertainty associated with a measurement. If the
estimated uncertainty in reading a calibrated buret is 0.02 mL, we say that 0.02 mL is the
absolute uncertainty associated with the reading.

Relative uncertainty compares the size of the absolute uncertainty with the size of its associated
measurement. The relative uncertainty of a buret reading of 12.350.02 mL is a dimensionless
quotient:

0.02
= = = 0.002
12.35

Example 3.3
The volume delivered by a buret is the difference between the initial and final reading. If the
uncertainty in each reading is 0.02 mL, what is the uncertainty in the volume delivered if the
initial and final reading are 0.05(0.02) mL and 17.88(0.02) mL respectively?

Solution
17.88 (0.02)

- 0.05 (0.02)

17.83 (S) mL
SVol,delivered = 0.022 + 0.022 = 0.028 0.03 mL

Volume delivered = 17.83 (0.03) mL

Example 3.4

Calculate the uncertainty in the number of millimoles of chloride contained in 250.0 mL of a


sample when three aliquots of 25.00 mL are titrated with silver nitrate with the following results:
36.78, 36.82, 36.75 mL. The molarity of the AgNO3 solution is 0.11670.0002 M.

Solution

The absolute mean volume is

36.78 + 36.82 + 36.75


= 36.78
3
The standard deviation

Volume, xi
( )
36.78 0.00 0.0000
36.82 0.04 0.0016
36.75 -0.03 0.0009
( )2 = 0.0025

0.0025
= = 0.035
31

Therefore, the mean volume = 36.78 0.04 mL

mmol of Cl- titrated = (0.1167 0.0002 mmol/mL of Ag)( 36.78 0.04 mL) = 4.29 ?

Using table 2.1, y = 4.29 and Sy is calculated as

0.1167 2 36.78 2

= ( ) +( )
4.29 0.0002 0.04


= (2.9 106 ) + (0.90 106 )
4.29

= 3.8 106 = 1.9 103
4.29

Therefore, = 4.29 0.0019 = 0.0082

The absolute uncertainty in mmol of Cl- is 4.29 0.0082 mmol

Hence, mmol of Cl- in 250 mL = 10(4.292 0.0082) = 42.90 0.08 mmol

3.5 Law of propagation of uncertainty

Let R be a given function of independent variables

R = f (x1, x2, x3, , xn)

Also, let wR be the uncertainty in R. We can therefore write

wR = f (w1, w2, w3, , wn)

The uncertainty in R is given as

1
2 2 2 2

= ( . 1 ) + ( . 2 ) + ( . )
1 2 3 3

1
2 2

= ( . )

Example 3.5

A temperature measuring device which works according to the resistance in a copper wire is
correlated as:

= 0 [1 + ( 20)]

Where:

R0 is the resistance at 20 oC = 6 0.3%


is the temperature coefficient of resistance = 0.004 0.01%
T is temperature of copper wire = 30 oC 1%

Compute the resistance of the copper wire and its uncertainty.

Solution

0 = 0.003 , = 0.0001 and = 0.01

But
= 0 [1 + ( 20)] = 6[1 + 0.004(30 20)] = 6.24

Also,


= 1 + ( 20) = 1 + 0.004(30 20) = 1.04
0


= 0 ( 20) = 6(30 20) = 60


= 0 = (0.004 6) = 0.024

We also know

1
2 2 2 2

= ( . 0 ) + ( . ) + ( . )
0

1
= (1.04 0.003)2 + (60 0.0001)2 + (0.024 0.01)2 2 = 0.00677

Therefore, the resistance is

= 6.24 0.677%
CHAPTER 4: NORMAL DISTRIBUTION

In chapters two and three, the normal distribution was briefly introduced. Many random
continuous variables have distribution that are bell-shaped and are called approximately
normally distributed variables. Such distributions are also known as the Bell curve or the
Gaussian distribution.

When the data values are evenly distributed about the mean, the distribution is said to be
symmetrical. When majority of the values falls to the left or right of the mean, the distribution is
said to be skewed. Figures 4.1 a, b, and c show the different forms of distribution.

The tail of the curve indicates the direction of skewnes (right is positive, left is negative).

4.1 Properties of the normal distribution

The normal distribution is a continuous, symmetric, bell-shaped distribution of a variable. The


mathematical equation for the normal distribution is

2 (2 2 )
()
=
2
In applied statistics, tables are used instead of the equation. Again, the area under the curve is
more important than the frequencies. Finally, the shape and position of the normal distribution
curve depend on two parameters, mean and standard deviation.

Below are the summary of the properties of the Normal distribution.

1. The normal distribution curve is bell-shaped.


2. The mean, median, and mode are equal and located at the center of the distribution.
3. The distribution curve is unimodal.
4. The curve is symmetrical about the mean.
5. The curve is continuous.
6. The curve never touches the x-axis.
7. The total area under the curve is approximately 1.00 or 100%.
8. The area under the curve that lies within one standard deviation of the mean is approximately
0.68 or 68%; within two standard deviations, about 0.95 or 95%; and within three standard
deviations, about 0.997 or 99.7%.

4.2 Checking for Normality

To check whether a distribution is normal or approximately normal, the following steps are used:

a. Draw a histogram for the data and check its shape. If the histogram is not approximately bell-
shaped, then the data are not normally distributed.

b. Check the skewness of the data by using Pearson coefficient of skewness (PC) or Pearsons
index of skewness.

3( )
=

If PC +1 (positively skewed) and PC -1 (negatively skewed), it can be concluded that the


data are significantly skewed.

3. Check for outliers! One or more outliers can affect the normality.
Example 4.1
A survey of 18-high technology firms showed the number of days inventory they had on hand.
Determine if the data is approximately normally distributed.

5 29 34 44 45 63 68 74 74

81 88 91 97 98 113 118 151 158

Solution

- Construct a frequency distribution and draw a histogram for the data.

The histogram is approximately bell-shaped, so we can conclude that the distribution is


approximately normal.

-Check for skewness: For these data set, = 79.5, median = 77.5, and s = 40.5. Therefore,

3(79.5 77.5)
= = 0.148
40.5

The distribution is not significantly skewed.

-Check for outliers: Q1 = 45, Q3 = 98 and IQR = 53. An outlier will be a data value less than 45
1.5(53) = -34.5 or a data value larger than 98 + 1.5(53) = 177.5

In this case, there are no outliers! Generally, the distribution is approximately normal.
4.3 The standard normal distribution

Since there can be thousands of normal distribution curves (due to the different mean and
standard deviation of variables), one would have to have a table of areas for each variable for
practical applications. To simplify this situation, the standard normal distribution is used.

The standard normal distribution is a normal distribution with a mean of 0 and a standard
deviation of 1.

Figure 4.2: Standard normal distribution

All normally distributed variables can be transformed into the standard normally distributed
variable by using the formula for standard score:


= =

When the normal distribution is transformed into standard normal distribution, it can be used to
solve practical application problems.

4.3.1 Area under standard normal curve

Example 4.2
Find the area to the left of z = 2.06

Solution
Draw and represent the area as shown the figure

We are looking for the area under the standard normal distribution to the left of z = 2.06. Look
up for the area between 0 and 2.06. From the standard normal distribution table (Appendix Table
B-1), this area is 0.4803. For the entire area to the left, add 0.5000. Therefore, the required area
is 0.9803. Hence, 98.03% of the area is less than z = 2.06.

Example 4.3
Find the area to the right of z = - 1.19

Solution

We are looking for the area to the right of z = 1.19. Look up to the area between 0 and 1.19. It is
0.3830 (Appendix Table B-1). Therefore, required area is 0.8830. Hence, 88.30% of the area
under the Standard normal distribution curve is to the right of z = -1.19

Example 4.3
Find the area between z = +1.68 and z = -1.37.

Solution
Look up for the areas between 0 and 1.68 and 0 and 1.37 and add the areas. The area between 0
and 1.68 is 0.4535 and that between -1.37 and 0 is 0.4147. Therefore, the required area is 0.8682.
4.3.2 Normal distribution curve as a probability distribution curve

The normal distribution curve can be used as a probability distribution curve for normally
distributed variables. The area under the normal distribution curve corresponds to a probability.
That is, if it were possible to select any z value at random, the probability of choosing, say,
between 0 and 2.00 would be the same as the area under the curve between 0 and 2.00. In this
case, the area is 0.4772. Therefore, the probability of selecting any z value between 0 and 2.00 is
0.4772.

For probabilities, a special notation is used. For example, if one wants to find the probability of
any z value between 0 and 2.00, the probability is written as P(0 < z < 2.00).

Example 4.6
Find the probability for each.
a. P(0 z 2.32)
b. P(z 1.65)
c. P(z 1.91)

Solution

a. P(0 z 2.32) = 0.4898 or 48.98%


b. P(z 1.65) = 0.9505 or 95.05%
c. P(z 1.91) = 0.0287 or 2.87%

4.3 Application of Normal distribution

Example 4.7
Each month, an American household generate 28 pounds of newspaper for garbage or recycling.
Assuming the standard deviation is 2 pounds. If a household is selected at random, find the
probability of its generating
a. Between 27 and 31 pounds per month
b. More than 30.2 pounds per month
Solution
The two z-values are

27 28 31 28
1 = = 0.5 2 = = 1.5
2 2

The area between z = 0 and z = -0.5 is 0.1915 and the area between z = 0 and z = 1.5 is 0.4332.
Therefore, the total area = 0.1915 + 0.4332 = 0.6247. Hence, the probability that a randomly
selected household generates between 27 and 31 pounds of newspaper per month is 62.47%.

(b) The z value for x = 30.2 is 1.1. The area between z = 0 and z = 1.1 is 0.3643. Therefore, the
actual area = 0.5000 0.3643 = 0.1357. Hence, the probability that a randomly selected
household will accumulate more than 30.2 pounds of newspaper is 13.57%.

Example 4.8
The American Automobile Association reports thats that the average time it takes to respond to
an emergency call is 25 minutes. Assume the variable is approximately normally distributed and
the standard deviation is 4.5 minutes. If 80 calls are randomly selected, approximately how many
will be responded to in less than 15 minutes?

Solution
The z value for x = 15 is -2.22. The area between z = 0 and z = -2.22 is 0.4868. Therefore, the
actual area = 0.5000 0.4868 = 0.0132. The number of calls that will be made in less than 15
minutes will be (80 calls)(0.0132) = 1.056. Hence, approximately 1 call be responded to in less
than 15 minutes.

Example 4.9
In order to qualify for a police academy, candidates must score in the top 10% on a general
abilities test. The test has a mean of 200 and standard deviation of 20. Find the lowest possible
score to qualify. Assume the test scores are normally distributed.

Solution

10% or 0.100 represents the area to the right of the normal distribution for a text score of X. The
area between z = 0 and z-value of the test score will be 0.5000 0.1000 = 0.4000. From standard
normal distribution table, z = 1.28 gives a corresponding area of 0.3997 ( 0.4000).
Next, calculate the X score from the z value.

200
1.28 =
20

= 226

A score of 226 will be used as cutoff. Anyone who scores 226 or higher qualifies for the
academy.

Example 4.10
For a medical study, a researcher wishes to select people in the middle 60% of the population
based on blood pressure. If the mean systolic blood pressure is 120 and the standard deviation is
8, find the upper and lower readings that would qualify people to participate in the study.

Solution
Since a middle area of 0.6000 is required, the test values will have an area of 0.3000 on each side
of the mean. The closest z value for an area of 0.3000 is 0.84. Therefore, the z = 0.84.
Calculating the test score, we have

1 120 2 120
0.84 = 0.84 =
8 8

Hence, x1 = 126.72 and x2 = 113.28

Therefore, the middle 60% will have blood pressure readings of 113.28 < x < 126.72.

4.4 Binomial distribution

Many types of probability problems have only two outcomes or can be reduced to two
outcomes. For example, when a coin is tossed, it can land heads or tails. When a baby is born, it
will be either male or female. In a basketball game, a team either wins or loses. A true/false item
can be answered in only two ways, true or false. A multiple-choice question, even though there
are four or ve answer, choices can be classied as correct or incorrect. Situations like these are
called binomial experiments.
A binomial experiment is a probability experiment that satises the following four requirements:

1. There must be a xed number of trials.


2. Each trial can have only two outcomes or outcomes that can be reduced to two outcomes.
These outcomes can be considered as either success or failure.
3. The outcomes of each trial must be independent of one another.
4. The probability of a success must remain the same for each trial.

A binomial experiment and its results give rise to a special probability distribution called the
binomial distribution.

If p = probability of success and q = 1 p = probability of failure that an event will occur in a


single trial, then the probability that the event will happen exactly x times in n trials (i.e x
successes and n - x failures) is given by

!
() = ( ) =
! ( )!

Where x = 0, 1, 2, , n; n! = n(n-1)(n-2)1; 0! = 1

This discrete probability distribution is often called binomial distribution since for x = 0, 1, 2,
, n corresponds to successive terms of the binomial formula or binomial expansion.


( + ) = + ( ) 1 + ( ) 2 2 + +
1 2

Example 4.11
A fair coin is tossed 6 times. What is the probability of:
(a) Getting exactly 2 heads?
(b) At least 4 heads.

Solution

n = 6; p = 1/2 ; q=

(a)

6 6! 1 2 1 4
( = 2) = ( ) 2 62 = ( ) ( ) = 0.2344
2 2! (6 2)! 2 2
(b)

6 6 6
( 4) = ( ) 4 2 + ( ) 5 + ( ) 6
4 5 6

15 6 1 11
( 4) = + + = = 0.3438
64 64 66 32

4.4.1 Normal approximation to binomial distribution

The normal distribution is used to solve problems involving binomial distribution since when n is
large (say, 100), the calculations are too difficult to do by hand using the binomial distribution.

Statistians agree that the normal approximation should be used only when n.p and n.q are both
greater than, or equal to 5. Again, correction for continuity may be used in the normal
approximation. A correction for continuity is a correction employed when a continuous
distribution is used to approximate a discrete distribution.

Binomial Normal
When finding Use
P(X = a) P(a 0.5 < X < a + 0.5)
P(X a) P(X > a 0.5)
P(X > a) P(X > a + 0.5)
P(X a) P(X < a + 0.5)
P(X < a) P(X < a 0.5)

The formulas for the mean and standard deviation for the binomial distribution are

= =
Table 4.1 summarizes the Normal approximation to Binomial distribution.

Example 4.12
A magazine reported that 6% of American drivers read the newspaper while driving. If 300
drivers are selected at random, find the probability that exactly 25 say they read the newspaper
while driving.

Solution
Here, p = 0.06, q = 0.94, and n = 300
n.p = (300)(0.06) = 18 and n.q = (300)(0.94) = 282

Since both n.p and n.q are greater than 5, the normal distribution can be used.

The mean and standard deviation of the binomial distribution are


= 300 0.06 = 18 = 300 0.06 0.94 = 4.11

Next, we write the problem in probability notation: P(X = 25). Convert the problem to Normal
distribution and solve it: P(24.5 < X < 25.5).

The z values are

24.5 18 25.5 18
1 = = 1.82 2 = = 1.58
4.11 4.11

The area for z1 = 1.82 is 0.4656 and that for z2 = 1.58 is 0.4429. Therefore, the required area will
be 0.4656 0.4429 = 0.0227. Hence the probability that exactly 25 people read the newspaper
while driving is 2.27%.

Example 4.13
Of the members of a bowling league, 10% are widowed. If 200 bowling league members are
selected at random, find the probability that 10 or more will be widowed.

Solution

p = 0.10; q = 0.90; n = 200

n.p = (200)(0.10) = 20 and n.q = (200)(0.9) = 180

The mean and standard deviation of the binomial distribution are

= 200 0.10 = 20 = 200 0.10 0.90 = 4.24

Problem in binomial probability notation: P(X 10). Convert to Normal distribution: P(X > 9.5).
The z value is

9.5 20
= = 2.48
4.24

The area between z = 0 and z = 2.48 is 0.4934. Therefore, required area = 0.4934 + 0.5000 =
0.9934.

The probability of 10 or more widowed people in a random sample of 200 bowling league
members is 99.34%.
ASSIGNMENT 3

The average annual salary for all Ghanaian teachers is 47,750. Assume that the distribution is
normal and the standard deviation is 5,680. Find the probability that a randomly selected
teacher earns
a. Between 35,000 and 45,000 a year
b. More than 40,000 a year
c. If you were applying for a teaching position and were offered 31,000 a year, how would you
feel (based on this information)? Explain

Solution
CHAPTER 5: CONFIDENCE INTERVALS AND SAMPLE SIZE

5.1 Confidence intervals for the mean ( known or n 30) and sample size

5.1.1 Confidence interval

The sample mean is almost always different from the population mean due to sampling error.
Therefore, how good is a point estimate? There is no way of knowing how close a particular
point estimate is to the population mean. This, therefore, places some doubt on the accuracy of
point estimates. For this reason, statisticians prefer another type of estimate, called an interval
estimate.

An interval estimate of a parameter is an interval or a range of values used to estimate the


parameter. Either the interval contains the parameter or it does not. A degree of confidence
(usually a percent) can be assigned before an interval estimate is made. For instance, you may
wish to be 95% confident that the interval contains the true population mean. Another question
then arises. Why 95%? Why not 99 or 99.5%? If you desire to be more confident, such as 99 or
99.5% confident, then you must make the interval larger.

The confidence level of an interval estimate of a parameter is the probability that the interval
estimate will contain the parameter, assuming that a large number of samples are selected and
that the estimation process on the same parameter is repeated.
A confidence interval is a specific interval estimate of a parameter determined by using data
obtained from a sample and by using the specific confidence level of the estimate.

The relationship between and the confidence level is that the stated confidence level is the
percentage equivalent to the decimal value of 1 , and vice versa.

The formula for calculating the confidence interval of the mean for a specific is given by


2 ( ) < < + 2 ( )


The term 2 ( ) is called the maximum error of estimate. For a specific value, say = 0.05,

95% of the sample means will fall within this error value on either side of the population mean as
shown in figure 5.1.
Figure 5.1: 95% Confidence interval

When the original variable is normally distributed and is known, the standard normal
distribution can be used to find confidence internals regardless of the size of the sample. When n
30, the distribution of means will be approximately normal even if the original distribution of
the variable departs from normality. Also, if n 30, s can be substituted for in the formula for
confidence intervals.

Example 5.1
The president of a large university wishes to estimate the average age of the students presently
enrolled. From past studies, the standard deviation is known to be 2 years. A sample of 50
students is selected, and the mean is found to be 23.2 years. Find the 95% confidence interval of
the population mean.

Solution
For 95% confidence level, = 0.05 and z/2 = 1.96 (Table E). Hence, substituting into the
formula


2 ( ) < < + 2 ( )

2 2
23.2 1.96 ( ) < < 23.2 + 1.96 ( )
50 50
22.6 < < 23.8 23.2 0.6
Hence, the president can say, with 95% confidence, that the average age of the students is
between 22.6 and 23.8 years, based on 50 students.

Example 5.2
The following data represent a sample of the assets (in millions of Ghana cedis) of 30 credit
unions in Accra. Find the 90% confidence interval of the mean.

12.23 16.56 4.39


2.89 1.24 2.17
13.19 9.16 1.42
73.25 1.91 14.64
11.59 6.69 1.06
8.74 3.17 18.13
7.92 4.78 16.85
40.22 2.42 21.58
5.01 1.47 12.24
2.27 12.77 2.76

Solution
The mean and standard deviation of the data are calculated as 11.091 and 14.405 respectively.
Since n 30, s = and the standard normal distribution can be used in estimating the confidence
internal. Since 90% confidence level was desired, = 0.10 and z/2 = 1.65 (Table E). Substituting
into the formula

14.405 14.405
11.091 1.65 ( ) < < 11.091 + 1.65 ( )
30 30
6.752 < < 15.430 11.091 4.339

Hence, one can be 90% confident that the population mean of the assets of all credit unions is
between 6.752 million and 15.430 million, based on a sample of 30 credit unions.

5.1.2 Sample size

Sample size is obtained from the maximum error estimate formula as:
/2 . 2
=( )

where E = there error margin

Example 5.3
The Provost of the college asked the Experimental Data Analysis teacher to estimate the average
age of the students at their college. How large a sample is necessary? The statistics teacher
would like to be 99% confident that the estimate should be accurate within one year. From a
previous study, the standard deviation of the ages is known to be 3 years.

Solution
Since = 0.01, z/2 = 2.58, and E = 1 year. Substituting in the formula, one get

/2 . 2 2.58 3 2
=( ) =( ) = 59.9
1

Therefore, in order to be 99% confident that the estimation is within 1 year of the true mean age,
the teacher needs a sample size of at least 60 students.

5.2 Confidence interval for the mean ( unknown or n < 30)

When is known and the variable is normally distributed or when is unknown and n 30, the
standard normal distribution is used to estimate the confidence intervals for the mean. However,
in many situations, is unknown and n < 30. In such situations, s = but a different distribution
called the t distribution must be used.

The t distribution shares some characteristics of the normal distribution and differs from it in
others. The t distribution is similar to the standard normal distribution in these ways:
1. It is bell-shaped.
2. It is symmetric about the mean.
3. The mean, median, and mode are equal to 0 and are located at the center of the distribution.
4. The curve never touches the x axis.

The t distribution differs from the standard normal distribution in the following ways:
1. The variance is greater than 1.
2. The t distribution is actually a family of curves based on the concept of degrees of freedom,
which is related to sample size.
3. As the sample size increases, the t distribution approaches the standard normal distribution.

Figure 5.2: The t family of curves

The formula for calculating the confidence interval of the mean for a specific is given by:


2 ( ) < < + 2 ( ) . = 1

Example 5.4
Ten randomly selected automobiles were stopped, and the tread depths of the right front tire were
measured. The mean was 0.32 inch, and the standard deviation was 0.08 inch. Find the 95%
confidence interval of the mean depth. Assume that the variable is approximately normal.

Solution

Since is unknown and n < 30, s = and the t distribution must be used for 95%. Hence, with 9
d.f, t/2 = 2.262 (Table F). Hence the confidence interval is calculated as


2 ( ) < < + 2 ( )

0.08 0.08
0.32 2.262 ( ) < < 0.32 + 2.262 ( )
10 10

0.26 < < 0.38


Therefore, one can be 95% confident that the population mean tread depth of all right front tires
is between 0.26 and 0.38 inch based on a sample of 10 tires.

5.3 Confidence intervals for variances and standard deviations


In sections 5.1 and 5.2, confidence intervals were calculated for means. This section will explain
how to find confidence intervals for variances and standard deviations. In statistics, the variance
and standard deviation of a variable are as important as the mean.

To calculate these confidence intervals, a new statistical distribution is needed. It is called the
chi-square distribution. The chi-square variable is similar to the t variable in that its distribution
is a family of curves based on the number of degrees of freedom. The symbol for chi-square is 2
(Greek letter chi, pronounced ki). Several of the distributions are shown in Figure 5.3, along
with the corresponding degrees of freedom. The chi-square distribution is obtained from the
values of (n - 1)s2/2 when random samples are selected from a normally distributed population
whose variance is 2.
A chi-square variable cannot be negative, and the distributions are skewed to the right. At about
100 degrees of freedom, the chi-square distribution becomes somewhat symmetric. The area
under each chi-square distribution is equal to 1.00, or 100%.

Figure 5.3: Chi-square distribution

Two different values are used in the formula because the distribution is not symmetric. One value
is found on the left side of the table, and the other is on the right.
Figure 5.4: Chi-square distribution for d.f = n 1

To find confidence intervals for variances and standard deviations, you must assume that the
variable is normally distributed.
The formulas for the confidence intervals are shown here. For variance,

( 1) 2 ( 1) 2
2 < 2 < 2 . = 1

For standard deviation,

( 1) 2 ( 1) 2
2 << 2 . = 1

Example 5.5
2 2
Find the values for and for a 90% confidence interval when n = 25.

Solution
2
To find , subtract 1 - 0.90 = 0.10 and divide by 2 to get 0.05.
2
To find , subtract 1 - 0.05 to get 0.95. Hence, use the 0.95 and 0.05 columns and the row
corresponding to 24 degrees of freedom (Table G)

Hence, the values are


2
= 36.415
2
= 13.848

Example 5.6
Find the 95% confidence interval for the variance and standard deviation of the glycerin content
of soaps manufactured if a sample of 20 soaps has a standard deviation of 1.6 milligrams.

Solution
Since = 0.05, the two critical values, respectively, for the 0.025 and 0.975 levels for 19 degrees
of freedom are 32.852 and 8.907. The 95% confidence interval for the variance is found by
substituting in the formula.
( 1) 2 2
( 1) 2
2 < < 2

(20 1)(1.6)2 (20 1)(1.6)2


< 2 <
32.852 8.907

1.5 < 2 < 5.5

For standard deviation,

1.5 < < 5.5

1.2 < < 2.3


CHAPTER 6: HYPOTHESIS TESTING

Statistical Hypothesis testing is a decision making process for evaluating claims about a
population. Hypotheses concerning parameters such as the mean and proportions can be
investigated.

Three methods used to test hypotheses are


1. Traditional method
2. P-value method
3. Confidence interval method

6.1 Steps in Hypothesis testing Traditional method

Every hypothesis testing situation begins with the statement of a hypothesis. There are two types
of statistical hypotheses for each situation: the null hypothesis and the alternative hypothesis.

The null hypothesis, H0, is a statistical hypothesis that states that there is no difference between
a parameter and a specific value, or that there is no difference between two parameters.

The alternative hypothesis, H1, is a statistical hypothesis that states the existence of a difference
between a parameter and a specific value, or that there is a difference between two parameters.

6.1.1 Stating the hypothesis

Lets consider the preamble and the following cases.

A chemist invents an additive but he doesnt know whether the additive will increase, have no
effect or will decrease the life of automobile battery. If the mean lifetime of the automobile
battery is 36 months, the following hypotheses can be written.

Case A: If the chemist wants to verify whether the additive will change the mean lifetime of
automobile battery, then

0 : = 36 1 : 36
This test is called a two-tailed test since the possible side effects of the additive will either
increase or decrease the lifetime of automobile battery.

Case B: If the chemist wants to verify whether the additive will increase the mean lifetime of
automobile battery, then

0 : 36 1 : > 36

This test is called right-tailed since the interest is to increase only.

Case C: If the chemist wants to verify whether the additive will decrease the mean lifetime of
automobile battery, then

0 : 36 1 : < 36

This test is called left-tailed since the interest is to decrease only.

Example 6.1
State the null and alternative hypotheses for each conjecture.

a. A researcher thinks that if expectant mothers use vitamin pills, the birth weight of their babies
will increase. The average birth weight of the population is 8.6 pounds.

b. An engineer hypothesizes that the mean number of defects can be decreased in a


manufacturing process of compact discs by using robots instead of humans for certain tasks. The
mean number of defective discs per 1000 is 18.

c. A psychologist feels that playing soft music during a test will change the results of the test.
The psychologist is not sure whether the grades will be higher or lower. In the past, the mean of
the scores was 73.

Solution

a. 0 : 8.6 1 : > 8.6


b. 0 : 18 1 : < 18
c. 0 : = 73 1 : 73
6.1.2 Testing the hypothesis

In hypothesis testing, the mean is computed from the data of the sample and is compared to the
population mean. Then, a decision is made to reject or not to reject the null hypothesis on the
basis of the value obtained from the statistical test. If the difference is significant, the null
hypothesis is rejected. If it is not, the null hypothesis is not rejected.

In hypothesis testing, there are four possible outcomes. The four outcomes are shown in figure
6.1.

H0 True H0 False

Reject Error Correct


H0 Type I decision

Do not
Correct Error
reject
H0
decision Type II

Figure 6.1: Possible outcomes of a hypothesis test

If the null hypothesis is true and it is rejected, a type I error is committed. On the other hand, if
the null hypothesis is not rejected when it is false, a type II error is committed.

The decision to reject or not to reject doesnt prove anything. The only way to prove something
statistically is to use the entire population, which, in most cases is not possible. The decision is
then made on the basis of probabilities.

The level of significance is the maximum probability of committing a type I error. This
probability is represented by . That is P(type I error) = .

The probability of a type II error is represented by . That is P(type II error) = . In most


hypothesis-testing situations, cannot easily be computed.
Next, the researcher decides what level of significance to use. With this, a critical value is
selected from a table for the appropriate test. The critical value determines the critical and
noncritical regions.

Example 6.2
Using the appropriate table, find the critical value(s) for each situation, draw the appropriate
diagram and indicate the critical and noncritical regions.

a. A left-tailed test with = 0.10


b. A two-tailed test with = 0.02
c. A right-tailed test with = 0.005.

Solution
a. For this left-sided test, the area between z and 0 = 0.5000 0.1000 = 0.4000. The z value for
0.3997 ( 0.4000) is 1.28. Since it is a left-sided test, z = - 1.28

b. Since it is a two-sided test, /2 = 0.01 is used. The area between z and 0 on either sides of the
mean = 0.5000 0.010 = 0.4900. The z value for 0.4901 (0.4900) is 2.33

c. Since it is a right-sided test, the area between z and 0 = 0.5000 0.0050 =0.4950. The z value
for 0.4950 is +1.28.

6.2 z test for mean

The z test is a statistical test for the mean of a population. It can be used when n 30, or when
the population is normally distributed and is known. The formula for the z test is

where = sample mean


= hypothesized population mean
= population standard deviation
n = sample size
The denominator is the standard error of the mean.

Example 6.3
A researcher reports that the average salary of petrochemical engineer in Ghana is more than
Gh 42,000 per year. A sample of 30 petrochemical engineers has a mean salary of Gh 43,260.
At = 0.05, test the claim that the petrochemical engineer earn more than Gh 42,000 a year.
The standard deviation of the population is Gh 5,230.

Solution

Step 1: State the hypotheses and identify the claim

0 : 42,000 1 : > 42,000()

Step 2: Find the critical value. Since the test is right-tailed and = 0.05, the critical value z =
+1.65.

Step 3: Compute the test value.

43,260 42,000
= = = 1.32
5230/30

Step 4: Make the decision. Since the test value (+1.32) is less than the critical value (+1.65), it is
in the noncritical region, the decision is, Do not reject the null hypothesis.

Step 5: Summarize the results. There is not enough evidence to support the claim that
petrochemical engineers in Ghana earn more than 42,000 a year.

Example 6.4
The Medical Rehabilitation Education Foundation reports that the average cost of rehabilitation
for stroke victims is $24,672. To see if the average cost of rehabilitation is different at a
particular hospital, a researcher selected a random sample of 35 stroke victims at the hospital and
found that the average cost of their rehabilitation is $25,226. The standard deviation of the
population is $3,251. At = 0.01, can it be concluded that the average cost of stroke
rehabilitation at a particular hospital is different from $24,672?
Solution

Step 1: 0 : = $24,672 1 : $24,672()

Step 2: Since the test is two-tailed and = 0.01, the critical values are +2.58 and -2.58

Step 3: The computed test value z = 1.01

Step 4: Do not reject the null hypothesis since the test value falls within the noncritical region.

Step 5: There is not enough evidence to support the claim that the average cost of rehabilitation
at the particular hospital is different from $24,672.

6.3 P-value method for hypothesis testing

The P-value (or probability value) is the probability of getting a sample statistic (such as the
mean) in the direction of the alternative hypothesis when the null hypothesis is true. In other
words, the P-value is the actual area under the standard normal distribution curve representing
the probability of a particular sample statistic occurring if the null hypothesis is true.

When the hypothesis test is two-tailed, the area in one tail is doubled. The following rules can be
used in the decision of hypothesis test.

1. If P-value , reject the null hypothesis


2. If P-value > , do not reject the null hypothesis

Example 6.5
A researcher wishes to test the claim that the average age of bodyguards in Accra is greater than
24 years. She selects a sample of 36 guards and finds the mean of the sample to be 24.7 years,
with a standard deviation of 2 years. Is there evidence to support the claim at = 0.05? Use the
P-value method.

Solution
Step 1: 0 : 24 1 : > 24()
Step 2: The computed test value z = 2.10
Step 3: From the standard normal distribution, at z = 2.10, the area is 0.4821. The P-value is the
area in the right tail. Therefore, P = 0.5000 0.4821 = 0.0179
Step 4: Since the P-value < , the decision is to reject the null hypothesis.
Step 5: There is not enough evidence to support the claim that the average age of bodyguards in
Accra is greater than 24 years.

Example 6.6
A researcher claims that the average wind speed in Kumasi is 8 miles/h. A sample of 32 days has
an average wind speed of 8.2 miles/h. The standard deviation of the sample is 0.6 mile/h. At =
0.05, is there enough evidence to reject the claim? Use the P value method.

Solution
Step 1: 0 : = 8() 1 : 8
Step 2: The computed test value z = 1.89
Step 3: From the standard normal distribution, at z = 1.89, the area is 0.4706. The P-value is the
area in the right tail. Therefore, area in right tail = 0.5000 0.4706 = 0.0294. The P-value = 2
0.0294 = 0.0588.
Step 4: Since the P-value > , the decision is not to reject the null hypothesis.
Step 5: There is not enough evidence to reject the claim that the average wind speed in Kumasi is
8 miles/h.

6.4 t-test for mean

The t-distribution has been explained in chapter 5. The t-test is a statistical test for the mean of a
population and is used when the population is normally or approximately normally distributed,
is unknown, and n < 30. The formula for the t-test is


= . = 1

Example 6.7
A job placement director claims that the average starting salary for nurses is 24,000. A sample
of 10 nurses has a mean of 23,450 and the standard deviation of 400. Is there enough evidence
to reject the directors claim at = 0.05?
Solution
Step 1: 0 : = 24,000 () 1 : 24,000
Step 2: At = 0.05 and d.f = 9, the critical t-values are +2.262 and -2.262.
Step 3: Compute the test value.

23,450 24,000
= = = 4.35
400/10

Step 4: Since t = -4.35 is in the critical region, the decision is to reject the null hypothesis.
Step 5: There is enough evidence to reject the claim that the starting salary of nurses is $24,000.

Example 6.8
A physician claims that joggers maximal volume oxygen uptake is greater than the average of
all adults. A sample of 15 joggers has a mean of 40.6 ml/kg and a standard deviation of 6 ml/kg.
If the average of all adults is 36.7 ml/kg, is there enough evidence to support the physicians
claim at = 0.05? Use the P-value method.

Solution
Step 1: 0 : 36.7 1 : > 36.7()
Step 2: Compute the test value.

40.6 36.7
= = = 2.517
6/15

Step 3: With d.f = 14 and a right-tailed test, the P-value will lie between = 0.025 and = 0.01.
The correct P-value is 0.012.
Step 4: Since P-value < , reject the null hypothesis.
Step 5: There is not enough evidence to support the claim that the joggers maximal volume
oxygen intake is greater than 36.7 ml/kg.

6.5 Confidence interval and hypothesis testing

Example 6.9
Sugar is packed in 5-kg bags. An inspector suspects the bags may not contain 5 kg of sugar. A
sample of 50 bags produces a mean of 4.6 kg and a standard deviation of 0.7 kg. Is there enough
evidence to conclude that the bags do not contain 5 kg as stated at = 0.05? Also, find the 95%
confidence interval of the true mean.

Solution
Step 1: 0 : = 5 1 : 5 ()
Step 2: Since the test is two-tailed and = 0.05, the critical values are +1.96 and -1.96
Step 3: The computed test value z = -4.04
Step 4: Since -4.04 < -1.96, the null hypothesis is rejected.
Step 5: There is enough evidence to support the claim that the bags do not weigh 5 kg.

The 95% confidence for the mean is given by


2 ( ) < < + 2 ( )

0.7 0.7
4.6 1.96 ( ) < < 4.6 + 1.96 ( )
50 50

4.4 < < 4.8

95% confidence interval of does not contain the hypothesized value = 5. Hence, there is
agreement between the hypothesis and the confidence interval.

Example 6.10
A researcher claims that if adult cattle feed on a special diet, they will have an average weight of
200 kg. A sample of 10 cattle has an average weight of 198.2 kg and a standard deviation of 3.3
kg. At = 0.05, can the claim be rejected? Also, find the 95% confidence interval of the true
mean.

Solution
0 : = 200 (claim) 1 : 200
Since the test is two-tailed, = 0.05 and d.f = 9, the critical values are +2.262 and -2.262.
The computed test value t = -1.72
The null hypothesis is not rejected. There is not enough evidence to reject the claim.
The 95% confidence for the mean is given by

2 ( ) < < + 2 ( )

3.3 3.3
198.2 2.262 ( ) < < 198.2 + 2.262 ( )
10 10

195.8 < < 200.6

95% confidence interval of does contain the hypothesized value = 200.


ASSIGNMENT 4

The mean salary of federal government employees on the General Schedule is $59,593. The
average salary of 30 state employees who do similar work is $58,800 with = $1500. At the
0.01 level of significance, can it be concluded that state employees earn on average less than
federal employees?
Solution
CHAPTER 7: TESTING THE DIFFERENCE BETWEEN TWO MEANS AND
VARIANCES

The basic concepts of hypothesis testing were explained in Chapter 6. With the z, t, and
2 tests, a sample mean and variance can be compared to a specific population mean and variance
to determine whether the null hypothesis should be rejected.
There are, however, many instances when researchers wish to compare two sample means, using
experimental and control groups. For example, the average lifetimes of two different brands of
bus tires might be compared to see whether there is any difference in tread wear. Two different
brands of fertilizer might be tested to see whether one is better than the other for growing plants.

In the comparison of two means, the same basic steps for hypothesis testing shown in Chapter 6
are used, and the z and t tests are also used. When comparing two means by using the t test, the
researcher must decide if the two samples are independent or dependent. Finally, two variances
can be compared by using an F test. The concepts of independent and dependent samples as well
as the F-test will be explained in this chapter.

7.1 Test the difference between two means for large samples

If a researcher is interested in comparing the means of two groups, the following hypotheses can
be written

0 : 1 = 2 1 : 1 2

where 1 = mean of group 1

2 = mean of group 2

Another way of stating the hypotheses for the situation is

0 : 1 2 = 0 1 : 1 2 0

Assumptions for testing the difference between 2 means

1. The samples must be independent of each other. That is, there should be no relationship
between the subjects in each sample.
2. The populations from where the samples were obtained must be normally distributed, and
the standard deviations of the variable must be known, or the sample sizes must be
greater than or equal to 30.

The formula for the z test for comparing two means from independent populations is

(1 2 ) (1 2 )
=
2 2
1 + 2
1 2

Such tests can also be one-tailed, using the following hypotheses.

Right-tailed Left-tailed
0 : 1 2 0 : 1 2 0 0 : 1 2 0 : 1 2 0
1 : 1 > 2 1 : 1 2 > 0 1 : 1 < 2 1 : 1 2 < 0

If 12 and 22 are not known, the variances obtained from each sample, 12 and 22 can be used but
both samples sizes must be 30 or more.

When one or both sample sizes are less than 30 and 1 and 2 are unknown, the t-test must be
used.

The formula for confidence interval for difference between two means for large samples is

12 22 12 22
(1 2 ) 2 + < 1 2 < (1 2 ) + 2 +
1 2 1 2

Example 7.1
A survey found that the average hotel room rate in Takoradi is Gh 88.42 and the average hotel
room rate in Kumasi is Gh 80.61. Assume that the data were obtained from two samples of 50
hotels each and that the standard deviations were Gh 5.62 and Gh 4.83 respectively. At =
0.05, can it be concluded that there is significant difference in the rates? Also, find the 95%
confidence interval for the difference between the means.
Solution
Step 1: 0 : 1 = 2 and 1 : 1 2 ()
Step 2: At = 0.05, the critical z values are +1.96 and -1.96
Step 3: The test value is
(1 2 ) (1 2 ) (88.42 80.61) (0)
= = = 7.45
2 2
2 22 5.62 + 4.83
1 50 50
1 + 2

Step 4: Reject the null hypothesis at = 0.05.


Step 5: There is enough evidence to support the claim that the means are not equal. Hence, there
is significant difference between the rates.

For 95% confidence interval and when z/2 = 1.96, use the formula

2 2 2 2
(1 2 ) 2 1 + 2 < 1 2 < (1 2 ) + 2 1 + 2
1 2 1 2

5.622 4.832
(88.42 80.61) (1.96) + < 1 2
50 50

5.622 4.832
< (88.42 80.61) + (1.96) +
50 50

5.76 < 1 2 < 9.86

Since the confidence interval does not contain 0, the decision is to reject the null hypothesis.

Example 7.2

A researcher hypothesizes that the average number of sports colleges offer for males is greater
than the average number of sports colleges offer for females. A sample of the number of sports
offered by colleges is shown. At = 0.10, is there enough evidence to support the claim? Use the
P-value method.
Males Females
6 11 11 8 15 6 8 11 13 8
6 14 8 12 18 7 5 13 14 6
6 9 5 6 9 6 5 5 7 6
6 9 18 7 6 10 7 6 5 5
15 6 11 5 5 16 10 7 8 5
9 9 5 5 8 7 5 5 6 5
8 9 6 11 6 9 18 13 7 10
9 5 11 5 8 7 8 5 7 6
7 7 5 10 7 11 4 6 8 7
10 7 10 8 11 14 12 5 8 5

Solution
Step 1: 0 : 1 2 and 1 : 1 > 2 ()
Step 2: For males, 1 = 8.6 and s1 = 3.3; and for females 2 = 7.9 and s2 = 3.3.
Step 3: The computed z value = 1.06.
Step 4: For z = 1.06, the area is 0.3554. Therefore, P = 0.5000 0.3554 = 0.1446
Step 5: Do not reject the null hypothesis.
Step 6: There is not enough evidence to support the claim that colleges offer more sports for
males than they do for females.

7.2 Testing the difference between two variances

For the comparison of two variances or standard deviations, an F test is used. If two independent
samples are selected at random from two normally distributed populations in which the variances
2
are equal (12 = 22 ) and if the variances 12 and 22 are computed as 12 , the sampling distribution
2

of the variances is called the F distribution (see figure 7.1).

The formula for F test is given as

12
=
22
Where 12 is the larger of the two variances. The F test has two terms for degrees of freedom: the
numerator, n1 1, and the denominator, n2 1, where n1 is the sample size from which the larger
variance was obtained.

Figure 7.1: Shapes of several curves for F distribution

7.2.1 Characteristics of F-distribution

1. The values of F cannot be negative, because variances are always positive or zero.
2. The distribution is positively skewed.
3. The mean value of F is approximately equal to 1.
4. The F distribution is a family of curves based on the degrees of freedom of the variance of the
numerator and the degrees of freedom of the variance of the denominator.

Note the following when using the F test:


1. The larger variance should always be designated as s12.
2. For a two-tailed test, the value must be divided by 2 and the critical value be placed on the
right side of the F test.
3. When the degrees of freedom cannot be found in tables, the closest value on the smaller side
should be used.
Example 7.3
A medical researcher wishes to see whether the variances of the heart rates (in beats/min) of
smokers are different from the variances of heart rates of people who do not smoke. Two
samples were selected, and the data are as shown. Using = 0.05, is there enough evidence to
support this claim?
Smokers Nonsmokers
n1 = 26 n2 = 18
12 = 36 22 = 10

Solution
Step 1: 0 : 12 = 22 and 1 : 12 22 ()
Step 2: Since = 0.05, a value of 0.025 will be used. Here, the d.f.N = 25 and d.f.D = 17. The
critical f value is 2.56 (Table H) (d.f.N = 24 was used)

Step 3: The test value is

12 36
= 2= = 3.6
2 10
Step 4: The decision is to reject the null hypothesis since 3.6 > 2.54.
Step 5: There is enough evidence to support the claim that the variances of heart rates of smokers
and nonsmokers are different.

Example 7.4
The CEO of an airport hypothesizes that the variance for the number of passengers for American
airports is greater than the variance for then number of passengers for foreign airports. At =
0.10, is there enough evidence to support the hypothesis? The data in millions of passengers per
year is shown for selected airports. Use the P-value method. Assume the variable is normally
distributed.
American airports Foreign airports
36.8 73.5 60.7 51.2
72.4 61.2 42.7 38.6
60.5 40.1
Solution
Step 1: 0 : 12 22 and 1 : 12 > 22 ()
Step 2: The calculated variances are 12 = 246.38 and 22 = 95.87. Therefore, F-test value = 2.57.
Step 3: Using d.f.N = 5 and d.f.D = 3, the table below can be constructed.

0.10 0.05 0.025 0.01 0.005


F 5.31 9.01 14.88 28.24 45.39

Since 2.57 < 5.31, the P-value is greater than 0.01. The approximate value is 0.234.

Step 4: The decision is not to reject the null hypothesis.


Step 5: There is not enough evidence to support the claim that the variance in the number of
passengers for American airport is greater than the variance in the number of passengers for
foreign airports.

7.3 Testing the difference between two means: small independent samples

When one or both samples sizes are less than 30, 1 and 2 are unknown, the samples are
independent, the t-test must be used.

There are two different options for the use of t-test. One option is used when the variances of the
populations are not equal, and the other is used when the variances are equal. To determine
whether the two sample variances are equal, the F-test is used as explained in section 7.2.

The formulas for testing the difference between two means when the samples are smaller are
given as

1. When the variances are assumed to be unequal:

(1 2 ) (1 2 )
=
2 2
1+ 2
1 2

where the d.f are equal to the smaller of n1 1 or n2 1.

2. When the variances are assumed to be equal:


(1 2 ) (1 2 )
=
(1 1)12 + (2 1)22 1 1
+
1 + 2 2 1 2

where d.f are equal to n1 + n2 2.

The formulas for confidence interval for difference between two means for smaller samples are

1. When the variances are assumed to be unequal:

2 2 2 2
(1 2 ) 2 1 + 2 < 1 2 < (1 2 ) + 2 1 + 2
1 2 1 2

d.f = the smaller value of n1 1 or n2 1.

2. When the variances are assumed to be equal:

(1 1)12 + (2 1)22 1 1
(1 2 ) 2 . + < 1 2
1 + 2 2 1 2

(1 1)12 + (2 1)22 1 1
< (1 2 ) + 2 . +
1 + 2 2 1 2

d.f = n1 + n2 2.

Example 7.5
The average size of a farm in Nkoranza, Brong Ahafo region, is 191 acres. The average size of a
farm in New Edubiase, Ashante region, is 199 acres. Assume that the data were obtained from
two samples with standard deviations of 38 acres and 12 acres, respectively, and sample sizes of
8 and 10, respectively. Can it be concluded at = 0.05 that the average size of the farms in the
two regions is different? Assume the populations are normally distributed. Also, calculate the
95% confidence interval for the means.

Solution
The approach here is to use the F-test to determine whether or not the variances are equal. The
null hypothesis states that the variances are equal.
Step 1: 0 : 12 = 22 () and 1 : 12 22
Step 2: The F critical value obtained from tables for 7 and 9 d.f and = 0.05 is 4.82. (Remember
for two-tailed test, /2 is used for F test!).
Step 3: The computed F test value = 10.03.
Step 4: Decision is to reject the null hypothesis since 10.03 > 4.82. Hence the variances are not
equal.
Step 5: State the hypotheses and identify the claim from the means.

0 : 1 = 2 0 : 1 2 ()

Step 6: Find the critical values. Since the test is two-tailed, = 0.05, variances are not equal, the
d.f are the smaller of n1 1 and n2 1. In this case, the d.f = 8 1 = 7. Hence, from tables, the
critical t values are +2.365 or -2.365.
Step 7: Compute the t test value. Since variances are unequal,

(1 2 ) (1 2 ) (191 199) (0)


= = = 0.57
2 2
2 22 38 + 12
1 8 10
1 + 2

Step 8: Decision is not to reject the null hypothesis since -0.57 > 2.365
Step 9: There is not enough evidence to support the claim that the average size of the farms is
different.

For 95% confidence level, the internal is calculated as

2 2 2 2
(1 2 ) 2 1 + 2 < 1 2 < (1 2 ) + 2 1 + 2
1 2 1 2

382 122 382 122


(191 199) (2.365) + < 1 2 < (191 199) + (2.365) +
8 10 8 10

41.02 < 1 2 < 25.02


Since 0 is contained in the interval, the decision is not to reject the null hypothesis.

Example 7.6
A researcher wishes to determine whether the salaries of professional nurses employed by
private hospitals are higher than those of nurses employed by government owned hospitals. She
selects a sample of nurses from each type of hospital and calculates the means and standard
deviations of their salaries. At = 0.01, can she conclude that the private hospitals pay more than
the government hospitals? Assume that the populations are approximately normally distributed.
Use the P-value method.

Private Government
1 = 26,800 2 = 25,400
s1 = 600 s2 = 450
n1 = 10 n2 = 8
Solution

The F-test will be used to determine whether or not the variances are equal. The null hypothesis
states that the variances are equal.
Step 1: 0 : 12 = 22 () and 1 : 12 22
Step 2: Compute the F critical value. The compute value, F = 1.78
Step 3: Find the P-value. Using d.f.N = 9 and d.f.D = 7, the approximate P-value = 0.460.
Step 4: Decision is not to reject the null hypothesis since 0.460 > 0.01. Hence the variances are
equal.
Step 5: State the hypotheses and identify the claim from the means.

0 : 1 2 0 : 1 > 2 ()

Step 6: Compute the t test value. Since variances are equal,


(1 2 ) (1 2 ) (26,800 25,400) (0)
= =
2 2
( 1)12 + (2 1)22 1 1 (10 1)600 + (8 1)450 1 + 1
1 + 10 + 8 2 10 8
1 + 2 2 1 2

= 5.47
Step 7: From tables, the P-value for t = 5.47 with d.f = 16 (i.e., 10 + 8 2) is 0.00002.
Step 8: Decision is to reject the null hypothesis since P-value < 0.01.
Step 9: There is enough evidence to support the claim that the salaries paid to nurses employed
by private hospitals are higher than those paid to nurses employed by government owned
hospitals.

7.4 Testing the difference between two means: small dependent samples

When the samples are dependent, a special t test for dependent means is used. This test employs
the difference in values of the matched pairs. The hypotheses are as follows, where D
represents differences between the expected mean of the matched pairs.

Two-tailed Left-tailed Right-tailed


0 : = 0 0 : 0 0 : 0
1 : 0 1 : < 0 1 : > 0

The general procedure for finding the test value involves the following:

Firstly, find the differences (D) in the values of the data pairs.

) of the differences, using the formula


Secondly, find the mean (

Thirdly, find the standard deviation (sD) of the differences using

2 ( )2

=
1

Next, find the estimated standard error ( ) of the differences, using

Finally, find the test value, using




= . = 1

The confidence interval for the mean difference is


/2
+ /2
< < . = 1

Example 7.7
A physical education director claims by taking a special vitamin, a weight lifter can increase his
strength. Eight athletes are selected and given a test of strength, using the standard bench press.
After two weeks of regular training, supplemented with the vitamin, they are tested again. Test
the effectiveness of the vitamin regimen at = 0.05. Each value in the data that follow represents
the maximum number of pounds the athlete can bench press. Assume that the variable is
approximately normally distributed.

Athlete 1 2 3 4 5 6 7 8
Before (x1) 210 230 182 205 262 253 219 216
After (x2) 219 236 179 204 270 250 222 216

Solution

Step 1: State the hypotheses and identify the claim

0 : 0 1 : < 0 ()

Step 2: Find the critical value. At = 0.05 and d.f = 7, the critical value of the left-tailed test = -
1.895.

Step 3: Compute the test value using the completed table below

19
=
= = 2.375
8
Before (x1) After (x2) D = x1 x2 D2 = (x1 x2)2
210 219 -9 81
230 236 -6 36
182 179 +3 9
205 204 +1 1
262 270 -8 64
253 250 +3 9
219 222 -3 9
216 216 0 0
= 19 2 = 209

2 (19)2
2 ( ) 209
= 8
= = 4.84
1 81


2.375 0
= = = 1.388
4.84/8

Step 4: Decision is not to reject the null hypothesis at = 0.05 since -1.388 > -1.895
Step 5: There is not enough evidence to support the claim that the vitamin increases the strength
of weight lifters.

Example 7.8
A dietitian wishes to see if a persons cholesterol level will change if the diet is supplemented by
a certain mineral. Six subjects were presented and then took the mineral supplement for six-week
period. The results are shown in the table (Cholesterol level is measured in mg/dl). Can it be
concluded that the cholesterol level has been changed at = 0.10? Assume the variable is
approximately normally distributed. Also, find the 90% confidence interval.

Subject 1 2 3 4 5 6
Before (x1) 210 235 208 190 172 244
After (x2) 190 170 210 188 173 228

Solution
Step 1: State the hypotheses and identify the claim

0 : = 0 1 : 0 ()
Step 2: Find the critical value. At = 0.10 and d.f = 5, the critical value = 2.015.

Step 3: Compute the test value using the completed table below.

Before (x1) After (x2) D = x1 x2 D2 = (x1 x2)2


210 190 +20 400
235 170 +65 4225
208 210 -2 4
190 188 +2 4
172 173 -1 1
244 228 +16 256
= 100

2 = 4890
100
=
= = 16.7
6

2 (100)2
2 ( ) 4890
= 6 = 25.4
=
1 61


16.7 0
= = = 1.610
25.4/6

Step 4: Decision is not to reject the null hypothesis at = 0.10 since 1.610 falls within the
noncritical region.
Step 5: There is not enough evidence to support the claim that the mineral changes a persons
cholesterol level.

For 95% confidence interval with d.f = 5, we have


/2
+ /2
< <

25.4 25.4
16.7 2.015 < < 16.7 + 2.015
6 6

4.19 < < 37.59

Since 0 is contained in the interval, the decision is not to reject the null hypothesis.
CHAPTER 8: CORRELATION AND REGRESSION

Correlation is a statistical method used to determine whether a relationship between variables


exists or not. Regression is a statistical method used to describe the nature of the relationship
between variables that is, positive or negative, linear or nonlinear.

Statisticians use a measure called coefficient of correlation to determine whether two or more
variables are related and also to determine the strength of the relationship between the variables.
There are two types of relationships: simple and multiple. In simple relationship, there are only
two variables under study. In multiple relationship, there are many variables under study.

8.1 Scatter plots

A scatter plot is a graph of ordered pairs (x,y) of numbers consisting of the independent variable,
x, and the dependent variable, y. The independent variable is the variable in the regression that
can be controlled or manipulated. A dependent variable is the variable in the regression that
cannot be controlled or manipulated.

A scatter plot is a visual way to describe the nature of the relationship between an independent
and dependent variables.

Example 8.1

(a) Construct a scatter plot for the data obtained in a study of age and systolic blood pressure of 6
randomly selected subjects. The data are shown in the following table.

Subject Age, x Pressure, y


A 43 128
B 48 120
C 56 135
D 61 143
E 67 141
F 70 152

(b) Construct a scatter plot for the data obtained in a study on the number of absences and final
grades of 7 randomly selected students from a statistics class. The data are shown below.
Students # of absences, x Final grade, y
A 6 82
B 2 86
C 15 43
D 9 74
E 12 58
F 5 90
G 8 78

(c) Construct a scatter plot for the data obtained in a study on the number of hours 9 people
exercise each week and the amount of milk (in ounces) each person consumes per week.

Subject # of absences, x Final grade, y


A 3 48
B 0 8
C 2 32
D 5 64
E 8 10
F 5 32
G 10 56
H 2 72
I 1 48

Solution

160 100

150 90
Final grade

80
140
Pressure

70
130
60
120 50

110 40
40 50 60 70 80 0 5 10 15 20
Age Number of absences

(a) (b)
80
70

Ounces of milk
60
50
40
30
20
10
0
0 2 4 6 8 10
Hours of excersie

(c)

It can be seen from solution example 8.1 (a) that the plot suggests a positive relationship since as
a persons age increases with blood pressure. The plot in solution example 8.1 (b) suggests a
negative relationship. These two plots suggest linear relationship since the point seem to fit a
straight line, although not perfectly. Finally, the plot in solution example 8.1 (c) shows no
specific relationship.

8.2 Correlation
8.2.1 Correlation coefficient

There are several types of correlation coefficients. They show relationships between variables.
The one used here is called the Pearson product moment correlation coefficient (PPMC). It is
given as

( ) ( )( )
= = #
[( 2 ) ( )2 ][( 2 ) ( )2 ]

The range of the correlation coefficient is from -1 to +1. If there is a strong positive linear
relationship between the variables, the value of r will be close to +1. If there is a strong negative
linear relationship between the variables, the value of r will be close to -1. When there is no
relationship between the variables or only a weak relationship, the value of r will be close to 0.
Example 8.2
Compute the correlation coefficients for the data in example 8.1 (a), (b), and (c).

Solution
(a) Construct and complete the table as shown

Subject Age, x Pressure, y xy x2 y2


A 43 128 5,504 1,849 16,384
B 48 120 5,760 2,304 14,400
C 56 135 7,560 3,136 18,225
D 61 143 8,723 3,721 20,449
E 67 141 9,447 4,489 19,881
F 70 152 10,640 4,900 23,104
= 345 2
= 819 = 47,634 = 20,399 2
= 112,443

The correlation coefficient is calculated as

( ) ( )( )
=
[( 2 ) ( )2 ][( 2 ) ( )2 ]

(6)(47,634) (345)(819)
= = 0.897
[(6)(20,399) 3452 ][(6)(112,443) 8192 ]

This suggests a strong positive relationship between age and blood pressure.

(b)

Subject # of absences, x grade, y xy x2 y2


A 6 82 492 36 6,724
B 2 86 172 4 7,396
C 15 43 645 225 1,849
D 9 74 666 81 5,476
E 12 58 696 144 3,364
F 5 90 450 25 8,100
G 8 78 624 64 6,084
= 57 = 511 = 3,745 2
= 579 2
= 38,993

(7)(3,745) (57)(511)
= = 0.944
[(7)(579) 572 ][(7)(38,993) 5112 ]
This suggests a strong negative relationship between a students final grade and number of
absence a student makes.

(c)

Subject Hours, x Amount, y xy x2 y2


A 3 48 144 9 2,304
B 0 8 0 0 64
C 2 32 64 4 1,024
D 5 64 320 25 4,096
E 8 10 80 64 100
F 5 32 160 25 1,024
G 10 56 560 100 3,136
H 2 72 144 4 5,184
I 1 48 48 1 2,304
= 36 = 370 = 1,520 2 = 232 2 = 19,236

(9)(1,520) (36)(370)
= = 0.067
[(9)(232) 362 ][(9)(19,236) 3702 ]

This value indicates a weak positive correlation.

8.2.2 Significance of correlation coefficient

When there is a strong relationship between variables, the value of r is close to +1 or -1. It is
close to zero when there is no or weak relationship. When r 0, there are two possibilities: either
the value of r is high enough to conclude that there is significant linear relationship between the
variables or the value of r is due to chance.

The population correlation coefficient is computed from taking all possible (x, y) pairs: it is
denoted as (rho).

In hypothesis testing, one of the following is true:

H0: = 0 This null hypothesis means that there is no correlation between the x and y
variables in the population.

H1: 0 This alternative hypothesis means that there is a significant correlation between
the x and y variables in the population.
When the null hypothesis is rejected at a specific level, it means that there is a significant
difference between r and 0. When the null hypothesis is not rejected, it means the value of r is
not significantly different from 0 and it is due to chance.

The formula for the t test for correlation coefficient is given as

2
= . = 2
1 2

Example 8.3 (Method 1)


Test the significance of the correlation coefficient found in example 8.1 (a). Use = 0.05 and r =
0.897.

Solution
Step 1: State the hypotheses
0 : = 0 1 : 0

Step 2: Find the critical values. Since = 0.05 and there are 6 2 = 4 d.f, the critical values
obtained from tables are 2.776.

Step 3: Compute the test value

2 62
= = (0.897) = 4.059
1 2 1 0.8972

Step 4: Make your decision: Reject the null hypothesis since 4.059 > 2.776.

Step 5: Summarize the results: There is a significant relationship between age and blood pressure

Example 8.4 (Method 2)


Using the appropriate tables, test the significance of the correlation coefficient, r = 0.067
obtained in example 1.8 (c), at = 0.01.

Solution

Since the sample size is 9, there are 7 d.f. When = 0.01, the value obtained from tables is 0.798
(Table I). Since r = 0.067, the null hypothesis is not rejected. Hence, there is not enough
evidence to say that there is a significant linear relationship between the variables.

Note: The P-value (Method 3) can also be used.

8.3 Regression

In order to determine the mathematical relationship between two variables, it is important to


obtain the line of best fit. The line of best fit gives the regression equation. Best fit means that
the sum of the squares of the vertical distances from each point to the line is at a minimum. Line
of best fit enables the researcher to see the trend and make predictions on the basis of the data.
When r is not significant, making predictions using the regression line is meaningless.

8.3.1 Line of best fit

Consider a data set that must be fitted to a straight line. The equation of a straight line is y = a0 +
a1x. The deviation, d, between the data set and the values from the equation is

= [ ]2 = [ (0 + 1 )]2

Also,

= 2 [ (0 + 1 )] (1)
0

= 2 [ (0 + 1 )] ()
1
At the minimum point,

=0 =0
0 1
Hence,

0 + 1 =

0 + 1 2 =

Solving for a0 and a1 simultaneously, we have


( )( 2 ) ( )( )
0 =
( 2 ) ( )2

( ) ( )( )
1 =
( 2 ) ( )2

This method of obtaining the regression equation is known as least squares method.

Example 8.5
Find the equation of the regression line for the data in example 8.1 (a).

Solution
The values needed for the equation are

n = 6, = 345 , = 819 , = 47,634 , 2 = 20,399 , 2 = 112,443

Substituting into the formulas, we have

( )( 2 ) ( )( ) (819)(20,399) (345)(47,634)
0 = = = 81.048
( 2 ) ( )2 (6)(20,399) (345)2

( ) ( )( ) (6)(47,634) (345)(819)
1 = = = 0.964
( 2 ) ( )2 (6)(20,399) (345)2

Hence, the equation of the regression line is y = 81.048 + 0.964x.

160

150

140
Pressure

130

120

110
40 50 60 70 80
Age
8.4 Coefficient of determination and standard error of estimate

8.4.1 Types of variation for the regression model

The total variation is given as

( )2 = ( )2 + ( )2

( )2 = explained variation. Most of the variations can be explained by the relationship.

( )2 = unexplained variation. This variation is due to chance.

When all the data values fall on the regression line, ( )2 = ( )2 and r will +1 or -1.

8.4.2 Coefficient of determination

The coefficient of determination is the ratio of the explained variation to the total variation and it
is denoted as r2. That is


2 =

The coefficient of determination can also be obtained by squaring the correlation coefficient. The
range of r2 values span from -1 to +1 and has the same significance as correlation coefficient. In
modern statistical software such as Spreadsheet programs, Matlab, Mini Tab, SPSS, etc.,
coefficient of determination is preferred compared to correlation coefficient.

8.4.3 Standard error of estimate

The standard error estimate, Sest, is the standard deviation of the observed y values about the
predicted y values. It is given as

( )2
=
2

8.4.4 Residual plot


The values are called residuals (sometimes called the prediction errors). These values
can be plotted with the x values, and the plot, called a residual plot, can be used to determine
how well the regression line can be used to make predictions.

To interpret a residual plot, you need to determine if the residuals form a pattern. Consider the
figures below. If the residual values are more or less evenly distributed about the line =
0, as shown in figure 8.1 (a), then the relationship between x and y is linear and the regression
line can be used to make predictions.

Figure 8.1: Types of residual plots

Figure 8.1 (b) shows that the variance of the residuals increases as the values of x increase. This
means that the regression line is not suitable for predictions.
Figure 8.1 (c) shows a curvilinear relationship between the x values and the residual values;
hence, the regression line is not suitable for making predictions.
Figure 8.1 (d) shows that as the x values increase, the residuals increase and become more
dispersed. This means that the regression line is not suitable for making predictions.

Example 8.6
A researcher collects the following data and determines that there is a significant relationship
between the age of a copy machine and its monthly maintenance cost. The regression equation is

y = 55.57 + 8.13x. Find

(a) the coefficient of determination.


(b) the standard error of estimate.

Machine Age, x (years) Monthly cost, y


A 1 62
B 2 78
C 3 70
D 4 90
E 5 93
F 6 103

Solution
Construct a table as shown
x y y y y (y y)2 (y -
)2 )2
(y -
1 62 63.70 -1.70 2.89 359.7344 427.1111
2 78 71.83 6.17 38.069 117.4333 21.7778
3 70 79.96 -9.96 99.202 7.3260 160.4444
4 90 88.09 1.91 3.648 29.4125 53.7778
5 93 96.22 4.91 24.108 183.6928 106.7778
6 103 104.35 -1.35 1.823 470.1669 413.4444

= 496 , ( )2 = 169.7392, ( )2 = 1167.766, ( )2 = 1183.333

Also,

496
= = = 82.67
6

1167.766
2 = = = 0.9868
1183.333

(b) Substitute in the formula and find Sest.


( )2 169.7392
= = = 6.51
2 62

The standard deviation of the observed values about the predicted values is 6.51.

8.5 Equations of approximating curves

Apart from straight lines, sometimes, engineers must fit data to other types of curves. The
variables, x and y are the independent and dependent variables respectively. Here are some
common types of approximating curves and their equations.

Straight line = 0 + 1

Parabola, or quadratic curve = 0 + 1 + 2 2

Cubic curve = 0 + 1 + 2 2 + 3 3

nth-degree curve = 0 + 1 + 2 2 + +

Example 8.7
Wheat is grown on 9 equal-sized plots. The amount of fertilizer put on each plot is given in the
table below. Fit a parabolic curve y = a0 + a1x + a2x2 to the data.

# Amount of Fertilizer (x) # Amount of Fertilizer (x)


Wheat (y) Wheat (y)
1 2.4 1.2 6 5.2 5.0
2 3.4 2.3 7 4.9 5.5
3 4.4 3.3 8 4.4 6.1
4 5.1 4.1 9 3.9 6.9
5 5.5 4.8

Solution
For the parabolic equation, = 0 + 1 + 2 2 , the following equations can be derived

= 0 + 1 + 2 2
= 0 + 1 2 + 2 3

2 = 0 2 + 1 3 + 2 4

Construct the table below

2 2 3 4
1.2 2.4 1.44 2.88 3.456 1.728 2.0736
2.3 3.4 5.29 7.82 17.986 12.167 27.9841
3.3 4.4 10.89 14.52 47.916 35.937 118.5921
4.1 5.1 16.81 20.91 85.731 68.921 282.5761
4.8 5.5 23.04 26.40 126.720 110.592 530.8416
5.0 5.2 25.00 26.00 130.000 125.000 625.0000
5.5 4.9 30.25 26.95 148.225 166.375 915.0625
6.1 4.4 37.21 26.84 163.724 226.981 1384.584
6.9 3.9 47.61 26.91 185.679 328.509 2266.712
39.2 39.2 197.54 179.23 909.437 1076.21 6153.426

Simplified as:

39.2 = 90 + 39.21 + 197.542

179.23 = 39.20 + 197.541 + 1076.212

909.437 = 197.540 + 1076.211 + 6153.4262

Solving these equations simultaneously will give

a0 = -0.2009, a1 = 2.2659, and a2 = -0.2421

Hence, the equation of the regression line is = 0.2009 + 2.2659 0.2421 2

8.6 Multiple linear relationships

The degree of relationship existing between three or more variables is called multiple
correlation. The fundamental principles involved in problems of multiple correlation are
analogues to those of simple correlation as explained in section 8.3.

The general form of the multiple regression equation with k independent variables is
= 0 + 1 1 + 2 2 + +

For two independent variables, the formula for the multiple correlation coefficient, R, is

2 + 2 2
1 1 1 . 2 . 1 2
=
1 21 2

Where ryx1 is the value of the correlation coefficient for the variables y and x1; ryx2 is the value of
the correlation coefficient for the variables y and x2; and rx1x2 is the value of the correlation
coefficient for the variables x1 and x2.

Example 8.8
A nursing instructor wishes to see whether a students grade point average and age are related to
the students score on the state board nursing exams. She selects 5 students and obtains the
following data.
Student GPA, x1 Age, x2 Score, y
A 3.2 22 550
B 2.7 27 570
C 2.5 24 525
D 3.4 28 670
E 2.2 23 490

(a) Find the regression equation


(b) Find the regression coefficient R.

Solution
(a) The regression equation is = 0 + 1 1 + 2 2 . The following equations can be written,

= 0 + 1 1 + 2 2

1 = 0 1 + 1 12 + 2 1 2

2 = 0 2 + 1 1 2 + 2 22
Construct the table below


3.2 22 550 1760 12100 10.24 484 70.4
2.7 27 570 1539 15390 7.29 729 72.9
2.5 24 525 1312.5 12600 6.25 576 60.0
3.4 28 670 2278 18760 11.56 784 95.2
2.2 23 490 1078 11270 4.884 529 50.6
14.0 124 2805 7967.5 70120 40.18 3102 349.1

Hence,

2805 = 50 + 141 + 1242

7967.5 = 140 + 40.181 + 349.12

70120 = 1240 + 349.11 + 31022

Solving the equations simultaneously will give

a0 = -44.810, a1 = 87.64, and a2 = 14.533

Hence, the equation of the regression line is = 44.810 + 87.6401 + 14.5332

(b)

The values of the correlation coefficients are

1 = 0.845

2 = 0.791

1 2 = 0.371

Substituting in the formula, we have

2 + 2 2
1 1 1 . 2 . 1 2
=
1 21 2
(0.845)2 + (0.791)2 2(0.845)(0.791)(0.371)
= = 0.98
1 0.3712

8.7 Linearization of nonlinear equations

Nonlinear equations can be linearized by:

i. Series expansion: an approximate method


ii. Transformation or grouping of variables such that the transformed function assumes a linear
form.

Illustration 1: xy =5
1
Linearized form: = 5 ()

A plot of y against 1/x is a straight line with 5 as gradient.

Illustration 2:
2
=( )+4

Linearized form:
2
=
( 4)
A plot y2/(y 4) against x is a straight line with 1 as the slope.

Illustration 3: y = abx

Linearized form: log y = log a + x log b

A plot of log y against x is a straight line with log b as slope and log a as intercept.

Illustration 4: y = axb
Linearized form: log y = log a + b log x

A plot of log y against log x is a straight line with b as slope and log a as intercept.

Example 8.9
An index of the rate of growth of microorganism is the generation time (g). In the logarithmic
phase of microbial growth, number of microorganism (N) change with of growth (t) according
to:

= 0 2/

Find the generation time of a bacterial culture that shows the following numbers with time of
growth:

Number (N) Time of growth (mins)


980 0
1700 10
4000 30
6200 40

Solution

= 0 + ()2 = 0 + 1

Where y = log N, x = t, a = slope = log 2/g, b = log N0

= =
0 2.9912 0 0 8.947
10 3.2304 100 32.304 10.436
30 3.6021 900 108.062 12.975
40 3.7923 1600 151.696 14.382
80 13.6161 2600 292.062 46.740
Applying the linear regression formula,

( )( 2 ) ( )( ) (13.6161)(2600) (80)(292.062)
0 = = = 3.010
( 2 ) ( )2 (4)(2600) (80)2

( ) ( )( ) (4)(292.062) (80)(13.6161)
1 = 2) 2
= 2
= 0.01974 1
( ( ) (4)(2600) (80)

Therefore,
log 2
= 0.01974

2
= = 15.2
0.01974 1
ASSIGNMENT 5

The data shown are the number of items of a new product sold over a period of 15 months at a
certain store.

Month, x 1 3 6 8 10 12 15
# of items sold, y 10 12 15 19 20 21 21

Fit the following equations to the data above. In each case, calculate the coefficient of
determination and explain which equation is better.

(i) =

(ii) = +

Solution

Vous aimerez peut-être aussi