Académique Documents
Professionnel Documents
Culture Documents
Chapter 2
Frequency Distribution is the organization of raw data (data in original form) in table form,
using classes and frequencies.
For example suppose a researcher wished to do a study on the number of kms that the
employees of a factory traveled to work each day. The researcher first would have to collect the
data by asking each employee the approximate distance the factory is from his/her home. When
data are collected in original form, they are called raw data. In this case, the raw data are:
The researcher organizes the data by constructing a frequency distribution; the frequency is
the number of values in a specific class of distribution. For this set of data a frequency
distribution is shown as:
Notes
1. This frequency distribution has 6 classes.
2. For the first class, the class lower limit is 1 and the class upper limit is 3.
3. The class width is 3 (the class width for a class in a frequency distribution is founded by
subtracting the lower (or upper) class limit of one class from the lower (or upper) class
limit of the next class).
Class boundaries: the class boundaries are used to separate the classes so that there are no
gaps in the frequency distribution. The gaps are due to the limits; for example, there is a gap
between 3 and 4.
The basic rule of the class boundaries is that the class limits should have the same decimal
place value as the original data, but the class boundaries should have one additional place
value and end in a 5. For example, if the values in the data set are whole numbers, such as 34,
32, 36, the limit of the class might be 31 37 , and the boundaries are 30.5 37.5.
Example 2
The average quantitative entrance examination scores for the top 30 graduate schools of
engineering are listed below. Construct a frequency distribution with six classes.
Solution
We follow the same steps as in example 1. Lowest value= 746 and highest value = 780, then
the Range is given by R = 780 746 = 34, hence the class width = 34/6 = 5.666 rounded up to
6. The frequency distribution is thus given by:
The Histogram: is a graph that displays the data using vertical bars of various heights to
represent the frequencies of the classes.
Consider Example 1, which has the following frequency distribution for the record of high
temperature for each of the 50 states.
Class Boundaries
99.5 104.5
104.5 109.5
109.5 114.5
114.5 119.5
119.5 124.5
124.5 129.5
129.5 134.5
Frequency
2
8
18
13
7
1
1
Class midpoint
102
107
112
117
122
127
132
Summarize the data through the use of frequency distribution where the data are grouped into
different classes or intervals. Dividing each class frequency by the total number of observations,
we obtain the proportion of the set of observations in each of the classes. A table listing relative
frequencies is called a relative frequency distribution. The relative frequency distribution for the
data of the above table, showing the midpoints of each class interval, is given in the table below.
The information provided by a relative frequency distribution in tabular form is easier to grasp if
presented graphically. Using the midpoints of each interval and the corresponding relative
frequencies, we construct a relative frequency histogram as shown in figure 1 below.
A distribution is said to be symmetric if it can be folded along a vertical axis so that, the two
sides coincide. A distribution that lacks symmetry with respect to a vertical axis is said to be
skewed. The distribution illustrated in Figure 3(a) is said to be skewed to the right since it has a
long right tail and a much shorter left tail. In Figure 3(b) we see that the distribution is
symmetric, while in Figure 3(c) it is skewed to the left.
If our primary purpose in looking at the data is to determine the general shape or form of the
distribution, it will seldom be necessary to construct a relative frequency histogram. There are
several other types of graphical tools and plots that are used. These are discussed in Chapter 3.
Thus, we have shown how one can gain information from raw data by organizing them into a
frequency distribution and then presenting the data by using graphs. In chapter 4, we are going
to study the statistical methods that can be used to summarize data. The most familiar of these
methods is the finding of averages.
Discrete and Continuous Data
Frequency Graphs of Discrete Data
Consider the number of defective items in successive samples of six items each. The data are
summarized in the table below.
Number of defectives, xi
0
1
2
>2
Frequency, fi
48
10
2
0
These data can be shown graphically in a very simple form because they involve discrete data,
as opposed to continuous data, and only a few different values exist. The variable is discrete in
the sense that only certain values are possible. in this case the number of defective items in a
group of six must be an integer rather than a fraction. The number of defective items in each
group of this example is only 0, 1, or 2. The frequencies of these numbers are shown above.
The corresponding frequency graph is shown in fig. 4 below. The isolated spikes correspond to
the discrete character of the variate.
An empirical relation which gives an approximate value of the appropriate number of classes is
Sturgess Rule:
number of class intervals = 1 + 3.3 log10N . (2.1)
where N is the total number of observations in the sample or population.
The procedure is to start with the range, the difference between the largest and the smallest
items in the set of observations. Then the constant class width is given approximately by
dividing the range by the approximate number of class intervals from equation 2.1. Round off
the class width to a convenient number.
The class boundaries must be clear with no gaps and no overlaps. For problems in this course
choose the class boundaries halfway between possible magnitudes. This gives a definite and
fair boundary. For example, if the observations are recorded to one decimal place, the
boundaries should end in five in the second decimal place. If 2.4 and 2.5 are possible
observations, a class boundary might be chosen as 2.45. The smallest class boundary should
be chosen at a convenient value a little smaller than the smallest item in the set of observations.
Each class midpoint is halfway between the corresponding class boundaries.
Then the number of items in each class should be tallied and shown as class frequency in a
table called a grouped frequency table. The relative frequency is the class frequency divided by
the total of all the class frequencies, which should agree with the total number of items in the set
of observations. The cumulative frequency is the total of all class frequencies smaller than a
class boundary. The class boundary rather than class midpoint must be used for finding
cumulative frequency because we can see from the table how many items are smaller than a
class boundary, but we cannot know how many items are smaller than a class midpoint unless
we go back to the original data. The relative cumulative frequency is the fraction (or percentage)
of the total number of items smaller than the corresponding upper class boundary.
Example 4
The thickness of a particular metal part of an optical instrument was measured on 121
successive items as they came off a production line under what was believed to be normal
conditions. The results are shown in the table below.
Thickness is a continuous variable, since any number at all in the appropriate range is a
possible value. The data in the above table are given to two decimal places, but it would be
possible to measure to greater or lesser precision. The number of possible results is infinite. The
mass of numbers is very difficult to comprehend.
Now let us apply the grouped frequency approach to the numbers. The largest item in the table
is 3.57, and the smallest is 3.21, so the range is 0.36.
The number of class intervals according to Sturges Rule should be approximately 1 + (3.3)
(log10121) = 7.87. Then the class width should be approximately 0.36 / 7.87 = 0.0457. Let us
choose a convenient class width of 0.05. The thicknesses are stated to two decimal places, so
the class boundaries should end in five in the third decimal.
Let us choose the smallest class boundary, then, as 3.195. The resulting grouped frequency
table is shown below.
Grouped Frequency Table for Thicknesses
In this table the class frequency is obtained by counting the tally marks for each class. This
becomes easier if we divide the tally marks into groups of five as shown in the table. The
relative frequency is simply the class frequency divided by the total number of items in the table,
i.e. the total frequency, which is 121 in this case. The cumulative frequency is obtained by
adding together all the class frequencies for classes with values smaller than the current upper
class boundary. Thus, in the third line of the table, the cumulative frequency of 40 is the sum of
the class frequencies 2, 14 and 24. The corresponding relative cumulative frequency would be
40/121 = 0.331, or 33.1%. The cumulative frequency in the last line must be equal to the total
frequency. From the table the mode is given by the class midpoint of the class with the largest
class frequency, 3.370 mm. The mean, median and mode, 3.369, 3.37 and 3.370 mm, are in
close agreement. This indicates that the distribution is approximately symmetrical.
Graphical representations of grouped frequency distributions are usually more readily
understood than the corresponding tables. Some of the main characteristics of the data can be
seen in histograms and cumulative frequency diagrams. A histogram is a bar graph in which the
class frequency or relative class frequency is plotted against values of the quantity being
studied, so the height of the bar indicates the class frequency or relative class frequency. Class
midpoints are plotted along the horizontal axis.
In principle, a histogram for continuous data should have the bars touching one another.
However, the bars are often shown separated, and some computer software does not allow the
bars to touch one another.
The histogram for the data is shown in Figure 5 for a class width of 0.05 mm as already
calculated. Relative class frequency is shown on the right-hand scale.
Histograms for class widths of 0.03 mm and 0.10 mm are shown in Figures 6 and 7 for
comparison.
Of these three, the class width of 0.05 mm in Figure 5 seems most satisfactory (in agreement
with Sturges Rule).
Cumulative frequencies are shown in the last column of the table. A cumulative frequency
diagram is a plot of cumulative frequency vs. the upper class boundary, with successive points
joined by straight lines. A cumulative frequency diagram for the thicknesses is shown in figure 8.
Example 5
A sample of 120 electrical components was tested by operating each component continuously
until it failed. The time to the nearest hour at which each component failed was recorded. The
results are shown in the table below.
Times to Failure of Electrical Components, hours
Once again, frequency grouping is needed to make sense of this mass of data. When the data
are sorted in order of increasing magnitude, the largest value is found to be 5312 hours and the
smallest is 3 hours. Then the range is 5312 3 = 5309 hours. There are 120 data points. Then
applying Sturges Rule, equation 2.1 indicates that the number of class intervals should be
approximately 1 + 3.3 log10120 = 7.86. Then the class width should be approximately
5309/7.86 = 675 hours. A more convenient class width is 600 hours.
Grouped Frequency Table for Failure Times
the cumulative frequency diagram of Figure 8 is S-shaped, with its slope first increasing and
then decreasing, whereas the cumulative frequency diagram of Figure 10 shows the slope
generally decreasing over its full length.
Now the mean, median and mode for the data (corresponding to Figures 9 and 10) will be
calculated and compared. The mean is = 140746/120 = 1173 hours. The median is the average
of the two middle items in order of magnitude, 869 and 877, so 873 hours. The mode according
to the table is the midpoint of the class with the largest frequency, 300.5 hours, but of course the
value would vary a little if the class width or starting class boundary were changed. Since Fig. 9
shows that the distribution is very asymmetrical or skewed, it is not surprising that the mean,
median and mode are so widely different.