Vous êtes sur la page 1sur 24

CHAPTER 1 DATA DESCRIPTION AND NUMERICAL MEASURES

Learning Outcomes

At the end of this chapter, the students will be able to:


1. Understand the basic introductory concept of statistics.
2. Describe data set graphically and numerically.

1.1 INTRODUCTION

Statistics is a field of study which implies collecting, presenting, analyzing and


interpreting data as a basis for explanation, description and comparison. There are
two types of statistics. They are descriptive statistics and inferential statistics.

Descriptive statistics is a field of study which involves organizing, displaying and


describing data by using tables, graphs and summary measures. While a study that
used sample results to help make decisions or predictions about a population is
called inferential statistics.

Population refers to every element for observation which are interest for data
collection. Sample refers to a certain number of elements that have been chosen from
a population for observation. Sample is a subset of population.

Element or member is a specific subject or object about which the information is


collected. Characteristics from element under study that assumes different values for
different elements is called variable. Observation is the value of a variable for an
element.

Data set is a collection of observations on one or more variables. Ungrouped data set
contains information on each member of a sample or population and data that are
grouped into classes is called grouped data set. Outliers are values that are very
small or very large relative to the majority of the values in a data set.

Population parameter is a numerical measure computed from a population data and


sample statistic is a numerical measure computed from a sample data.

There are two types of variables. They are qualitative variable and quantitative
variable. Qualitative variable is a variable that cannot assume any numerical value
but can be classified into two or more nonnumeric categories. Examples of common
qualitative variables are color, brand of cars and gender. Quantitative variable is a
variable that can be measured numerically. Examples of common quantitative
variables are heights, number of students and time.
Quantitative variables can be classified into two groups. They are discrete variables
and continuous variables. Discrete variable is a variable whose values are countable.
Examples of common discrete variables are number of cars, number of houses and
number of defective products. Continuous variable is a variable that can assume an
infinite number of values between any two specific values. They are obtained by
measuring and often include fractions and decimals. Examples of common
continuous variables are temperature, salary and weight.

1.2 DESCRIBING QUALITATIVE DATA

In order to construct a graph, a data set can be described by using frequency, relative
frequency and percentage. Frequency is the number of observations in a category or
in a class and it is denoted by f. Relative frequency is the proportion of the total
number of observation in each category. Relative frequency and percentage can be
calculated by using these formulas:

f
Relative frequency, Rf =
f

Percentage = Relative frequency X 100%

Frequency distribution will list all categories and the number of observations that
belong to each of the categories.

Example 1.1:
An example of the above terms are illustrated in the table below:

Table 1.1 Frequency, Relative frequency, and Percentage Distributions

Category Frequency Relative Percentage


Frequency
White 5 5/50 = 0.1 10%
Blue 15 15/50 = 0.3 30%
Green 25 25/50 = 0.5 50%
Red 5 5/50 = 0.1 10%
Total 50 1.00 100%
Qualitative data can be displayed by using bar graph and pie chart. Bar graph is a
graph made of bars whose heights represent the frequencies or percentages of a
population or a sample belonging to different categories. Pie chart is a circle divided
into portions that represent the relative frequencies or percentages of a population or
a sample belonging to different categories. Figure 1.1 and 1.2 shows bar graph and
pie chart respectively.

Bar Graph
30

25

20

15

10

0
White Blue Green Red

Figure 1.1 Example of bar graph

Pie Chart
5; 10% 5; 10%

15; 30%

25; 50%

White Blue Green Red

Figure 1.2 Example of pie chart


Exercise 1.1:

Data on the grades of IT subject for 50 students is as follows:

A C A C C B B B C B
D D D B C C C D C C
B C B C B E D A E B
D A C C C C B C C A
D D D A B A C C B C

a. Prepare a frequency distribution table.


b. Calculate the relative frequencies and percentages for all categories.
c. What percentage of these students are get A or E?
d. Construct a bar graph for the frequency distribution.
Exercise 1.2:

The following are the responses of 30 students from a statistics class who were asked
to evaluate their instructor. The students were asked to choose one of five answers:
Excellent (E), Above average (AA), Average (A), Below average (B) and P (Poor).

AA B AA A P A P A A E
E AA B AA A E AA AA AA A
E AA AA A E P AA B A AA

a. Construct a frequency distribution table.


b. Calculate the relative frequencies and percentages for all categories
c. What percentage of these students ranked this instructor as excellent or
above average?
d. Draw a pie chart for the percentage distribution.
1.3 DESCRIBING QUANTITATIVE DATA

For quantitative data, an interval that includes all values that fall within two
numbers, the lower and upper limits, is called a class. The classes always represent a
variable.

Frequency distribution for quantitative data all classes and the number of values that
belong to each of the classes. Data presented in the form of frequency distribution are
called grouped data.

Class interval can be classified into two groups. They are exclusive class interval and
inclusive class interval. Exclusive class interval is a class interval with no gap in
between next class interval and it takes the form of a x b . Example of exclusive
class interval is 5 10, 10 15, 15 - 20, 20 25, 25 - 30. Inclusive class interval is a
class interval with gap in between next class interval and it takes the form
of a x b . Example of inclusive class interval is 72 -74, 75 77, 78 80, 81 83, 84
86.

Class midpoint is the midpoint between upper limit and lower limit and class
boundary is the midpoint of the upper limit of one class and the lower limit of the
next class.

Class width is the difference between upper boundary and lower boundary. It can
also be obtained by finding the difference between two consecutive lower class
limits. Table 1.1 shows example of inclusive class interval with its midpoint, width
and boundaries.

Example 1.2:
Table 1.2: Inclusive class interval

Class limits Class midpoint Class Width Class boundaries


72 - 74 73 3 71.5 74.5
75 - 77 76 3 74.5 77.5
78 - 80 79 3 77.5 80.5
81 - 83 82 3 80.5 83.5
84 - 86 85 3 83.5 86.5

Lower limit Upper limit Lower Upper boundary


Boundary
boundarylimit

We can describe quantitative data by using relative frequency and percentage as we


have done section 1.2. Example of grouped data is illustrated in Table 1.3.
Example 1.3:

Table 1.3: Frequency, relative frequency and percentage distribution

Class limits Frequency Relative Frequency Percentage


72 - 74 14 14/50 = 0.28 28%
75 - 77 6 6/50 = 0.12 12%
78 - 80 12 12/50 = 0.24 24%
81 - 83 8 8/50 = 0.16 16%
84 - 86 10 10/50 = 0.20 20%
Total 50 1.00 100%

A cumulative frequency distribution gives the total number of values that fall below
the upper boundary of each class. We can also describe quantitative data by using
cumulative relative frequency and cumulative percentage.

Cumulative frequency
Cumulative relative frequency =
Total observatio ns in the data set

Cumulative percentage = Cumulative relative frequency X 100%

Example of cumulative frequencies is illustrated in Table 1.4.

Example 1.4:

Table 1.4: Cumulative frequency, cumulative relative frequency and cumulative


percentage distribution

Class limits Cumulative Cumulative Relative Cumulative


Frequency Frequency Percentage
72 - 74 14 14/50 = 0.28 28%
75 - 77 14 + 6 = 20 20/50 = 0.40 40%
78 - 80 14 + 6 + 12 = 32 32/50 = 0.64 64%
81 - 83 14 + 6 + 12+ 8 = 40 40/50 = 0.80 80%
84 - 86 14 + 6 + 12 + 8 + 10 = 50/50 = 1.00 100%
50

Quantitative data can be displayed by using histogram or a polygon. A histogram is


a graph drawn by using classes marked on the horizontal axis. On vertical axis, the
frequencies, relative frequencies and percentages will be used to represent the
heights of the bars that are drawn adjacent to each other. Class boundaries can also
be used to mark on the horizontal axis to graph a histogram. Polygon is constructed
by connecting the midpoints with the height equal to the frequencies or relative
frequencies or percentages of respective class limits with straight lines. Figure 1.3 and
1.4 show the example of histogram and polygon respectively.

Figure 1.3 Example of Histogram

Figure 1.4 Example of polygon


In order to graph cumulative distributions for a variable, an ogive can be constructed
by connecting points with the heights equal to the cumulative frequencies or
cumulative frequencies or cumulative percentages of respective classes. Upper
boundaries will be used to mark on horizontal axis. Figure 1.4 show the example of
an ogive.

Figure 1.5 Example of ogive


Exercise 1.3:

Cixon Corporation manufactures computer terminals. The following data are the
numbers of computer terminals produced at the company for a sample of 30 days.

24 32 27 23 33 33 29 25 23 28
21 26 31 22 27 33 27 23 28 29
31 35 34 22 26 28 23 35 31 27

a. Construct a frequency distribution table by using the classes 21-23, 24-


26,
b. Calculate the relative frequencies and percentages for all classes.
c. Construct a histogram and a polygon for the percentage distribution.
d. Using the frequency distribution table, prepare the cumulative frequency,
cumulative relative frequency and cumulative percentage distributions.
e. Construct an ogive for the cumulative frequency distribution.
1.4 Measures for Ungrouped Data

1.4.1 Measures of Central Tendency for Ungrouped Data

Measures of central tendency gives the center of a histogram. It also shows the
central point around which observations tend to cluster.

Mean is the most frequently used measure of central tendency. It is also called as
average.

To find mean for ungrouped data, we are to divide the sum of all values with the
number of observations in the dataset. Thus,

x
Mean for population data:
N

Note: x is the sum of all observations and N is the population size.

x
Mean for sample data: x
n

Note: x is the sum of all observations and n is the sample size.

Example 1.5:

The following data give the prices of ten computer laptops sold recently in a shop.

1200 2600 1660 2199 3599 900 2600 2299 2299 1200

Find the mean sale price for these computer laptops.

Solution:

x 1200 + 2600 + 1660 + 2199 + 3599 + 900 + 2600 + 2299 + 2299 + 1200
x= =
n 10

20556
= = 2055.6
10
To find median for ungrouped data, we are to find middle term value in a dataset
that has been arranged in increasing order. Hence,

x( n1 2) , if n is odd

Median = x( n / 2) x((n / 2)1)

2
, if n is even

Example 1.6:

The following data give the weight (in gram) of eight pen drives produced by Cixon
Company selected randomly in a week.
200 210 199 2203 198 195 204 195

Find the median.

Solution:

Rearrange them into increasing order:

195 195 198 199 200 204 210 2203

xn + xn
+1 199 + 200
2 2
median = = = 199.5
2 2

To find mode for ungrouped data, we are to find the value that the most common in
a dataset. In others word, this value have the highest frequency in that dataset. If
there is no observation that occurs with the highest frequency, the data is said to
contain no mode. In a dataset, it can also contain more than one mode.

Example 1.7:

The following data give the speed of wireless transfer speeds (in Mbps) in certain
area.

11 54 54 11 11 11 300 200 11 11

Find the mode.


Solution:

Mode = 11

Exercise 1.4:

The following data give the number of computer keyboards assembled at the Cixon
Corporation for a sample of 20 days.

45 52 48 41 56 46 44 42 48 53
43 50 45 50 56 56 56 41 38 50

Calculate the mean, median and mode.


Exercise 1.5:

The following data give the number of hours spent studying by randomly selected
college students during the past week.

1.83 2.91 4.31 1.78 4.16 6.52 7.04 6.91 1.69 15.44

a. Calculate the mean and median for these data.


b. Do these data contain an outlier? If yes, drop this value and recalculate the
mean and median. Which of the two summary measures changes by a larger
amount when you drop the outlier?
c. Is the mean or the median a better summary measure for these data?

1.4.2 Relationship between the mean, median and mode

When we know the values of the mean, median and mode of a data set, it will give us
some idea about the shape of a frequency curve.

a. Symmetric distribution

Frequency

mean = median = mode Variable

b. Positively- skewed distribution (skewed to the right)

Frequency

mode < median < mean Variable


c. Negatively skewed distribution (skewed to the left)

Frequency

mean < median < mode Variable

1.4.3 Measures of Dispersion for Ungrouped Data

The measures of central tendency do not expose the whole description of the
distribution of a dataset. Two datasets with the same mean may have completely
different spreads. The variation among values of observations for one dataset may be
larger or smaller than for the other dataset. We also need a measure that can provide
the spread of a dataset that are called as measures of dispersion. The greater the
variation between the observations, the more they will spread out.

Range is the simplest measure of dispersion. It gives the difference between the
smallest and the largest value in dataset. Thus,

Range = Largest Value Smallest Value

The standard deviation is the most used measure of dispersion. The value of the
standard deviation tells how closely the values of a data set are clustered around the
mean. Larger value of standard deviation indicate larger spread around the mean
while smaller value of standard deviation indicates smaller spread around the mean.
Figure 1.5 shows the spread around the mean for a sample dataset.
s 5

s 3

Figure 1.5 Spread around the mean value

The variance is obtained by squaring the value of standard deviation. Standard


deviation for ungrouped data is given as follows.

1 1 ( x) 2
Standard deviation for population data: (x )
2
( x 2 )
N N N

Variance for population data = 2

1 1 ( x) 2
Standard deviation for sample data: s (x x )
2
( x 2 )
n 1 n 1 n

Variance for sample data = s 2

Example 1.8:

The following data gives the time spent (in second) for experiments in testing an
optimization algorithm.
3.4 2.5 4.8 2.9 3.6
2.8 3.3 5.6 3.7 2.8
4.4 4.0 5.2 3.0 4.8

Find the range, standard deviation and variance for the time spent.
Solution:

Range = 5.6 2.8 = 2.8

56.8 2
228.28
1 1 ( x) 2 15 0.9709
s (x x ) 2 ( x 2 - )
n 1 n 1 n 14

s 2 = 0.9427

Exercise 1.6:

A binarization algorithm in image processing have been tested on two different type
of images. Time spent (in second) for the algorithm to finish have been recorded for
each type of images.

Type A 2.07 2.14 2.22 2.03 2.21 2.03


2.05 2.18 2.09 2.14 2.11 2.02
Type B 2.52 2.15 2.49 2.03 2.37 2.05
1.99 2.42 2.08 2.42 2.29 2.01

a. Compute range, standard deviation and variance for the time spent for each
type of image.
b. What can we conclude based on the values obtained in (a)?
1.5 Measures for Grouped Data

Grouped dataset is obtained when data collected have been grouped into classes.

mf mf
Mean for population data:
f N

mf mf
Mean for population data: x
f n

Note: m is the midpoint for each class, f is the frequency for each class, N is the
population size and n is the sample size.

Standard deviation for population data:

2 1 ( mf ) 2

1 m 2 f ( mf ) m f
2
f f N N

Variance for population data = 2

Standard deviation for sample data:

2
1 ( mf ) 2
s
1 m 2 f ( mf ) m f
2
( f ) 1 f
n 1 n

Variance for population data = s 2


Example 1.9:

Wireless mouse have been inspected randomly for its battery life (in months) in a
manufacturing company.

Class limits Frequency


1.5 1.9 2
2.0 2.4 1
2.5 2.9 4
3.0 3.4 15
3.5 3.9 10
4.0 4.4 5
4.5 4.9 3

Find the mean, standard deviation and variance for the above data.

Solution:

Class Frequency Midpoint (m) mf m2f


limits
1.5 1.9 2 1.7 3.4 5.78
2.0 2.4 1 2.2 2.2 4.84
2.5 2.9 4 2.7 10.8 29.16
3.0 3.4 15 3.2 48 153.6
3.5 3.9 10 3.7 37 136.9
4.0 4.4 5 4.2 21 88.2
4.5 4.9 3 4.7 14.1 66.27
Total 40 136.5 484.75

136.5
x= = 3.4125
40

136.5 2
484.75 -
1 m 2 f
( mf ) 2
1 ( mf ) 2
40 0.6969
s m 2 f
( f ) 1 f
n 1 n
39

s 2 = 0.4857
Exercise 1.7:

The following table gives the frequency distribution of the amount of internet bills
for April 2014 for a sample of 66 families.

Amount Of Internet Bill Number Of Families


(RM)
40 to less than 70 14
70 to less than 100 17
100 to less than 130 21
130 to less than 160 10
160 to less than 190 4

Calculate the mean, variance, and standard deviation.


SUMMARY

Variable

Qualitative Variable Quantitative Variable


Eg: color, gender

Discrete Continuous
Eg: no of cars, no of books Eg: height, weight

TUTORIAL

1. Indicate whether each of the following constitutes a population or sample.


a. Ages of all members of a family.
b. Number of days missed by all employees of a company during the past
month.
c. Marital status of 50 persons selected from a large city.
d. Number of computers sold during the past week at all computer stores in
Taman Tasik Utama.
e. Scores of 100 students in a statistics class.

2. Indicate whether each of the following variables are qualitative or quantitative.


If the variables are quantitative, classify them as discrete or continuous
variables.
a. Number of persons in a family.
b. Types of cars owned by families.
c. Color of eyes.
d. Monthly phone bills.
e. Length of a frogs jump.

3. The operations manager at a cereal packaging plant said that, in her experience,
typically there are nine reasons that result in the production of unacceptable
cereal cartons at the end of the packaging process: broken carton (R), bulging
carton (G), cracked carton (C), dirty carton (D), hole in carton (H), improper
package weight (I), printing error (P), unreadable label (U), and unsealed box
top (S). The raw data below represent a sample of 50 unacceptable cereal
cartons taken from the past weeks production, and the reasons for
nonconformance are indicated:
U G U S H D D R I U

S U S U G C S U D R

S U D U S S D P R S

I S U D G S S U S D

G S C U D D S S S U

a. Prepare a frequency distribution table.


b. Calculate the relative frequencies and percentages for all categories.
c. Construct a bar graph for the frequency distribution.
d. Construct a pie chart for the percentage distribution.

4. The following data give the repair costs (in RM) for 30 cars randomly selected
from a list of cars that were involved in collisions.

2300 750 2500 410 555 1576

2460 1795 2108 897 989 1866

2105 335 1344 1159 1236 1395

6108 4995 5891 2309 3950 3950

6655 4900 1320 2901 1925 6896

a. Construct a frequency distribution. Take RM1 as the lower limit of the first
class and RM1400 as the width of each class.
b. Compute the relative frequencies and percentages for all classes.
c. Draw a histogram and a polygon for the relative frequency distribution.
d. What are the class boundaries and the width of the fourth class?
e. Construct the cumulative frequency, cumulative relative frequency and
cumulative percentage distribution.
f. Construct an ogive for the cumulative frequency distribution.

5. The following data are the length of life (in hours) of a sample 100-watt light
bulbs produced by Manufacturer A and a sample of 100-watt light bulbs
produced by Manufacturer B.
Manufacturer A Manufacturer B
684 697 720 773 821 819 836 888 897 903
831 835 848 852 852 907 912 918 942 943
859 860 868 870 876 952 959 962 986 992
893 899 905 909 911 994 1004 1005 1007 1015
922 924 926 926 938 1016 1018 1020 1022 1034
939 943 946 954 971 1038 1072 1077 1077 1082
972 977 984 1005 1014 1096 1100 1113 1113 1116
1016 1041 1052 1080 1093 1153 1154 1174 1188 1230

a. Construct the frequency distribution for each manufacturer. Start with 650
for Manufacturer A and 750 for Manufacturer B. Choose class width of
100.
b. Construct the percentage distribution for each manufacturer.
c. Construct the cumulative percentage distributions for each manufacturer.
d. Draw the percentage histograms in separate graphs.
e. Draw the percentage polygons in one graph.
f. Draw the ogives on cumulative percentage distributions in one graph.
g. Which manufacturer has bulbs that have a longer life, Manufacturer A or
Manufacturer B? Explain.

6. The following data give the number of computer terminals produced at a


certain company for a sample of 10 days.

24 32 27 23 35 33 29 21 23 28

Calculate the mean, median and mode for these data.

7. The mean score on a statistics test for eight students is 76. The scores of seven of
these eight students are 81, 65, 93, 66, 71, 84 and 80. Find the score of the eight-
th student.

8. Consider the following two data sets.

Data set I: 4, 8, 15, 9, 11


Data set II: 8, 16, 30, 18, 22

Notice that each value of the second data set is obtained by multiplying the
corresponding value of the first data set by 2. Calculate the mean for each of
these two data sets. Comment on the relationship between two means.
9. The following data give the weight (in kg) lost by 15 new members of a health
club at the end of their first two months of membership.

5 10 8 7 25 12 5 14 11 10
21 9 8 11 180

Compute the mean, median, mode, range, variance and standard deviation.

10. Consider the following two data sets.

Data set I: 12 25 37 8 41
Data set II: 19 32 44 15 48

Note that each value of the second data is obtained by adding 7 to the
corresponding value of the first data set. Calculate the standard deviation for
each of these two data sets using the formula for sample data. Comment on the
relationship between the two standard deviations.

11. The following table gives the frequency distribution of the amounts of internet
bills for the month of November 2015 for a sample of 50 families.

Amount of Internet Bill Number of families


(RM)
20 to less than 40 8
40 to less than 60 13
60 to less than 80 17
80 to less than 100 9
100 to less than 120 3

Calculate the mean, variance and standard deviation.

12. The following table gives the grouped data on the weights of all 100 books at a
small library in 2013.

Weight (kg) Number of books


3 to less than 5 5
5 to less than 7 30
7 to less than 9 40
9 to less than 11 20
11 to less than 13 5

Find the mean, variance and standard deviation.

Vous aimerez peut-être aussi