Vous êtes sur la page 1sur 14

CHAPTER III

Description of Data
Learning Objectives:
Given the learning materials and activities of this chapter, they will be able to:
 Calculate the mean, weighted mean, median, and determine the mode to described the
center of a set of data.
 Calculate the range, variance, mean deviation, and standard deviation.
 Describe the shape of the distribution using measures of skewness and kurtosis
 Distinguish the appropriate descriptive measure of data and determine their usability and
limitations.

Measures of Central Tendency


The measures of central tendency are also known as measures of central location or simply
the central value from the well-defined observations. This is used to facilitate the comparison of
two or more data sets.
Mean/Arithmetic Mean/Average
This is the most common type of measures of central tendency. It is the sum of all the
observed values divided by the number of observations. The formula is given below;
∑𝑛
𝑖=1 𝑋𝑖
𝜇= - Population mean
𝑁

∑𝑛
𝑖=1 𝑋𝑖
𝑥= – sample mean
𝑛

Example 1: Five judges give their scores on the performance of a gymnast as follows:8, 9, 9, 9,
and 10. Find the mean score of the gymnast.
Solution: Let x be the score given by the ith judges in the population. Add the scores given by the
5 judges. Thus, the mean score is;
∑𝑛𝑖=1 𝑋𝑖 8 + 9 + 9 + 9 + 10 45
𝜇= = = = 9.
𝑁 5 5
Therefore, the mean score of the gymnast is 9 units.
Mean from Grouped Data
We can estimate the value of the average in the form of a grouped frequency distribution
table. If the original raw data could not be accessible, then it is still possible to approximate its
value using the formula given below.
∑𝑘𝑖=1 𝑓𝑖 𝑥𝑖
𝑥=
𝑛
where;
𝑓𝑖 – the frequency of the ith class
𝑥𝑖 – the class mark of the ith class
k – the total number of classes/class intervals
Example 2: The following table presents the frequency distribution of the weight of 75 pieces of
luggage in pounds. Approximate the sample mean weight of the luggage.
Weight (pounds) No. of Luggage (f) Class Mark (x) fx
31.5 – 41.4 9 36.45 328.05
41.5 – 51.4 8 46.45 371.60
51.5 – 61.4 4 56.45 225.80
61.5 – 71.4 32 66.45 2126.40
71.5 – 81.4 14 76.45 1070.30
81.5 – 91.4 5 86.45 432.25
91.5 – 101.4 3 96.45 289.35
total 75 4843.75

Substitute the values to the formula and the mean weight value is

∑7𝑖=1 𝑓𝑖 𝑥𝑖 4843.75
𝑥= = = 64.58 𝑝𝑜𝑢𝑛𝑑𝑠
𝑛 75
Thus, the mean weight of the luggage is approximately 64.58 pounds.
Weighted Mean
The weighted mean is used when the individual observed values vary in their degree of
importance. This can be done by assigning the weights of the observations depending on their
relative importance.
If we assign a weight w to each observation x, then the weighted mean is given by
∑ 𝑤𝑥
𝑤𝑒𝑖𝑔ℎ𝑡𝑒𝑑 𝑚𝑒𝑎𝑛 =
∑𝑤
Example 3: A government agency gives scholarship grants to employees taking graduate studies.
Courses in graduate studies earn credits of 1, 2, 3, 4, or 5 units. They can get a partial scholarship
for the next semester if they get an average of 1.5 to 1.75 and a full scholarship if the average is
better than 1.5. What kind of scholarship will the 2 employees get given their grades for the
previous semester? The data is given below:
Employee A Employee B
Subjects units grade Subjects units grade
A 1 1.0 A 1 2.0
B 2 1.25 B 2 1.75
C 3 1.5 C 3 1.5
D 4 1.75 D 4 1.25
E 5 2.0 E 5 1.0
The average grade of employee A is:
1(1.0) + 2(1.25) + 3(1.5) + 4(1.75) + 5(2.0) 25
𝑤𝑒𝑖𝑔ℎ𝑡𝑒𝑑 𝑚𝑒𝑎𝑛 = = = 1.67
1+2+3+4+5 15
The average grade of employee B is:

1(2.0) + 2(1.75) + 3(1.5) + 4(1.25) + 5(5.0) 20


𝑤𝑒𝑖𝑔ℎ𝑡𝑒𝑑 𝑚𝑒𝑎𝑛 = = = 1.33
1+2+3+4+5 15
Thus, employee B will get a full scholarship grant while employee A will get a partial scholarship.
Median
The median is the midpoint of the data array. The median will either be a specific value o
will fall between two values. The formula to determine the median depends on whether the number
of observations, n, is odd or even.
Case 1. Number of observation is odd
The median is the observation in the middle of the array. The formula is:
𝑚𝑒𝑑𝑖𝑎𝑛 = 𝑥(𝑛+1)𝑡ℎ
2

Case 2. Number of observation is even


The median is the average of the two-middle value in the array. The formula is:
𝑥𝑛 + 𝑥𝑛+1
2 2
𝑚𝑒𝑑𝑖𝑎𝑛 =
2
Example 4: The following data sets are the total receipts of 7 mining companies (in million pesos):
1.3, 6.6, 10.5, 12.6, 50.7, 4.7, 7.3. Determine the median value.
Solution: Arrange the observation from lowest to highest.
Array: 1.3, 4.7, 6.6, 7.3, 10.5, 12.6, 50.7
The median is

𝑚𝑒𝑑𝑖𝑎𝑛 = 𝑋(7+1) = 𝑋(8) = 𝑋4 = 7.3 𝑚𝑖𝑙𝑙𝑖𝑜𝑛 𝑝𝑒𝑠𝑜𝑠.


2 2

Therefore, it means that companies with total receipt of less than 7.3 million pesos belong in the
lower half of the ordered observations.
Median from Grouped Data
The approximate median from frequency distribution is obtained utilizing the given
formula below:
𝑛
− 𝑐𝑓<𝑚𝑐
𝑚𝑒𝑑𝑖𝑎𝑛 = 𝑙𝑚𝑐 + (𝑤) [2 ]
𝑓𝑚𝑐

Where;
n – the total frequency
𝑐𝑓<𝑚𝑐 – cumulative frequency of the class preceding the median class
𝑓𝑚𝑐 – frequency of the median class
𝑙𝑚𝑐 – lower boundary of the median class
w – the class size/class width
Example 5: The following table presents the frequency distribution of the weight of 75 pieces of
luggage in pounds. Approximate the sample median weight of the luggage.
Weight (pounds) No. of Luggage (f) less than cumulative
frequency (<cf)
31.5 – 41.4 9 9
41.5 – 51.4 8 17
51.5 – 61.4 4 21 𝑐𝑓<𝑚𝑐
61.5 – 71.4 Median class 32 𝑓𝑚𝑐 53
71.5 – 81.4 14 67
81.5 – 91.4 5 72
91.5 – 101.4 3 75
total 75

The median weight of the luggage is:


37.5 − 21
𝑚𝑒𝑑𝑖𝑎𝑛 = 61.45 + (10) [ ] = 66.61 𝑝𝑜𝑢𝑛𝑑𝑠
32
Therefore, 50% of the luggage’s weight are below or equal to 66.61 pounds.
Mode
The mode is the most frequency observed value in the data set. It is the observed value that
occurs the greatest number of times.
Example 6: Consider the heights in inches of 10 basketball players: 70, 70, 75, 75, 72, 72, 71, 72,
75, 72.
Thus, the mode is 72 inches. This implies that the most frequent height among the 10 basketball
players is 72 inches.
Mode from Grouped Data
To approximate the mode from frequency distribution table, the formula is:
𝑑1
𝑚𝑜𝑑𝑒 = 𝑙𝑚𝑜𝑐 + (𝑤) [ ]
𝑑1 + 𝑑2
Where;
𝑙𝑚𝑜𝑐 – the lower-class boundary of the modal class
w – the class size/class width
𝑑1 - the difference between the frequency of the modal class and the class frequency
preceding it
𝑑2 – the difference between the class frequency of the modal class and the class frequency
succeeding it.
Example 7: The following table presents the frequency distribution of the weight of 75 pieces of
luggage in pounds. Approximate the sample modal weight of the luggage.
Weight (pounds) No. of Luggage (f)
31.5 – 41.4 9
41.5 – 51.4 8
51.5 – 61.4 4
𝑑1 =32-4=28
61.5 – 71.4 32
𝑑2 =32-14=18
71.5 – 81.4 14
81.5 – 91.4 5
91.5 – 101.4 3
total 75
The modal class is 61.5 – 71.4 because it is the class with the highest frequency of 32.
Thus, the approximate mode is:
28
𝑚𝑜𝑑𝑒 = 61.45 + (10) [ ] = 67.54 𝑝𝑜𝑢𝑛𝑑𝑠
28 + 18
Therefore, the most frequent weight of the luggage is approximately 67.54 pounds.
Measures of Location
A measure of location provides us information on the percentage of observation in the
collection whose values are less than or equal to it. We also refer to these measures of location as
quantiles or fractiles. The percentile, decile and quartile are the common measure of location.
Percentile
Percentile are measures of location or position used in educational and health-related fields
to indicate the position of an individual in a group. Further, this type of fractiles divide the ordered
observations into 100 equal parts. The formula of percentile is:
𝑘(𝑛+1)
𝑝𝑘 = – Weighted average estimate method
100

𝑘(𝑛+1)
If 𝑝𝑘 = is not an integer then the weighted average estimate makes use of simple
100
interpolation between the two observed values, using the formula below:
𝑝𝑘 = (1 − 𝑚)𝑋𝑖 + 𝑚𝑋(𝑖+1)

Where;
m – is the fractional part
i – is the integer part
k – the desired location
Example 1: The following data sets are the number of years of operation of 20 mining
companies: 4, 6, 7, 5, 6, 30, 23, 25, 20, 21, 17, 18, 17, 19, 11, 10, 10, 8, 20, 16. Determine the
95th percentile.
Solution: Arrange the data set in order.
4, 5, 6, 6, 7, 8, 10, 10, 11, 16, 17, 17, 18, 19, 20, 20, 21, 23, 25, 30
95(20+1)
Compute for 𝑝95 = = 19.95. Since, 𝑝95 is not an integer, then the 𝑝95 is computed by
100

𝑝95 = (1 − .95)𝑋19 + 0.95𝑋(19+1) =0.05(25) +0.95(30) = 1.25 + 28.5 = 29.75 years.

Therefore, we can say that 95 percent of the 20 mining companies have been operating for less
than 29.75 years.
Percentile from Grouped Data
To approximate the kth percentile, 𝑝𝑘 , we used this formula given below:
𝑛𝑘
− 𝑐𝑓<𝑝𝑘
𝑃𝑘 = 𝑙𝑝𝑘 + (𝑤) [ 100 ]
𝑓𝑝𝑘
Where;
n – the total frequency
𝑐𝑓<𝑝𝑘 – cumulative frequency of the class preceding the 𝑝𝑘 class

𝑓𝑝𝑘 – frequency of the 𝑝𝑘 class

𝑙𝑝𝑘 – lower class boundary of the 𝑝𝑘 class

w – the class size/class width


k – the percentile interest
Decile
The deciles divide the ordered observation into ten equal parts. Basically, the first decile,
𝐷1 is the number that divides the bottom 10% of the data from the top 90%. To obtain the deciles,
divide the data set into tenths and then determine the number dividing the tenths. The formula is:
𝑘(𝑛+1)
𝐷𝑘 = – Weighted average estimate method
10

𝑘(𝑛+1)
If 𝐷𝑘 = is not an integer then the weighted average estimate makes use of simple
10
interpolation between the two observed values, using the formula below:
𝐷𝑘 = (1 − 𝑚)𝑋𝑖 + 𝑚𝑋(𝑖+1)

Where;
m – is the fractional part
i – is the integer part
k – the desired location
Decile from Grouped Data
To approximate decile from frequency distribution table, the formula is:
𝑛𝑘
− 𝑐𝑓<𝐷𝑘
𝐷𝑘 = 𝑙𝐷𝑘 + (𝑤) [10 ]
𝑓𝐷𝑘

Where;
n – the total frequency
𝑐𝑓<𝐷𝑘 – cumulative frequency of the class preceding the 𝐷𝑘 class

𝑓𝐷𝑘 – frequency of the 𝐷𝑘 class

𝑙 𝐷𝑘 – lower class boundary of the 𝐷𝑘 class


w – the class size/class width
k – the percentile interest

Quartile
The quartile divides the ordered observations into four (4) equal parts. The formula to
compute quartile is:
𝑘(𝑛+1)
𝑄𝑘 = – Weighted average estimate method
4
𝑘(𝑛+1)
If 𝑄𝑘 = is not an integer then the weighted average estimate makes use of simple
10
interpolation between the two observed values, using the formula below:
𝑄𝑘 = (1 − 𝑚)𝑋𝑖 + 𝑚𝑋(𝑖+1)

Where;
m – is the fractional part
I – is the integer part
k – the desired location
Decile from Grouped Data
To approximate decile from frequency distribution table, the formula is:
𝑛𝑘
− 𝑐𝑓<𝑄𝑘
𝑄𝑘 = 𝑙𝑄𝑘 + (𝑤) [10 ]
𝑓𝑄𝑘

Where;
n – the total frequency
𝑐𝑓<𝑄𝑘 – cumulative frequency of the class preceding the 𝑄𝑘 class

𝑓𝑄𝑘 – frequency of the 𝑄𝑘 class

𝑙𝑄𝑘 – lower class boundary of the 𝑄𝑘 class

w – the class size/class width


k – the percentile interest
Measures of Variability
A measure of dispersion is a descriptive summary measure that helps us characterize the
data set in terms of how varied the observations are form each other. It can help us create a mental
picture of the spread of the data set. It its value is small, then this indicates that the observations
are not too different from each other. On the other hand, if its value is too large, then this indicates
the scattered observations from the center. Which means that the observed data are widely spread
out from the center. Here are some of the measures of variability.
Range
Range is the simplest and easiest to use measure of dispersion. It is the difference between
the highest and the lowest observation in a data set. The formula is:
𝑅𝑎𝑛𝑔𝑒 (𝑅) = ℎ𝑖𝑔ℎ𝑒𝑠𝑡 𝑣𝑎𝑙𝑢𝑒 (𝐻𝑉) − 𝑙𝑜𝑤𝑒𝑠𝑡 𝑣𝑎𝑙𝑢𝑒 (𝐿𝑉)
Example 1: Given the heights of five rabbits in pounds as shown below:
8 10 12 14 15
Solution: The lightest rabbit weighs is 8 pounds and the heaviest is 15 pounds. Thus, the range of
the weights of the rabbits is
Range = 15 – 8 = 7 pounds
Range from Grouped Data
The range from grouped data is the difference between the upper-class limit of the last class
interval and the lower-class limit of the first-class interval. The formula is:
𝑟𝑎𝑛𝑔𝑒 = 𝑈𝐶𝐿𝐻𝐶𝐼 − 𝐿𝐶𝐿𝐿𝐶𝐼
Where;
𝑈𝐶𝐿𝐻𝐶𝐼 – upper class limit of the highest class interval
𝐿𝐶𝐿𝐿𝐶𝐼 – lower class limit of the lowest class interval
Example 2: The following table presents the frequency distribution of the weight of 75 pieces of
luggage in pounds. Approximate the sample range of the luggage.
Weight (pounds) No. of Luggage (f)
31.5 – 41.4 9
41.5 – 51.4 8
51.5 – 61.4 4
61.5 – 71.4 32
71.5 – 81.4 14
81.5 – 91.4 5
91.5 – 101.4 3
total 75
Solution: the upper class limit of the largest class interval is 101.4 and the lower class limit is 31.5.
Thus, the range is
Range = 101.4 – 31.5 = 69.9 pounds.
Standard Deviation
The standard deviation is the most common measure of variation. This indicates how
closely the values of a given data set are clustered around the mean. The basic formula to calculate
standard deviation is

∑(𝑥−𝑥)2
Sample standard deviation 𝑠=√ and the population standard
𝑛−1
deviation

∑(𝑥 − 𝜇)2
𝜎=√
𝑁

Variance
The variance is the positive square of the standard deviation.
Example 3: The final scores of 5 students were recorded as follows: 80, 88, 92, 90 and 85.
Determine the variance and standard deviation.
Solution: Consider the scores as x’s and compute the average.
scores (x) (x – mean) (x – mean)2
80 80 – 87 = -7 49
88 88 – 87 =1 1
92 92 – 87 =5 25
90 90 – 87 =3 9
85 85 – 87 = -2 4
∑(𝑥 − 𝑥)2 = 88

Steps:
80+88+92+90+85
1. Calculate the average score, the average score is = 87.
5
2. Calculate the deviation from the mean (x-mean)
3. Take the square of the deviation from the mean.
4. Take the sum of the square of the deviation from the mean.
5. Substitute the values to the formula.

Thus, the standard deviation is

88
𝑠 = √5−1 = 4.69 units.
Therefore, on average the individual observed values deviate around 4.69 units away from the
mean.
Standard Deviation from Grouped Data
The procedure is similar to that of finding the mean for grouped data, and it uses the
midpoints of each class. The formula is

𝑛 ∑ 𝑓. 𝑥 2 − (∑ 𝑓. 𝑥)2
𝑠=√
𝑛(𝑛 − 1)

Example 4: For 108 randomly selected high school students, the following IQ frequency
distribution table were obtained.
IQ interval Frequency
90 – 98 6
99 – 107 22
108 – 116 43
117 – 125 28
126 – 134 9
Find the variance and standard deviation.
Solution: Make a table. Find the class marks/class midpoint of each class. Multiply the midpoints
by the frequency for each class. Take the square of the midpoint for each class. Multiply the
frequency by the square of the midpoint for each class. Lastly, find the sum of the frequency, the
sum between the product of the frequency and class midpoints and the sum between the product
of the frequency and the square of the class marks. Finally, substitute the values to the formula.
IQ interval Frequency (f) class f.x 𝑥2 f.𝑥 2
mark/class
midpoint (x)
90 – 98 6 94 564 8836 53,061
99 – 107 22 103 2266 10609 233,398
108 – 116 43 112 4816 12544 539,392
117 – 125 28 121 3388 14641 409,948
126 – 134 9 130 1170 16900 152100
total n=108 ∑ 𝑓𝑥 ∑ 𝑓𝑥 2
= 12,204 = 1,387,854

Then, the standard deviation is

108(1,387,854)−12,204
𝑠=√ = 9.07.
(108)(107)

While the variance is


Variance (𝑠 2 = (9.07)2 = 82.26

Mean Deviation/Absolute Mean Deviation


The mean deviation measures the extent by which each individual value in a distribution
deviates from the mean of that distribution. The formula is shown below:
∑|𝑥 − 𝑚𝑒𝑎𝑛|
𝑀𝐴𝐷 =
𝑛
Where;
MAD – the mean absolute deviation
X – the individual score in the data set.
n – the total number of items
Example 5: Find the mean absolute deviation for the following scores: 22, 24, 26, 32, 30, and 28.
Solution: Construct a table. Find the mean. Take the deviation from mean. Take the absolute
deviation from the mean. Compute the sum of the absolute deviation from the mean. Get the sum
of the scores and the sum of the absolute deviation from the mean. Substitute the values to the
formula.
Scores (x) X – mean absolute of (x-mean)
22 22 – 27 = -5 5
24 24 – 27 = -3 3
26 26 – 27 = -1 1
28 28 – 27 = 1 1
30 30 – 27 = 3 3
32 32 – 27 = 5 5
∑ 𝑥 = 162 ∑|𝑥 − 𝑚𝑒𝑎𝑛| = 18
Then, the mean absolute deviation is

18
𝑀𝐴𝐷 = =3
6
Therefore, on average each score deviates around 3 units away from the mean.
Mean Absolute Deviation from Grouped Data
When the data is presented in the frequency distribution table, then the mean absolute
deviation is computed in the following manner.
∑ 𝑓|𝑥 − 𝑚𝑒𝑎𝑛|
𝑚𝑒𝑎𝑛 𝑎𝑏𝑠𝑜𝑙𝑢𝑡𝑒 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 =
𝑛
Where;
f – the frequency of each class
x – the class mark of each class
n – the total frequency
The example on how to compute mean absolute deviation from grouped data is leave as exercises.
Coefficient of Variation
The measures of variability like standard deviation is measures of absolute variability and
not the relative dispersion. It can only compare two samples data sets that have the same units of
measure. However, it is customary that two samples have varied on their units of measurement, to
overcome this problem coefficient of variation is a solution. The formula is
𝑠
𝑐𝑣 = ∗ 100%
𝑚𝑒𝑎𝑛
Where;
s – the sample standard deviation
cv – the coefficient of variation
Example 1: The average score of the students in Algebra class is 110, with a standard deviation
of 5; while the average score of students in a Biology class is 106, with a standard deviation of 4.
Which class is more variable in terms of score?
Solution: Calculate the coefficient of variation of each class.
Algebra class:
5
𝑐𝑣 = ∗ 100% = 4.55%
110
Biology class:

4
𝑐𝑣 = ∗ 100% = 3.77%
106
Since the coefficient of variation for the algebra class is larger compared to Biology class, thus,
the scores of students in Algebra class are more variable than the scores in the Biology class.
Measures of Skewness
A measure of skewness is a single value that indicates the degree and direction of
asymmetry. The computed value of skewness may be interpreted as follows:
sk = 0 – symmetric distribution
sk >= 0 – positively skewed distribution
sk <=0 – negatively skewed distribution

The formula is:


3(𝑚𝑒𝑎𝑛 − 𝑚𝑒𝑑𝑖𝑎𝑛)
𝑠𝑘 =
𝑠
Where;
sk – the coefficient of skewness
s – the sample standard deviation
A rule of thumb, some statisticians suggest a coefficient of skewness within the range from -3 to
+3 the distribution is said to be approximately normal distribution.
Measure of Kurtosis
The curves of the distribution are almost symmetrical but it may differ in the sharpness of
their peaks. This property of curves can be described using the measure of kurtosis. The formula
to compute the measure of kurtosis is
∑(𝑥 − 𝑚𝑒𝑎𝑛)4
𝑘=
𝑛𝑠 4
Where;
n – the number of observation in the sample
s – the sample standard deviation
The symmetrical curve distribution has three types: If the kurtosis is exactly 0, a distribution is
mesokurtic. If the kurtosis is greater than 3, then the distribution is leptokurtic. And finally, if the
kurtosis is less than 3, then the distribution is platykurtic.

Vous aimerez peut-être aussi