Vous êtes sur la page 1sur 11

1

Instructions for Chapter 3


Prepared by Dr. Guru-Gharana
I will omit the sections on Box-and-Whiskers, Covariance, Correlation, and the Least Squares.
We will come back to some of these topics (except Box-and-Whiskers) when we start Chapter 13
on Regression.

Terminology and Conventions


Values related to Population are called Population parameters (which are generally not
known), and a value estimated from the observed sample is called a Sample Statistic. A
population parameter is a characteristic of the population and the Sample statistic is the
characteristic of the sample. The word Statistics is used both as the Subject and as the plural of
Statistic. The general convention is to use Greek letters like , , , , , etc for unknown
population parameters and suitable English letters for corresponding sample statistics. The
sample statistics are used to make inferences about the unobservable population parameters. If a
single value is estimated for any variable, we call it a point estimate. If an interval of values is
estimated for indicating a range of possible values in which the population parameter may fall
with some measurable probability, then we call it Confidence Interval.

Four Characteristics of a Distribution


A random variable is any variable whose numerical values depend on chance or have associated
probabilities. If the values and corresponding probabilities of a random variable are listed, or
tabulated, or depicted graphically, we call it a probability distribution (or simply a Distribution).
Since a random variable and associated probability distribution involve a large number of values,
we want to summarize them by some summary measures. In general, we are interested in four
characteristics of a distribution: a) where is it located? : found by Measures of location or
Central Tendency; b) how widely is it dispersed or spread out? : found by Measures of
Variation; c) how symmetrical are the two tails around the central value(s)? : found by
Symmetrical property or Measure of Skewness; and d) how high is the mound or hump,
which generally appear towards the middle? : found by Measures of Kurtosis.
(These four measures are also referred to as First, Second, Third and Fourth Moments,
respectively)
In this chapter we will focus mainly on the first two measures, although we will also learn briefly
about the third characteristic.

Measures of Central Tendency


One of the key summary statistics is the Measure of Central Tendency which indicates the
value which is centrally located in the data, that is, towards the middle. The three popular

measures of Central Tendency are the Mean, the Median and the Mode. The Mean is by far the
most popular and the most widely used measure, followed by the Median.

The Mean
The Mean is also known as the Arithmetic Mean or the Average or the Expected Value (in
Advanced Statistics). Sometimes, the name may be quite different depending on the context. For
example, the Per Capita Income of a country or city is in fact the Mean income.
The (general) formula for the Mean is very simple: it is the sum of all values divided by the total
n

number of items (or observations) under study. Thus, =


population mean. Replace by

Xi
i =1

is the formula for the

n
or X (with bar over it) for the sample mean. Thus for the

sample,

Xi
i =1

n
If you have problem typing X with bar over it, just type X-bar in your answers or copy it from
my Instructions.
Example1: Suppose a variable X has 12 values for twelve months of any year:
2,5,6,8,9,10,12,13,10,7,5, and 3, then the sample mean (or the annual average) is:
X = (2+5+6+8+9+10+12+13+10+7+5+3) 12 = 7.5
Note that the mean is not equal to any value. This is perhaps one of the drawbacks of the Mean.
Another drawback is that the Mean is very sensitive to extreme values (or outliers). For example,
if the value in only the sixth month (somehow) jumped to 100 instead of being 10, then the
overall annual average or the Mean would jump to 15. Thus one extreme value could drive the
average to greater than all other values in the whole series! This works similarly for outliers in
the lower end. The average household or per capita income of a large city where a few super
billionaires happen to live could be quite misleading.
So, the Mean has problems as a measure of the central tendency. Why is it so popular then?
Before you lose total respect for this measure, let me discuss some of the nice properties it does
have.
1. The mean takes into account all the values in the data. Change in any value, other things
remaining the same, will change the mean. That is, the mean does not ignore any value (or
information contained in the data). This is not true of other measures of Central tendency such as
Median and Mode which we will discuss below.
2. The mean is unique. One data set has only one mean. This is not true of Mode.

3. The mean indicates the balancing pint or center of gravity. If equal weights were distributed on
distances indicated by the values in the above series, then the whole structure could be exactly
balanced by putting a finger (or other support) under the point corresponding to the Mean 7.5. So
it is a point which balances the weights on the two sides.
4. A very useful property (used later in advanced statistical applications) is that the sum of
deviations around the mean is exactly zero. Let us try this for the above series.

X- X
-5.5
-2.5
-1.5
0.5
1.5
2.5
4.5
5.5
2.5
-0.5
-2.5
-4.5
0

X
2
5
6
8
9
10
12
13
10
7
5
3
90

Note that the formula given above has to be modified for grouped data. For grouped data apply
the weighted mean formula:
= n fi Xi /n fi
i=1
i=1
where fi is the frequency corresponding to the value Xi and the denominator is the total frequency
(or the total number of observations). (Replace by X to obtain the equivalent formula for
sample)
Example 2: Suppose we have grouped data as shown in the first two columns below. Then the
third column is calculated as the product of the first two to derive the mean.
X
5
7
8
10
12
Total

Frequency
2
4
7
5
2
20

Therefore,

X*Frequency
10
28
56
50
24
168

= 168 20 = 8.4

If we have grouped data with classes or intervals instead of single X values with frequencies,

then we first find the class midpoints and then the Mean by treating the class midpoints like the
X column above.
For example, the mean for the grouped data given below for 30 items is calculated after finding
the class midpoints first:
Example 3
Class Class Midpt (Mi)
3-7
5
7-11
9
11-15
13
15-19
17

Frequency
3
10
12
5

Total

30

Therefore,

N/A
X

Mi*Frequency
15
90
156
85

346

= 346 30 = 11.53

Note that, in case of grouped data the deviations from the mean will add up to Zero only after
multiplying by the respective frequencies. The above examples are examples of frequency
weighted mean. But the weights could be any other thing instead of frequency. You just replace
the frequencies by the weights in the above formula to get the weighted average or mean. For
example it could be relative frequency or probability associated with the individual values. If the
weights are relative frequency in decimals or probabilities then the only new thing to remember
is that the denominator would be exactly 1. (Can you guess why? you guessed it right. The
sum of all relative frequencies or probabilities has to be 1). Or the weights could be the credit
hours for each course you take, to derive the weighted average score.
Later (in chapters 6 and 7) we will discuss discrete and continuous random variables and
probability distributions. The above formula applies only to discrete variables. For continuous
variables (such as the normal random variable) we need to replace the summation sign by the
integral sign (Calculus! Yikes!). But dont get scared, you will not be asked to use Integrals in
this course. You just have to be aware of it.

The Median
Next to the Mean, the Median is also popular in some applications, such as the comparison of
cities based on Median Household Income. But I will be brief on this topic because of its limited
use in most cases (especially in the advanced applications of Statistics).
Interestingly, the Median is indeed the central value in that it divides the whole population (or
sample) in two halves. It is such a value that 50 percent of the values are equal to or less than this
value and 50 percent are greater than equal to it. To calculate the Median, order the series of
values in increasing or decreasing order. If the number of observations is odd Median is the
N +1 th
value). If the number of observations is even then the
middlemost value (the (
2
N th
N
and the (
+1 th values].
Median is the average of the two middle-most values [the (
2
2

Example of Median calculation:


Suppose there are twelve values as in the original series above. The values have to be first
rearranged in order of magnitude (increasing or decreasing). Let us arrange them in increasing
order: 2, 3, 5, 5, 6, 7, 8, 9, 10, 10, 12, and 13. Note that repeated values are listed consecutively.
Now the average of the two middle items after ordering them in ascending order (sixth and
seventh) is exactly equal to 7.5 which is also what the Mean found for us. This is the value such
that six items lie to its left and six to its right.
If, however, we had only eleven values of X: 2, 5, 6, 8, 9, 10, 12, 13, 10, 7, and 5. Then the
values have to be first rearranged in order of magnitude (increasing or decreasing). Let us
arrange them in increasing order: 2, 5, 5, 6, 7, 8, 9, 10, 10, 12, and 13. Now the middlemost
value or the Median is 8, which is the sixth value: (11+1)/2. This is the value such that six items
lie on the left tail (including itself) and six items on the right tail (including itself).
The Median is insensitive to extreme values as long as the ranking is not disturbed. For example
we could have 130 in place of 13 and the median would still be the same. This insensitivity is
regarded as a good property by the Text book, but statisticians dont look at it that way. Any
measure which ignores the information contained in the data is not considered highly desirable.
For example we could change the values on the two sides of the Median in any way just keeping
the order intact, and the median would not change. This does not sound right because it is
treating the ratio scale values almost like ordinal values only. The problem with the Mean is that
it is too sensitive, but the problem with Median is that it is too insensitive. Statisticians would
rather have the first problem, because there are reasonable cures for outliers and extreme values,
but simply ignoring the values and only accounting for the relative ranking is not a very desirable
property. Therefore, in all advanced applications of Statistics, it is the Mean not the Median
which is the universally used measure.
The Mode is simply the value with the highest number of occurrences (or the largest frequency).
A distribution can be multimodal, that is, there can be two or more values with equal highest
frequency. In the above series there are two modes: 5 and 10. This is called bi-modal. There is
not much use for Mode because it ignores most information in the data.

Relative positions of Mean, Median and Mode: In a symmetric distribution (which


has the two tails around the middle value identically shaped or mirror images) the Mean and
Median are equal and if there is only one mode (only one peak) it is also equal to the mean and
the Median. If the distribution is skewed to the right (that is the right side has longer tail or in
other words there are few extremely large values) then the relationship is Mode < Median <
Mean. If the distribution is skewed to the left (long left tail or there are few extremely small
values) then the relationship is Mean < Median < Mode. Note that right skewed distributions are
also called positively skewed and left skewed distributions are called negatively skewed. If the
distribution is highly skewed then the Mean is unduly influenced by the few extreme values.
Right skewed continuous distribution

Mode < Median < Mean


A left skewed distribution would have the long tail to the left and the order of magnitude will be
Mean < Median < Mode. I am illustrating left and right skewed distributions together in the
diagram below.

Left skewed

Right Skewed

The normal distribution is the most widely known example of a symmetric distribution as shown
below:

0
Mean=Median=Mode

Measures of Variation or Spread or Dispersion

Simply knowing the Central Tendency is not enough. For example, there are some
underdeveloped small countries whose per capita income levels are similar to those of the USA,
but only a few dozen families in those countries own almost everything and the rest of the
country is living in poverty. Similarly in stocks we need to know not only the expected (or
average) return but also the variability of the return, which measures the risk. Looking only at
return and not the risk would be financially disastrous (as also shown by the recent financial
crisis). So, measures of variation or spread are also important.
Among the various measures of Variation, the Variance (or its square root: the Standard
Deviation) is by far the most widely used measure in most applications. And it is based on the
Mean! So I will mention other measures only briefly and focus more on the Variance and its
derivatives.
The Range is simply the difference between the largest and the smallest values. It gives some
idea of the spread but only considers the two extreme values and ignores the rest: a highly
undesirable property.
To understand Interquartile Range we need to know percentiles and quartiles. It is very
straightforward and does not need further explanation. The Interquartile range estimates the
interval which contains the middle 50 percent of the values. There is very little use for such a
measure which ignores most of the information contained in the data. So I will now move to the
most important measure of Variation.

The Variance and the Standard Deviation


The population variance denoted by 2 (pronounced sigma squared) is simply the sum of squared
deviations of the individual values from the Mean divided by the number of items. Its positive
square root is called the standard deviation (). We rarely know the population mean or variance.
Most often we deal with samples. The sample variance denoted s2 (s squared) is defined as:
s2 = n (Xi X )2/(n-1),
i=1
If you have grouped data, simply multiply the squared deviations from the Mean by the
respective frequencies, add and divide by the total frequency minus one.
Then the formula becomes: s2 = n fi(Xi X
i=1
th
where fi refers to the i class frequency.

)2/(n-1),

The reason we divide by n-1 in the case of samples instead of n (the number of observations)
is to obtain an unbiased estimator for the Variance. We lose one degree of freedom in the
calculation of the sample variance because we have to use the data once to calculate the Mean
before we can calculate the variance, as if, using data depreciates it! You dont need to fully
understand this statistical jargon. Simply learn the rule that in the sample variance formula
we divide by n-1 instead of n.
Let us calculate the variance for example1, for which we have already calculated the Mean.

X: 2, 5,6,8,9,10,12,13, 10, 7, 5, and 3; the calculated Mean


X

Xi X

2
5
6
8
9
10
12
13
10
7
5
3
90

2-7.5
5-7.5
6-7.5
8-7.5
9-7.5
10-7.5
12-7.5
13-7.5
10-7.5
7-7.5
5-7.5
3-7.5
0

is 7.5, and n =12.

(Xi X )2
30.25
6.25
2.25
0.25
2.25
6.25
20.25
30.25
6.25
0.25
6.25
20.25
131

Thus the Sample Variance is equal to 131 (12-1) = 11.909


For grouped data we use this formula:
s2 = n fi(Xi X
i=1

)2/(n-1)

Let us calculate the sample variance for the grouped data in Example 2 above:
X
Freq(f)
5
2
7
4
8
7
10
5
12
2
Total
20
X = 168 20=8.4

X*f X- X
(X- X )2
10
-3.4 11.56
23.12
28
-1.4
1.96
7.84
56
-0.4
0.16
1.12
50
1.6
2.56
12.80
24
3.6
12.96
25.92
168
--70.8
2
s = 70.8 (20-1) = 3.726

f(X-

X )2

If you have grouped data with classes, then first find the class midpoints and then the Mean and
then the variance by treating the class midpoints like the X column above, and use the formula
given above. This is done for Example 3 below.
Class Class Midpt (Mi)
3-7
5
7-11
9
11-15
13
15-19
17

Freq
3
10
12
5

Total
N/A
30
X = 346 30 = 11.53
s2 = 367.44 (30-1) = 12.67

Mi*Freq
15
90
156
85

346

Mi-6.53
-2.53
1.47
5.47

(M i 42.64
6.40
2.16
29.92

X )2 fi(Mi 127.92
64.0
25.92
149.6

367.44

X )2

The Standard Deviation


The problem with Variance is that it is based on squared terms (to get rid of negative signs in
some deviations, otherwise the sum would be always zero). As a consequence the calculated
value is in squared units which often make no sense. For example, if you started with dollar, you
end up with squared dollars which cannot be interpreted meaningfully. Therefore, statisticians
have devised a measure called the standard deviation which is simply the positive square root
of Variance and is denoted by s for the sample. (Again note that the sample Variance formula
gives unbiased estimator, which will be explained in later Chapters, but the operation of taking
square root destroys this property. That is the Standard Deviation is not an Unbiased estimator
of the corresponding population parameter.)
Thus, for the ungrouped data of the first example we have s = 11.909 =3.451.
For the grouped data in example 2, the sample standard deviation is s = 3.726 = 1.93
For the grouped data in example 3, s = 12.67 = 3.56
Instead of going through all these calculations for the ungrouped data, we could simply enter the
data in Excel and invoke the Add-in, go to MegaStat and select descriptive Statistics, select the
Range of cells for the input and get all the result in one go as follows (for example 1):
Descriptive statistics
count
mean
sample variance
sample standard deviation
minimum
maximum
range
sum
deviation sum of squares (SSX)

X
12
7.50
11.91
3.45
2
13
11
90.00
131.00

This gives all the important measures we have discussed so far. I want you to also learn it the
hard way (that is using calculator and the formula)!
However, for grouped data this is a little tricky. You have to first convert the grouped data as data
of individual values and follow the above procedure. For example, in the case of the grouped
data of the second example you would type the value of X equal to 5 two times (for frequency 2)
and the value 7 four times, and so on. I find using calculator and formula easier when the data is
grouped in frequencies.

The Empirical Rule


The most frequently used probability distribution in practice is the Normal distribution. We will
learn about it in more details later. Here it is sufficient to know that this distribution is the famous
bell shaped curve, which is symmetrical and looks like the following.

10

For a normal distribution knowing only two parameters (standard deviation or variance and the
mean) is sufficient to derive the whole distribution. For any normal distribution with mean and
standard deviation (or variance 2) the following is approximately true:
i.

ii.
iii.

The interval around the mean between - and + contains about 68.27 percent
(more than two-thirds) of all the values. That is one deviation around the mean value
includes 68.27 percent of values.
Two deviations around the mean include 95.45 percent of values, and
Three deviations around the mean include 99.73 percent (that is nearly all) of the
values.

The three intervals above are also called the corresponding Confidence or Tolerance intervals.
For example 95.45% confidence interval would be constructed by subtracting from and adding to
the mean, two times the standard deviation. As an illustration, if the mean is 40 and the standard
deviation is estimated to be 3, then the range 40 3 is expected to contain 68.27 percent of
items. In other words, 68.27 percent of items are expected to fall between 37 and 43. Similarly,
95.45 percent of items are expected to fall between 34 and 46, and so on. These simple facts are
also called the Empirical rules and can be safely applied to most distributions which may not be
exactly normal but are not very skewed.

The Chebyshevs theorem or Chebyshevs inequality


The Chebyshevs theorem gives the percentage of items that lie within a specified interval
around the mean regardless of the shape of the distribution. Chebyshevs inequality guarantees
that in any probability distribution no more than 1/k2 of the distributions values can be more
than k standard deviations away from the mean. The inequality is quite general because it can be
applied to completely arbitrary distributions (unknown except for mean and variance). For
example it can be used to prove the weak law of large numbers. It is very general but is of little
practical importance because the resulting intervals are too wide unless the standard deviation is
quite small.

11

Let X be a random variable with expected value and finite variance 2. Then for any real
number k > 0, Chebyshevs inequality guarantees that, Pr(|X - | k) (1 k2)
In simple words, for any distribution, the minimum proportion of the values that lie within k
standard deviations of the mean is at least 1- 1/k2, where k is any constant greater than 1. Only
the case k 2 provides useful information (when k 1 the right-hand side is greater than or
equals to one, so the inequality becomes invalid because probability cannot be greater than 1. As
an illustration, using k = 2 shows that at least 75% of the values lie in the interval ( 2, +
2). The empirical rule says around 95.45 percent lie in this interval. This is a more precise
statement compared to Chebyshevs rule.
Similarly if we knew (or could safely assume) that the distribution is normal, then we could say
(by looking at the Normal Table as explained in a later chapter) that at least 75% of items lie in
the interval ( 1.16, + 1.16). This is a comparatively much tighter (or precise) interval than
what the Chebyshevs rule can give. Therefore, the Empirical rule based on Normal Distribution
(discussed above) is much more popular than the Chebyshevs rule.

The Z-Score: Subtracting the mean from a given value of a random variable and dividing by
the standard deviation, is also called standardization, and the resulting value is called Z-score.
If we do this for a normal random variable we obtain the Standard normal distribution or the ZDistribution. The z-scores tell us how far a given value (left or right) is from the mean in
multiples of the standard deviation.
Example: If the mean is 40 and the standard deviation is 4, then the Z-score of 45 is (45-40) 4
= 1.25, and the Z-score of 38 is (38-40) 4 = -0.5

The Coefficient of Variation


Sometimes it is difficult to compare two distributions based on the Standard Deviation (or
Variance) because of different units or sizes involved. Therefore the Coefficient of Variation is
defined as the ratio of standard deviation to the Mean times 100, a value in percentage (unit free)
which can be used to compare the variability (or risk) in different distributions with different
Mean, Variance and units involved. Since this measure is unit free, meaningful comparisons can
be made among different distributions regardless of the scale or units involved.

*100

Example: If standard deviation is 10 and the mean is 40, then the coefficient of Variation is
(10/40)*100 = 25%. If a second distribution has standard deviation of 50 and mean equal to 400,
then its Coefficient of Variation is (50/400)*100 = 12.5%.
Coefficient of Variation in percentage =

Thus the second distribution seems to have less relative variability compared to the first
distribution notwithstanding its larger standard deviation.

Vous aimerez peut-être aussi