Académique Documents
Professionnel Documents
Culture Documents
Degrees of Freedom
The notes below, and on the next tabs, explain what's going on behind the scenery.
i=1 N
On the other hand, the variance of sample data, used as an estimate of the population
variance, is computed as:
( x ix )2
n
s =
2
i =1 n1
Why, you might well wonder, is the denominator in the second formula n-1 instead of n ?
Why don't we instead use:
n 2
(?) ( x x)
in
i=1
Imagine that you knew the value of , the population mean. Then it would be natural -
and correct - to estimate the population variance using the formula:
n 2
(!) ( x )
in
i=1
Since we don't know , we cheat just a little and use the sample mean
instead. This means that the same data is being used to make two
distinct estimates, first the mean and then the variance. When we do this,
typically the numbers all fit together a bit too well.
sample
We'll use Excel's Solver tool to see this. Listed to the right are 9.34
20 random observations sampled from a population where the 9.16
characteristic being studied is uniformly distributed between 5 5.27
and 15. The true population mean is 10. 9.86
n 8.90
Consider the function:
( xi w) 2 13.75
13.52
i=1
If w = 10.000 , this function takes the value: 149.634 6.44
To find the value of w which minimizes this function, we select 5.64
Tools, Solver, Solve (I've already set up the problem), and OK. 10.70
Go ahead, try it. 11.69
12.38
Page 1
Degrees of freedom
Except in the unlikely case that the sample mean is precisely equal to the
true population mean, the sum of the squared deviations of the sample
observations around the sample mean will be strictly smaller than the sum of
the squared deviations around the true mean. Because of this, (?) typically
is an underestimate of (!), and some adjustment must be made. It turns out
that dividing by (n-1) instead of n scales the result up just enough to offset
the downward bias created by using the same data to estimate both the
population mean and the population variance. (This is demonstrated via
simulation on the next tab.)
One useful way to think of all this is to picture the original n sample
observations as each being "free" to take any value. Once we compute an
estimate (such as the sample mean, our estimate of the population mean)
using the data, however, one "degree of freedom" is lost, i.e., for any
sample to yield this particular estimate, once n-1 of the observations were
freely determined, the last observation would have a forced value. Therefore,
any subsequent estimates (such as our estimate of the population variance)
which are made using this first estimate will be based on data with only n-1
remaining degrees of freedom.
Note: The downward bias in the example could also be eliminated by drawing
two separate samples, and using one to estimate population mean, and then
the other to estimate the sample variance (and standard deviation). But this
is less efficient than using a single sample, and then adjusting for the lost
degree of freedom in the second estimate.
Page 2
Caution
The following tabs contain my best effort to demystify the notion of "degrees of freedom"
and the t-distribution. Read them for personal interest - As we'll see when the course
moves forward into regression analysis, dealing with all of this will ultimately be quite
simple and mechanical. Reading the following tabs is not mandatory.
Adjusting for 1 Lost Degree of Freedom
1-TDIST(~,9,TRUE)
standard deviations above 0
0.5 1 1.5 2 2.5 3
1 1 1 1 1 1 is Z below this?
1 1 1 1 1 1 is T below this?