Vous êtes sur la page 1sur 11

Degrees of freedom

Degrees of Freedom

For most practical purposes (including managerial applications), it suffices to let a


computer count "degrees of freedom," and to use the "approximately 2" multiplier the
computer provides in order to determine the margin of error in an estimate (at the
95%-confidence level).

The notes below, and on the next tabs, explain what's going on behind the scenery.

The variance of a population characteristic is computed using the formula:


N 2
( xi )
=
2

i=1 N
On the other hand, the variance of sample data, used as an estimate of the population
variance, is computed as:

( x ix )2
n
s =
2

i =1 n1
Why, you might well wonder, is the denominator in the second formula n-1 instead of n ?
Why don't we instead use:
n 2
(?) ( x x)
in
i=1

Imagine that you knew the value of , the population mean. Then it would be natural -
and correct - to estimate the population variance using the formula:
n 2
(!) ( x )
in
i=1
Since we don't know , we cheat just a little and use the sample mean
instead. This means that the same data is being used to make two
distinct estimates, first the mean and then the variance. When we do this,
typically the numbers all fit together a bit too well.
sample
We'll use Excel's Solver tool to see this. Listed to the right are 9.34
20 random observations sampled from a population where the 9.16
characteristic being studied is uniformly distributed between 5 5.27
and 15. The true population mean is 10. 9.86
n 8.90
Consider the function:
( xi w) 2 13.75
13.52
i=1
If w = 10.000 , this function takes the value: 149.634 6.44
To find the value of w which minimizes this function, we select 5.64
Tools, Solver, Solve (I've already set up the problem), and OK. 10.70
Go ahead, try it. 11.69
12.38

Page 1
Degrees of freedom

The resulting value of w is called the "least-squares fit" to the 12.44


sample data. [In fact, regression analysis uses the same 5.33
least-squares-fit approach to estimate the coefficients of the 7.09
regression model, and we've just done the equivalent of a 10.04
regression analysis with no independent variables.] 9.36
8.01
The striking fact is that this value of w is exactly equal to the 9.42
sample mean, i.e., the sample mean is the single number which 14.10
"fits" the entire sample better than any other number.

sample mean = 9.621

Except in the unlikely case that the sample mean is precisely equal to the
true population mean, the sum of the squared deviations of the sample
observations around the sample mean will be strictly smaller than the sum of
the squared deviations around the true mean. Because of this, (?) typically
is an underestimate of (!), and some adjustment must be made. It turns out
that dividing by (n-1) instead of n scales the result up just enough to offset
the downward bias created by using the same data to estimate both the
population mean and the population variance. (This is demonstrated via
simulation on the next tab.)

One useful way to think of all this is to picture the original n sample
observations as each being "free" to take any value. Once we compute an
estimate (such as the sample mean, our estimate of the population mean)
using the data, however, one "degree of freedom" is lost, i.e., for any
sample to yield this particular estimate, once n-1 of the observations were
freely determined, the last observation would have a forced value. Therefore,
any subsequent estimates (such as our estimate of the population variance)
which are made using this first estimate will be based on data with only n-1
remaining degrees of freedom.

Many statistical techniques (in particular, regression analysis) entail the


making of several estimates from the same sample data. Consequently,
throughout much of what we'll be doing, the "counting" of degrees of
freedom will be necessary, in order to correct for the bias in our estimates
that would otherwise result from everything fitting together a bit too well.

Note: The downward bias in the example could also be eliminated by drawing
two separate samples, and using one to estimate population mean, and then
the other to estimate the sample variance (and standard deviation). But this
is less efficient than using a single sample, and then adjusting for the lost
degree of freedom in the second estimate.

Page 2
Caution

The following tabs contain my best effort to demystify the notion of "degrees of freedom"
and the t-distribution. Read them for personal interest - As we'll see when the course
moves forward into regression analysis, dealing with all of this will ultimately be quite
simple and mechanical. Reading the following tabs is not mandatory.
Adjusting for 1 Lost Degree of Freedom

At the right, we're drawing a random sample 50 , population mean


of size 10 from a normally-distributed 5 , population standard deviation
population. That the observations are 25 2, population variance
normally distributed is not important - They 10 n, sample size
could be drawn from a population with any
distribution sample (xi-x)2 (xi-)2
46.641 5.655 11.280
The sum of the squared deviations from the 45.666 11.246 18.782
sample mean, and from the true mean, are 53.586 20.854 12.861
both computed. As discussed on the 48.882 0.019 1.251
previous tab, the first sum is always smaller 50.516 2.239 0.266
than the second. (Press 'F9' to resample.) 56.205 51.625 38.498
52.954 15.477 8.725
50,000 simulations are used to estimate 51.296 5.180 1.679
the average values of both sums. 45.349 13.471 21.628
39.101 98.371 118.780
In the latter case, the sum of the squared sum sum
x
deviations is, on average, just about 10 49.020 224.136 233.748
times the true variance. But the sum of the
squared deviations around the sample $I$20 $J$20 monitored cell
mean is only, on average, about 9 = 10-1 224.968 250.105 mean
times the true variance. 106.498 112.269 sample standard deviation
11.513 17.701 minimum
This illustrates why we divide by 9, and 969.745 1096.328 maximum
generally, n-1, to estimate the true variance 50,000 50,000 number of simulation runs
(and standard deviation) from our sample
data.
monitored cell

sample standard deviation

number of simulation runs


What Is This "t-Distribution"?

A central idea in estimation is that


_
X- 50 , population mean
Z=
/n 5 , population standard deviation
10 n, sample size
is approximately normally distributed, and therefore
_ sample
Pr( | X - | 1.96 /n ) = 95% . 55.105
53.432
The problem, of course, is that we dont know . 42.415
We deal with this by cheating, and using s instead. 60.553
53.878
But, if we let S be the random variable that 51.465
_
eventually takes the value s , 49.134 x s
_ 47.256 51.598 4.856
X- 50.426
T=
S/n 52.317 /n s/n
1.581 1.536
isn't quite normally distributed. In particular, if our
sample yields an underestimate for , T can take Z T
values quite a bit more extreme than does Z . 1.011 1.041

In the simulation to the right, note how T takes, at $I$23 $J$23


times, much more extreme values than does Z . 0.000 -0.001
1.000 1.135
-4.701 -12.928
4.891 10.813
1,000,000 1,000,000
monitored cell
mean
sample standard deviation
minimum
maximum
number of simulation runs
What Is This "t-Distribution"?

Here, we'll again draw a sample of size 10 from a


normally-distributed population. This time, the normality
of the population distribution is important if the sample 50 , population mean
size is no larger than a couple of dozen observations - 5 , population standard deviation
For larger samples, any distribution would
_ yield similar 10 n, sample size
conclusions. (What we need is for X to be roughly
normally-distributed, which will be the case if either the sample /n 1.581
_
individual observations come from a normal distribution, 50.603 (x-) / (/n) 0.049
or the sample size is large enough for the Central Limit 53.250
Theorem to kick in.) 53.863
47.356
The simulation estimates the left-tail probabilities for the 42.877
distributions of Z and T. 50.704
39.803 NORMDIST(~,0,1,TRUE)
Notice that these probabilities are all smaller for the 57.888
distribution of T. This shows us that the upper tail of the 49.635
distribution of T is "fatter" than the upper tail of the 54.789 _ s/n 1.743
_
normal distribution, which is what we'd expect from our x (x-) / (s/n) 0.044
previous simulation. (Symmetrically, the lower tail of the 50.077
distribution of T is fatter as well.)

1-TDIST(~,9,TRUE)
standard deviations above 0
0.5 1 1.5 2 2.5 3
1 1 1 1 1 1 is Z below this?
1 1 1 1 1 1 is T below this?

left-tail probabilities for Z


$M$5 $N$5 $O$5 $P$5 $Q$5 $R$5 monitored cell
69.15% 84.14% 93.30% 97.72% 99.38% 99.87% mean
46.19% 36.53% 25.00% 14.93% 7.87% 3.62% sample standard deviation
0 0 0 0 0 0 minimum
1 1 1 1 1 1 maximum
1,000,000 1,000,000 1,000,000 1,000,000 1,000,000 1,000,000 number of simulation runs

69.15% 84.13% 93.32% 97.72% 99.38% 99.87%

left-tail probabilities for T


$M$6 $N$6 $O$6 $P$6 $Q$6 $R$6 monitored cell
68.58% 82.84% 91.62% 96.14% 98.30% 99.24% mean
46.42% 37.70% 27.71% 19.26% 12.94% 8.67% sample standard deviation
0 0 0 0 0 0 minimum
1 1 1 1 1 1 maximum
1,000,000 1,000,000 1,000,000 1,000,000 1,000,000 1,000,000 number of simulation runs

68.55% 82.83% 91.61% 96.17% 98.31% 99.25%


What Is This "t-Distribution"?

Excel's TDIST(t,df,TRUE) gives the probability


distribution of T. (Somewhat perversely, this Excel
function is designed to give right-tail probabilities.) 50 , population mean
5 , population standard deviation
There is one t-distribution for every different number of 10 n, sample size
degrees of freedom. The number of degrees of freedom
in our example is 10-1 = 9. sample /n 1.581
_
52.812 (x-) / (/n) 0.685
Notice that the upper tail probability for the normal 47.018
distribution, 2 or more standard deviations above the 41.948
mean, is 2.23%, and therefore going up and down 2 56.293
standard deviations captures a bit more than 95% of the 53.975
distribution. But for the t-distribution with 9 degrees of 55.643
freedom, the upper tail probability 2 or more standard 50.251 NORMDIST(~,0,1,TRUE)
deviations above the mean is greater than 3%, so going 51.278
up and down two standard deviations captures less than 49.240
95% of the distribution. 52.370 _ s/n 1.354
_
x (x-) / (s/n) 0.800
To capture 95%, we need to go out more than 2 51.083
standard deviations. Indeed, the TINV function shows us
that we need to go up and down 2.262
standard deviations.

As the number of degrees of freedom becomes very 1-TDIST(~,9,TRUE)


large, the t-distribution approaches the normal
distribution, and our "approximately 2" multiplier
approaches 1.96 .
standard deviations above 0
0.5 1 1.5 2 2.5 3
0 1 1 1 1 1 is Z below this?
0 1 1 1 1 1 is T below this?

left-tail probabilities for Z


$M$5 $N$5 $O$5 $P$5 $Q$5 $R$5 monitored cell
69.15% 84.14% 93.30% 97.72% 99.38% 99.87% mean
46.19% 36.53% 25.00% 14.93% 7.87% 3.62% sample standard deviation
0 0 0 0 0 0 minimum
1 1 1 1 1 1 maximum
1,000,000 1,000,000 1,000,000 1,000,000 1,000,000 1,000,000 number of simulation runs

69.15% 84.13% 93.32% 97.72% 99.38% 99.87%

left-tail probabilities for T


$M$6 $N$6 $O$6 $P$6 $Q$6 $R$6 monitored cell
68.58% 82.84% 91.62% 96.14% 98.30% 99.24% mean
46.42% 37.70% 27.71% 19.26% 12.94% 8.67% sample standard deviation
0 0 0 0 0 0 minimum
1 1 1 1 1 1 maximum
1,000,000 1,000,000 1,000,000 1,000,000 1,000,000 1,000,000 number of simulation runs

68.55% 82.83% 91.61% 96.17% 98.31% 99.25%

Vous aimerez peut-être aussi