Vous êtes sur la page 1sur 9

Real Statistics Using Excel

Everything you need to do real statistical analysis using Excel

Basic Concepts for ANOVA


We start with the one factor case. We will define the concept of factor elsewhere, but
for now we simply view this type of analysis as an extension of the t tests that are
described in Two Sample t-Test with Equal Variances and Two Sample t-Test with
Unequal Variances. We begin with an example which is an extension of Example 1 of
Two Sample t-Test with Equal Variances.

Example 1: A marketing research firm tests the effectiveness of three new flavorings
for a leading beverage using a sample of 30 people, divided randomly into three groups
of 10 people each. Group 1 tastes flavor 1, group 2 tastes flavor 2 and group 3 tastes
flavor 3. Each person is then given a questionnaire which evaluates how enjoyable the
beverage was. The scores are as in Figure 1. Determine whether there is a perceived
significant difference between the three flavorings.

Figure 1 – Data for Example 1

Our null hypothesis is that any difference between the three flavors is due to chance.

H0: μ1 = μ2 = μ3

We interrupt the analysis of this example to give some background, after which we will
resume the analysis.

Definition 1: Suppose we have k samples, which we will call groups (or treatments);
these are the columns in our analysis (corresponding to the 3 flavors in the above
example). We will use the index j for these. Each group consists of a sample of size nj.
The sample elements are the rows in the analysis. We will use the index i for these.

Suppose the jth group sample is


and so the total sample consists of all the elements

We will use the abbreviation x̄j for the mean of the jth group sample (called the group
mean) and x̄ for the mean of the total sample (called the total or grand mean).

Let the sum of squares for the jth group be

We now define the following terms:

SST is the sum of squares for the total sample, i.e. the sum of the squared deviations
from the grand mean. SSW is the sum of squares within the groups, i.e. the sum of the
squared means across all groups. SSB is the sum of the squares between group sample
means, i.e. the weighted sum of the squared deviations of the group means from the
grand mean.

Where

we also define the following degrees of freedom

Finally we define the mean square as

and so

Summarizing:
Observation: Clearly MST is the variance for the total sample. MSW is the weighted
average of the group sample variances (using the group df as the weights). MSB is the
variance for the “between sample” i.e. the variance of {n1x̄1, …, nkx̄k}.

Property 1: If a sample is made as described in Definition 1, with the xij independently


and normally distributed and with all σj2 equal, then

Property 2:

Definition 2: Using the terminology from Definition 1, we define the structural model
as follows. First we estimate the group means from the total mean: = μ + aj where aj
denotes the effect of the jth group (i.e. the departure of the jth group mean from the total
mean). We have a similar estimate for the sample of x̄j = x̄ + aj.

The null hypothesis is now equivalent to

H0: aj = 0 for all j

Similarly, we can represent each element in the sample as xij = μ + αj + εij where εij
denotes the error for the ith element in the jth group. As before we have the sample
version xij = x̄ + aj + eij where eij is the counterpart to εij in the sample.

Also εij = xij – (μ + αij) = xij – μj and similarly, eij = xij – x̄j.

Observation: Since

it follows that

and so
for any j, as well as

If all the groups are equal in size, say nj = m for all j, then

i.e. the mean of the group means is the total mean. Also

Property 3:

Observation: Click here for a proof of Property 1, 2 and 3.

Observation: MSB is a measure of variability of the group means around the total mean.
MSW is a measure of the variability of each group around its mean, and, by Property 3,
can be considered a measure of the total variability due to error. For this reason, we will
sometimes replace MSW, SSW and dfW by MSE, SSE and dfE.

In fact,

If the null hypothesis is true, then αj = 0, and so

While if the alternative hypothesis is true, then αj ≠ 0, and so

If the null hypothesis is true then MSW and MSB are both measures of the same error and
so we should expect F = MSB / MSW to be around 1. If the null hypothesis is false we
expect that F > 1 since MSB will estimate the same quantity as MSW plus group effects.

In conclusion, if the null hypothesis is true, and so the population means μj for the k
groups are equal, then any variability of the group means around the total mean is due to
chance and can also be considered error.
Thus the null hypothesis becomes equivalent to H0: σB = σW (or in the one-tail test, H0:
σB ≤ σW). We can therefore use the F-test (see Two Sample Hypothesis Testing of
Variances) to determine whether or not to reject the null hypothesis.

Theorem 1: If a sample is made as described in Definition 1, with the xij independently


and normally distributed and with all μj equal and all equal, then

Proof: The result follows from Property 1 and Theorem 1 of F Distribution.

Example 1 (continued): We now resume our analysis of Example 1 by calculating F


and testing it as in Theorem 1.

Figure 2 – ANOVA for Example 1

Based on the null hypothesis, the three group means are equal, and as we can see from
Figure 2, the group variances are roughly the same. Thus we can apply Theorem 1. To
calculate F we first calculate SSB and SSW. Per Definition 1, SSW is the sum of the group
SSj (located in cells J7:J9). E.g. SS1 (in cell J7) can be calculated by the formula
=DEVSQ(A4:A13). SSW (in cell F14) can therefore be calculated by the formula
=SUM(J7:J9).

The formula =DEVSQ(A4:C13) can be used to calculate SST (in cell F15), and then per
Property 2, SSB = SST – SSW = 492.8 – 415.4 = 77.4. By Definition 1, dfT = n – 1 = 30 –
1 = 29, dfB = k – 1 = 3 – 1 = 2 and dfW = n – k = 30 – 2 = 28. Each SS value can be
divided by the corresponding df value to obtain the MS values in cells H13:H15. F is
then MSB / MSW = 38.7/15.4 = 2.5. We now test F as we did in Two Sample Hypothesis
Testing of Variances, namely:

p-value = FDIST(F, dfB, dfW) = FDIST(2.5, 2, 27) = .099596 > .05 = α


Fcrit = FINV(α, dfB, dfW) = FINV(.05, 2, 27) = 3.35 > 2.5 = F

Either of these shows that we can’t reject the null hypothesis that all the means are
equal.

As explained above, the null hypothesis can be expressed by H0: σB ≤ σW, and so the
appropriate F test is a one-tail test, which is exactly what FDIST and FINV provide.

We can also calculate SSB as the square of the deviations of the group means where each
group mean is weighted by its size. Since all the groups have the same size this can be
expressed as =DEVSQ(H7:H9)*F7.

SSB can also be calculated as DEVSQ(G7:G9)/F7. This works as long as all the group
means have the same size.

Excel Data Analysis Tool: Excel’s Anova: Single Factor data analysis tool can also be
used to perform analysis of variance. We show the output for this tool in Example 2
below.

The Real Statistics Resource Pack also contains a similar supplemental data analysis
tool which provides additional information. We show how to use this tool in Example 1
of Confidence Interval for ANOVA.

Example 2: A school district uses four different methods of teaching their students how
to read and wants to find out if there is any significant difference between the reading
scores achieved using the four methods. It creates a sample of 8 students for each of the
four methods. The reading scores achieved by the participants in each group are as
follows:

Figure 3 – Data and output from Anova: Single Factor data analysis tool
This time the p-value = .04466 < .05 = α, and so we reject the null hypothesis, and
conclude that there are significant differences between the methods (i.e. all four
methods don’t have the same mean).

Note that although the variances are not the same, as we will see shortly, they are close
enough to use ANOVA.

Observation: We next review some of the concepts described in Definition 2 using


Example 2.

Figure 4 – Error terms for Example 2

From Figure 4, we see that

 x̄ = total mean = AVERAGE(B4:E11) = 72.03 (cell F12)


 mean of the group means = AVERAGE(B12:E12) = 72.03 = total mean
 = 0 (cell F13)
 = 0 for all j (cells H12 through K12)

We also observe that Var(e) = VAR(H4:K11) = 162.12, and so by Property 3,

and so

which agrees with the value given in Figure 3.

Observation: In both ANOVA examples, all the group sizes were equal. This doesn’t
have to be the case, as we see from the following example.
Example 3: Repeat the analysis for Example 2 where the last participant in group 1 and
the last two participants in group 4 leave the study before their reading tests were
recorded.

Figure 5 – Data and analysis for Example 3

Using Excel’s data analysis tool we see that p-value = .07276 > .05, and so we cannot
reject the null hypothesis and conclude there is no significant difference between the
means of the four methods.

Observation: MSW can also be calculated as a generalized version of Theorem 1 of Two


Sample t-Test with Equal Variances. There we had

Generalizing this, we have

From Figure 6, we see that we obtain a value for MSW in Example 3 of 177.1655, which
is the same value that we obtained in Figure 5.

Figure 6 – Alternative calculation of MSW


Observation: As we did in Example 1 we can calculate as SSB = SST – SSW. We now
show an alternative ways of calculating SSB for Example 3.

Figure 7 – Alternative calculation of SSB

We first find the total mean (the value in cell P10 of Figure 7), which can be calculated
either as =AVERAGE(A4:D11) from Figure 5 or =SUMPRODUCT(O6:O9,P6:P9)/O10
from Figure 7. We then calculate the square of the deviation of each group mean from
the total mean. E.g. for group 1, this value (located in cell Q6) is given by =(P6-P10)^2.
Finally, SSB can now be calculated as =SUMPRODUCT(O6:O9,Q6:Q9).

Real Statistics Functions: The Real Statistics Resource Pack contains the following
supplemental functions for the data in range R1:

SSW(R1, b) = SSW dfW(R1, b) = dfW MSW(R1, b) = MSW


SSBet(R1, b) = SSB dfBet(R1, b) = dfB MSBet(R1, b) = MSB
SSTot(R1) = SST dfTot(R1) = dfT MSTot(R1) = MST
ANOVA(R1, b) = F = MSB / MSW ATEST(R1, b) = p-value

Here b is an optional argument. When b = True (default) then the columns denote the
groups/treatments, while when b = False, the rows denote the groups/treatments. This
argument is not relevant for SSTot, dfTot and MSTot (since the result is the same in
either case).

These functions ignore any empty or non-numeric cells.

For example, for the data in Example 3, MSW(A4:D11) = 177.165 and


ATEST(A4:D11) = 0.07276 (referring to Figure 5).

Real Statistics Data Analysis Tool: As mentioned above, the Real Statistics Resource
Pack also contains the Single Factor Anova and Follow-up Tests data analysis tool
which is illustrated in Example 1 and 2 of Confidence Interval for ANOVA.

Vous aimerez peut-être aussi