Vous êtes sur la page 1sur 7

# The One-Way Analysis of Variance (ANOVA) Process for STAT 461 Students at Penn State University

## Audience and Purpose

The audience for this document is students majoring in Statistics at Penn State University who will be taking STAT 461, the ANOVA class that is offered at Penn State. This is a class that is fundamental for any student who is in the Applied Statistics option of the statistics major at Penn State. The purpose of this document is to help students understand the background statistics behind what the ANOVA process actually does and why it does this, so that these students are not just learning how to do the right steps to get an answer but why they are doing these steps. Since the purpose of the paper is not to tell the students how to do the testing, I wont explain how to do each step because that would be more of an instruction set. I will explain what each step does in helping to come to a statistical conclusion.

Scope
This document will describe how the ANOVA process works, why each step is taken, and the mechanics behind how each step helps us come to a sound, statistical conclusion. Students need to understand what each step is doing and why it is being taken so that they are better suited to identify which situations require an ANOVA test.

Introduction
The ANOVA process is used by statisticians to analyze data when specific aspects of that data are being compared. This process has its roots in agricultural research, mainly in testing the effects of different variables on crops. An ANOVA is a statistical process used to compare the means of multiple different treatment levels. The assumptions that have to be made are that the variation within the different treatments is the same, the variables that are being tested are normally distributed and the data points are gathered randomly. Basically what is being tested is whether all of the means of different treatment levels are the same. The conclusions that are made from this are whether are not a certain treatment is different from other treatments.

Process
The one main idea behind the ANOVA test is the difference amongst within group variation and between group variation. Within group variation describes the variation that occurs in the data points that are within each treatment level. The between group variation describes the variation that occurs in the data points collectively between the different treatment levels. Knowing this is very important when an ANOVA test is being conducted because knowing the ratio of between group variation compared to within group variation will help make the final conclusion.

Step 1
The first step of an ANOVA test is to state the null and alternative hypothesis for the test. These hypotheses are important because they show what is being tested for. The null hypothesis will always be the same for all ANOVA testing. It will be that all means are the same. This is because the main reason for doing an ANOVA test is to see if treatment levels are different. The null hypothesis is always what is considered true before the test, so it is normal to first suppose that all the means are equal. The alternative hypothesis will then be that just one of the means is different from another mean. The alternative hypothesis is what is trying to be shown to be true. This is the opposite of the null hypothesis because the test is trying to show that the null hypothesis is false, and therefore accept the alternative hypothesis. The test is also testing to see if just one mean is different from the rest because the reason for ANOVA tests is to find what treatment is different, so if just one mean is different, there are significant results. This step is normally written as:

## Image source: http://forrest.psych.unc.edu/research/vista-frames/help/lecturenotes/lecture10/anova6.html

Step 2
The second step of an ANOVA test is to calculate the F statistic for the data, which is used to make the decision on the null hypothesis. There are a couple of variables that will need to be calculated before the F statistic can be calculated. The first calculation that needs to be done is to find the Sum of Squares of the Treatment (SST). The reasoning behind the SST is to find the between group variation. This is the number that is really focused on because it shows how much variability there is between the means of the groups, which will help us to decide whether or not the means are different. The reason that the SST measures between group variation is because it compares all the group means to the mean of the entire data set. This shows how much variation there is between each treatment level in the data set. The next calculation will be to find the Sum of Squares of the Error (SSE). The reasoning behind the SSE is to find the within group variation. The reason that SSE measures within group variation is because it compares all the data points within one treatment to the mean of that treatment. This shows how much variation there is within each treatment level because it shows how much each data point varies from the mean of that treatment. After the SSE and SST are calculated, the Sum of Squares Total (SSTotal) is calculated next. This is an easy calculation after the SSE and SST are calculated because it is just the sum of the SSE and SST. This makes sense because the total amount of variation in a data set is only the within group variation and the between group variation, since there is no other type of variation present. So adding these two together will give the SSTotal. After calculating the SSE, SST, and SSTotal, the degrees of freedom for the SSE and SST need to be calculated. The degrees of freedom, which should be a familiar concept for you, for the SST will be the number of treatments (n) minus 1. This is because there are n treatment levels involved with one data value for each sample (the sample mean). The degrees of freedom for the SSE will be the total number of data points in the data set minus the number of treatment levels. This is because each treatment has degree of freedom of one less than their sample size, so adding them together gives the above formula.

The next calculations will be to find the Mean Squared Error (MSE) and the Mean Squared Treatment (MST). This is done because the variables that will be used for comparison in the F statistic need to be standardized so that they are similar. The standardization is done by dividing the SST and SSE by their respective degrees of freedom. This is important because the SSE may be inflated since it has more data points, and therefore the variation could be larger. Dividing by the degrees of freedom will get rid of this inflation because it takes into account the number of data points in the data set and the number of treatments. Once the MSE and MST are calculated, it is finally time to calculate the F statistic. The F statistic is what will be used to determine whether or not the differences in means are significant. The F statistic is calculated by dividing the MST by the MSE. This gives a ratio of how much variability in the data set is from between group variation compared to within group variation. This will then be compared to an F critical value, which will be explained in the next section. All of these values are then put into a table that looks like the table below.

## Image Source: http://www.mesacc.edu/~derdy61351/230_web_book/module4/anova/index.html

Step 3:
The next step is to calculate an F critical value. This is done by using an F table. The F table takes into account the degrees of freedom of the SSE and SST as well as the significance level of the test. The significance level of a test should be familiar to you. This is chosen by the statistician and is based on how precise the researcher wants the data. A normal significance level for this test is .01. Once the degrees of freedom for both SSE and SST are known along with the significance level, an F table such as the one below is used to calculate the F critical

value. This F critical value is the value of F at which the ratio between the SSE and SST becomes significant that specific degrees of freedom and significance level. The degrees of freedom and significance level are important in determining this number because the significance level, which measures confidence, will change the F critical since the more confident the test is, the higher the f critical will be because it takes more to make the null inconsistent with the data. The degrees of freedom are important because they take into account the size of the data set and number of treatments, which also affect at what point the f calculated becomes significant because the larger a sample is, the more likely it is that there are smaller variances. So the larger the degrees of freedom are, the smaller the f critical values become.

## Image Source: http://mips.stanford.edu/courses/stats_data_analsys/lesson_5/234_7_n.html

Step 4:
The last step is to determine whether or not the test has given significant results. This is the easy part of the analysis. All that has to be done is to compare the calculated F value to the critical F value. If the calculated F is greater than the critical F, then there is significant data and the null hypothesis can be rejected. This means that there is one mean that is different from the rest of the means. If the calculated F is less than the critical F, then there is not significant data and the null hypothesis is not rejected. This is because the F critical value is the extreme at where the results that are calculated are not consistent with the null hypothesis anymore. So any F calculated that is greater than this F critical is not consistent with the null hypothesis and therefore is significantly different.

Conclusion
The main idea for a one-way ANOVA test is to test whether or not there is a difference in different treatments on a set of data. It is called an Analysis of Variance test because it is testing the variance in the within the treatments and between the treatment levels to see if there is significant data. Finding whether or not there is a difference in means of treatments is very significant for many industries such as agriculture, production, and medicine. Hopefully with this new understanding of how the ANOVA process works, you will be able to better use this testing process and understand the results that the test gives.

Works Cited