Vous êtes sur la page 1sur 4

PAIRED T-TESTS FOR COMPARING RESULTS: A tutotial on how to do them in Excel

Dr Michael Madden National University of Ireland, Galway 2003 & 2005


http://www.it.nuigalway.ie/m_madden It is essential in Machine Learning to be able to compare results from different algorithms or variations statistically, to decide which is best for a given application. Here are the results of applying three algorithms to the same dataset, where the dataset was divided 10 times into training and testing sets (perhaps independently subsampled from a very large dataset). The key point is that we assume all training/test sets are independent of each other, and the same data was used to train and test each algorithm, so results are paired. Each result is a statistic such as percentage accuracy on the test set. Notes: 1) If the independence assumption is false (e.g. repeated splits of all data into training and testing sets, or cross-validation), use a corrected resampled t-test (not covered). 2) If you have a set of pre-computed differences, e.g. from multiple cross-validation runs runs averaged over repeated folds, see "When Differences Have Been Pre-Computed". Alg1 95.108 94.073 95.014 94.073 95.202 94.638 95.108 95.202 95.014 94.544 Alg2 96.707 95.767 94.920 96.049 96.143 96.613 95.673 96.896 96.990 96.425 Alg3 96.143 95.767 95.296 95.767 95.767 96.613 95.861 96.707 96.049 95.955

We want to see whether any of these result sets is better than any of the others. To do this, we will do pairwise comparisons of them => three pairs. Comparing two result sets, our null hypothesis is that there's no statistically significant difference between them. There's an elaborate way and a simpler way to do this Before we start, here's what the raw results look like. How do you think they compare?
97.5 97.0 96.5 96.0 95.5 95.0
94.5
Alg1

Alg2
Alg3

94.0 93.5 93.0


1 2 3 4 5 6 7 8

10

The Elaborate Way


We'll compare the result sets for Alg1 and Alg2. You may need to install the Analysis ToolPak for this to work, under Tools - Add-Ins. Using Tools - Data Analysis, select "t-test: Paired two sample for means". Select the variable ranges, set the hypothesized mean difference to 0, set alpha to one minus the desired confidence level (95% => alpha=0.05), and select where you want the output to appear. (Note: if you change the data you have to repeat the procedure; the table is not updated automatically.)

The result of this covers several lines (my annotations in red): t-Test: Paired Two Sample for Means Variable 1 Variable 2 94.7976 96.2183 0.194812 0.412564 10 10 0.17494 0 9 Degrees of freedom -6.302187 t-statistic 7.03E-05 1.833113 0.000141 Note 2 2.262157 Note 1

Mean Variance Observations Pearson Correlation Hypothesized Mean Difference df t Stat P(T<=t) one-tail t Critical one-tail P(T<=t) two-tail t Critical two-tail

Note 1: We use a two-tailed test if either set of results might be better. If |t-stat| > t-crit, we reject the null hypothesis that means are same, and conclude the difference between the algorithms is significant. Conclusion: these two sets of results are not equally good, at the 95% confidence level. Note 2: This is the value of alpha at which the hypothesis would be rejected. Thus, our conclusion holds up to a confidence level of 1 - P(T<=t): 99.986%

The Simpler Way


Excel has a TTEST worksheet function that gives us the bottom line from the elaborate way: P(T<=t), the significance level up to which the hypothesis holds. It has four parameters: (1) Range of first result set; (2) range of second result set; (3) number of tails; (4) type [paired=1]. Another way of thinking of the p-value, P(T<=t), is that it is the probability of observing a difference in the samples where no such difference exists in the populations from which the samples are drawn.

We can use this to test each pair of result sets: Pair 1 2 3 Set 1 Alg1 Alg2 Alg1 Set 2 Alg2 Alg3 Alg3 P(T<=t) Confidence 0.014% 99.986% 9.812% 90.188% 0.007% 99.993%

The formula here is: =TTEST(D17:D26,E17:E26,2,1)

We have to decide on our confidence level. 95% is most commonly used, and 99% is also popular. 95% corresponds to a p-value of 5%, meaning that we accept a 1 in 20 chance of incorrectly rejecting the null hypothesis. If the p-value calculated is lower than this threshold, then we reject the null hypothesis.

Note 3: Excel's T-Test formula can return a value of #DIV/0 (division by zero) in two situations: 1) Both algorithms have identical results [no difference] 2) Extremely large difference between results [infinity] Obviously, these two situations are very different from each other! If you get a #DIV/0 result you should examine the data to determine whether this corresponds to evidence that the null hypothesis should be accepted or rejected.

Conclusions:
On this particular data set, at a confidence level of 95%, from the table above: - Alg1 and Alg2 are not equally good - Alg1 and Alg3 are not equally good - Alg2 and Alg3 are equally good

When Differences Have Been Pre-Computed


In some situations, you'll have a single list of numbers representing the individual measured differences between two algorithms on a single dataset. In particular, this arises when you have performed 10 sorted runs of 10-fold cross-validation. Statistically, this is no problem; in fact, when performing a paired T-Test, the procedure is to calculate differences between pairs of observations and the test is performed on these differences, not the original observations. Unfortunately, Excel does not provide functions for a one-sample t-test. However, since we are really testing that the differences between paired observations have a mean of 0, we can construct a column on zeros and perform a paired test between it and the list of computed differences. Below, I have a column of zeros and three columns of differences between the results from the three algorithms (Alg1 - Alg2, etc).

Zeros 0 0 0 0 0 0 0 0 0 0

Alg1-Alg2 Alg2-Alg3 -1.599 0.564 -1.694 0.000 0.094 -0.376 -1.976 0.282 -0.941 0.376 -1.975 0.000 -0.565 -0.188 -1.694 0.189 -1.976 0.941 -1.881 0.470

Alg1-Alg3 -1.035 -1.694 -0.282 -1.694 -0.565 -1.975 -0.753 -1.505 -1.035 -1.411

Now we can perform a t-test comparing the zeros to each of the sets of differences:

Pair 1 2 3

Set Alg1-Alg2 Alg2-Alg3 Alg1-Alg3

P(T<=t) Confidence 0.014% 99.986% 9.812% 90.188% 0.007% 99.993%

The formula here is: =TTEST(D17:D26,E17:E26,2,1) These results are exactly the same as before, so the same conclusions are drawn. If you do the test 'the elaborate way', you of course get the same t-statistic results. However, the individual variable statistics (mean, variance) will be different since the raw numbers are different, and the Pearson Correlation figure will be undefined since the row of zeros gives a divide-by-zero error.

Vous aimerez peut-être aussi