Vous êtes sur la page 1sur 18

A RECONCEPTUALIZATON OF SIGNIFICANCE TESTING - page 1

Pre-print version. Full article published as:


Perezgonzalez, J. D. (2014). A reconceptualization of significance testing. Theory &
Psychology, 24(6), 852859. doi:10.1177/09593543145461

A Reconceptualization of Significance Testing

Massey University, New Zealand

Authors biography:
Dr. Jose Perezgonzalez is a lecturer at Massey Universitys School of Aviation in New
Zealand. His research interests encompass Human Factors and Psychology, including sensemaking, research methods and statistics. Address: School of Aviation, Massey University,
SST8.18, Manawatu Campus, P.O.Box 11-222, Palmerston North 4442, New Zealand.
Phone: +6463505326, E-mail: j.d.perezgonzalez@massey.ac.nz

A RECONCEPTUALIZATON OF SIGNIFICANCE TESTING - page 2

A Reconceptualization of Significance Testing

A RECONCEPTUALIZATON OF SIGNIFICANCE TESTING - page 3

Abstract

Significance testing has been controversial since Neyman and Pearson published their
procedure for testing statistical hypotheses. Fisher, who popularized tests of significance, first
noticed the emerging confusion between that procedure and his own, yet he could not stop
their hybridization into what is nowadays known as Null Hypothesis Significance Testing
(NHST). Here I hypothesize why similar attempts to clarify matters have also failed; namely
because both procedures are designed to be confused: their names may not match purpose,
both use null hypotheses and levels of significance yet for different goals, and p--values,
errors, alternative hypotheses and significance only apply to one procedure yet are commonly
used with both. I also propose a reconceptualization of the procedures to prevent further
confusion.

Keywords: Significance testing; Null Hypothesis Significance Testing; NHST; Statistical


hypotheses; Fisher; Neyman--Pearson

A RECONCEPTUALIZATON OF SIGNIFICANCE TESTING - page 4

A Reconceptualization of Significance Testing

Hager's (2013) article on the statistical theories of Fisher and of Neyman and Pearson is the
latest in a string of exhortations trying to sort out confusion between those two main theories
for testing research data. Hager starts his abstract with "Most of the debates around statistical
testing suffer from a failure to identify clearly the features specific to the theories invented by
Fisher and by Neyman and Pearson" (p. 251). However, there is literary evidence that those
specific features have been covered, most clearly by Hubbard (2004), Gigerenzer (2004),
Lou (2008), and, of course, Fisher (1925, 1955, 1959, 1960), and Neyman and Pearson
(1928, 1933). A subset of both theories, the confusion between p--values and Type I error,
has also been covered by Hubbard (2004, 2011), Goodman (1999), and Christensen (2005),
among others. To Hager's credit goes that his article is the best to date in presenting an
exhaustive identification of such features.
Furthermore, historical analyses on the hybridization of Fisher's and Neyman-Pearson's approaches have been done by Halpin and Stam (2006), Huberty (1993), Hubbard
(2004), and Johnstone (1986). And, among proposed solutions, there are mathematical
bridges between both approaches, such as Berger's (2003) proposal of conditioning Neyman-Pearson's error probabilities on Fisher's strength of evidence, and Schweder's (1988) proposal
of a Fisher--like significance version for Neyman--Pearson's tests; philosophical bridges,
such as Lehmann's (1993) attempt at a unified theory, and Jones and Tukeys (2000) proposal
of testing three alternative decisions rather than null hypotheses; exhortations to give up
testing in favor of confidence intervals (Gigerenzer, 2004), meta--analysis (Schmidt, 1992),
or Bayes' hypothesis testing (Hubbard & Bayarri, 2003); and exhortations to better statistical

A RECONCEPTUALIZATON OF SIGNIFICANCE TESTING - page 5

education provided by statisticians (Hubbard, 2011). So far, most solutions have been unable
to resolve the confusion inherent to significance testing.
An interesting observation when studying this topic is that the confusion of both
approaches is as old as Neyman--Pearson's theory. Fisher was first to criticize such confusion
(more clearly in 1955), even before Lindquist's textbook hybridizing both theories into the
Null Hypothesis Significance Testing (NHST) procedure was printed back in 1940 (Halpin &
Stam, 2006); yet he could not stop the hybridization wave.
I see an insidious factor that may explain why the confusion persists despise the best
efforts to straighten it: the theories are unwittingly designed to be confused, insofar as they
use the same tests, similar procedures and similar concepts. Little can be done regarding tests
and procedures; Resolving the confusion thus rests on re--designing the conceptual
framework of both theories (as Pearson put it in 1955, "I would agree that some of our
wording may have been chosen inadequately", p. 206). It does not matter how much ink is
spent on trying to correct the wave of hybridization, the basic design of both theories will by
default confuse the unwary and the expert alike, so much so that, among the latter, it has
occurred to the American Psychological Association (2010), Wilkinson and the Task Force
on Statistical Inference (1999), and Krueger (2001)---who are rather Fisherian---, and to Kline
(2004), Nickerson (2000), Wainer (1999), Cortina and Dunlap (1997), Frick (1996), Cohen
(1988, 1994), Schmidt (1992), and Rosnow and Rosenthal (1989)---who are rather Neyman-Pearsonian.
The goal of this paper is to propose such conceptual redesign. This covers the names
of the procedures, the role of the null and alternative hypotheses, levels of significance, p-values and errors, the statistical significance of results and its interpretation. The article ends
with appropriate examples to put theory to practice, thus describing the same results

A RECONCEPTUALIZATON OF SIGNIFICANCE TESTING - page 6

depending on whether they are interpreted following Fisher's approach or following Neyman-Pearson's approach.

The Name of the Procedures

When doing significance testing, Fisher was interested in finding noteworthy results and in
assessing the strength of such evidence. Using his approach, the researcher is prepared to pay
attention to statistically significant results and ignore the rest (Fisher, 1960). Therefore,
Fisher's tests of (statistical) significance is an appropriate name for this procedure.
In contrast, Neyman and Pearson's interest was in deciding which hypothesis, among
competing ones, to accept as most likely. They called their procedure tests of statistical
hypotheses (e.g., Neyman & Pearson, 1933), a denomination which certainly differs from that
of Fisher's but which does not create a clear conceptual separation between both. After all,
Fisher also uses statistical hypotheses, yet neither procedure tests hypotheses, only research
data against statistical hypotheses assumed to be true (Gigerenzer, 2004). The reference to
testing hypotheses is, thus, misleading and the procedure benefits by being renamed as
Neyman--Pearson's tests of acceptance (Fisher, 1955).

Null Hypotheses

The core of Fisher's procedure is a null hypothesis, which represents a theoretical random
distribution generated ad hoc from information provided by the data (e.g., variance) as well
as by theory (e.g., normal curve) (Fisher, 1959). The procedure locates the research results

A RECONCEPTUALIZATON OF SIGNIFICANCE TESTING - page 7

within this distribution and assesses their theoretical probability. Research results with a low
probability of occurrence are taken as evidence against the hypothesis, nullifying it as being
explanatory of those results (Hubbard & Bayarri, 2003). Calling it the null hypothesis (H0) is,
thus, appropriate.
Neyman--Pearson's procedure works in a somewhat similar manner, although it uses
at least two hypotheses representing distributions to be tested in the long run, using repeated
sampling procedures on the same population (Fisher, 1955). The procedure selects the
hypothesis of greatest interest as the one on which to carry the test, which is also called the
null hypothesis. Yet this approach is not about nullifying a hypothesis but about deciding
among competing hypotheses (Neyman & Pearson, 1928). Thus, the concept of null
hypothesis is inaccurate. Neyman--Pearson's procedure benefits by substituting the concept of
main hypothesis (HM), instead.

Alternative Hypotheses

Fisher's procedure does not contemplate an alternative hypothesis at all (Fisher, 1960).
Somehow, however, there is some emptiness in this procedure in two accounts: on the one
hand, the interest is in a research (substantive) hypothesis even when the test is done on its
complementary statistical null hypothesis (Hager, 2013). On the other hand, there is very
little that can be said about significant results other than that the null hypothesis has been
rejected. The need for filling this emptiness is strong enough that a concept such as the
alternative hypothesis fits well, even when inappropriate. To avoid confusion, Fisher's
procedure could include a reference to the research hypothesis to fill such gap, as well as to
make researchers aware that the null is proposed as a way of making inferences about the

A RECONCEPTUALIZATON OF SIGNIFICANCE TESTING - page 8

research hypothesis in the first place; however, we should avoid calling it the alternative
hypothesis.
Neyman--Pearson's procedure requires the use of one or more alternative hypotheses,
which are specific and represent long--run distributions (Neyman & Pearson, 1928).
Nowadays, this procedure is used in such a manner that only the power of the alternative
hypothesis (1 - ) is ever considered. Thus, although the alternative hypothesis is, in
principle, specific, it is often portrayed as an unspecified hypothesis (i.e., as being 'unequal' to
the main hypothesis); as a consequence, accepting the alternative hypothesis does not tell
much beyond such fact. Neyman--Pearson's alternative hypothesis (HA) can certainly retain
its name under this procedure, although it should be made explicit, if only by providing
information about its power. (Put otherwise, if the alternative hypothesis is not so specified,
the test is carried out as a, de facto, Fisher's test.)

Levels of Significance

Fisher's procedure uses levels of significance for ascertaining the noteworthiness of a result
(Fisher, 1925). Fisher did not put forward any particular notation for it, but researchers tend
to write it as alpha (). Most researchers also find it convenient to work with fixed levels of
significance (e.g., 5%, 1%). These features have parallelisms in Neyman--Pearson's
approach. To avoid confusion, Fisher's levels of significance, which are fit--for--purpose,
should be called as such, and sig could be used as shorthand notation when required (thus,
avoiding alpha). Convenient levels of significance may be used (Fisher, 1960), although these
should be treated as flexible rather than as strict cut--off points (i.e., the difference between
5% and 6% is not too critical under Fisher's approach). Furthermore, a gradation in levels of

A RECONCEPTUALIZATON OF SIGNIFICANCE TESTING - page 9

significance is appropriate (such as significant and highly significant), as it reflects the


relative strength of the evidence against the null hypothesis.
Neyman--Pearson's procedure does not work with evidence against the null but with
long--run error probabilities. They sought to make a small Type I error (the error committed
when wrongly accepting the alternative hypothesis (1)), whose probability is known as alpha
(). Alpha needs to be set in conjunction with beta (the probability of making a Type II error,
of wrongly accepting the main hypothesis). For convenience, Neyman and Pearson also
worked with fixed levels of alpha (e.g., 5%, 1%). They also called alpha the significance
level (Neyman & Pearson, 1933). To avoid confusion, acceptance level can substitute the
latter, as it is a more appropriate name to its functioning as the cut--off point for deciding
between hypotheses. This procedure also retains exclusive use of the concepts of alpha (),
beta (), and power (1 - ) (2). Convenient levels of acceptance may be used; however,
whatever level of acceptance is chosen, it is fixed throughout the research (it represents a
fixed error risk), it is strict, not flexible, and cannot accept any gradation (i.e., a result cannot
be highly accepted, or accepted with a particular proportion of error, nor alpha can become a
roving alpha, so called by Goodman, 1993).

P--values

To Fisher, p--values represent the probability of the observed and more extreme results,
always assuming that the null hypothesis is true. They also represent the strength of the
evidence against the null (Fisher, 1960), so that knowing the exact p--value is very
informative, indeed. Actually, p--values only make sense under Fisher's approach and should
be used exclusively with this approach. The preferred method is to report the exact p--value

A RECONCEPTUALIZATON OF SIGNIFICANCE TESTING - page 10

(e.g., p = .012) (Gigerenzer, 2004), although reporting p--values bound to levels of


significance (e.g., p < .05, or '***' for p < .001) may be convenient in some circumstances
(e.g., in tables).
Under Neyman--Pearson's approach, p--values are neither necessary nor routinely
calculated, although they may be used as proxies for deciding when a result falls in the alpha
region, so as to accept the alternative hypothesis. The p--value does not provide any strength
of evidence (i.e., the alternative hypothesis can only be accepted, it cannot be accepted
slightly or strongly) (Gigerenzer, 2004). Therefore, when p--values are used, they should be
stripped off their numerical value and simply be described bound to the selected alpha level
(e.g., p < , or p < .05).

Errors

Fisher's procedure is aimed towards novel, single research projects. Although Type I errors
are plausible, they are of little practical relevance given the ad hoc nature of the theoretical
distributions the tests are carried on. Any interpretation of p--values as the probability of
making a Type I error is, thus, inaccurate. Because Fisher's procedure does not contemplate
alternative hypotheses, Type II errors are not possible under this approach (Fisher, 1955).
Type I and Type II errors are concepts first introduced by Neyman and Pearson
(1928), although they were rather interested in minimizing the probability of their occurrence
( and respectively) in the long run under repeated testing (i.e., it is not possible to ascertain
whether an error has been made in any particular trial). Thus, the concepts of Type I and Type
II errors, and their associated probabilities alpha () and beta (), are exclusive to Neyman-Pearson's approach.

A RECONCEPTUALIZATON OF SIGNIFICANCE TESTING - page 11

Significant Results

As discussed above, a research result can only be statistically significant under Fisher's
procedure. They can also be more or less significant according to the strength of the evidence
they represent against the null hypothesis (Fisher, 1960).
When using Neyman--Pearson's procedure, a research result cannot be significant,
properly speaking. Instead, a result that falls within the alpha region is a result that will be
accepted under the alternative hypothesis.

Interpretation

Fisher's results can only be interpreted according to an unavoidable duality: either a rare
chance event occurred or the null hypothesis does not explain the research results (Fisher,
1959). From here an inductive inference may be made regarding whether the results support
the substantive research hypothesis or whether further research is needed.
Neyman--Pearson's results can only be interpreted as a decision: the research results
support either the alternative hypothesis or, if power is adequate, the main hypothesis (if
power is not adequate, nothing can be concluded about the latter) (Neyman, 1953).

Examples
Below are two interpretations of how the same results may appear in a research article,
depending on whether one is following Fisher's procedure or Neyman--Pearson's.

A RECONCEPTUALIZATON OF SIGNIFICANCE TESTING - page 12

This was a novel, single research study, and statistical analyses were conducted
according to Fisher's procedure, using a conventional level of significance of 5% (sig
= .05). Results showed a highly signicant difference in performance between control
and experimental groups in favor of the latter, t(50) = 2.34, p = .012, 1--tailed. As the
probability of getting these or greater results is low under the null hypothesis, we
rejected the latter and inferred that the results reflected a true difference between
groups.

This study forms part of a repeated sampling program, and statistical analyses were
conducted according to Neyman--Pearson's procedure. The acceptance criterion was
set at 5% ( = .05), and a priori research power was set at 80% (1 - = .80) for an
estimated large effect size (d = 0.8), requiring a minimum total sample size of 42
participants. The observed results fell in the critical alpha region, t(50) = 2.34, p < ,
1--tailed; thus we accepted the alternative hypothesis and concluded that the
experimental treatment had a definite positive impact on performance.

Final Note

Null Hypothesis Significance Testing (NHST) is, by large, a mismatch of two incompatible
statistical philosophies (e.g., Halpin & Stam, 2006). A reasonable factor in such confusion is
the conceptual framework under which both philosophies were developed and published. The
negative impact of that shared conceptual framework is pervasive enough that even expert
users of statistics get confused between the theory of Fisher, and the theory of Neyman and

A RECONCEPTUALIZATON OF SIGNIFICANCE TESTING - page 13

Pearson (e.g., Cohen, 1994). Although NHST may be waning down in psychology, its role in
future publications is necessarily open to editorial policy (APA, 2010) while its historical
footprint in the publication record is here to stay, thus perpetuating the confusion despite the
best efforts to straighten it (e.g., Hager, 2013). The proposed solution is to reengineer such
conceptual framework in a way that not only clarifies the confusion at present but also offers
a tool for re--interpreting past and future NHST--based articles in the light of their most
probable underlying philosophy: either Fisher's tests of significance or Neyman--Pearson's
tests of acceptance. Hopefully, it may also serve as a tool for researchers to follow their
desired philosophy coherently throughout their works.

A RECONCEPTUALIZATON OF SIGNIFICANCE TESTING - page 14

References

American Psychological Association (2010). Publication manual of the American


Psychological Association (6th ed.). Washington, DC: APA.
Berger, J. O. (2003). Could Fisher, Jeffreys and Neyman have agreed on testing. Statistical
Science, 18(1), 1--32.
Christensen, R. (2005). Testing Fisher, Neyman, Pearson, and Bayes. The American
Statistician, 59(2), 121--126.
Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). New
Yersey, NJ: Lawrence Erlbaum.
Cohen, J. (1994). The Earth is round (p < .05). American Psychologist, 49(12), 997--1003.
Cortina, J. M., & Dunlap, W. P. (1997). On the logic and purpose of significance testing.
Psychological Methods, 2(2), 161--172.
Fisher, R. A. (1925). Statistical methods for research workers. Edinburgh, U.K.: Oliver and
Boyd.
Fisher, R. A. (1955). Statistical methods and scientific induction. Journal of the Royal
Statistical Society, Series B (Methodological), 17(1), 69--78.
Fisher, R. A. (1959). Statistical methods and scientific inference (2nd ed.). Edinburgh, U.K.:
Oliver and Boyd.
Fisher, R. A. (1960). The design of experiments (7th ed.). Edinburgh, U.K.: Oliver and Boyd.
Frick, R. W. (1996). The appropriate use of null hypothesis testing. Psychological Methods,
1(4), 379--390.
Gigerenzer, G. (2004). Mindless statistics. The Journal of Socio--Economics, 33, 587--606.

A RECONCEPTUALIZATON OF SIGNIFICANCE TESTING - page 15

Goodman, S. N. (1993). P values, hypothesis tests, and likelihood: Implications for


epidemiology of a neglected historical debate. American Journal of Epidemiology,
137 (5), 485--496.
Goodman, S. N. (1999). Toward evidence--based medical statistics. 1: the p--value fallacy.
Annals of Internal Medicine, 130(12), 995--1004.
Hager, W. (2013). The statistical theories of Fisher and of Neyman and Pearson: a
methodological perspective. Theory and Psychology, 23(2), 251--270.
Halpin, P. F., & Stam, H. J. (2006). Inductive inference or inductive behavior: Fisher and
Neyman--Pearson approaches to statistical testing in psychological research (1940-1960). American Journal of Psychology, 119(4), 625--653.
Hubbard, R. (2004). Alphabet soup. Blurring the distinctions between p's and 's in
psychological research. Theory and Psychology, 14(3), 295--327.
Hubbard, R. (2011). The widespread misinterpretation of p--values as error probabilities.
Journal of Applied Statistics, 38(11), 2617--2626.
Hubbard, R., & Bayarri, M. J. (2003). Confusion over measures of evidence (p's) versus
errors ('s) in classical statistical testing. The American Statistician, 57(3), 171--178.
doi:10.1198/0003130031856.
Huberty, C. J. (1993). Historical origins of statistical testing practices: the treatment of Fisher
versus Neyman--Pearson views in textbooks. Journal of Experimental Education,
61(4), 317--333.
Johnstone, D. J. (1986). Tests of significance in theory and practice. Journal of the Royal
Statistical Society, Series D (The Statistician), 35(5), 491--504.
Jones, L. V., & Tukey, J. W. (2000). A sensible formulation of the significance test.
Psychological Methods, 5(4), 411--414. doi:10.1037//1082-989X.5.4.411.

A RECONCEPTUALIZATON OF SIGNIFICANCE TESTING - page 16

Kline, R. B. (2004). Beyond significance testing. Reforming data analysis methods in


behavioral research. Washington, DC: APA.
Krueger, J. (2001). Null hypothesis significance testing. On the survival of a flawed method.
American Psychologist, 56(1), 16--26. doi:10.1037//0003-066X.56.1.16.
Lehmann, E. I. (1993). The Fisher, Neyman--Pearson theories of testing hypotheses: one
theory or two? Journal of the American Statistical Association, 88(424), 1242--1249.
Lindquist, E. F. (1940). Statistical analysis in educational research. Boston, MA: Houghton
Mifflin.
Lou, F. (2008). Should the widest cleft in statistics --- How and why Fisher opposed
Neyman and Pearson. (Working Papers WP/02/2008/DE/UECE). Lisbon, Portugal:
School of Economics and Management, Technical University of Lisbon. Retrieved
from https://www.repository.utl.pt/bitstream/10400.5/2327/1/wp022008.pdf
Neyman J. (1953). First course in probability and statistics. New York, NY: Henry Holt.
Neyman, J., & Pearson, E. S. (1928). On the use and interpretation of certain test criteria for
purposes of statistical inference: part I. Biometrika, 20(A), 175--263.
Neyman, J., & Pearson, E. S. (1933). On the problem of the most efficient tests of statistical
hypotheses. Philosophical Transactions of the Royal Society of London, Series A, 231,
289--337.
Nickerson, R. S. (2000). Null hypothesis significance testing: a review of an old and
continuing controversy. Psychological Methods, 5(2), 241--301. doi:10.1037//1082989X.5.2.241.
Pearson, E. S. (1955). Statistical concepts in the relation to reality. Journal of the Royal
Statistical Society, Series B (Methodological), 17(2), 204--207.
Rosnow, R. L., & Rosenthal, R. (1989). Statistical procedures and the justification of
knowledge in psychological science. American Psychologist, 44(10), 1276--1284.

A RECONCEPTUALIZATON OF SIGNIFICANCE TESTING - page 17

Schmidt, F. L. (1992). What do data really mean? Research findings, meta--analysis, and
cumulative knowledge in psychology. American Psychologist, 47(10), 1173--1181.
Schweder, T. (1988). A significant version of the basic Neyman--Pearson theory for scientific
hypothesis testing. Scandinavian Journal of Statistics, 15(4), 225-242.
Wainer, H. (1999). One cheer for null hypothesis significance testing. Psychological
Methods, 4(2), 212--213.
Wilkinson, L., & the Task Force on Statistical Inference (1999). Statistical methods in
psychology journals. Guidelines and explanations. American Psychologist, 54(8),
594--604.

A RECONCEPTUALIZATON OF SIGNIFICANCE TESTING - page 18

Endnotes

1. Or the error committed when wrongly rejecting the main hypothesis.


2. Interestingly enough, Neyman used the notation for power, and 1 - for the
probability of making a Type II error (Neyman, 1953).

Vous aimerez peut-être aussi