Académique Documents
Professionnel Documents
Culture Documents
Some thoughts
Statistical significance testing retards the growth of scientific knowledge; it never makes a positive contribution (Schmidt & Hunter, 1997, p. 37). The almost universal reliance on merely refuting the null hypothesis is a terrible mistake, is basically unsound, poor scientific strategy, and one of the worst things that ever happened in the history of psychology (Meehl, 1978, p. 817). A potent but sterile intellectual rake who leaves in his merry path a long train of ravished maidens but no viable scientific offspring (Meehl again) Cohen (1994) suggested that Statistical Hypothesis Inference Testing produces a more appropriate acronym. What is NHST? What isnt it? And what is the problem?
vs.
Fishers approach
1) State the research question 2) State the null hypothesis 3) Construct a sampling distribution based on the null hypothesis and calculate the test statistic 4) Note the associated probability of obtaining that test statistic (i.e. p(D|H0) 5) Use the p-value to ascertain whether you will reject the null or come to no conclusion 1. Do guys and gals differ in their assessment of John McCains electability? 2. The difference in mean electability ratings is zero 3. Depends on sample size but the t-distribution would be appropriate 4. You found a difference, what is the probability of coming up with that size of a difference if you were expecting no difference? Based on that observed pvalue, do you reject or fail to reject?
male - female = 0
P - value
Note that the p-value in this sense is used as a measure of disbelief in the null hypothesis1
Despite the fact it is not a probability as to the likelihood of the null hypothesis It is p(D|Ho) not p(Ho|D)
However, it is confounded by sample size and so cannot be considered sufficient evidence against the null by itself
All else being equal, increasing N decreases the p-value
Central Limit Theorem
Unfortunately it often is interpreted without regard to sample size, unless of course the sample size is small
It would have been significant I swear!
For example, if you obtained a difference of 2 and it was not large enough to reject no difference, how do you know that the true population difference isnt 1 or any other infinite possibilities that are near zero?1
However you do see researchers accepting the null hypothesis and then drawing conclusions from this
Not the way to do things according to Fisher or logic
4. You found a difference and in this case have an observed t-statistic 5. Is your observed t more extreme than the t critical value (i.e. does it fall into the region of rejection)?
YES? Reject the null. NO? Act as if the null were true.
Key differences
Fisher
No alternative hype, no talk of alpha/beta, decision based on the observed p Level of significance p-value
How determined
Early Fisher: set some acceptable standard, say .05 Later: State exact level as a communication to researchers1
Epistemic interpretation about the likelihood of the null hypothesis (how much do we believe in the false null), p is a property of the data
Non-significant result: Do nothing, cant prove the null can only falsify (to some extent) Alternative hypothesis, alpha/beta/power concerns that determine the design of the study, decision based on observed t-statistic either falls into the region of rejection or doesnt (if not accept the null)2 Level of significance
Must be set before the experiment to interpret it as a long run frequency of error (type I)
So now that we have this new sort of thing to worry about ( ), how do we make it more confusing?
Set the standard level at .05.
N-P
Behavioristic interpretation (reject or dont) that refers to repeated experimentation, p is a property of the test /design
The actual observed p-value doesnt matter, our statistic either falls in our region of rejection or doesnt
Probability: Fisher
Probability obtained tell us:
If the null hypothesis were true, this is the probability of obtaining a sample statistic of the kind observed
Probability: N-P
State of the World H0 true Research Decision Reject H0 Accept H0 Type I error Correct Acceptance H0 false Correct rejection Type II error
H0 true
Research Decision Reject H0 Accept H0 p= p=1-
H0 false
p=1- p=
Belief in replicability
P(D|H0) P(H0|D)
Problems!
This is true by denying the consequent (modus tollens) Unfortunately this is not how hypothesis testing takes place
Gives us the illusion of probabilistic proof
The problem is that we make the first statement probabilistic, and that changes everything
Hypothesis Testing
If a person is an American, then he is not a member of Congress FALSE This person is a member of Congress Therefore, he is not an American This is a valid argument but untrue as the first premise is false If a person is an American, then he is probably not a member of Congress TRUE This person is a member of Congress Therefore, he is not an American This is the form of hypothesis testing we undertake, and though the first statement is now true, the argument is logically incorrect
p( D | H )* p( H ) p( H | D) p( D | H )* p( H ) p( D |~ H )* p(~ H )
Replication
Another misinterpretation that comes from the confusion regarding a probability of a hypothesis, some think the probability of replication can be estimated with a NHST pvalue P(D|H0) does not imply anything about p(Replication) either
P(D|H0)
Belief in the Law of Small Numbers (Kahneman & Tversky) Some believe that significant results arising from presumably representative (though relatively small) samples are automatically strong findings that will most likely replicate1
The fact is that the significant result may be very unlikely to replicate
Also, just because we reject the null does not imply a theory is correct
The statistical test rarely reflects the actual research idea2
Also, that failure to reject null means the population effect size is zero- Absence of evidence is not evidence of absence
The effect size may be the same but due to sample size, one rejects, one doesnt
So NHST does have utility and is itself not to blame for its misuse and misinterpretation
Bayesian inference (which does give P(H|D), but has its own problems) More focus on confidence intervals Effect sizes Better NHST Graphs and more descriptives
Solutions: Bayesian
Benefits: probabilities and intervals that make sense
Gives P(H|D)
More overlap suggests less difference in estimates May lead to more thinking about the size of the differences rather than statistical significance
Solutions: General
Dont forget to use your noggin when conducting analyses- dont let SPSS or textbooks tell you what it means.1 There are other ways to analyze data without using NHST. But dont fall in to the same trap of rigid thinking with those either. Focus on effect sizes and interval estimation, report as much information as possible, let others know exactly why you came to your conclusions Collect good data (not as easy as it sounds) and have good theories and clear ideas driving the motivation for your research. Let the data tell its story Replicate whenever possible, validate your own data
Use valid and reliable measures of the construct under investigation Use the test of a nil hypothesis as a preliminary step at most Test approximate null hypotheses Make appropriate decisions based on the situation
Its not just about type I error at .05
Resources
Gigerenzer, G. (1993). The Superego, The Ego and the Id in Statistical Reasoning. In Keren & Lewis (Eds.) Data Analysis in the Behavioral Sciences. Cohen, J. (1994). The earth is round, p < .05. American Psychologist, 49, 997-1003. Hubbard R. & Bayarri, M.J. (2003). Confusion Over Measures of Evidence (p's) Versus Errors ('s) in Classical Statistical Testing. The American Statistician. Volume: 57 Number: 3 Page: 171 178 Oakes, M. 1986. Statistical Inference: A Commentary for the Social and Behavioral Sciences. Chichester, John Wiley & Sons. Abelson, Robert. Statistics as Principled Argument. Mahwah, NJ:Erlbaum, 1995.